WO2022045401A1 - Codeur et décodeur de flux binaire d'apprentissage profond, et procédé associé - Google Patents

Codeur et décodeur de flux binaire d'apprentissage profond, et procédé associé Download PDF

Info

Publication number
WO2022045401A1
WO2022045401A1 PCT/KR2020/011528 KR2020011528W WO2022045401A1 WO 2022045401 A1 WO2022045401 A1 WO 2022045401A1 KR 2020011528 W KR2020011528 W KR 2020011528W WO 2022045401 A1 WO2022045401 A1 WO 2022045401A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
deep learning
bitstream
unit
kernel
Prior art date
Application number
PCT/KR2020/011528
Other languages
English (en)
Korean (ko)
Inventor
김성제
정진우
최병호
홍민수
이승호
Original Assignee
한국전자기술연구원
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from KR1020200108297A external-priority patent/KR102517449B1/ko
Priority claimed from KR1020200108298A external-priority patent/KR102654690B1/ko
Application filed by 한국전자기술연구원 filed Critical 한국전자기술연구원
Publication of WO2022045401A1 publication Critical patent/WO2022045401A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T9/00Image coding
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/102Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or selection affected or controlled by the adaptive coding
    • H04N19/124Quantisation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/102Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or selection affected or controlled by the adaptive coding
    • H04N19/13Adaptive entropy coding, e.g. adaptive variable length coding [AVLC] or context adaptive binary arithmetic coding [CABAC]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/169Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding
    • H04N19/17Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object
    • H04N19/176Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object the region being a block, e.g. a macroblock
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/42Methods or arrangements for coding, decoding, compressing or decompressing digital video signals characterised by implementation details or hardware specially adapted for video compression or decompression, e.g. dedicated software implementation
    • H04N19/423Methods or arrangements for coding, decoding, compressing or decompressing digital video signals characterised by implementation details or hardware specially adapted for video compression or decompression, e.g. dedicated software implementation characterised by memory arrangements
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/60Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using transform coding
    • H04N19/625Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using transform coding using discrete cosine transform [DCT]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/70Methods or arrangements for coding, decoding, compressing or decompressing digital video signals characterised by syntax aspects related to video coding, e.g. related to compression standards

Definitions

  • the present invention relates to a hardware deep learning accelerator, and more particularly to a low-power deep learning accelerator.
  • Deep learning technology is showing encouraging performance that surpasses existing traditional methods in various application fields such as image recognition, voice signal processing, and natural language processing. Deep learning technology improves performance by learning a network model composed of weights and biases to suit a given application field, and then re-applying the learned network model to the application field.
  • the size of the deep learning network model is growing as the amount of learning increases. Therefore, the size of the deep learning network model can be a problem when applied to fields that need to increase usage time with low power, such as mobile applications. As the size of the deep learning network model increases, the amount of data increases, so processing time is also a problem, but power consumption also increases according to the amount of computation.
  • the inventors of the present invention have been researching and trying to solve the problem of high power consumption of the deep learning accelerator of the prior art and the difficulty of managing the deep learning network model. It provides a deep learning accelerator that can be used in a mobile environment by alleviating the high power consumption problem of the deep learning accelerator of the prior art while maintaining the performance of the deep learning network model, and a multimedia bitstream where the deep learning network model can be managed together After much effort to provide a production method, the present invention has been completed.
  • An object of the present invention is to provide a deep learning accelerator capable of reducing power required for deep learning operations without degrading the performance of the deep learning network model.
  • Another object of the present invention is to provide a method for generating a multimedia bitstream that is easy to manage the multimedia bitstream and the deep learning network model by including the deep learning network model information in the multimedia bitstream and generating it as a single file.
  • an encoding unit that compresses the kernel or activation data and generates a bitstream
  • an insertion unit for inserting and outputting a delimiter that divides the compressed data into any one unit of a layer, a tile, and a block into the bitstream
  • the encoding unit includes: a quantization unit for quantizing a kernel or activation data; a context modeling unit for deriving context information based on the quantized kernel or activation data and previous kernel or activation data; an entropy coding unit that compresses the quantized kernel or activation data through the quantizer based on the context information of the context modeling unit and generates a bitstream; and an inserter for inserting and outputting a delimiter for dividing the compressed data into any one unit of a layer, a tile, and a block into the bitstream.
  • the entropy coding unit is characterized by using run-length coding, Huffman coding, or arithmetic coding.
  • a prediction unit predicting the kernel or activation data using correlation with pre-encoded kernel or activation data; a conversion unit converting the kernel or activation data from a time domain to a frequency domain; and a decomposing unit that decomposes the kernel or activation data into a plurality of small-dimensional data.
  • the inserting unit may further insert, into the bitstream, a flag indicating whether or not to selectively activate the predictor, the transform unit, and the decomposition unit according to an encoding method.
  • the transform unit may perform Discrete Cosine Transform (DCT), Discrete Sine Transform (DST), or Discrete Fourier Transform (DFT) transform on the kernel or activation data.
  • DCT Discrete Cosine Transform
  • DST Discrete Sine Transform
  • DFT Discrete Fourier Transform
  • the decomposition unit is characterized in that it uses Low Rank Decomposition or CP Decomposition (Canonical Polyadic Decomposition).
  • a parser that divides the bitstream into units by using a delimiter that distinguishes any one unit of kernel or activation data in units of layers, tiles, and blocks inserted into the bitstream; and a decoding unit that decodes the unit divided by the parser among the bitstream and stores the decoded unit in the memory.
  • the decoding unit may include: an entropy decoding unit for entropy-decoding the kernel or activation data of the divided unit using context information; A context modeling unit for deriving context information based on the entropy-decoded kernel or activation data and previous entropy-decoded kernel or activation data; and an inverse quantization unit for inverse quantizing the entropy-decoded kernel or activation data; characterized in that
  • the decoding unit may include: an inverse prediction unit configured to inversely predict the entropy-decoded kernel or activation data using correlation with the entropy-decoded kernel or activation data; an inverse transform unit for inversely transforming the entropy-decoded kernel or activation data from a frequency domain to a time domain; and an inverse decomposition unit that restores the kernel or activation data from a plurality of low-dimensional data to higher-dimensional data.
  • a deep learning acceleration device according to another embodiment of the present invention.
  • a parser that divides the bitstream into units by using a delimiter that distinguishes any one unit of kernel or activation data in units of layers, tiles, and blocks inserted into the bitstream; a decoder that decodes the units identified by the parser among the bitstream and stores them in the memory; and an accelerator for performing a deep learning operation using the decoded data.
  • quantizing the kernel or activation data deriving context information based on the quantized kernel or activation data and previous kernel or activation data; generating a bitstream by compressing the quantized kernel or activation data based on the derived context information; and inserting and outputting a delimiter for dividing the compressed kernel or activation data into any one unit of a layer, a tile, and a block into the bitstream.
  • bitstream dividing the bitstream into units by using a delimiter that distinguishes any one unit of kernel or activation data in units of layers, tiles, and blocks inserted into the bitstream; entropy-decoding the kernel or activation data of the divided unit using context information; deriving context information based on the entropy-decoded kernel or activation data and the entropy-decoded kernel or activation data; and inverse quantizing the entropy-decoded kernel or activation data.
  • a first decoder for decoding a kernel data bitstream divided into any one unit of a layer, a tile, and a block in the unit and storing the decoded unit in the memory; an encoder that compresses the activation data and generates an activation data bitstream in which a delimiter for classifying each unit of a layer, a tile, and a block is inserted; a parser that divides the activation data bitstream into any one unit of a layer, a tile, and a block using the delimiter; a second decoder that decodes the divided activation data bitstream in the unit and stores it in the memory; and an accelerator for performing a deep learning operation using the kernel or activation decoding data stored in the memory.
  • a method for generating a bitstream according to the present invention comprises:
  • generating a payload type field that identifies a deep learning network model; generating a use case field that identifies an algorithm to be applied to multimedia data; generating a location index field that specifies a location in the multimedia data to which the deep learning network model is applied; generating a parameter index field that designates a parameter to be applied to the multimedia data; generating a property field indicating a property of the multimedia data after the deep learning network model is applied; and generating a payload field including deep learning network model data according to the payload type.
  • the bitstream is characterized in that it is included in the additional information area of the multimedia bitstream.
  • the bitstream is characterized in that it is included in the multimedia data area of the multimedia bitstream.
  • the position in the multimedia data to which the deep learning network model is applied is characterized in that it is divided into frames, regions, or scenes of multimedia data.
  • the parameter index to be applied to the multimedia data is a video parameter set (VPS) index, a sequence parameter set (SPS) index, or a picture parameter set (PPS) index.
  • VPS video parameter set
  • SPS sequence parameter set
  • PPS picture parameter set
  • the algorithm to be applied to the multimedia data is characterized in that it is super resolution or noise reduction.
  • the properties of the multimedia data after the deep learning network model is applied include image resolution, frame rate, or color space.
  • An encoding apparatus includes: a memory; and a bitstream generator including one or more processors to generate a bitstream including multimedia data and a deep learning network model to be applied to the multimedia data and store it in the memory; including, wherein the bitstream generator comprises: a deep learning network Generates a payload type field for classifying a model, a use case field for classifying an algorithm to be applied to multimedia data, an index field for designating a location in the multimedia data to which the deep learning network model is applied, and the It is characterized in that a field indicating the properties of the multimedia data after the deep learning network model is applied is generated, and a payload field including the deep learning network model data according to the payload type is generated.
  • classifying the types of deep learning network models by the payload type field classifying an algorithm to be applied to multimedia data according to a use case field; Separating the location in the multimedia data to which the deep learning network model is applied by the location index field; classifying the properties of the multimedia data after the deep learning network model is applied by the property field; and classifying the deep learning network model data according to the payload type by the payload field.
  • a decoding apparatus according to another embodiment of the present invention.
  • bitstream parsing unit that parses and stores a bitstream including a deep learning network model to be applied to multimedia data by including one or more processors in the memory; including, wherein the bitstream parsing unit performs the deep learning by the payload type field Classify the type of learning network model, classify an algorithm to be applied to multimedia data by a use case field, classify a location in the multimedia data to which the deep learning network model is applied by a location index field, and classify the deep learning network model by an attribute field It is characterized in that the attribute of the multimedia data after the learning network model is applied is distinguished, and the deep learning network model data according to the payload type is distinguished by the payload field.
  • the entire data can be used in units of layers/tiles/blocks without modification of the deep learning network model, it has the advantage of implementing a deep learning accelerator without degrading the performance of the deep learning network model.
  • FIG. 1 is a structural diagram of a deep learning acceleration apparatus according to the prior art.
  • FIG. 2 is a structural diagram of an entire system using a deep learning acceleration device according to a preferred embodiment of the present invention.
  • FIG. 3 is an example of a bitstream parsing method according to a preferred embodiment of the present invention.
  • FIG. 4 is a structural diagram of an encoder according to a preferred embodiment of the present invention.
  • FIG. 5 is a structural diagram of a decoder according to a preferred embodiment of the present invention.
  • FIG. 6 is a flowchart of an encoding method according to a preferred embodiment of the present invention.
  • FIG. 7 is a flowchart of a decoding method according to a preferred embodiment of the present invention.
  • FIG. 8 shows an example in which a multimedia bitstream and a deep learning network model according to the prior art are applied.
  • 9 is an example for applying a deep learning network model to a multimedia bitstream according to a preferred embodiment of the present invention.
  • FIG. 10 shows the structure of a multimedia bitstream to which the multimedia bitstream generating method according to a preferred embodiment of the present invention is applied.
  • FIG. 11 is a flowchart of a method for generating a multimedia bitstream according to another preferred embodiment of the present invention.
  • FIG. 12 is a flowchart of a multimedia bitstream decoding method according to another preferred embodiment of the present invention.
  • FIG. 13 is a structural diagram of an encoding apparatus according to a preferred embodiment of the present invention.
  • FIG. 14 is a structural diagram of a decoding apparatus according to a preferred embodiment of the present invention.
  • terms such as “or” and “at least one” may indicate one of the words listed together, or a combination of two or more.
  • “A or B” and “at least one of A and B” may include only one of A or B, or both A and B.
  • 'first' and 'second' may be used to describe various components, but the components should not be limited by the above terms.
  • the above terms should not be construed as limiting the order of each component, and may be used for the purpose of distinguishing one component from another.
  • a 'first component' may be referred to as a 'second component'
  • a 'second component' may also be referred to as a 'first component'.
  • FIG. 1 is a schematic configuration diagram of a mobile processor chip (AP: Application Processor) including a deep learning accelerator of the prior art.
  • AP Application Processor
  • the mobile processor chip 10 includes a central processing unit (CPU) 11 , an external interface 12 , a memory controller 13 , a deep learning accelerator 14 , and an on-chip memory 15 .
  • CPU central processing unit
  • external interface 12 external interface
  • memory controller 13 external interface
  • deep learning accelerator 14 deep learning accelerator
  • on-chip memory 15 on-chip memory
  • the learned kernel data 22 and activation data 24 are stored in the external memory 20 .
  • the kernel data 22 is also called a deep learning network model, and is composed of several layers, and each layer is composed of a weight and a bias value. Since one layer of the kernel data 22 has different weights and numbers of bias values according to the layer type (Input / Output / Convolutional / Residual / Fully-Connected / Recurrent / Batch Normalization, etc.), the kernel data 22 varies in size.
  • the size of the deep learning network model is 240 MB for AlexNet network and 552 MB for VGG-16 network.
  • a model with such a large capacity cannot be used continuously after being stored in the internal on-chip memory 15 using SRAM, so it must be stored and used in an external memory 20 using DRAM, etc.
  • the kernel data 22 must be fetched from the external memory 20 frequently for this purpose, power consumption is high (in the 45nm CMOS process, accessing the 32-bit DRAM memory consumes 640pJ of energy).
  • the activation data 24 is used as an input for the next layer, and in some cases, it is also used for a layer after several steps.
  • the activation data 24 varies depending on the size of the input data. As the input resolution increases, the amount of the activation data 24 also increases. Therefore, when the activation data 24 is stored in the external memory 20 for later use, power consumption also increases.
  • the prior art uses a method of compressing and storing the kernel/activation data 22 and 24 in order to reduce the number of accesses to the external memory 20 and the amount of data.
  • the compression method is used, there is still a problem of inefficient use of power because the process of fetching the entire compressed data from the external memory 20 and using and discarding only necessary data is repeated.
  • FIG. 2 is a block diagram of an entire system to which a deep learning acceleration device according to a preferred embodiment of the present invention is applied.
  • the deep learning acceleration apparatus 100 includes an on-chip memory 110 , a deep learning accelerator 120 , a CPU 130 , an interface 140 , and a memory controller 150 .
  • the on-chip memory 110 stores data necessary for operation in the deep learning accelerator 120 . Since the on-chip memory 110 has a small capacity instead of using a high-speed SRAM, a large amount of data is stored using the external memory 20 using a large-capacity DRAM.
  • the deep learning accelerator 120 includes a kernel data decoder 122 and an activation data decoder/encoder 124 .
  • the CPU 130 may include a bitstream parser 132 .
  • the interface 140 is used for communication with the outside of the deep learning acceleration device 100 .
  • the memory controller 150 controls data transmission/reception with the external memory 20 .
  • the kernel data encoder 30 for kernel data compression exists outside the deep learning acceleration apparatus 100 .
  • the compressed kernel data 22 in the external memory 20 is fetched, the on-chip memory 110 may be insufficient, and power is consumed every time it is fetched, which is not efficient in power use. Therefore, in the present invention, the compressed kernel data 22 is divided and brought in a predetermined unit.
  • the bitstream parser 132 parses only a necessary part of the externally stored compressed kernel data 22 or compression activation data 24 and transmits it to the deep learning accelerator 120 to decode the compressed data.
  • the compressed kernel data 22 or the compression activation data 24 is
  • 3 is an example of fetching only necessary data from compressed data using a bitstream parser.
  • the bitstream parser 132 divides compressed kernel data or activation data as much as necessary and transmits them to the kernel data decoder 122 or the activation data decoder 124 . To this end, the bitstream parser 132 may classify compressed kernel data or activation data into specific units such as layers, tiles, and blocks by a delimiter 221 (Start Code Prefix) inserted into the bitstream. Therefore, the delimiter is removed from the m-th layer data divided in this way, and the delimited data is transmitted to the deep learning accelerator 120 .
  • a delimiter 221 Start Code Prefix
  • the bitstream parser 132 may be implemented in software by the CPU 130 of the deep learning acceleration device 100 or may be implemented in the form of hardware inside the deep learning acceleration device 100 .
  • FIG. 4 is a structural diagram of an activation data encoder.
  • the activation data encoder 126 may also encode kernel data in the same structure.
  • the activation data encoder 126 includes a preprocessor 1261 , a quantization unit 1262 , a decomposition unit 1263 , a transform unit 1264 , a prediction unit 1265 , a context modeling unit 1266 , and an entropy coding unit 1267 . and an insert 1268 .
  • the encoder 126 receives kernel data or activation data as input and generates a bitstream in which a delimiter is inserted.
  • kernel data can also be encoded in the same way.
  • the pre-processing unit 1261 When activation data is input, the pre-processing unit 1261 is performed.
  • the preprocessor 1261 selectively replaces data values with 0 according to the importance of input data by using a pruning technique or the like.
  • Quantization is a process of integerizing activation data, which is a real value, to increase the calculation speed and reduce the amount of computation.
  • An example is to convert a real value, usually expressed as a 32-bit floating point, into a 16/8/6/4-bit integer value.
  • the decomposition unit 1263 decomposes the activation data into several small-dimensional data.
  • the decomposition unit 1263 may decompose the input data using low rank decomposition, canonical polyadic decomposition (CP), or the like.
  • CP canonical polyadic decomposition
  • the transform unit 1264 transforms the activation data into frequency domain data.
  • the transform unit 1264 converts data into block-unit frequency domain data by using transforms such as Discrete Cosine Transform (DCT), Discrete Sine Transform (DST), and Discrete Fourier Transform (DFT). Due to the characteristics of activation data, a phenomenon in which data is concentrated in a certain frequency domain occurs, and therefore, frequency domain transformation is performed because processing a signal in the frequency domain is more effective for compression.
  • DCT Discrete Cosine Transform
  • DST Discrete Sine Transform
  • DFT Discrete Fourier Transform Due to the characteristics of activation data, a phenomenon in which data is concentrated in a certain frequency domain occurs, and therefore, frequency domain transformation is performed because processing a signal in the frequency domain is more effective for compression.
  • the prediction unit 1265 predicts the activation data by using a correlation with the previously encoded activation data.
  • the prediction unit 1265 is used, only the residual of the activation data can be encoded and transmitted, so that the amount of data can be reduced.
  • the prediction unit 1265 , the transform unit 1264 , and the decomposition unit 1263 may be positioned before or after the quantization unit 1262 to process activation data.
  • the prediction unit 1265 , the transform unit 1264 , and the decomposition unit 1263 may be deactivated if necessary. There may be cases in which some blocks are deactivated for real-time processing of activation data. Accordingly, information indicating which block is activated or deactivated is transmitted to the entropy coding unit 1267 and inserted into the bitstream.
  • the context modeling unit 1266 compares the activation data that has passed through the prediction unit 1265 , the transform unit 1264 , the decomposition unit 1263 , or the quantization unit 1262 with the previously encoded activation data to obtain a context.
  • Context modeling to identify Context modeling is to analyze the trend of data, and by identifying the change trend of activation data, the entropy coding unit 1267 derives and delivers context information (Context_id), which is the ID of the probability table to be used.
  • the entropy coding unit 1267 compresses the activation data that has gone through all the previous processes.
  • Entropy coding is lossless compression coding, and compression methods such as Run-Length Coding, Huffman Coding, and Arithmetic Coding may be used.
  • the inserter 1268 inserts a delimiter (Start Code Prefix) into the compressed bitstream.
  • a bitstream can be divided into a specific unit such as a layer, a tile, or a block by the delimiter.
  • FIG. 5 is a schematic structural diagram of the kernel data decoder 122 .
  • the kernel data decoder 122 decodes the compressed kernel data
  • the activation data decoder 124 for decoding the activation data compressed by the activation data encoder 126 also has the same structure.
  • the kernel data decoder 122 includes an entropy decoding unit 1221, a context modeling unit 1222, an inverse quantization unit 1223, a synthesis unit 1224, an inverse transform unit 1225, an inverse prediction unit 1226, and a post-processing unit ( 1227).
  • the entropy decoding unit 1221 receives data from which the delimiter is removed from the bitstream parser 132 and performs entropy decoding. It decompresses the compressed bitstream and converts it into kernel data.
  • the entropy decoding unit 1221 performs entropy decoding using the context information (Context_id) of the context modeling unit 1222 .
  • the inverse quantization unit 1223 converts the data quantized into integer data by the encoder back into real data.
  • Kernel data that has passed through the inverse quantization unit 1223 is restored through a post-processing process of the post-processing unit 1227 and used by the deep learning accelerator.
  • the entropy-decoded kernel data may pass through the synthesis unit 1224 , the inverse transform unit 1225 , or the inverse prediction unit 1226 . Since which block is activated is included in the bitstream, it is possible to determine whether to activate each block based on this information.
  • the synthesis unit 1224 , the inverse transform unit 1225 , and the inverse prediction unit 1226 perform the reverse process of the decomposition unit 1263 , the transform unit 1264 , and the prediction unit 1265 of the encoder. That is, the synthesizing unit 1224 restores the data decomposed into several original dimensional data, the inverse transform unit 1225 restores the data transformed into the frequency domain back into the time domain, and the prediction unit 1265 adds the data to the residual data. to restore the original data.
  • the encoder/decoder of the present invention to selectively encode/decode only the data required for deep learning operation, not the entire data, the power and time consumed for transmitting the entire deep learning network model data can be reduced.
  • FIG. 6 is a schematic flowchart of a data encoding method according to the present invention.
  • the input kernel/activation data is first quantized (S10). Converts real number data to integer data.
  • the quantized data may be subjected to prediction (S20), transformation (S30), or decomposition (S40) steps. These three steps may be selectively performed according to the amount of computation. Also, the order of the quantization step ( S10 ) may be changed.
  • Data prediction may be performed on the quantized data (S20). It predicts the next data by using the correlation with the previously encoded data.
  • the frequency domain transform transforms data into block unit frequency domain data by using transforms such as Discrete Cosine Transform (DCT), Discrete Sine Transform (DST), and Discrete Fourier Transform (DFT) (S30).
  • DCT Discrete Cosine Transform
  • DST Discrete Sine Transform
  • DFT Discrete Fourier Transform
  • Data decomposition is performed by decomposing the data into several small-dimensional data using Low Rank Decomposition, Canonical Polyadic Decomposition, and the like (S40).
  • Context modeling is performed to determine the context by comparing the thus-processed kernel data with the previously encoded activation data (S50). Context information performed by context modeling is transferred to the next step.
  • the quantized kernel data is entropy-coded using context information (S60). Compression methods such as Run-Length Coding, Huffman Coding, and Arithmetic Coding may be used.
  • delimiter Start Code Prefix
  • FIG. 7 is a schematic flowchart of a method of decoding encoded kernel data or activation data
  • bitstream parsing is performed (S110).
  • the bitstream parser finds the necessary data using the delimiter and supplies it to the decoder.
  • Compressed data is decompressed by entropy decoding (S120). It is to reversely decode data compressed by compression methods such as Run-Length Coding, Huffman Coding, and Arithmetic Coding.
  • Context modeling is performed by comparing entropy-decoded data with previously decoded data (S130). Context information is derived by context modeling, which is again used in the entropy decoding step.
  • Data synthesis (S140), time domain transformation (S150), and data inverse prediction (S160) may be selectively performed. Since the coding parameter, which is whether each block is activated or not, is included in the bitstream of the compressed data and transmitted, it is determined whether each step is performed based on this information. In addition, the data synthesis ( S140 ), the time domain transformation ( S150 ), and the data inverse prediction ( S160 ) steps may be located after the inverse quantization step ( S170 ).
  • the data synthesis step restores the data decomposed into several worker-dimensional data (S140).
  • data in the frequency domain is converted into data in the time domain (S150).
  • Inverse transform methods such as Inverse Discrete Cosine Transform (iDCT), Inverse Discrete Sine Transform (iDST), and Inverse Discrete Fourier Transform (iDFT) may be used.
  • the kernel data is restored by converting integer data back to real data ( S170 ).
  • This encoding/decoding method encodes/decodes only the units of data required for deep learning acceleration, so that deep learning network model data can be used without performance degradation even in insufficient on-chip memory, and only a portion of the entire data is stored or loaded in external memory. Accordingly, there is an advantage in that the amount of power consumed for data transmission can be effectively reduced.
  • FIG. 8 shows an example in which a multimedia bitstream and a deep learning network model according to the prior art are applied.
  • the multimedia bitstream decoding apparatus 1 includes a decoder 2 and a deep learning processing unit 3 .
  • the decoder 2 may decode the multimedia bitstream 4 to produce video/audio.
  • the deep learning processing unit 3 applies the deep learning network model 5 to the decoded video/audio content to create video/audio content with improved picture quality or sound quality.
  • NNR Neural Network Representation
  • this deep learning network model standardization only describes a method for defining the deep learning network model and effectively exchanging weights, and it is independently composed without any relation to the multimedia stream to which these deep learning network models are applied. .
  • the decoder 2 processes the bitstream 4
  • the deep learning processing unit 3 applies the deep learning network model 5
  • the bitstream 4 and the deep learning network model 5 are It is a reality that is managed independently without any connection.
  • bitstream (4) and the deep learning network model (5) are managed independently in this way, it is not easy to manage which deep learning network model (5) is applied to which bitstream (4).
  • the deep learning network model 5 is applied in units of scenes or frames even within the same bitstream 4, the difficulty of management is bound to increase.
  • FIG. 9 is a structure of a multimedia bitstream including a deep learning network model according to a preferred embodiment of the present invention for solving the problems of the prior art.
  • multimedia data composed of video or audio consists of a very large amount of data. Therefore, in order to efficiently store and transmit multimedia data, compression is mainly performed.
  • Various international standards exist for compressing multimedia data and the data compressed for transmission or storage is called a multimedia stream.
  • international standards for image data include H.264 and HEVC (Hi Efficiency Video Coding), and international standards for image data include Joint Picture Experts Group (JPEG) and JPEG2000.
  • JPEG Joint Picture Experts Group
  • JPEG2000 Joint Picture Experts Group
  • MPEG-3 MP3
  • AAC Advanced Audio Coding
  • a payload area 310 which is an area for storing multimedia data itself, and an additional information area 320 including other information.
  • the additional information area 320 is transmitted empty.
  • An unspecified area exists in the additional information area 320, and this area is an area vacated for a use that is not determined at the time the standard is defined. Therefore, it is an object of the present invention to transmit a deep learning network model using some of these areas.
  • the multimedia payload and the deep learning network model can be simultaneously included in one stream.
  • This additional information area 320 may include not only the deep learning network model defined in each multimedia standard, but also the deep learning network model standard defined previously.
  • the above-mentioned ONNX, NNEF, and NNR formats may be included in the additional information area 320 , and a flag indicating them may also be added.
  • a use case 321 indicating which deep learning model is used, a header 322 that can distinguish the start of each payload, and payloads 323, 324, 325 that are deep learning network model data ) may be included.
  • a deep learning network model can be added by adding a new type of NAL unit to an unspecified area among NAL units.
  • the NAL unit may include not only the deep learning network model format defined in each multimedia standard but also the previously defined deep learning network model standard.
  • a bitstream definition for a deep learning network model may include a flag indicating which frame, which scene, etc., the deep learning network is applied to.
  • Table 1 below shows an example of applying the above method to HEVC.
  • the HEVC standard defines a Supplemental Enhancement Information (SEI) area for transmitting additional information.
  • SEI Supplemental Enhancement Information
  • the present invention proposes a method of adding a deep learning network model to an unassigned value among payloadTypes of the SEI area.
  • deep learning models such as ONNX, NNEF, NNR, and DNM (Deeplearning Network Model) can be assigned, respectively.
  • DNM stands for a newly defined deep learning network model.
  • Table 2 below shows the syntax structure when payloadType is ONNX.
  • ONNX(payloadSize) explanation use_case Super resolution, Noise Reduction, etc.
  • use_case shows the application to which the deep learning network model is applied.
  • Applications such as super resolution to increase resolution or noise reduction to remove noise may be applicable.
  • frame_id, vps_id (video parameter set), sps_id (scene paraeter set), and pps_id (picture parameter set) indicate which frame, which scene, and which picture the deep learning network model is applied to.
  • frame_reolution, frame_rate, and color_space represent image properties after the deep learning network model is applied, that is, image properties such as frame resolution, frame rate, and color space.
  • ONNX_payload refers to the data that the ONNX standard has, and includes all data from the ONNX standard.
  • use cases such as NNEF, NNR, and DNM may also have payloads defined in the same format.
  • FIG. 10 shows the structure of a multimedia bitstream to which the multimedia bitstream generating method according to a preferred embodiment of the present invention is applied.
  • Additional information data 335 of which payloadType is ONNX shows that use_case is SR (Super Resolution) and is applied to the immediately preceding INTRA frame 34 .
  • the additional information data 338 whose payloadType is NNEF indicates that use_case is NR (Noise Reduction) and is applied to the two INTER frames 336 and 337 .
  • various deep learning network models can be applied even within one stream, and it can be set so that the deep learning model can be applied only to a specific frame.
  • FIG. 11 is a flowchart of a method for generating a multimedia bitstream including the additional information area as described above.
  • Payload types are assigned to deep learning network models such as ONNX, NNEF, NNR, and DNM. It is possible to determine which deep learning network model is included in the additional information area by the payload type value.
  • the use case indicates which application the deep learning network model is used for.
  • use_case indicates which application the deep learning network model is used for.
  • Super Resolution technology for increasing the resolution of SD (Standard Defenition) images into HD (High Definition) images or Noise_reduction for noise removal are examples of use cases.
  • a location index field indicating a location to which the use case is to be applied is generated (S230). It can be applied to the entire video, or various application ranges such as frame units, scene units, and picture units can be configured.
  • an index of the parameter to be applied is generated (S240).
  • the parameters include a video parameter set (vps), a scene parameter set (sps), and a picture parameter set (PPS).
  • a field indicating the properties of the image after the deep learning network is applied is generated (S250).
  • the frame resolution (frame_resolution), the frame rate (frame_rate), the color space (color space), etc. may indicate the properties of the image.
  • the ONNX payload includes the entire deep learning network model included in the ONNX standard.
  • the ONNX standard includes all data such as the structure and weights of the deep learning model.
  • a new area for data that is not included in the ONNX model data but can be used later may be added.
  • the deep learning network model can be included in the SEI message, but can also be defined and configured as a separate NAL unit type.
  • Table 3 below shows an example of newly allocating the deep learning network model fill to nal_uit_type in the additional information area of H..264.
  • nal_unit_type Name of nal_unit_type Content of NAL unit and RBSP syntax structure NAL unit type class 01 TRAIL_NTRAIL_R Coded slice segment of a non-TSA, non-STSA trailing picture slice_segment_layer_rbsp( ) VCL 23 TSA_NTSA_R Coded slice segment of a TSA picture slice_segment_layer_rbsp( ) VCL ... 3940 PREFIX_SEI_NUTSUFFIX_SEI_NUT Supplemental enhancement information sei_rbsp( ) non-VCL 41..47 RSV_NVCL41..RSV_NVCL47 Reserved non-VCL 48 ONNX ONNX format deep learning network model onnx_rbsp( ) non-VCL 49 NNEF Deep learning network model in NNEF format nnef_rbsp( ) non-VCL 50 NNR Deep learning network model in NNR format nnr_rbs
  • nal_unit_type values are not defined from 48 to 63. Therefore, deep learning network models can be applied to this area.
  • 48 may indicate ONNX
  • 49 may indicate NNEF
  • 50 may indicate NNR
  • 51 may indicate DNM.
  • the configuration of each deep learning network model is as described above.
  • the configuration of onnx_rbsp() is similar to the configuration included in the SEI discussed above.
  • Table 4 below shows the case in which the deep learning network model is included in the h.264 sei message.
  • Table 5 below is an example of including the deep learning network model in a separate NAL unit type, not in the additional information area in H.264.
  • nal_unit_type Content of NAL unit and RBSP syntax structure C 0 Unspecified One Coded slice of a non-IDR pictureslice_layer_without_partitioning_rbsp( ) 2, 3, 4 2 Coded slice data partition Aslice_data_partition_a_layer_rbsp( ) 2 ... 6 Supplemental enhancement informationsei_rbsp( ) 5 ... 12 Filler datafiller_data_rbsp( ) 9 13 ..
  • bitstream generation method is not limited to HEVC or H.264 and may be included in VVC or any video stream standard to be defined later in a similar form.
  • FIG. 12 is a flowchart of a multimedia bitstream decoding method according to another preferred embodiment of the present invention.
  • a method of decoding the deep learning network model included in the additional information area or a separate NAL unit type is as follows.
  • the deep learning network model type is classified by payloadType (S310).
  • the deep learning network model includes ONN, NNEF, NMR, and DNM, and is not limited to these models, and various types of deep learning models may be added.
  • use_case is used to classify which algorithm is applied (S320).
  • various algorithms may be included.
  • a location to which the deep learning network model is applied is divided (S330).
  • the deep learning network model In order for the deep learning network model to be applied not only to the entire image but also to the frame, picture, and scene units, it is possible to figure out which deep learning model is applied to which position through the location ID.
  • This field may include image resolution (frame_resolution), frame rate, color space, and the like.
  • deep learning association can be performed by classifying the data of the deep learning network model to be applied to the multimedia bitstream (S350)
  • FIG. 13 is a structural diagram of an encoding apparatus according to a preferred embodiment of the present invention.
  • the encoding apparatus 400 includes an encoder 410 , a deep learning network model 420 , a memory 430 , and a bitstream generator 440 .
  • the encoding apparatus 400 When a video/audio signal is input, the encoding apparatus 400 encodes it into compressed data through the encoder 410 .
  • the bitstream generator 440 includes a deep learning network model to be applied to the multimedia signal together with the encoded multimedia signal, generates a bitstream, and stores it in the memory 430 .
  • the bitstream generation unit 440 generates a payloadType field that identifies the deep learning network model.
  • the payload type may be ONNX, NNEF, NMR, DNM, or the like.
  • Use cases may be applications such as Super Resolution, Noise Reduction, and the like.
  • an index field indicating the location to which this deep learning network model is applied is created, and an attribute field indicating the properties of the multimedia file after the deep learning network model is applied is created.
  • the index field indicating the position may be divided into units of frames, scenes, pictures, and videos.
  • the property field corresponds to properties of the multimedia file such as frame resolution, frame rate, and color space.
  • bitstream generation is completed by including the deep learning network model data according to the payload type in the payload.
  • the deep learning network model data may additionally contain undefined new data.
  • FIG. 14 is a structural diagram of a decoding apparatus according to a preferred embodiment of the present invention.
  • the decoding apparatus 500 includes a bitstream parser 510 , a memory 520 , a decoder 530 , and a deep learning processing unit 540 .
  • the decoding device 500 parses the bitstream including the multimedia bitstream and the deep learning network model together through the bitstream parser 510, and then decodes it into a multimedia file through the decoder 530 and reads the deep learning network model data.
  • a video/audio file is finally generated after deep learning processing is performed in the deep learning processing unit 540 using
  • the memory 520 stores a program code for parsing a bitstream or decoding a multimedia file, data for operations such as deep learning model data, multimedia data, and an intermediate result final result.
  • the bitstream parser 510 classifies the deep learning network model included in the bitstream in the following order.
  • the deep learning network model is classified by the payload type field.
  • the deep learning network model may be ONNX, NNEF, NNR, DNM, or the like.
  • a use case refers to an application to which a deep learning network model is applied.
  • the index on the location in the video/audio file to which the deep learning network model is applied is distinguished.
  • the location in the file may be divided into units such as frames, scenes, and pictures.
  • the attribute field is parsed. Properties include frame resolution, frame rate, and color space.
  • the deep learning network model can be transmitted together with the multimedia stream to which the deep learning network model is applied, so it is easy to manage. It has the advantage of being able to apply accurate deep learning network models.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Neurology (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Discrete Mathematics (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

La présente invention concerne un accélérateur d'apprentissage profond. L'accélérateur d'apprentissage profond selon la présente invention a pour effet de, lors du codage/décodage de données de modèle de réseau d'apprentissage profond, effectuer un codage/décodage uniquement autant que nécessaire dans une unité spécifique telle qu'une couche plutôt que la totalité des données, de façon à économiser l'énergie consommée pendant le stockage ou le chargement de la totalité des données vers ou depuis une mémoire externe. De plus, étant donné qu'un modèle de réseau d'apprentissage profond peut être inclus dans une zone d'informations supplémentaire d'un flux binaire multimédia et transmis conjointement avec le flux binaire multimédia, la gestion des données de modèle d'apprentissage profond est aisée. De plus, étant donné que le modèle de réseau d'apprentissage profond peut être appliqué dans des unités de trames, de scènes et d'images, une performance d'apprentissage profond peut également être améliorée, et une application précise du modèle de réseau d'apprentissage profond est possible.
PCT/KR2020/011528 2020-08-27 2020-08-28 Codeur et décodeur de flux binaire d'apprentissage profond, et procédé associé WO2022045401A1 (fr)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
KR10-2020-0108298 2020-08-27
KR10-2020-0108297 2020-08-27
KR1020200108297A KR102517449B1 (ko) 2020-08-27 2020-08-27 딥러닝 네트워크 모델을 포함하는 멀티미디어 비트스트림 생성방법 및 그 장치
KR1020200108298A KR102654690B1 (ko) 2020-08-27 2020-08-27 딥러닝 가속 장치 및 그 방법

Publications (1)

Publication Number Publication Date
WO2022045401A1 true WO2022045401A1 (fr) 2022-03-03

Family

ID=80355242

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/KR2020/011528 WO2022045401A1 (fr) 2020-08-27 2020-08-28 Codeur et décodeur de flux binaire d'apprentissage profond, et procédé associé

Country Status (1)

Country Link
WO (1) WO2022045401A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024003191A1 (fr) * 2022-06-29 2024-01-04 Fondation B-Com Procédé et dispositif de décodage, programme d'ordinateur et flux de données associés

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20190062273A (ko) * 2017-11-28 2019-06-05 한국전자통신연구원 영상 변환 신경망 및 영상 역변환 신경망을 이용한 영상 처리 방법 및 장치
KR20200050433A (ko) * 2018-11-01 2020-05-11 한국전자통신연구원 신경망을 사용하여 영상에 대한 처리를 수행하는 방법 및 장치
KR20200093404A (ko) * 2019-01-28 2020-08-05 포항공과대학교 산학협력단 신경망 가속기 및 그것의 동작 방법

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20190062273A (ko) * 2017-11-28 2019-06-05 한국전자통신연구원 영상 변환 신경망 및 영상 역변환 신경망을 이용한 영상 처리 방법 및 장치
KR20200050433A (ko) * 2018-11-01 2020-05-11 한국전자통신연구원 신경망을 사용하여 영상에 대한 처리를 수행하는 방법 및 장치
KR20200093404A (ko) * 2019-01-28 2020-08-05 포항공과대학교 산학협력단 신경망 가속기 및 그것의 동작 방법

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
WANG SHUANG, YUE BO, LIANG XUEFENG, JIAO LICHENG: "How Does the Low-Rank Matrix Decomposition Help Internal and External Learnings for Super-Resolution", IEEE TRANSACTIONS ON IMAGE PROCESSING, IEEE, USA, vol. 27, no. 3, 1 March 2018 (2018-03-01), USA, pages 1086 - 1099, XP055902288, ISSN: 1057-7149, DOI: 10.1109/TIP.2017.2768185 *
ZHOU MINGYI, LIU YIPENG, LONG ZHEN, CHEN LONGXI, ZHU CE: "Tensor rank learning in CP decomposition via convolutional neural network", SIGNAL PROCESSING. IMAGE COMMUNICATION., ELSEVIER SCIENCE PUBLISHERS, AMSTERDAM., NL, vol. 73, 1 April 2019 (2019-04-01), NL , pages 12 - 21, XP055902287, ISSN: 0923-5965, DOI: 10.1016/j.image.2018.03.017 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024003191A1 (fr) * 2022-06-29 2024-01-04 Fondation B-Com Procédé et dispositif de décodage, programme d'ordinateur et flux de données associés

Similar Documents

Publication Publication Date Title
WO2014163241A1 (fr) Procédé et appareil permettant de traiter une vidéo
WO2011034372A2 (fr) Procédé et appareil de codage et décodage d'informations de mode
WO2020101321A1 (fr) Procédé de codage de coefficient de transformation sur la base de la mise à zéro à haute fréquence et appareil à cet effet
EP2524508A2 (fr) Procédé et appareil pour encoder et décoder une image en utilisant une unité de transformation importante
WO2013162249A1 (fr) Procédé de codage vidéo, procédé de décodage vidéo et appareil les mettant en œuvre
WO2014042460A1 (fr) Procédé et appareil pour le codage/décodage d'images
WO2015012600A1 (fr) Procédé et appareil de codage/décodage d'une image
WO2016076677A1 (fr) Procédé et dispositif de codage entropique ou de décodage entropique d'un signal vidéo, pour un traitement parallèle haute capacité
WO2014038905A2 (fr) Procédé de décodage d'image et appareil l'utilisant
WO2017065592A1 (fr) Procédé et appareil de codage et de décodage de signal vidéo
WO2021225338A1 (fr) Procédé de décodage d'image et appareil associé
WO2022045401A1 (fr) Codeur et décodeur de flux binaire d'apprentissage profond, et procédé associé
WO2014051372A1 (fr) Procédé de décodage d'image, et appareil l'utilisant
WO2021137588A1 (fr) Procédé et appareil de décodage d'image pour coder des informations d'image comprenant un en-tête d'image
WO2021040488A1 (fr) Codage d'image ou de vidéo basé sur une binarisation d'échappement en mode palette
WO2020130514A1 (fr) Procédé et dispositif permettant de décider l'ordre de balayage de coefficient de transformée sur la base de la mise à zéro haute fréquence
WO2020071672A1 (fr) Procédé de compression de vecteur de mouvement et appareil associé
WO2018026028A1 (fr) Procédé et dispositif de codage/décodage de signal résiduel au moyen d'un groupe de sous-coefficients
WO2021091260A1 (fr) Codage d'image ou de vidéo basé sur des informations de décalage de paramètre de quantification de chrominance
WO2023075564A1 (fr) Procédé et appareil de codage/décodage de caractéristiques et support d'enregistrement stockant un flux binaire
WO2021235895A1 (fr) Procédé de codage d'image et dispositif associé
WO2021091261A1 (fr) Codage d'image ou de vidéo sur la base d'informations relatives à la quantification
WO2021096057A1 (fr) Procédé de codage d'image sur la base d'informations relatives à un point d'entrée dans un système de codage vidéo ou d'image
WO2016195378A1 (fr) Procédé, appareil et support d'enregistrement lisible par ordinateur permettant de réaliser de manière efficace une compression incorporée de données
WO2021091263A1 (fr) Codage d'image ou de vidéo basé sur la signalisation d'informations relatives à un paramètre de quantification

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20951633

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20951633

Country of ref document: EP

Kind code of ref document: A1