WO2022075754A1

WO2022075754A1 - Method and apparatus for processing image for machine vision

Info

Publication number: WO2022075754A1
Application number: PCT/KR2021/013723
Authority: WO
Inventors: 김재곤; 윤용욱; 박도현; 천승문
Original assignee: 한국항공대학교산학협력단; (주)인시그널
Priority date: 2020-10-06
Filing date: 2021-10-06
Publication date: 2022-04-14
Also published as: KR20220045920A

Abstract

Disclosed are a method and apparatus for processing an image for machine vision. The method for processing an image for machine vision, according to an embodiment, comprises: an extraction step of extracting feature information from the image; a pre-processing step of reducing the size of the feature information by pre-processing the feature information extracted in the extraction step; and an encoding/decoding step including an encoding process of encoding the pre-processed feature information by applying an existing video codec technique, wherein, in the encoding/decoding step, pre-processing-related information, which is related to a pre-processing method used in the pre-processing step, is received and encoded together.

Description

Image processing method and apparatus for machine vision

The present invention relates to image processing technology, and to image processing technology for encoding and/or decoding input image feature information in order to efficiently process or transmit an input image and/or feature information thereof for performing a machine vision task will be.

The development of deep learning or machine learning and various artificial intelligence platforms and applications are being developed. Accordingly, as the amount of data processed by machines or humans by deep learning or machine learning increases, the demand for encoding and decoding technologies for image and/or feature information for efficient information transmission and real-time signal processing increases, and there is.

Conventional image (especially video) compression technology using deep learning or machine learning extracts feature information/feature map/feature vector/latent vector from an input image. Then, the extracted characteristic information is transmitted by itself or the characteristic information is encoded and transmitted. In this case, the size of the extracted feature information may be defined according to the designed deep learning or machine learning network. That is, the characteristic information may be defined by one or more parameters among various sizes, a plurality of channels, and various data types.

However, the characteristic information defined in this way may have a smaller size compared to the input image, but the characteristic information having a smaller size will lose information compared to the original image. Therefore, in order to reduce the loss of information, feature information of as large a size as possible is required. In this case, since the amount of data increases, there is a difficulty in transmission or real-time data processing. Accordingly, a method of reducing the size of data to be transmitted or processed as well as reducing information loss by performing entropy encoding and decoding on feature information of a larger size has been proposed.

As an example, in Korean Patent Application Laid-Open No. 2018-0087264, “Image processing method and apparatus using feature map compression” (Patent Document 1), a feature map generated by performing a convolution operation on an input image is displayed in at least one line unit. A technique for reducing the amount of computation by reducing the number of filter parameters by processing or compressing them is disclosed. And in Korean Patent Application Laid-Open No. 2020-0026026, “An electronic device and control method for high-speed compression processing of a feature map of a CNN utilization system” (Patent Document 2), a feature map for an input image is acquired, and the acquired features A technique of transforming a feature map through a lookup table corresponding to the map and then compressing the transformed feature map through a corresponding compression mode is disclosed.

[Prior art literature]

[Patent Literature]

(Patent Document 1) Korean Patent Publication No. 2018-0087264

(Patent Document 2) Korean Patent Publication No. 2020-0026026

Since the conventional compression method of image features processes the entire input image and encodes and decodes it, not only information necessary for a machine or a person (eg, object information, object location, etc.) but also unnecessary information such as a background is included. to be transmitted or processed. As a result, not only the size of the extracted feature information is large, but also compression of the extracted feature information is often inefficient.

The problem to be solved by the present invention is to efficiently encode and/or decode feature information of an input image for a machine vision network that performs a specific task such as image restoration, object detection, object tracking, object classification, etc. An object of the present invention is to provide a method and apparatus for processing an input image for machine vision that can efficiently transmit or process information necessary for a person to perform a task.

An embodiment of the present invention for solving the above problems relates to a method of encoding/decoding characteristic information of an input image for efficiently processing characteristic information of an image, and first, the characteristic information extracted from the input image Visualize it. An existing deep learning network or an existing feature extractor may be applied to extract feature information, and there is no particular limitation on the type.

In addition, imaged feature information is compressed (encoded) and/or restored (decoded) using a conventional video codec. According to an embodiment of the present invention, prior to encoding/decoding using a conventional video codec, the imaged feature information is By first performing a predetermined pre-processing process for the data, the size of data constituting the extracted feature information is reduced or compression efficiency is improved. Thereby, the data size of the characteristic information to be transmitted or processed can be effectively reduced.

To this end, in an embodiment of the present invention, which will be described later, various preprocessing methods for compression of feature information are proposed so that the size (ie, amount of information) of feature information can be reduced. In addition, prediction, transformation, and/or entropy encoding/decoding methods for efficient encoding/decoding of preprocessed feature information are also proposed.

In addition, an embodiment of the present invention proposes an efficient encoding and decoding method and apparatus together with a system architecture (structure) for performing a machine vision mission as well as efficient compression of feature information.

An embodiment of the present invention for solving the above problems is an image processing method for machine vision, wherein an extraction step of extracting feature information from the image, and preprocessing the feature information extracted in the extraction step, A pre-processing step of reducing the size of information and an encoding step of encoding the pre-processed feature information by applying an existing video codec technique, wherein in the encoding/decoding step, the pre-processing step is used in the pre-processing step It receives and encodes preprocessing related information related to the preprocessing method.

According to one aspect of the embodiment, in the pre-processing step, the pre-processing may be performed by applying one or more methods among normalization, scaling, rearrangement, reduction in the number of expression bits, quantization, transformation, and filtering.

According to another aspect of the embodiment, the encoding/decoding step includes a decoding process of decoding the bitstream generated as a result of the encoding process, and the encoding/decoding step includes prediction-based encoding/decoding and transformation-based encoding. Encoding/decryption is performed by one or more methods among /decryption.

Another embodiment of the present invention for solving the above problem is an image processing apparatus including a processor for processing an image for machine vision, wherein the processor includes: a feature extractor for extracting feature information from the image; A preprocessor for reducing the size of the feature information by preprocessing the feature information extracted by the feature extraction unit, and an encoder/decoder for encoding the feature information preprocessed by the preprocessor by applying an existing video codec technique, and , the encoding/decoding unit receives preprocessing-related information related to the preprocessing method used by the preprocessor from the preprocessor and encodes the same.

According to an aspect of the embodiment, the preprocessor may perform the preprocessing by applying one or more methods of normalization, scaling, rearrangement, reduction in the number of expression bits, quantization, transformation, and filtering.

According to another aspect of the embodiment, the encoding/decoding unit also performs a decoding process of decoding the bitstream generated as a result of the encoding process, and the encoding/decoding unit is prediction-based encoding/decoding and transformation-based encoding/decoding. Encoding/decoding may be performed by one or more methods among decoding.

According to an embodiment of the present invention, after performing a predetermined pre-processing prior to encoding the imaged feature information extracted from the input image, the metadata generated in the pre-processing process and the pre-processing related information are encoded for the pre-processed feature information. By doing this, it is possible to efficiently reduce the feature information to be encoded. And, it is possible to reduce the amount of data to be transmitted or processed through efficient compression and/or restoration of feature information, so that the machine vision task in the machine vision network can be efficiently performed.

1 is a block diagram showing a schematic configuration of an image processing apparatus for machine vision according to an embodiment.

FIG. 2 is a block diagram schematically showing an example of a detailed configuration of the processor of FIG. 1 .

3 is a diagram schematically showing an example of a method for performing feature extraction based on deep learning in a network for performing a specific task.

4 is a diagram showing an example of a case in which normalized feature values are divided into four sections.

5 is a diagram illustrating an example in which a preprocessed feature map is expressed as an n-bit symbol map.

6 is a diagram showing an example of a binary map formed according to the order of bins.

7 is a configuration diagram schematically illustrating an example of a system configuration when an image processing apparatus according to an embodiment of the present invention is implemented using a network model that performs machine vision.

8 is a configuration diagram schematically illustrating an example of a system configuration when an image processing apparatus according to an embodiment of the present invention is implemented using a separate network model distinguished from a machine vision network.

9 is a configuration diagram schematically illustrating another example of a system configuration when an image processing apparatus according to an embodiment of the present invention is implemented using a separate network model distinguished from a machine vision network.

FIG. 10 is a diagram illustrating the configuration of a system corresponding to the system of FIG. 7 from a different point of view, in a case in which an image restoration network is not provided.

11 is a diagram illustrating the configuration of a system corresponding to the system of FIG. 7 from a different point of view, in a case in which an image restoration network is provided.

12 is a view showing the configuration of a system corresponding to the system of FIG. 8 from another viewpoint.

Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings. Terms and words used in this specification are terms selected in consideration of functions in the embodiments, and the meaning of the terms may vary according to the intention or custom of the invention. Therefore, the terms used in the examples to be described below, when specifically defined in the present specification, follow the definition, and when there is no specific definition, it should be interpreted as a meaning commonly recognized by those skilled in the art.

In addition, a module in the present specification may mean hardware capable of performing functions and operations according to each name described in this specification, or may mean computer program code capable of performing specific functions and operations. Or, it may refer to an electronic recording medium on which a computer program code capable of performing a specific function and operation is loaded, for example, a processor or a microprocessor. That is, a module may mean a functional and/or structural combination of hardware for carrying out the technical idea of the present invention and/or software for driving the hardware.

Referring to FIG. 1 , an image processing apparatus 10 encodes and decodes feature information of an input image, and features extracted from a received image to perform a specific task in a machine vision network (machine vision system). The process of encoding and then decoding the information can be performed. For example, the image processing apparatus 10 encodes the feature information extracted from the received image in a predetermined method to generate a bitstream and restores the encoded bitstream, so that the machine vision network performs a predetermined task, for example. Image data used to effectively perform tasks such as object tracking, object recognition, and image restoration can be output.

To this end, the image processing apparatus 10 includes at least a receiver 12 , a memory 14 , and a processor 16 . According to an embodiment, the image processing apparatus 10 may further include a transmitter 18 .

The receiver 12 may receive an image, that is, a moving image or a still image. The receiver 12 is for receiving an original image that can be utilized in a machine vision network, and is not limited to an image directly captured by an imaging device, and is an image captured in advance and transmitted through another communication network or another storage device. It may be an image stored in . The image received by the receiver 12 may be input to the processor 16 and/or stored in the memory 14 and then input to the processor 16 .

The memory 14 stores some or all of the image data received by the receiver 12 . According to an embodiment, all or part of the image data received by the receiver 12 may be directly input to the processor 16 without being stored in the memory 14 . In addition, the memory 14 stores image data that has been processed or is being processed by the processor 16 . For example, the memory 14 may store metadata and/or preprocessing related information generated during preprocessing of the feature map in the processor 16 to be described later, preprocessing result data, and the like. In addition, the memory 14 includes encoded feature map data (bitstream) generated during the encoding/decoding process of the processor 16 to be described later, a feature map restored from the bitstream by the processor 16, related data of these feature maps, In addition, image data restored from the feature map (eg, data for a machine vision network) may be stored.

The processor 16 processes the image data received by the receiver 12 and/or the image data stored in the memory 14 to generate image data for the machine vision network. The type or type of image data generated by the processor 16 may vary depending on the function of the machine vision network. For example, image data recovered through compression and restoration of feature information and/or generated by compressing feature information It may be a bitstream that becomes In the former case, the machine vision network performs a specific task using the restored image. On the other hand, in the latter case, the machine vision network performs a specific task by restoring the feature information from the received bitstream and then generating the restored image therefrom.

More specifically, the processor 16 extracts feature information from the input image input from the receiver 12 or the memory 14, and then encodes (compresses) the extracted feature information by applying a conventional video codec compression technique, Create a bitstream. At this time, the processor 16 compresses the extracted feature information into an image. In addition, the processor 16 may perform a predetermined pre-processing on the imaged feature information prior to applying the conventional video codec compression technique, thereby improving compression efficiency in the subsequent compression process.

In addition, the processor 16 may restore the feature map by decoding the generated bitstream. In the decoding process, a decoding technique corresponding to the coding technique of the existing video codec used in the above-described compression process may be applied. According to an embodiment, the processor 16 may restore a predetermined image from the restored feature information after restoring the feature information. And, if necessary, the processor 16 performs predetermined post processing on the restored feature information or image.

According to the present embodiment, in the process of compressing and restoring the feature information in the processor 16, meta data used or generated in the pre-processing process and/or pre-processing related information (hereinafter referred to as 'pre-processing meta data') are utilized. . More specifically, in the preprocessing of the feature information, the processor 16 may generate the preprocessing technique itself or various information related thereto (hereinafter referred to as 'preprocessing related information') as preprocessing metadata. In addition, the generated pre-processing metadata may be utilized in a compression and/or decoding process of the pre-processed feature information in the processor 16 .

The transmitter 18 transmits the feature information compressed and restored by the processor 16 or image data restored therefrom (ie, feature information or image data that has been restored or post-processed) to the machine vision network. According to an embodiment, the machine vision network may be implemented as a system separate from the image processing apparatus 10 , or may be implemented integrally therewith. In the latter case, there is no need for the transmitter 18 in the image processing apparatus 10 , and the output of the processor 16 may be directly input to the machine vision network. The machine vision network that has received the image data from the transmitter 18 uses the received image data to perform predetermined tasks, for example, object tracking, object recognition, object segmentation, image restoration, and the like.

FIG. 2 is a block diagram schematically showing an example of a detailed configuration of the processor 16 of FIG. 1 . Referring to FIG. 2 , the processor 16 includes a feature extraction unit 102 , a preprocessing unit 104 , and an encoding/decoding unit 106 . Depending on the embodiment, the processor 16 may further include a post-processing unit 108 .

The feature extraction unit 102 extracts feature information or a feature map from the input image. As described above, the feature information may be referred to as a feature map or sparse map, a feature vector, a latent vector, or the like. Such feature information may be in all possible forms extracted from the feature extractor, and may be defined according to at least one of a user/device/network type, environment, and definition.

When the feature extraction unit 102 performs feature extraction from the input image, various well-known image processing-based feature extraction techniques may be applied. Alternatively, the feature extraction unit 102 may extract features by using one or more of deep learning or machine learning-based feature extraction techniques.

As such, a subject performing feature extraction may vary according to an embodiment of the feature extraction unit 102 . For example, the feature information is at least one of an application, a device, the deep learning network itself, or a feature extraction unit inside the deep learning network that performs at least one specific task such as image restoration, object recognition, object detection, and object tracking. can be extracted by In this case, in performing feature extraction, at least one or more of an image processing technique for the purpose of feature extraction, a deep learning network, a partial network of a network for performing a specific task, and an intermediate step of a network for performing a specific task method can be used.

3 is a diagram schematically showing an example of a method for performing feature extraction based on deep learning in a network for performing a specific task. Referring to FIG. 3 , feature information (feature map) is extracted from an image input by a backbone network or a feature extractor. In addition, the extracted feature map may be expressed as an image 202 of a predetermined shape. In addition, in a machine vision network indicated as an evaluation network or a verification network, etc., it is extracted to perform a predetermined task (image restoration, object recognition, object tracking, object classification, etc.) A feature map can be used.

There is no particular limitation on a method of expressing the feature information extracted by the feature extraction unit 102, and the extracted feature information may be expressed in various ways. For example, the expression of the extracted feature information may vary according to at least one of data type, size, network type, size, network layer type, and size.

And, according to an aspect of the present embodiment, the extracted feature map may be formed of data having at least one characteristic among high correlation data such as a general image, sparse data, and dense data. In this case, the post-processing method may be applied in consideration of the characteristics of the extracted feature map data.

The pre-processing unit 104 performs a predetermined pre-processing on the feature information extracted by the feature extracting unit 102 . The preprocessor 104 converts the extracted feature information into a form for performing compression in a subsequent compression process. This is to convert the feature information into a form suitable for compression and restoration of the feature information in the encoder/decoder 106, and the output of the preprocessor 104 is at least one of a conversion method, a process, a form, and a size of the feature information. It may vary according to the above. For example, the output of the preprocessor 104 may be 1D, 2D, 3D, or the like, or may be an integer-dimensional result, and may be maintained, increased, or reduced in size compared to the input.

The preprocessor 104 may convert the extracted feature information into another form by applying various methods. For example, the preprocessor 104 transforms the feature information by applying one or more methods among normalization, scaling, reordering, expression bit reduction, quantization, and filtering. can do.

A specific method of performing preprocessing transformation of feature information using normalization is not particularly limited. For example, one or more methods among methods such as maximum/minimum value normalization and normal distribution normalization may be used. In this case, the normalization parameters, for example, the minimum value, the maximum value, the feature value, the standard deviation, and one or more values or information of the interval information, are added in the form of syntax, semantic, metadata, etc. It is transmitted to the /decryption unit 106 and can be used for encoding/decoding.

In the case of maximum/minimum value normalization, the feature information (feature value) may be normalized by using the maximum value and the minimum value among the feature information. More specifically, normalization may be performed using a calculation formula (feature value - minimum value)/(maximum value - minimum value), and a normalization value according to the calculation result may be between 0 and 1.

In this case, the application range of the normalization may vary according to the dimension, characteristic, size, number, type, data type of the characteristic value, etc. of the characteristic information. Here, the dimension of the feature information may be n-dimensional, the number of channels may be C channels, and n and C may be at least one of positive integers, such as 1, 2, and 3, respectively.

For example, if the shape of the feature information is a three-dimensional shape, that is, W×H×C (horizontal×vertical×channel), normalization may be performed using the maximum/minimum values for the entire number of channels. As another example, if the shape of the feature information is a three-dimensional shape, that is, W×H×C (horizontal×vertical×channel), normalization may be performed using the maximum/minimum values for each channel. As another example, the form of the feature information is an n-dimensional form, and one or more feature values of different sizes, that is, having multiple feature values such as F1 = W1 × H1 × C1, F2 = W2 × H2 × C2, ... In many cases, normalization may be performed using maximum/minimum values according to the types of feature values, such as F1 normalization using the maximum/minimum values of F1, F2 normalization using the maximum/minimum values of F2, and the like. As another example, if the form of feature information is an n-dimensional form and it has one or more features of different sizes, that is, a plurality of feature values such as F1=W1×H1×C1, F2=W2×H2×C2, ... , may be normalized using the maximum/minimum values of all feature values.

In the case of normal distribution normalization, normalization may be performed using the mean/standard deviation of feature information. In normalizing the normal distribution of feature information using the mean/standard deviation, the normalization can be performed by the formula (feature value-mean)/(standard deviation), and in this case, the feature values are normally distributed in the form of a standard deviation of 1. can be normalized so that In this case, the normalized feature value z consists of negative and positive numbers, and the occurrence probability can be calculated according to the range of z. For example, the probability of occurrence in the range of the normalized feature value (z) between -1 and +1 is 68%, and the probability of occurrence in the range of the normalized feature value (z) between -2 and +2 is 95%. can In this case, the occurrence probability between sections may be used as a probabilistic model in the entropy encoding/decoding process in the encoding/decoding unit 106 .

According to an aspect of the present embodiment, normalized feature values obtained through maximum/minimum value normalization or normal distribution normalization may be divided into an interval of m (m is an integer greater than or equal to 2).

For example, when m is 3, a case in which the normalized feature value is out of a specific range may be regarded as a singular value, and the section may be classified. More specifically, in the case of maximum/minimum value normalization, the case where the normalized feature value is 0.2 or less or 0.8 or more is regarded as a singular value, and has three intervals, namely 0.2 or less (first interval), and 0.2 to 0.9 (second interval). ) and 0.8 or more (the third interval), the interval of normalized feature values may be divided. And, in the case of normal distribution normalization, as shown in FIG. 4 , the case where the normalized feature value is -1 or less or 1 or more is regarded as a singular value, and 3 sections, namely, -1 or less (first section), Sections of normalized feature values may be divided into -1 to +1 (second section) and +1 or more (third section).

As another example, the sections may be classified by dividing the sections based on arbitrary criteria. More specifically, when m is 4, in the case of maximum/minimum value normalization, intervals may be divided based on 0.2, 0.5, and 0.8. In this case, the section can be classified as section 1 when the normalized feature value is 0.2 or less, section 2 when it is 0.2 or more and 0.5 or less, section 3 when it is 0.5 or more and 0.8 or less, and section 4 when it is 0.8 or more. And in the case of normal distribution normalization, as in section 1 when the normalized feature value is -1 or less, section 2 when it is -1 or more and 0 or less, section 3 when it is 0 or more and 1 or less, and section 4 when it is 1 or more. can be classified.

In this case, the designated section may be expressed with n-bits (n is a positive integer) based on at least one of the number of the designated sections and the probability of the sections. In addition, the expressed section information may be encoded using one or more methods of binarization and entropy encoding. For example, when divided into 4 sections, section information can be expressed as 2-bits. Section 1 can be expressed by binarizing 00, section 2 with 01, section 3 with 10, and section 4 with 11. And this binarized representation (corresponding bin) can be entropy-encoded. Alternatively, when divided into four sections, an appropriate bin may be assigned to each section based on a probability through Huffman coding. For example, the section with the highest probability is 0, and the section with the next highest probability is 10. , the section with the next highest probability can be expressed in the form of 110, and finally, the section with the lowest probability can be expressed in the form of 111.

In this way, when the section is divided, an n-bit symbol map can be formed by designating a representative value in each section. In this case, the representative value may be designated as at least one of a positive integer, an average value for each section, and a maximum value or a minimum value for each section. 5 is a view showing an example in which a preprocessed feature map is expressed as an n-bit symbol map, in which an 8-bit feature value is expressed by three symbols (three pixels).

On the other hand, when an 8-bit feature value is divided into 4 sections through normalization and representative values are designated as 0, 1, 2, 3 in the corresponding section, section 1=0, section 2=1, section 3= A 2-bit symbol map can be formed by designating as 2, section 4=3. Alternatively, when an 8-bit feature value is divided into 4 sections through normalization and representative values are designated as 0, 1, 2, 3 for the corresponding section, section 1 = average of section 1, section 2 = average of section 2 , section 3 = average of section 3, section 4 = average of section 4 can be specified to form a 2-bit symbol map.

When the normalized feature value is divided into a plurality of sections and each section is expressed as binarized, one or more binary maps may be formed according to the order of bins. 6 shows an example of such a binary map. Referring to FIG. 6 , the feature values are divided into 4 sections, that is, 1 section of 0-5, 2 sections of 5-10, 3 sections of 10-15, and 4 sections of 15-20, each section has 2 Bits, for example, 10 in

section

1, 00 in section 2, 01 in section 3, and 11 in section 4 are assigned to form a 2-bit symbol map. A binary map using , 0, 1, and 1 can be constructed, and a binary map using 0, 1, 0, and 1, which is the second bin, can be constructed.

Continuing to refer to FIG. 2 , the preprocessing unit 104 may perform preprocessing using scaling or quantization. In this case, m-bit scaling or quantization may be performed. In this case, one or more values or information among scaling parameters or quantization parameters, interval information, singular value information, and range information may be included in the form of syntax, semantic, metadata, etc. It may be transmitted to the abdomen 106 .

When the preprocessor 104 performs m-bit scaling or quantization, the encoder/decoder 106 may perform scaling or quantization with bits having a size suitable for encoding or decoding. For example, when encoding or decoding is performed by the encoder/decoder 106 using a conventional video codec, scaling or quantization may be performed with an 8-10 bit image suitable as an input of the conventional video codec. If the feature value has a 32-bit floating-point data type, scaling or quantization can be performed in an 8-bit or 10-bit unsigned integer type. In this case, in order to convert to an unsigned integer type, the normalization method described above may be performed.

According to the present embodiment, there is no particular limitation on a method of quantizing and preprocessing the feature map. For example, quantization may be performed using one or more quantization methods among vector quantization, uniform quantization, and non-uniform quantization.

For example, elements (feature values) that are not singular values are quantized using relatively large quantization parameters, and elements (feature values) corresponding to singular values are quantized using relatively small quantization parameters. Quantization can be performed. Conversely, elements (feature values) that are not singular values are quantized using relatively small quantization parameters, and elements (feature values) corresponding to singular values are quantized using relatively large quantization parameters. You can do it.

Alternatively, quantization may be performed at regular intervals while varying the quantization parameter according to the range of the feature value. Alternatively, the quantization parameter may be varied according to the range of the feature value, but quantization may be performed at irregular intervals. In the latter case, the quantization section can be determined by obtaining a quantization section that minimizes the quantization error of the quantization target. may be sent to (106).

Depending on the embodiment, the pre-processing unit 104 may perform pre-processing by using rearrangement of feature information (feature map). In this case, at least one or more methods among average, variance, histogram, and direction may be used. . In this case, information related to rearrangement, for example, one or more values or information among rearrangement information, rearrangement index, rearrangement order, average, variance, histogram, and direction required for decoding, is a syntax, semantic , may be transmitted to the decoder/decoder 106 in the form of metadata.

According to an aspect, in performing the rearrangement using the mean/variance, the rearrangement may be performed in one or more units of channels, regions, and lines of feature information. In this case, one or more values of mean and variance may be used.

For example, when a feature consists of multiple channels, one or more of the average and variance of each channel may be calculated, and channels of the feature may be rearranged in ascending or descending order. Alternatively, when a feature consists of multiple channels, one or more of the average and variance of each channel may be calculated and rearranged in an ascending or descending order in a 2D form.

As another example, when features appear in various regions, the channels of the features may be rearranged in ascending or descending order by calculating one or more of the average and variance of each region. Alternatively, when a feature has a plurality of lines, one or more of the average and variance of each line may be calculated, and channels of the feature may be rearranged in ascending or descending order.

As another example, one or more of the mean and variance may be calculated, and correlations between channels may be derived based on the calculations and rearranged in the order of high correlation. In this case, the information used for rearrangement may be used for temporal encoding/decoding in video encoding, and as an example, the unit of a group of pictures (GOP) used for temporal encoding/decoding may be adjusted according to a characteristic. . That is, for efficient video encoding, sub-coding may be performed between feature channels having high correlation, and the decoder decodes information transmitted from the encoder.

According to another aspect, in rearranging the histogram, the rearrangement may be performed in one or more units of channels, regions, and lines of feature information. For example, a histogram of feature values may be formed, and rearrangement may be performed using a correlation between the histogram and a representative value such as the average and variance of each channel. Alternatively, a histogram of feature values may be formed, and rearrangement may be performed by distinguishing between frequently occurring feature values and infrequently occurring feature values.

According to another aspect, in performing the rearrangement using the directionality, the rearrangement may be performed in one or more units of channels, regions, and lines of feature information. For example, the directionality may be defined using the gradient of each channel and rearrangement of similar directions may be performed.

When rearrangement is performed by any one of the methods described above, scaling or quantization may be performed on the rearranged feature information according to the characteristics of the applied rearrangement method.

According to an embodiment, the preprocessor 104 may perform the preprocessing using a conversion technique. For example, as a transformation technique, among Principal Component Analysis (PCA), Karhunen-Loeve Transform (KLT), Discrete Cosine Transform (DCT), Discrete Sine Transform (DST), Fast Fourier Transform (FFT), and Fourier Transform (FT) More than one method may be applied.

Such a transformation technique may be applied after first applying the above-described pre-processing methods including normalization. For example, after performing preprocessing using normalization and/or quantization, the size or type of input data may be reduced by applying a PCA transformation technique.

As described above, variables and/or information related to the pre-processing in the pre-processing unit 104 (ie, pre-processing related data) may be generated as a result of the pre-processing in the form of syntax, semantics, metadata, etc. for decoding. can In addition, such preprocessing-related data may also be encoded by one or more methods of binarization and entropy encoding (according to the embodiment, the encoding of such preprocessing-related data is performed by the preprocessor 104 or the encoding/decoding unit 106) could be).

In performing binarization, one or more of the following binarization techniques may be used. The binarization techniques that can be used include a truncated rice binarization method, a k-th order Exponential golomb binarization method, a limited k-order exponential-Golomb binarization method, and a fixed-length binarization method. method, a unary binarization method, a truncated unary binarization method, a truncated binary binarization method, etc., but is not limited thereto.

Entropy encoding may be performed not only on the aforementioned preprocessing-related data, but also on previous information generated through binarization. Such entropy encoding includes Run-length encoding, Huffman encoding, arithmetic coding, context-based adaptive binary arithmetic coding (CABAC, Context-Adaptive Binary Arithmetic Coding), and context-based adaptive variable length coding (CAVLC, Context- One or more of Adaptive Variable Length Coding and Bypass coding may be applied, and this is merely exemplary.

The encoding/decoding unit 106 encodes and decodes pre-processed feature information and pre-processing related information such as metadata. According to this embodiment, there is no particular limitation on the technique applied by the encoding/decoding unit 106, and the encoding/decoding method in the existing image compression standard as well as the encoding/decoding method to be developed or adopted in the future may be applied. , which will be described later.

According to an embodiment, the encoder of the encoder/decoder 106 may include the above-described feature extractor 102 . For example, in the case of the deep learning-based feature extraction unit 102, one or more convolutional layers and fully connected layers may be included. In this case, the filter for the convolutional layer may be changed according to one or more of a type of a feature, a learning form, a size, and the like, and one or more coefficients may be defined. In addition, the filter for the fully connected layer may be changed according to one or more of the type of feature, the learning form, the number of inputs, the size, and the like, and one or more coefficients may be defined.

In addition, the decoding unit of the encoding/decoding unit 106 may include one or more convolutional layers and fully connected layers when a learning-based compression and machine vision network is configured. In this case, the filter for the synthetic product layer may be changed according to one or more of a feature type, a learning form, a size, and the like, and one or more coefficients may be defined. In addition, the filter for the fully connected layer may be changed according to one or more of the type of feature, the learning form, the number of inputs, the size, and the like, and one or more coefficients may be defined.

According to the embodiment of the present invention, the shape of the feature input to the encoding/decoding unit 106 may be in the form of a kD dimension. Here, k may be a positive integer such as 1, 2, 3, ..., and the like. For example, when k is 1 (1D dimension), the input feature may be a feature vector, a hidden vector, or the like. When k is 2 (2D dimension), the input feature may be in the form of a frame having horizontal and vertical lengths. When k is 3 (3D dimension), the input feature may have a horizontal, vertical, and channel shape. Alternatively, when k is 3 (3D dimension), the input feature may have a horizontal, vertical, and time form.

When the encoding/decoding unit 106 encodes/decodes the feature information, encoding/decoding may be performed in units of at least one of a sample, a line, a block, and a frame. In this case, encoding/decoding may be performed in units of at least one of a sample, a line, a block, and a frame according to at least one of the shape, size, and dimension of the input feature.

In this case, information on a corresponding unit may also be encoded according to a coding unit, and the encoded information may be decoded and used for decoding. For example, when encoding/decoding is performed in units of blocks, one or more pieces of information among a size, division, and shape of a block may be encoded, and the encoded information may be decoded and utilized for decoding of feature information. As another example, when encoding/decoding is performed on a line-by-line basis, one or more pieces of information among the length, number, and shape of a line may be encoded, and the encoded information may be decoded and utilized for decoding of feature information. As another example, when encoding/decoding is performed on a frame-by-frame basis, one or more pieces of information among frame shape, size, number, and division information may be encoded, and the encoded information may be decoded to be utilized for decoding of feature information. can

When the encoding/decoding unit 106 encodes/decodes the feature information, prediction-based encoding/decoding may be performed. In this prediction encoding/decoding, prediction may be performed in units of one or more of samples, lines, blocks, and frames.

According to an embodiment, when the encoding/decoding unit 106 performs prediction-based encoding/decoding, a prediction method may vary according to data characteristics of feature information. As the prediction method, at least one of spatial prediction, temporal prediction, and channel prediction may be used.

In the case of spatial prediction, prediction may be performed according to coding units from previously encoded/decoded neighboring samples or prediction values. In this case, the neighboring sample may be a sample adjacent to the current coding unit or a neighboring sample that is not adjacent to the current coding unit, but is specified according to a predetermined rule. Such a neighboring sample may be one sample or a set of samples. When spatial prediction is applied, the signal of the current coding unit may be reduced.

In spatial prediction, when prediction is performed according to coding units from neighboring samples already encoded/decoded, prediction can be performed using one or more of directional prediction, template prediction, dictionary prediction, and matrix multiplication prediction. can More specifically, in the case of directional prediction, prediction of a current unit (eg, a block) may be performed by copying or interpolating a neighboring sample according to a direction using an already encoded/decoded neighboring sample. In the case of template prediction, prediction may be performed by searching for a template most similar to the current prediction unit from previously encoded/decoded neighboring samples. In the case of pre-prediction, prediction blocks of neighboring samples or units that have already been encoded/decoded are stored in a separate memory, and prediction of the current unit can be performed from information (samples or blocks) stored in the separate memory. And in the case of matrix product prediction, prediction of the current unit may be performed by multiplying the already encoded/decoded neighboring samples and an arbitrary matrix.

In the case of temporal prediction, the current unit may be predicted from a temporally front frame and/or a rear frame with a size corresponding to the coding unit.

In the case of channel prediction, when the feature information has a multi-channel form, prediction on the current unit may be performed from a channel other than the current channel with a size corresponding to the coding unit.

In this case, the coding unit may vary according to the configuration of the feature information. More specifically, if the feature information is configured in a multi-channel form and is arranged according to the correlation between channels, the coding unit may also be determined based on the correlation. For example, if feature information is arranged by an average value, a coding unit may be configured by a bundle of similar average values. Alternatively, if they are arranged based on the degree of similarity between channels, the coding unit may be configured as a bundle of similarities.

When the encoding/decoding unit 106 performs prediction-based encoding/decoding, the encoding/decoding unit 106 may encode/decode a difference (residual signal) between the original signal and the prediction signal. In this case, as a prediction method for generating a prediction signal, one or more methods among the above-described prediction methods may be applied. In this case, the residual signal may be a difference between the original signal and the prediction signal using one or more of the aforementioned prediction methods.

The encoding/decoding unit 106 may encode prediction information and a residual signal generated in performing prediction-based encoding/decoding by applying one or more methods of binarization and entropy encoding. More specifically, in performing binarization, one or more of the following binarization methods may be applied, but the present invention is not limited thereto: Truncated Rice binarization method, k-th order Exponential golomb) binarization method, limited k-order exponential-Gollum binarization method, fixed-length binarization method, unary binarization method, truncated unary binarization method, truncated binary binarization method .

In addition, the encoding/decoding unit 106 performs entropy encoding/decoding on variables or related information used in the above-described prediction process and residual signal generation process, and previous information generated through binarization. One or more of the following methods may be used for entropy encoding/decoding, which is merely exemplary: Run-length encoding, Huffman encoding, arithmetic coding, and context-based adaptive binary arithmetic coding (CABAC, Context- Adaptive Binary Arithmetic Coding), Context-Adaptive Variable Length Coding (CAVLC), Bypass coding.

The encoding/decoding unit 106 may also perform transformation-based encoding/decoding in encoding/decoding the feature information. In this case, the input to be transformed is one of the feature information extracted by the feature extraction unit 102 , feature information preprocessed by the preprocessor 104 , and the residual signal generated through the aforementioned prediction process. can be more than By performing such transformation-based encoding/decoding, the signal of the current coding unit may be transformed into a signal that can be encoded more efficiently. The transformation method may apply one or more of the following transformation methods, which are merely exemplary: PCA (Principal Component Analysis), KLT (Karhunen-Loeve Transform), DCT (Discrete Cosine Transform), DST (Discrete Sine Transform) ), FFT (Fast Fourier Transform), FT (Fourier Transform).

The encoding/decoding unit 106 also binarizes one or more of information generated in the transform-based encoding/decoding process, such as transform information, transform kernel information, transform coefficient information, and coefficients of a changed signal, at least one method among binarization and entropy encoding. can be applied and encoded. In performing binarization, one or more of the following binarization methods may be applied, which are merely exemplary: Truncated Rice binarization method, k-th order Exponential golomb binarization method, limited k-order exponential-Gollum binarization method, fixed-length binarization method, unary binarization method, truncated unary binarization method, truncated binary binarization method.

In addition, the encoding/decoding unit 106 may perform entropy encoding/decoding on variables and related information used in the above-described conversion process and previous information generated through binarization. One or more of the following methods may be applied to entropy encoding/decoding, which is merely exemplary: Run-length encoding, Huffman encoding, arithmetic coding, and context-based adaptive binary arithmetic coding (CABAC, Context) -Adaptive Binary Arithmetic Coding), Context-Adaptive Variable Length Coding (CAVLC), Bypass Coding.

Continuing to refer to FIG. 2 , the post-processing unit 108 performs a predetermined post-processing on the output signal from the encoding/decoding unit 106 . According to the present embodiment, there is no particular limitation on a specific post-processing method applied in the post-processing unit 108 . For example, the post-processing unit 108 may perform a post-processing procedure by applying a predetermined method corresponding to the method applied by the aforementioned pre-processing unit 104 . In addition, the post-processing unit 108 may additionally perform a predetermined signal processing procedure suitable for performing various tasks using machine vision in a machine vision network, for example, object tracking, object recognition, image restoration, and the like. According to this, data to be subjected to post-processing in the post-processing unit 108 may be a video signal decoded after being encoded in the encoder/decoder 106, but the format of the post-processed result is a machine vision network utilizing it. It may vary depending on the performance or characteristics of the

The image processing apparatus according to the embodiment of the present invention described above performs the tasks of machine vision in a machine vision system or a machine vision network, that is, image restoration, object recognition, object tracking, object classification, object segmentation, etc. is for Such an image processing apparatus corresponds to a kind of compression network, and is integrated with the machine vision network and is implemented as a single network (hereinafter, referred to as a 'combined network') or a machine vision network. It can be implemented as a separate network from

7 is a configuration diagram schematically illustrating an example of a system configuration when an image processing apparatus according to an embodiment of the present invention is implemented using a network model that performs machine vision. Although not shown in FIG. 7 , an image restoration network for restoring an image from a bitstream may be additionally provided in the entire system, and a detailed description thereof will be omitted (see FIG. 9 or FIG. 11 and related description).

Referring to FIG. 7 , the system may preprocess the feature information of an input image generated by the feature extractor of the corresponding network model, and then perform encoding and decoding on the preprocessed feature information. As the preprocessing method or the encoding/decoding method, any one or more of the various methods described above may be used, respectively. In addition, the bitstream generated as a result of encoding may be one or more of feature information output from one or more layers of a network model performing machine vision, a hidden vector, or an encoded form thereof. Also, the corresponding bitstream may be a result of performing encoding by using one or more of the aforementioned preprocessing of the characteristic information and the encoding method of the characteristic information. In this case, information generated by at least one of the preprocessing process and the encoding process may be transmitted to the decoder.

8 is a configuration diagram schematically illustrating an example of a system configuration when an image processing apparatus according to an embodiment of the present invention is implemented using a separate network model distinguished from a machine vision network. Referring to FIG. 8 , processing such as feature extraction, preprocessing, encoding/decoding, etc. is performed in a compression network separate from the machine vision network. Then, image data generated by performing predetermined post-processing after encoding/decoding (eg, feature information decoded after being encoded into a bitstream, image restored from feature information, etc.) is input to a machine vision network, It is used to carry out vision missions.

9 is a configuration diagram schematically illustrating another example of a system configuration when an image processing apparatus according to an embodiment of the present invention is implemented using a separate network model distinguished from a machine vision network. The configuration of the system shown in FIG. 9 is different from the configuration of the system shown in FIG. 8 in that it further includes an image restoration network.

Referring to FIG. 9 , processing such as feature extraction, preprocessing, encoding/decoding, etc. is performed in a compression network separate from the machine vision network. Then, image data (eg, feature information decoded after encoding into a bitstream, an image reconstructed from feature information, etc.) generated by performing predetermined post-processing after encoding/decoding is input to the machine vision network. And the image restoration network restores the image by using the bitstream generated by the encoder/decoder. The image restored in the image restoration network can be used together to perform a predetermined task in the machine vision network.

10 and 11 respectively show the configuration of a system corresponding to the system of FIG. 7 from different perspectives. FIG. 10 is a case in which an image restoration network is not provided, and FIG. 11 is a case in which an image restoration network is provided. am.

As shown in FIG. 10 or FIG. 11 , when both a network model performing compression and a network model performing machine vision are used, the two network models may be combined and trained. In this case, the output of the network model performing compression may be in the form of an image in which the input image is reconstructed or reconstructed feature information, and the corresponding output may be an input of the network model performing machine vision. The bitstream may be in the form of feature information/hidden vectors output from at least one or more layers of the two network models or an encoded form thereof. The corresponding bitstream may be a result of performing encoding by using one or more of the pre-processing and feature encoding methods described above with reference to FIG. 2 . In this case, information generated in one or more of the preprocessing process and the encoding process may be transmitted to the decoder.

And, as shown in FIG. 11, when the reconstructed image is not output, the entire system further includes a separate image restoration network model capable of performing decoding so that an image reconstructed from the input image can be output. can be configured. The image restoration network model may be decoded by using one or more of the above-described feature decoding methods among information used in compression of the input image, and may include filter coefficients of one or more network layers for decoding. .

12 is a view showing the configuration of a system corresponding to the system of FIG. 8 from another viewpoint. In such a system, both a network model performing compression and a network model performing machine vision are used, and each network model can be trained separately. In this case, the output of the network performing compression may be an image from which an input image is reconstructed, and the corresponding output may be an input of a network model performing machine vision. The bitstream may be in the form of feature information output from at least one or more layers of the compression network, a hidden vector, or an encoded form thereof. The corresponding bitstream may be a result of performing encoding using one or more of the above-described pre-processing method of the characteristic information and the encoding method of the characteristic information. In this case, information generated in the process of performing encoding during the pre-processing process or the encoding process may be transmitted to the decoder.

Although the present invention has been described in detail with reference to preferred embodiments, the present invention is not limited to the above-described embodiments, and various modifications may be made by those skilled in the art within the scope of the technical spirit of the present invention. It is possible.

Claims

An image processing method for machine vision, comprising:

an extraction step of extracting feature information from the image;

a pre-processing step of pre-processing the feature information extracted in the extraction step to reduce the size of the feature information; and

and an encoding/decoding step including an encoding process of encoding the pre-processed feature information by applying an existing video codec technique,

In the encoding/decoding step, the image processing method, characterized in that the pre-processing information related to the pre-processing method used in the pre-processing step is received and encoded together.
The method of claim 1,

In the preprocessing step, the image processing method, characterized in that the preprocessing is performed by applying one or more methods of normalization, scaling, rearrangement, reduction in the number of expression bits, quantization, transformation, and filtering.
The method of claim 1,

The encoding/decoding step includes a decoding process of decoding the bitstream generated as a result of the encoding process,

The image processing method, characterized in that in the encoding/decoding step, encoding/decoding is performed by at least one method among prediction-based encoding/decoding and transformation-based encoding/decoding.
An image processing apparatus comprising a processor for processing an image for machine vision, the processor comprising:

a feature extracting unit for extracting feature information from the image;

a preprocessor for preprocessing the feature information extracted by the feature extractor to reduce the size of the feature information; and

and an encoder/decoder that performs an encoding process by applying an existing video codec technique to the feature information preprocessed by the preprocessor,

The image processing apparatus according to claim 1, wherein the encoder/decoder receives preprocessing-related information related to the preprocessing method used by the preprocessor from the preprocessor and encodes the same.
5. The method of claim 4,

and the preprocessor performs preprocessing by applying one or more methods of normalization, scaling, rearrangement, reduction in the number of expression bits, quantization, transformation, and filtering.
5. The method of claim 4,

The encoding/decoding unit also performs a decoding process of decoding the bitstream generated as a result of the encoding process,

The image processing apparatus according to claim 1, wherein the encoding/decoding unit performs encoding/decoding by at least one of prediction-based encoding/decoding and transformation-based encoding/decoding.