US10990852B1 - Method and apparatus for training model for object classification and detection - Google Patents

Method and apparatus for training model for object classification and detection Download PDF

Info

Publication number
US10990852B1
US10990852B1 US16/666,051 US201916666051A US10990852B1 US 10990852 B1 US10990852 B1 US 10990852B1 US 201916666051 A US201916666051 A US 201916666051A US 10990852 B1 US10990852 B1 US 10990852B1
Authority
US
United States
Prior art keywords
image
classification model
training
classification
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active, expires
Application number
US16/666,051
Other versions
US20210125000A1 (en
Inventor
Byoung-Jip Kim
Jong-Won Choi
Young-joon Choi
Seong-Won Bak
Ji-Hoon Kim
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Samsung SDS Co Ltd
Original Assignee
Samsung SDS Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Samsung SDS Co Ltd filed Critical Samsung SDS Co Ltd
Assigned to SAMSUNG SDS CO., LTD. reassignment SAMSUNG SDS CO., LTD. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: BAK, SEONG-WON, CHOI, JONG-WON, CHOI, YOUNG-JOON, KIM, BYOUNG-JIP, KIM, JI-HOON
Application granted granted Critical
Publication of US10990852B1 publication Critical patent/US10990852B1/en
Publication of US20210125000A1 publication Critical patent/US20210125000A1/en
Active legal-status Critical Current
Adjusted expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/11Region-based segmentation
    • G06K9/6257
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/56Context or environment of the image exterior to a vehicle by using sensors mounted on the vehicle
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/211Selection of the most significant subset of features
    • G06F18/2115Selection of the most significant subset of features by evaluating different subsets according to an optimisation criterion, e.g. class separability, forward selection or backward elimination
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • G06F18/2148Generating training patterns; Bootstrap methods, e.g. bagging or boosting characterised by the process organisation or structure, e.g. boosting cascade
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • G06F18/2155Generating training patterns; Bootstrap methods, e.g. bagging or boosting characterised by the incorporation of unlabelled data, e.g. multiple instance learning [MIL], semi-supervised techniques using expectation-maximisation [EM] or naïve labelling
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/254Fusion techniques of classification results, e.g. of results related to same input data
    • G06K9/3275
    • G06K9/6231
    • G06K9/6259
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/0002Inspection of images, e.g. flaw detection
    • G06T7/0004Industrial image inspection
    • G06T7/001Industrial image inspection using an image reference approach
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • G06V10/7747Organisation of the process, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/809Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of classification results, e.g. where the classifiers operate on the same input data
    • G06K2209/21
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Definitions

  • Embodiments of present invention relate to a technique of training a model for object classification and detection.
  • a training data which is labeled with a bounding box and a class of an object existing in each image, is needed to train an object detection model.
  • a lot of effort is required to display the bounding box in the image, it is difficult to obtain a large amount of training data marked with the bounding box.
  • weakly-supervised object detection capable of training a model without a bounding box has been proposed recently.
  • the weakly-supervised object detection has a problem of requiring a lot of images labeled with a class. Accordingly, it is required to provide a method of training an object classification model even when only a small amount of images are labeled with a class in a situation without the bounding box.
  • Embodiments of the present invention are to provide a method and an apparatus for effectively training a model for object classification and detection.
  • a method of training a model for object classification and detection including training a first classification model including a shared feature extractor shared by a plurality of classification models and a first classifier for outputting a classification result of an object included in a first input image on the basis of feature values of the first input image extracted by the shared feature extractor, using a first training image set including an image assigned with a class; training a second classification model including the shared feature extractor and a second classifier for outputting a classification result about authenticity of a second input image on the basis of feature values of the second input image extracted by the shared feature extractor, by using a second training image set including a fake image and a real image; and training a third classification model including the shared feature extractor and a third classifier for outputting a classification result about a rotation angle of a third input image on the basis of feature values of the third input image extracted by the shared feature extractor, using a third training image set including images rotated at one or more angles.
  • the training of the first classification model may include training the first classification model, by using the image assigned with a class as an input data of the first classification model and the class as a target data of the first classification model.
  • the first classification model may further include a global average pooling (GAP) layer for outputting a location of the object in the first input image on the basis of the feature values of the first input image.
  • GAP global average pooling
  • the first training image set may further include an image assigned with location information
  • the training of the first classification model may include training the first classification model, by using the image assigned with a class and the image assigned with location information as an input data of the first classification model and the class and the location information as a target data of the first classification model.
  • the training of the second classification model may include generating the fake image by using a generative model based on a generative adversarial network (GAN).
  • GAN generative adversarial network
  • the training of the second classification model may include training the second classification model, by using the fake image and the real image as an input data of the second classification model and authenticity corresponding to each of the fake image and the real image as a target data of the second classification model, and training the generative model to generate an image the same as the real image.
  • the third classification model may include an image rotator for generating the third training image set by rotating an image not assigned with a label at the one or more angles.
  • the training of the third classification model may include training the third classification model, by using the rotated images as an input data of the third model and a rotation angle of each of the rotated images as a target data of the third model.
  • the first classification model, the second classification model, and the third classification model may be trained to minimize a weighted sum of a loss function of the first classification model, a loss function of the second classification model, and a loss function of the third classification model.
  • an apparatus for training a model for object classification and detection comprising: a memory for storing one or more commands; and one or more processors for executing the one or more commands, wherein the one or more processors configured to train a first classification model including a shared feature extractor shared by a plurality of classification models and a first classifier for outputting a classification result of an object included in a first input image on the basis of feature values of the first input image extracted by the shared feature extractor, using a first training image set including an image assigned with a class, train a second classification model including the shared feature extractor and a second classifier for outputting a classification result about authenticity of a second input image on the basis of feature values of the second input image extracted by the shared feature extractor, by using a second training image set including a fake image and a real image, and train a third classification model including the shared feature extractor and a third classifier for outputting a classification result about a rotation angle of a third input image on the basis of feature
  • the one or more processors may train the first classification model, by using the image assigned with a class as an input data of the first classification model and the class as a target data of the first classification model.
  • the first classification model may further include a global average pooling (GAP) layer for outputting a location of the object in the first input image on the basis of the feature values of the first input image.
  • GAP global average pooling
  • the first training image set may further include an image assigned with location information
  • the one or more processors may train the first classification model, by using the image assigned with a class and the image assigned with location information as an input data of the first classification model and the class and the location information as a target data of the first classification model.
  • the one or more processors may generate the fake image by using a generative model based on a generative adversarial network (GAN).
  • GAN generative adversarial network
  • the one or more processors may train the second classification model, by using the fake image and the real image as an input data of the second classification model and authenticity corresponding to each of the fake image and the real image as a target data of the second classification model, and train the generative model to generate an image the same as the real image.
  • the third classification model may include an image rotator for generating the third training image set by rotating an image not assigned with a label at the one or more angles.
  • the one or more processors may train the third classification model, by using the rotated images as an input data of the third classification model and a rotation angle of each of the rotated images as a target data of the third classification model.
  • the first classification model, the second classification model, and the third classification model may be trained to minimize a weighted sum of a loss function of the first classification model, a loss function of the second classification model, and a loss function of the third classification model.
  • the shared feature extractor may be sufficiently trained several times. Accordingly, since the feature extractor used for an object classification model, an object detection model and the like based on supervised learning using labeled training data is sufficiently trained, performance of the models can be enhanced. In addition, the effort, time and cost required for constructing a labeled training dataset can be reduced.
  • FIG. 1 is a block diagram showing an example of a computing environment including a computing device appropriate to be used in exemplary embodiments.
  • FIG. 2 is a flowchart illustrating a method of training a model for object classification and detection according to an embodiment.
  • FIG. 3 is a view schematically showing the configuration of a first classification model according to an embodiment.
  • FIG. 4 is a view schematically showing the configuration of a generative model and a second classification model according to an embodiment.
  • FIG. 5 is a view schematically showing the configuration of a third classification model according to an embodiment.
  • FIG. 6 is a view showing the overall configuration of a first classification model, a second classification model, and a third classification model according to an embodiment.
  • the neural network may use artificial neurons simplifying the functions of biological neurons, and the artificial neurons may be interconnected through connection lines having a connection weight.
  • the connection weight which is a parameter of the neural network, is a specific value that the connection line has and may be expressed as connection strength.
  • the neural network may perform a recognition action or a learning process of a human being through the artificial neurons.
  • the artificial neuron may also be referred to as a node.
  • the neural network may include a plurality of layers.
  • the neural network may include an input layer, a hidden layer and an output layer.
  • the input layer may receive an input for performing learning and transfer the input to the hidden layer, and the output layer may generate an output of the neural network on the basis of the signals received from the nodes of the hidden layer.
  • the hidden layer is positioned between the input layer and the output layer and may convert the training data transferred through the input layer into a value easy to estimate.
  • the nodes included in the input layer and the hidden layer are connected to each other through connection lines having a connection weight, and the nodes included in the hidden layer and the output layer may also be connected to each other through connection lines having a connection weight.
  • the input layer, the hidden layer and the output layer may include a plurality of nodes.
  • the neural network may include a plurality of hidden layers.
  • the neural network including a plurality of hidden layers is referred to as a deep neural network, and training the deep neural network is referred to as deep learning.
  • the nodes included in the hidden layer are referred to as hidden nodes.
  • training a neural network may be understood as training parameters of the neural network.
  • a trained neural network may be understood as a neural network to which the trained parameters are applied.
  • the neural network may be trained using a loss function as an objective.
  • the loss function may be an objective of the neural network for determining an optimum weight parameter through the training.
  • the neural network may be trained for the purpose of making a result value of the loss function to be the smallest.
  • the neural network may be trained through supervised learning or unsupervised learning.
  • the supervised learning is a method of inputting a training data including an input data and a target data corresponding the input data into the neural network and updating connection weights of connection lines so that the target data corresponding to the input data may be outputted.
  • the unsupervised learning is a method of inputting only an input data into the neural network as a training data without a target data corresponding to the input data, and updating the connection weights of the connection lines to find out the features or the structure of the input data.
  • FIG. 1 is a block diagram showing an example of a computing environment including a computing device appropriate to be used in exemplary embodiments.
  • each of the components may have a different function and ability in addition to those described below, and additional components other than those described below may be included.
  • the computing environment 10 shown in the figure includes a computing device 12 .
  • the computing device 12 may be an apparatus for training a model for object classification and detection.
  • the computing device 12 includes at least a processor 14 , a computer-readable storage medium 16 , and a communication bus 18 .
  • the processor 14 may direct the computing device 12 to operate according to the exemplary embodiments described above.
  • the processor 14 may execute one or more programs stored in the computer-readable storage medium 16 .
  • the one or more programs may include one or more computer executable commands, and the computer executable commands may be configured to direct the computing device 12 to perform operations according to the exemplary embodiment when the commands are executed by the processor 14 .
  • the computer-readable storage medium 16 is configured to store computer-executable commands and program codes, program data and/or information of other appropriate forms.
  • the programs 20 stored in the computer-readable storage medium 16 include a set of commands that can be executed by the processor 14 .
  • the computer-readable storage medium 16 may be memory (volatile memory such as random access memory, non-volatile memory, or an appropriate combination of these), one or more magnetic disk storage devices, optical disk storage devices, flash memory devices, other forms of storage media that can be accessed by the computing device 12 and is capable of storing desired information, or an appropriate combination of these.
  • the communication bus 18 interconnects various different components of the computing device 12 , including the processor 14 and the computer-readable storage medium 16 .
  • the computing device 12 may also include one or more input and output interfaces 22 and one or more network communication interfaces 26 , which provide an interface for one or more input and output devices 24 .
  • the input and output interfaces 22 and the network communication interfaces 26 are connected to the communication bus 18 .
  • the input and output devices 24 may be connected to other components of the computing device 12 through the input and output interfaces 22 .
  • Exemplary input and output devices 24 may include input devices such as a pointing device (a mouse, a track pad, etc.), a keyboard, a touch input device (a touch pad, a touch screen, etc.), a voice or sound input device, various kinds of sensor devices and/or photographing devices, and/or output devices such as a display device, a printer, a speaker and/or a network card.
  • the exemplary input and output devices 24 may be included inside the computing device 12 as a component configuring the computing device 12 or may be connected to the computing device 12 as a separate apparatus distinguished from the computing device 12 .
  • FIG. 2 is a flowchart illustrating a method of training a model for object classification and detection according to an embodiment.
  • the method shown in FIG. 2 may be executed by the computing device 12 provided with, for example, one or more processors and a memory for storing one or more programs executed by the one or more processors.
  • the method is described as being divided into a plurality of operations in the flowchart shown in the figure, at least some of the operations may be performed in a different order or in combination and together with the other operations, omitted, divided into detailed operations, or performed in accompany with one or more operations not shown in the figure.
  • the computing device 12 trains a first classification model including a shared feature extractor shared by a plurality of classification models and a first classifier for outputting a classification result of an object included in a first input image on the basis of feature values of the first input image extracted by the shared feature extractor, using a first training image set including an image assigned with a class.
  • the class included in the first training image set may be assigned by a user.
  • the shared feature extractor may be configured of one or more layers to extract feature values of an input image.
  • the shared feature extractor may include a convolution layer, a pooling layer and a fully connected layer in an embodiment, it is not necessarily limited thereto and may be configured in a variety of forms according to embodiments.
  • the shared feature extractor may extract a feature vector including one or more feature values of an input image.
  • the shared feature extractor is shared among a first classification model, a second classification model and a third classification model and may be used by the first classification model, the second classification model and the third classification model to extract feature values of an input image.
  • the first classifier may be configured of one or more layers for outputting a classification result of an object included in the input image.
  • the first classifier may output a probability of an object included in the input image to be classified as a specific class among a plurality of classes set in advance, on the basis of the feature values of the input image.
  • the computing device 12 may train the first classification model by using a supervised learning technique using a first training dataset.
  • the computing device 12 may train the first classification model, by using an image assigned with a class as an input data of the first classification model and the class assigned to the image as a target data of the first classification model.
  • the computing device 12 may output a classification result of an object from an input image by using the first classification model and train the first classification model on the basis of the classification result of the object and the class assigned to the input image. At this point, the computing device 12 may update the parameters of the shared feature extractor and the first classifier.
  • the first classification model may further include a global average pooling (GAP) layer for outputting the location of an object in the first input image on the basis of the feature values of the first input image.
  • GAP global average pooling
  • the global average pooling layer may filter the feature values of the input image to include the location information of the object.
  • the first classification model may output location information of the object using, for example, a class activation map (CAM) algorithm.
  • CAM class activation map
  • location information of an object is outputted using a class activation map technique
  • the method of outputting location information of an object may be diverse according to embodiments.
  • the computing device 12 may train the first classification model, by using an image assigned with a class and an image assigned with location information as an input data of the first classification model and using the class and the location information assigned to the image as a target data of the first classification model.
  • the location information outputted through the class activation map technique may be used as the location information assigned to the image, or the location information may be assigned in a variety of methods.
  • the computing device 12 may train to output location information the same as the location information included in the target data.
  • the computing device 12 may update the parameters of the shared feature extractor, the first classifier and the global average pooling layer by using a first training dataset including an image assigned with a class and an image assigned with location information.
  • the computing device 12 trains the second classification model including the shared feature extractor and a second classifier for outputting a classification result about the authenticity of a second input image on the basis of feature values of the second input image extracted by the shared feature extractor, by using a second training image set including a fake image and a real image.
  • the computing device 12 may generate a fake image using a generative model based on a generative adversarial network (GAN).
  • GAN generative adversarial network
  • the generative model may be a neural network which generates a fake image on the basis of probability distribution of a random latent variable.
  • the computing device 12 may generate a fake image by using the generative model and generate a second training image set by assigning labels of fake and real to the fake image and the real image.
  • the second classifier may be configured of one or more layers for outputting a classification result about the authenticity of the input image.
  • the second classifier may output a probability of whether the input image corresponds to a fake image or a real image on the basis of the feature values of the input image.
  • the computing device 12 may train the second classification model, by using the fake image and the real image as an input data of the second classification model and authenticity corresponding to each of the fake image and the real image as a target data of the second classification model.
  • the computing device 12 may input a fake image and a real image into the second classification model and output a classification result about the authenticity of the fake image and the real image. At this point, the computing device 12 may compare the outputted classification result and the authenticity of each of the fake image and the real image and update the parameters of the shared feature extractor and the second classifier through the result of the comparison.
  • the computing device 12 may train the generative model by using an unsupervised learning algorithm. Specifically, the computing device 12 may train the generative model to generate an image the same as the real image. At this point, the computing device 12 may update the parameters of the generative model on the basis of the classification result outputted from the second classification model.
  • the computing device 12 trains the third classification model including the shared feature extractor and a third classifier for outputting a classification result about a rotation angle of a third input image on the basis of feature values of the third input image extracted by the shared feature extractor, by using a third training image set including images rotated at one or more angles.
  • the third classifier may be configured of one or more layers for outputting a classification result about the rotation angle of the input image.
  • the third classifier may output a probability of the input image to have been rotated at a specific angle among a plurality of rotation angles on the basis of the feature values of the input image.
  • the third classification model may further include an image rotator for generating a third training image set by rotating an image not assigned with a label at one or more angles.
  • the image rotator may receive an image not assigned with a label and rotate the image at one or more angles. In addition, the image rotator may assign a rotation angle to each rotated image as a label.
  • the computing device 12 may train the third classification model using a self-supervised learning algorithm.
  • the computing device 12 may generate images rotated at one or more angles by using the third classification model, and train the third classification model by using the rotated images as an input data of the third model and the rotation angle of each of the rotated images as a target data of the third model.
  • the computing device 12 may input the rotated images into the third classification model and output a classification result about the rotation angle of each of the rotated images. At this point, the computing device 12 may compare the outputted classification result and the rotation angle of each of the rotated images and update the parameters of the shared feature extractor and the third classification model through the result of the comparison.
  • FIG. 3 is a view schematically showing the configuration of a first classification model according to an embodiment.
  • the first classification model 300 may include a shared feature extractor 310 , a main classifier 320 , and a global average pooling layer 330 .
  • a first training image set including an image assigned with a class and location information is inputted into a first deep neural network model 300 .
  • the shared feature extractor 310 may extract feature values of an inputted image.
  • the first classifier 320 may output a classification result of an object included in the image on the basis of the feature values of the image.
  • the global average pooling layer 330 may filter the feature values of the image and output the location information of the object.
  • the computing device 12 may calculate a loss function of the first classification model 300 by using the result outputted through the first classifier 310 and the global average pooling layer and the class and the location information assigned to the inputted image.
  • the loss function of the first classification model may be a loss function based on cross entropy.
  • the loss function L I of the first classification model may be as shown below in Equation 1.
  • Equation 1 y denotes a class assigned to the image, i denotes an index of the class, and p(y) denotes a probability of the classification result of an object to be outputted as y.
  • the loss function based on cross entropy is set in the first deep neural network model, it is not necessarily limited thereto, and the loss function of the first deep neural network model may be diverse according to embodiments.
  • FIG. 4 is a view schematically showing the configuration of a generative model and a second classification model according to an embodiment.
  • the second classification model 420 may be trained together with the generative model 410 .
  • the second classification model 420 may include a shared feature extractor 310 and a second classifier 421 .
  • the generative model 410 may generate a fake image and determine an image not assigned with a label as a real image.
  • the second classification model 420 may receive a second training dataset including the fake image and the real image.
  • the shared feature extractor 310 may extract feature values of the fake image and the real image.
  • the second classifier 421 may output a classification result about the authenticity of each of the fake image and the real image on the basis of the feature values of the fake image and the real image.
  • the computing device 12 may calculate a loss function of the second classification model 420 by using the result outputted through the second classifier 421 and the authenticity corresponding to the fake image and the real image.
  • the loss function L D of the second classification model 420 may be as shown below in Equation 2.
  • L D ⁇ H X [ p ( Y )]+ x ⁇ X [ H [ p ( y
  • Equation 2 H denotes the entropy function, E denotes an expectation value of the function, p(y) denotes a probability of the classification result about the authenticity of the fake image and the real image to be outputted as y from the second classifier 512 , p(y
  • L G ⁇ H G [ p ( y )]+ z ⁇ p(z) [ H [ p ( y
  • FIG. 5 is a view schematically showing the configuration of a third classification model according to an embodiment.
  • the third classification model 500 may include an image rotator 510 , a shared feature extractor 310 , and a third classifier 520 .
  • the image rotator 510 may generate a third training dataset by using the image not assigned with a label.
  • the third training dataset may include rotated images and rotation angles of the rotated images.
  • the shared feature extractor 310 may extract feature values of the rotated images.
  • the third classifier 520 may output a classification result about a rotation angle of a rotated image on the basis of the feature values of the rotated image.
  • the computing device 12 may calculate a loss function of the third classification model 500 by using the result outputted through the third classifier 520 and the rotation angle of the image rotated by the image rotator 510 .
  • the loss function L R of the third classification model 500 may be as shown below in Equation 4.
  • Equation 4 r denotes the rotation angle of an image rotated by the rotator 510 , and p(r) denotes a probability of the classification result about the rotation angle of the rotated image to be outputted as r from the third classifier 520 .
  • FIG. 6 is a view showing the overall configuration of a first classification model, a second classification model, and a third classification model according to an embodiment.
  • the computing device 12 may train the shared feature extractor 310 several times by simultaneously training the first classification model 300 , the generative model 410 , the second classification model 420 , and the third classification model 500 .
  • the computing device 12 may train the first classification model 300 , the generative model 410 , the second classification model 420 , and the third classification model 500 by using a round robin method, the method of training each deep neural network model may be diverse according to embodiments.
  • the computing device 12 may individually train the shared feature extractor 310 , the first classifier 320 , and the global average pooling layer 330 included in the first classification model 300 by using the first training dataset.
  • the computing device 12 may individually train the generative model 410 , and the shared feature extractor 310 and the second classifier 421 included in the second classification model 420 by using the second training dataset.
  • the computing device 12 may individually train the shared feature extractor 310 and the third classifier 520 included in the third classification model 500 by using the third training dataset.
  • the computing device 12 may sequentially train the first classification model 300 , the generative model 410 , the second classification model 420 , and the third classification model 500 .
  • the computing device 12 may train the first classification model 300 , the generative model 410 , the second classification model 420 , and the third classification model 500 by using the stochastic gradient decent (SGD) algorithm, it is not necessarily limited thereto, and the training method may be diverse according to embodiments.
  • SGD stochastic gradient decent
  • the first classification model 300 , the generative model 410 , the second classification model 420 , and the third classification model 500 may be trained to minimize the weighted sum of the loss function of the first classification model 300 , the loss function of the generative model 410 , the loss function of the second classification model 420 , and the loss function of the third classification model 500 .
  • Equation 5 the weighted sum of the loss function of the first classification model 300 , the loss function of the generative model 410 , the loss function of the second classification model 420 , and the loss function of the third classification model 500 may be expressed below as shown in Equation 5.
  • L total ⁇ I L I + ⁇ GAN ( L D +L G )+ ⁇ R L R [Equation 5]
  • Equation 5 ⁇ denotes the weight value.
  • the user may determine the first classification model including the shared feature extractor 310 , the first classifier 320 and the global average pooling layer 330 as an object detection model.
  • the shared feature extractor 310 may be sufficiently trained several times.
  • performance of the shared feature extractor 310 for extracting features may be enhanced although the amount of images assigned with a label is small.
  • each classification model is trained using a neural network configured of the first classification model, the second classification model and the third classification model, it is not necessarily limited thereto.
  • a classification model that can perform training by using an image not assigned with a label may be further included, in addition to the first classification model, the second classification model and the third classification model.
  • a classification model performing training by using weakly-supervised learning, semi-supervised learning, self-supervised learning, unsupervised learning or the like may be included.
  • the embodiments of the present invention may include programs for performing the methods described in this specification on a computer and computer-readable recording media including the programs.
  • the computer-readable recording media may store program commands, local data files, local data structures and the like independently or in combination.
  • the media may be specially designed and configured for the present invention or may be commonly used in the field of computer software.
  • Examples of the computer-readable recording media include magnetic media such as a hard disk, a floppy disk and a magnetic tape, optical recording media such as CD-ROM and DVD, and hardware devices specially configured to store and execute program commands, such as ROM, RAM, flash memory and the like.
  • An example of the program may include a high-level language code that can be executed by a computer using an interpreter or the like, as well as a machine code generated by a compiler.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Multimedia (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Quality & Reliability (AREA)
  • Image Analysis (AREA)

Abstract

A method of training a model for object classification and detection includes training a first classification model including a shared feature extractor shared by classification models and a first classifier for outputting a result of an object in a first input image based on feature values of the first input image, training a second classification model including the shared feature extractor and a second classifier for outputting a result about authenticity of a second input image based on feature values of the second input image, and training a third classification model including the shared feature extractor and a third classifier for outputting a classification result about a rotation angle of a third input image on the basis of feature values of the third input image extracted by the shared feature extractor, using a third training image set including images rotated at one or more angles.

Description

CROSS REFERENCE TO RELATED APPLICATIONS AND CLAIM OF PRIORITY
This application claims the benefit of Korean Patent Application No. 10-2019-0132151 filed on Oct. 23, 2019 in the Korean Intellectual Property Office, the disclosure of which is incorporated herein by reference in its entirety
TECHNICAL FIELD
Embodiments of present invention relate to a technique of training a model for object classification and detection.
BACKGROUND ART
Generally, a training data, which is labeled with a bounding box and a class of an object existing in each image, is needed to train an object detection model. Particularly, since a lot of effort is required to display the bounding box in the image, it is difficult to obtain a large amount of training data marked with the bounding box.
To solve this problem, weakly-supervised object detection (WSOD) capable of training a model without a bounding box has been proposed recently. However, the weakly-supervised object detection has a problem of requiring a lot of images labeled with a class. Accordingly, it is required to provide a method of training an object classification model even when only a small amount of images are labeled with a class in a situation without the bounding box.
SUMMARY
Embodiments of the present invention are to provide a method and an apparatus for effectively training a model for object classification and detection.
In one general aspect, there is provided a method of training a model for object classification and detection, the method including training a first classification model including a shared feature extractor shared by a plurality of classification models and a first classifier for outputting a classification result of an object included in a first input image on the basis of feature values of the first input image extracted by the shared feature extractor, using a first training image set including an image assigned with a class; training a second classification model including the shared feature extractor and a second classifier for outputting a classification result about authenticity of a second input image on the basis of feature values of the second input image extracted by the shared feature extractor, by using a second training image set including a fake image and a real image; and training a third classification model including the shared feature extractor and a third classifier for outputting a classification result about a rotation angle of a third input image on the basis of feature values of the third input image extracted by the shared feature extractor, using a third training image set including images rotated at one or more angles.
The training of the first classification model may include training the first classification model, by using the image assigned with a class as an input data of the first classification model and the class as a target data of the first classification model.
The first classification model may further include a global average pooling (GAP) layer for outputting a location of the object in the first input image on the basis of the feature values of the first input image.
The first training image set may further include an image assigned with location information, and the training of the first classification model may include training the first classification model, by using the image assigned with a class and the image assigned with location information as an input data of the first classification model and the class and the location information as a target data of the first classification model.
The training of the second classification model may include generating the fake image by using a generative model based on a generative adversarial network (GAN).
The training of the second classification model may include training the second classification model, by using the fake image and the real image as an input data of the second classification model and authenticity corresponding to each of the fake image and the real image as a target data of the second classification model, and training the generative model to generate an image the same as the real image.
The third classification model may include an image rotator for generating the third training image set by rotating an image not assigned with a label at the one or more angles. The training of the third classification model may include training the third classification model, by using the rotated images as an input data of the third model and a rotation angle of each of the rotated images as a target data of the third model.
The first classification model, the second classification model, and the third classification model may be trained to minimize a weighted sum of a loss function of the first classification model, a loss function of the second classification model, and a loss function of the third classification model.
In another general aspect, there is provided an apparatus for training a model for object classification and detection, the apparatus comprising: a memory for storing one or more commands; and one or more processors for executing the one or more commands, wherein the one or more processors configured to train a first classification model including a shared feature extractor shared by a plurality of classification models and a first classifier for outputting a classification result of an object included in a first input image on the basis of feature values of the first input image extracted by the shared feature extractor, using a first training image set including an image assigned with a class, train a second classification model including the shared feature extractor and a second classifier for outputting a classification result about authenticity of a second input image on the basis of feature values of the second input image extracted by the shared feature extractor, by using a second training image set including a fake image and a real image, and train a third classification model including the shared feature extractor and a third classifier for outputting a classification result about a rotation angle of a third input image on the basis of feature values of the third input image extracted by the shared feature extractor, using a third training image set including images rotated at one or more angles.
The one or more processors may train the first classification model, by using the image assigned with a class as an input data of the first classification model and the class as a target data of the first classification model.
The first classification model may further include a global average pooling (GAP) layer for outputting a location of the object in the first input image on the basis of the feature values of the first input image.
The first training image set may further include an image assigned with location information, and the one or more processors may train the first classification model, by using the image assigned with a class and the image assigned with location information as an input data of the first classification model and the class and the location information as a target data of the first classification model.
The one or more processors may generate the fake image by using a generative model based on a generative adversarial network (GAN).
The one or more processors may train the second classification model, by using the fake image and the real image as an input data of the second classification model and authenticity corresponding to each of the fake image and the real image as a target data of the second classification model, and train the generative model to generate an image the same as the real image.
The third classification model may include an image rotator for generating the third training image set by rotating an image not assigned with a label at the one or more angles.
The one or more processors may train the third classification model, by using the rotated images as an input data of the third classification model and a rotation angle of each of the rotated images as a target data of the third classification model.
The first classification model, the second classification model, and the third classification model may be trained to minimize a weighted sum of a loss function of the first classification model, a loss function of the second classification model, and a loss function of the third classification model.
According to the disclosed embodiments, as a plurality of classification models sharing the same shared feature extractor is trained individually, the shared feature extractor may be sufficiently trained several times. Accordingly, since the feature extractor used for an object classification model, an object detection model and the like based on supervised learning using labeled training data is sufficiently trained, performance of the models can be enhanced. In addition, the effort, time and cost required for constructing a labeled training dataset can be reduced.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 is a block diagram showing an example of a computing environment including a computing device appropriate to be used in exemplary embodiments.
FIG. 2 is a flowchart illustrating a method of training a model for object classification and detection according to an embodiment.
FIG. 3 is a view schematically showing the configuration of a first classification model according to an embodiment.
FIG. 4 is a view schematically showing the configuration of a generative model and a second classification model according to an embodiment.
FIG. 5 is a view schematically showing the configuration of a third classification model according to an embodiment.
FIG. 6 is a view showing the overall configuration of a first classification model, a second classification model, and a third classification model according to an embodiment.
DETAILED DESCRIPTION
Hereafter, specific embodiments of the present invention will be described with reference to the accompanying drawings. The detailed description is provided below to help comprehensive understanding of the methods, apparatuses and/or systems described in this specification. However, these are only an example, and the present invention is not limited thereto.
In describing the embodiments of the present invention, when it is determined that specific description of known techniques related to the present invention unnecessarily blurs the gist of the present invention, the detailed description will be omitted. In addition, the terms described below are terms defined considering the functions of the present invention, and these may vary according to user, operator's intention, custom or the like. Therefore, definitions thereof should be determined on the basis of the full text of the specification. The terms used in the detailed description are only for describing the embodiments of the present invention and should not be restrictive. Unless clearly used otherwise, expressions of singular forms include meanings of plural forms. In the description, expressions such as “include”, “provide” and the like are for indicating certain features, numerals, steps, operations, components, some of these, or a combination thereof, and they should not be interpreted to preclude the presence or possibility of one or more other features, numerals, steps, operations, components, some of these, or a combination thereof, in addition to those described above.
The neural network may use artificial neurons simplifying the functions of biological neurons, and the artificial neurons may be interconnected through connection lines having a connection weight. The connection weight, which is a parameter of the neural network, is a specific value that the connection line has and may be expressed as connection strength. The neural network may perform a recognition action or a learning process of a human being through the artificial neurons. The artificial neuron may also be referred to as a node.
The neural network may include a plurality of layers. For example, the neural network may include an input layer, a hidden layer and an output layer. The input layer may receive an input for performing learning and transfer the input to the hidden layer, and the output layer may generate an output of the neural network on the basis of the signals received from the nodes of the hidden layer. The hidden layer is positioned between the input layer and the output layer and may convert the training data transferred through the input layer into a value easy to estimate. The nodes included in the input layer and the hidden layer are connected to each other through connection lines having a connection weight, and the nodes included in the hidden layer and the output layer may also be connected to each other through connection lines having a connection weight. The input layer, the hidden layer and the output layer may include a plurality of nodes.
The neural network may include a plurality of hidden layers. The neural network including a plurality of hidden layers is referred to as a deep neural network, and training the deep neural network is referred to as deep learning. The nodes included in the hidden layer are referred to as hidden nodes. Hereinafter, training a neural network may be understood as training parameters of the neural network. In addition, a trained neural network may be understood as a neural network to which the trained parameters are applied.
At this point, the neural network may be trained using a loss function as an objective. The loss function may be an objective of the neural network for determining an optimum weight parameter through the training. The neural network may be trained for the purpose of making a result value of the loss function to be the smallest.
The neural network may be trained through supervised learning or unsupervised learning. The supervised learning is a method of inputting a training data including an input data and a target data corresponding the input data into the neural network and updating connection weights of connection lines so that the target data corresponding to the input data may be outputted. The unsupervised learning is a method of inputting only an input data into the neural network as a training data without a target data corresponding to the input data, and updating the connection weights of the connection lines to find out the features or the structure of the input data.
FIG. 1 is a block diagram showing an example of a computing environment including a computing device appropriate to be used in exemplary embodiments. In the embodiment shown in the figure, each of the components may have a different function and ability in addition to those described below, and additional components other than those described below may be included.
The computing environment 10 shown in the figure includes a computing device 12. In an embodiment, the computing device 12 may be an apparatus for training a model for object classification and detection.
The computing device 12 includes at least a processor 14, a computer-readable storage medium 16, and a communication bus 18. The processor 14 may direct the computing device 12 to operate according to the exemplary embodiments described above. For example, the processor 14 may execute one or more programs stored in the computer-readable storage medium 16. The one or more programs may include one or more computer executable commands, and the computer executable commands may be configured to direct the computing device 12 to perform operations according to the exemplary embodiment when the commands are executed by the processor 14.
The computer-readable storage medium 16 is configured to store computer-executable commands and program codes, program data and/or information of other appropriate forms. The programs 20 stored in the computer-readable storage medium 16 include a set of commands that can be executed by the processor 14. In an embodiment, the computer-readable storage medium 16 may be memory (volatile memory such as random access memory, non-volatile memory, or an appropriate combination of these), one or more magnetic disk storage devices, optical disk storage devices, flash memory devices, other forms of storage media that can be accessed by the computing device 12 and is capable of storing desired information, or an appropriate combination of these.
The communication bus 18 interconnects various different components of the computing device 12, including the processor 14 and the computer-readable storage medium 16.
The computing device 12 may also include one or more input and output interfaces 22 and one or more network communication interfaces 26, which provide an interface for one or more input and output devices 24. The input and output interfaces 22 and the network communication interfaces 26 are connected to the communication bus 18. The input and output devices 24 may be connected to other components of the computing device 12 through the input and output interfaces 22. Exemplary input and output devices 24 may include input devices such as a pointing device (a mouse, a track pad, etc.), a keyboard, a touch input device (a touch pad, a touch screen, etc.), a voice or sound input device, various kinds of sensor devices and/or photographing devices, and/or output devices such as a display device, a printer, a speaker and/or a network card. The exemplary input and output devices 24 may be included inside the computing device 12 as a component configuring the computing device 12 or may be connected to the computing device 12 as a separate apparatus distinguished from the computing device 12.
FIG. 2 is a flowchart illustrating a method of training a model for object classification and detection according to an embodiment.
The method shown in FIG. 2 may be executed by the computing device 12 provided with, for example, one or more processors and a memory for storing one or more programs executed by the one or more processors. Although the method is described as being divided into a plurality of operations in the flowchart shown in the figure, at least some of the operations may be performed in a different order or in combination and together with the other operations, omitted, divided into detailed operations, or performed in accompany with one or more operations not shown in the figure.
Referring to FIG. 2, at step 210, the computing device 12 trains a first classification model including a shared feature extractor shared by a plurality of classification models and a first classifier for outputting a classification result of an object included in a first input image on the basis of feature values of the first input image extracted by the shared feature extractor, using a first training image set including an image assigned with a class. At this point, the class included in the first training image set may be assigned by a user.
The shared feature extractor may be configured of one or more layers to extract feature values of an input image. Although the shared feature extractor may include a convolution layer, a pooling layer and a fully connected layer in an embodiment, it is not necessarily limited thereto and may be configured in a variety of forms according to embodiments.
Specifically, the shared feature extractor may extract a feature vector including one or more feature values of an input image.
In addition, the shared feature extractor is shared among a first classification model, a second classification model and a third classification model and may be used by the first classification model, the second classification model and the third classification model to extract feature values of an input image.
The first classifier may be configured of one or more layers for outputting a classification result of an object included in the input image.
Specifically, the first classifier may output a probability of an object included in the input image to be classified as a specific class among a plurality of classes set in advance, on the basis of the feature values of the input image.
In an embodiment, the computing device 12 may train the first classification model by using a supervised learning technique using a first training dataset.
Specifically, the computing device 12 may train the first classification model, by using an image assigned with a class as an input data of the first classification model and the class assigned to the image as a target data of the first classification model.
For example, the computing device 12 may output a classification result of an object from an input image by using the first classification model and train the first classification model on the basis of the classification result of the object and the class assigned to the input image. At this point, the computing device 12 may update the parameters of the shared feature extractor and the first classifier.
In addition, in an embodiment, the first classification model may further include a global average pooling (GAP) layer for outputting the location of an object in the first input image on the basis of the feature values of the first input image.
The global average pooling layer may filter the feature values of the input image to include the location information of the object.
At this point, the first classification model may output location information of the object using, for example, a class activation map (CAM) algorithm.
Meanwhile, although it is described in the above example that location information of an object is outputted using a class activation map technique, it is not necessarily limited thereto, and the method of outputting location information of an object may be diverse according to embodiments.
For example, the computing device 12 may train the first classification model, by using an image assigned with a class and an image assigned with location information as an input data of the first classification model and using the class and the location information assigned to the image as a target data of the first classification model. At this point, the location information outputted through the class activation map technique may be used as the location information assigned to the image, or the location information may be assigned in a variety of methods.
Specifically, the computing device 12 may train to output location information the same as the location information included in the target data. The computing device 12 may update the parameters of the shared feature extractor, the first classifier and the global average pooling layer by using a first training dataset including an image assigned with a class and an image assigned with location information.
At step 220, the computing device 12 trains the second classification model including the shared feature extractor and a second classifier for outputting a classification result about the authenticity of a second input image on the basis of feature values of the second input image extracted by the shared feature extractor, by using a second training image set including a fake image and a real image.
In an embodiment, the computing device 12 may generate a fake image using a generative model based on a generative adversarial network (GAN).
At this point, the generative model may be a neural network which generates a fake image on the basis of probability distribution of a random latent variable.
Specifically, the computing device 12 may generate a fake image by using the generative model and generate a second training image set by assigning labels of fake and real to the fake image and the real image.
The second classifier may be configured of one or more layers for outputting a classification result about the authenticity of the input image.
Specifically, the second classifier may output a probability of whether the input image corresponds to a fake image or a real image on the basis of the feature values of the input image.
In an embodiment, the computing device 12 may train the second classification model, by using the fake image and the real image as an input data of the second classification model and authenticity corresponding to each of the fake image and the real image as a target data of the second classification model.
For example, the computing device 12 may input a fake image and a real image into the second classification model and output a classification result about the authenticity of the fake image and the real image. At this point, the computing device 12 may compare the outputted classification result and the authenticity of each of the fake image and the real image and update the parameters of the shared feature extractor and the second classifier through the result of the comparison.
In an embodiment, the computing device 12 may train the generative model by using an unsupervised learning algorithm. Specifically, the computing device 12 may train the generative model to generate an image the same as the real image. At this point, the computing device 12 may update the parameters of the generative model on the basis of the classification result outputted from the second classification model.
At step 230, the computing device 12 trains the third classification model including the shared feature extractor and a third classifier for outputting a classification result about a rotation angle of a third input image on the basis of feature values of the third input image extracted by the shared feature extractor, by using a third training image set including images rotated at one or more angles.
The third classifier may be configured of one or more layers for outputting a classification result about the rotation angle of the input image.
Specifically, the third classifier may output a probability of the input image to have been rotated at a specific angle among a plurality of rotation angles on the basis of the feature values of the input image.
In an embodiment, the third classification model may further include an image rotator for generating a third training image set by rotating an image not assigned with a label at one or more angles.
The image rotator may receive an image not assigned with a label and rotate the image at one or more angles. In addition, the image rotator may assign a rotation angle to each rotated image as a label.
In an embodiment, the computing device 12 may train the third classification model using a self-supervised learning algorithm.
Specifically, the computing device 12 may generate images rotated at one or more angles by using the third classification model, and train the third classification model by using the rotated images as an input data of the third model and the rotation angle of each of the rotated images as a target data of the third model.
For example, the computing device 12 may input the rotated images into the third classification model and output a classification result about the rotation angle of each of the rotated images. At this point, the computing device 12 may compare the outputted classification result and the rotation angle of each of the rotated images and update the parameters of the shared feature extractor and the third classification model through the result of the comparison.
Meanwhile, although the method is described as being divided into a plurality of steps in the flowchart shown in FIG. 2, at least some of the steps may be performed in a different order or in combination and together with the other steps, omitted, divided into detailed steps, or performed in accompany with one or more steps not shown in the figure.
FIG. 3 is a view schematically showing the configuration of a first classification model according to an embodiment.
Referring to FIG. 3, the first classification model 300 may include a shared feature extractor 310, a main classifier 320, and a global average pooling layer 330.
Specifically, it is assumed that a first training image set including an image assigned with a class and location information is inputted into a first deep neural network model 300.
The shared feature extractor 310 may extract feature values of an inputted image.
The first classifier 320 may output a classification result of an object included in the image on the basis of the feature values of the image.
The global average pooling layer 330 may filter the feature values of the image and output the location information of the object.
At this point, the computing device 12 may calculate a loss function of the first classification model 300 by using the result outputted through the first classifier 310 and the global average pooling layer and the class and the location information assigned to the inputted image.
In an embodiment, the loss function of the first classification model may be a loss function based on cross entropy.
For example, the loss function LI of the first classification model may be as shown below in Equation 1.
L I = - i = 1 K y i log p ( y i ) [ Equation 1 ]
In Equation 1, y denotes a class assigned to the image, i denotes an index of the class, and p(y) denotes a probability of the classification result of an object to be outputted as y.
Meanwhile, although it is described in the above example that the loss function based on cross entropy is set in the first deep neural network model, it is not necessarily limited thereto, and the loss function of the first deep neural network model may be diverse according to embodiments.
FIG. 4 is a view schematically showing the configuration of a generative model and a second classification model according to an embodiment.
As shown in FIG. 4, the second classification model 420 may be trained together with the generative model 410.
Referring to FIG. 4, the second classification model 420 may include a shared feature extractor 310 and a second classifier 421.
Specifically, the generative model 410 may generate a fake image and determine an image not assigned with a label as a real image.
Then, the second classification model 420 may receive a second training dataset including the fake image and the real image.
The shared feature extractor 310 may extract feature values of the fake image and the real image.
The second classifier 421 may output a classification result about the authenticity of each of the fake image and the real image on the basis of the feature values of the fake image and the real image.
At this point, the computing device 12 may calculate a loss function of the second classification model 420 by using the result outputted through the second classifier 421 and the authenticity corresponding to the fake image and the real image.
For example, the loss function LD of the second classification model 420 may be as shown below in Equation 2.
L D =−H X[p(Y)]+
Figure US10990852-20210427-P00001
x˜X[H[p(y|F(x))]]−
Figure US10990852-20210427-P00002
z˜p(z)[H[p(y|G(z))]]  [Equation 2]
In Equation 2, H denotes the entropy function, E denotes an expectation value of the function, p(y) denotes a probability of the classification result about the authenticity of the fake image and the real image to be outputted as y from the second classifier 512, p(y|F(x)) denotes a probability of the classification result about the authenticity of the real image to be outputted as y from the second classifier 512, p(y|G(z)) denotes a probability of the classification result about the authenticity of the fake image to be outputted as y, x˜X denotes a data sampled from the probability distribution of the real image, and z˜p(z) denotes a data sampled from a latent space using a Gaussian distribution.
In addition, the loss function LG of the generative model 410 may be as shown below in Equation 3.
L G =−H G[p(y)]+
Figure US10990852-20210427-P00003
z˜p(z)[H[p(y|G(z))]]  [Equation 3]
FIG. 5 is a view schematically showing the configuration of a third classification model according to an embodiment.
Referring to FIG. 5, the third classification model 500 may include an image rotator 510, a shared feature extractor 310, and a third classifier 520.
When an image not assigned with a label is inputted, the image rotator 510 may generate a third training dataset by using the image not assigned with a label. At this point, the third training dataset may include rotated images and rotation angles of the rotated images.
The shared feature extractor 310 may extract feature values of the rotated images.
The third classifier 520 may output a classification result about a rotation angle of a rotated image on the basis of the feature values of the rotated image.
At this point, the computing device 12 may calculate a loss function of the third classification model 500 by using the result outputted through the third classifier 520 and the rotation angle of the image rotated by the image rotator 510.
For example, the loss function LR of the third classification model 500 may be as shown below in Equation 4.
L R = - i = 1 4 r i log p ( r i ) [ Equation 4 ]
In Equation 4, r denotes the rotation angle of an image rotated by the rotator 510, and p(r) denotes a probability of the classification result about the rotation angle of the rotated image to be outputted as r from the third classifier 520.
FIG. 6 is a view showing the overall configuration of a first classification model, a second classification model, and a third classification model according to an embodiment.
Referring to FIG. 6, although the amount of the images assigned with a label is small, the computing device 12 may train the shared feature extractor 310 several times by simultaneously training the first classification model 300, the generative model 410, the second classification model 420, and the third classification model 500.
At this point, although the computing device 12 may train the first classification model 300, the generative model 410, the second classification model 420, and the third classification model 500 by using a round robin method, the method of training each deep neural network model may be diverse according to embodiments.
Specifically, the computing device 12 may individually train the shared feature extractor 310, the first classifier 320, and the global average pooling layer 330 included in the first classification model 300 by using the first training dataset.
After training the first classification model, the computing device 12 may individually train the generative model 410, and the shared feature extractor 310 and the second classifier 421 included in the second classification model 420 by using the second training dataset.
In addition, after training the second classification model, the computing device 12 may individually train the shared feature extractor 310 and the third classifier 520 included in the third classification model 500 by using the third training dataset.
As described above, the computing device 12 may sequentially train the first classification model 300, the generative model 410, the second classification model 420, and the third classification model 500.
Meanwhile, although the computing device 12 may train the first classification model 300, the generative model 410, the second classification model 420, and the third classification model 500 by using the stochastic gradient decent (SGD) algorithm, it is not necessarily limited thereto, and the training method may be diverse according to embodiments.
In addition, in an embodiment, the first classification model 300, the generative model 410, the second classification model 420, and the third classification model 500 may be trained to minimize the weighted sum of the loss function of the first classification model 300, the loss function of the generative model 410, the loss function of the second classification model 420, and the loss function of the third classification model 500.
At this point, the weighted sum of the loss function of the first classification model 300, the loss function of the generative model 410, the loss function of the second classification model 420, and the loss function of the third classification model 500 may be expressed below as shown in Equation 5.
L totalI L IGAN(L D +L G)+λR L R  [Equation 5]
At this point, in Equation 5, λ denotes the weight value.
Then, the user may determine the first classification model including the shared feature extractor 310, the first classifier 320 and the global average pooling layer 330 as an object detection model.
Accordingly, as a plurality of classification models sharing the shared feature extractor 310 is individually trained, the shared feature extractor 310 may be sufficiently trained several times. In addition, since training may be performed in the second classification model 420 and the third classification model 400 that can perform training by using an image not assigned with a label, performance of the shared feature extractor 310 for extracting features may be enhanced although the amount of images assigned with a label is small.
Meanwhile, although it is described in FIG. 6 that each classification model is trained using a neural network configured of the first classification model, the second classification model and the third classification model, it is not necessarily limited thereto.
For example, a classification model that can perform training by using an image not assigned with a label may be further included, in addition to the first classification model, the second classification model and the third classification model. For example, a classification model performing training by using weakly-supervised learning, semi-supervised learning, self-supervised learning, unsupervised learning or the like may be included.
Meanwhile, the embodiments of the present invention may include programs for performing the methods described in this specification on a computer and computer-readable recording media including the programs. The computer-readable recording media may store program commands, local data files, local data structures and the like independently or in combination. The media may be specially designed and configured for the present invention or may be commonly used in the field of computer software. Examples of the computer-readable recording media include magnetic media such as a hard disk, a floppy disk and a magnetic tape, optical recording media such as CD-ROM and DVD, and hardware devices specially configured to store and execute program commands, such as ROM, RAM, flash memory and the like. An example of the program may include a high-level language code that can be executed by a computer using an interpreter or the like, as well as a machine code generated by a compiler.
The technical features have been described above focusing on embodiments. However, the disclosed embodiments should be considered from the descriptive viewpoint, not the restrictive viewpoint, and the scope of the present invention is defined by the claims, not by the descriptions described above, and all the differences within the equivalent scope should be interpreted as being included in the scope of the present invention.

Claims (18)

The invention claimed is:
1. A method of training a model for object classification and detection, the method comprising:
training a first classification model comprising a shared feature extractor shared by a plurality of classification models and a first classifier for outputting a classification result of an object included in a first input image on the basis of feature values of the first input image extracted by the shared feature extractor, using a first training image set including an image assigned with a class;
training a second classification model comprising the shared feature extractor and a second classifier for outputting a classification result about authenticity of a second input image on the basis of feature values of the second input image extracted by the shared feature extractor, by using a second training image set including a fake image and a real image; and
training a third classification model comprising the shared feature extractor and a third classifier for outputting a classification result about a rotation angle of a third input image on the basis of feature values of the third input image extracted by the shared feature extractor, using a third training image set including images rotated at one or more angles.
2. The method of claim 1, wherein the training of the first classification model comprises training the first classification model, by using the image assigned with a class as an input data of the first classification model and the class as a target data of the first classification model.
3. The method of claim 2, wherein the first classification model further comprises a global average pooling (GAP) layer for outputting a location of the object in the first input image on the basis of the feature values of the first input image.
4. The method of claim 3, wherein the first training image set further comprises an image assigned with location information, and the training of the first classification model comprises training the first classification model, by using the image assigned with a class and the image assigned with location information as an input data of the first classification model and the class and the location information as a target data of the first classification model.
5. The method of claim 1, wherein the training of the second classification model comprises generating the fake image by using a generative model based on a generative adversarial network (GAN).
6. The method of claim 5, wherein the training of the second classification model comprises training the second classification model, by using the fake image and the real image as an input data of the second classification model and authenticity corresponding to each of the fake image and the real image as a target data of the second classification model, and training the generative model to generate an image the same as the real image.
7. The method of claim 1, wherein the third classification model comprises an image rotator for generating the third training image set by rotating an image not assigned with a label at the one or more angles.
8. The method of claim 1, wherein the training of the third classification model comprises training the third classification model, by using the rotated images as an input data of the third model and a rotation angle of each of the rotated images as a target data of the third model.
9. The method of claim 1, wherein the first classification model, the second classification model, and the third classification model are trained to minimize a weighted sum of a loss function of the first classification model, a loss function of the second classification model, and a loss function of the third classification model.
10. An apparatus for training a model for object classification and detection, the apparatus comprising:
a memory for storing one or more commands; and
one or more processors for executing the one or more commands,
wherein the one or more processors configured to:
train a first classification model comprising a shared feature extractor shared by a plurality of classification models and a first classifier for outputting a classification result of an object included in a first input image on the basis of feature values of the first input image extracted by the shared feature extractor, using a first training image set including an image assigned with a class;
train a second classification model comprising the shared feature extractor and a second classifier for outputting a classification result about authenticity of a second input image on the basis of feature values of the second input image extracted by the shared feature extractor, by using a second training image set including a fake image and a real image; and
train a third classification model comprising the shared feature extractor and a third classifier for outputting a classification result about a rotation angle of a third input image on the basis of feature values of the third input image extracted by the shared feature extractor, using a third training image set including images rotated at one or more angles.
11. The apparatus of claim 10, wherein the one or more processors further configured to train the first classification model, by using the image assigned with a class as an input data of the first classification model and the class as a target data of the first classification model.
12. The apparatus of claim 11, wherein the first classification model further comprises a global average pooling (GAP) layer for outputting a location of the object in the first input image on the basis of the feature values of the first input image.
13. The apparatus of claim 12, wherein the first training image set further comprises an image assigned with location information, and the one or more processors further configured to train the first classification model, by using the image assigned with a class and the image assigned with location information as an input data of the first classification model and the class and the location information as a target data of the first classification model.
14. The apparatus of claim 10, wherein the one or more processors further configured to generate the fake image by using a generative model based on a generative adversarial network (GAN).
15. The apparatus of claim 14, wherein the one or more processors further configured to train the second classification model, by using the fake image and the real image as an input data of the second classification model and authenticity corresponding to each of the fake image and the real image as a target data of the second classification model, and train the generative model to generate an image the same as the real image.
16. The apparatus of claim 10, wherein the third classification model further comprises an image rotator for generating the third training image set by rotating an image not assigned with a label at the one or more angles.
17. The apparatus of claim 10, wherein the one or more processors further configured to train the third classification model, by using the rotated images as an input data of the third model and a rotation angle of each of the rotated images as a target data of the third model.
18. The apparatus of claim 10, wherein the first classification model, the second classification model, and the third classification model are trained to minimize a weighted sum of a loss function of the first classification model, a loss function of the second classification model, and a loss function of the third classification model.
US16/666,051 2019-10-23 2019-10-28 Method and apparatus for training model for object classification and detection Active 2039-12-30 US10990852B1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
KR1020190132151A KR20210048187A (en) 2019-10-23 2019-10-23 Method and apparatus for training model for object classification and detection
KR10-2019-0132151 2019-10-23

Publications (2)

Publication Number Publication Date
US10990852B1 true US10990852B1 (en) 2021-04-27
US20210125000A1 US20210125000A1 (en) 2021-04-29

Family

ID=75587129

Family Applications (1)

Application Number Title Priority Date Filing Date
US16/666,051 Active 2039-12-30 US10990852B1 (en) 2019-10-23 2019-10-28 Method and apparatus for training model for object classification and detection

Country Status (2)

Country Link
US (1) US10990852B1 (en)
KR (1) KR20210048187A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113807281A (en) * 2021-09-23 2021-12-17 深圳信息职业技术学院 Image detection model generation method, detection method, terminal and storage medium
US20220092812A1 (en) * 2020-09-24 2022-03-24 Eagle Technology, Llc Artificial intelligence (ai) system and methods for generating estimated height maps from electro-optic imagery
CN114283290A (en) * 2021-09-27 2022-04-05 腾讯科技(深圳)有限公司 Training of image processing model, image processing method, device, equipment and medium
CN114360008A (en) * 2021-12-23 2022-04-15 上海清鹤科技股份有限公司 Generation method of face authentication model, authentication method, equipment and storage medium
CN114937086A (en) * 2022-07-19 2022-08-23 北京鹰瞳科技发展股份有限公司 Training method and detection method for multi-image target detection and related products
US11747468B2 (en) 2020-09-24 2023-09-05 Eagle Technology, Llc System using a priori terrain height data for interferometric synthetic aperture radar (IFSAR) phase disambiguation and related methods
CN117746482A (en) * 2023-12-20 2024-03-22 北京百度网讯科技有限公司 Training method, detection method, device and equipment of face detection model

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20210149530A (en) * 2020-06-02 2021-12-09 삼성에스디에스 주식회사 Method for training image classification model and apparatus for executing the same
WO2022245191A1 (en) * 2021-05-21 2022-11-24 Endoai Co., Ltd. Method and apparatus for learning image for detecting lesions
CN114120420B (en) * 2021-12-01 2024-02-13 北京百度网讯科技有限公司 Image detection method and device
KR20230135383A (en) 2022-03-16 2023-09-25 삼성에스디에스 주식회사 Method and apparatus for training fake image discriminatve model
KR102507273B1 (en) * 2022-07-21 2023-03-08 (주)트루엔 Electronic device and operation method of electronic device for detecting and classifying object

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170364733A1 (en) * 2015-08-26 2017-12-21 Digitalglobe, Inc. System for simplified generation of systems for broad area geospatial object detection
US20180129930A1 (en) 2016-11-07 2018-05-10 Korea Advanced Institute Of Science And Technology Learning method based on deep learning model having non-consecutive stochastic neuron and knowledge transfer, and system thereof
US20180285663A1 (en) * 2017-03-31 2018-10-04 Here Global B.V. Method and apparatus for augmenting a training data set
US10664722B1 (en) * 2016-10-05 2020-05-26 Digimarc Corporation Image processing arrangements

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170364733A1 (en) * 2015-08-26 2017-12-21 Digitalglobe, Inc. System for simplified generation of systems for broad area geospatial object detection
US10664722B1 (en) * 2016-10-05 2020-05-26 Digimarc Corporation Image processing arrangements
US20180129930A1 (en) 2016-11-07 2018-05-10 Korea Advanced Institute Of Science And Technology Learning method based on deep learning model having non-consecutive stochastic neuron and knowledge transfer, and system thereof
US20180285663A1 (en) * 2017-03-31 2018-10-04 Here Global B.V. Method and apparatus for augmenting a training data set

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220092812A1 (en) * 2020-09-24 2022-03-24 Eagle Technology, Llc Artificial intelligence (ai) system and methods for generating estimated height maps from electro-optic imagery
US11587249B2 (en) * 2020-09-24 2023-02-21 Eagle Technology, Llc Artificial intelligence (AI) system and methods for generating estimated height maps from electro-optic imagery
US11747468B2 (en) 2020-09-24 2023-09-05 Eagle Technology, Llc System using a priori terrain height data for interferometric synthetic aperture radar (IFSAR) phase disambiguation and related methods
CN113807281A (en) * 2021-09-23 2021-12-17 深圳信息职业技术学院 Image detection model generation method, detection method, terminal and storage medium
CN113807281B (en) * 2021-09-23 2024-03-29 深圳信息职业技术学院 Image detection model generation method, detection method, terminal and storage medium
CN114283290A (en) * 2021-09-27 2022-04-05 腾讯科技(深圳)有限公司 Training of image processing model, image processing method, device, equipment and medium
CN114283290B (en) * 2021-09-27 2024-05-03 腾讯科技(深圳)有限公司 Training of image processing model, image processing method, device, equipment and medium
CN114360008A (en) * 2021-12-23 2022-04-15 上海清鹤科技股份有限公司 Generation method of face authentication model, authentication method, equipment and storage medium
CN114937086A (en) * 2022-07-19 2022-08-23 北京鹰瞳科技发展股份有限公司 Training method and detection method for multi-image target detection and related products
CN114937086B (en) * 2022-07-19 2022-11-01 北京鹰瞳科技发展股份有限公司 Training method and detection method for multi-image target detection and related products
CN117746482A (en) * 2023-12-20 2024-03-22 北京百度网讯科技有限公司 Training method, detection method, device and equipment of face detection model

Also Published As

Publication number Publication date
US20210125000A1 (en) 2021-04-29
KR20210048187A (en) 2021-05-03

Similar Documents

Publication Publication Date Title
US10990852B1 (en) Method and apparatus for training model for object classification and detection
US10860928B2 (en) Generating output data items using template data items
US11087086B2 (en) Named-entity recognition through sequence of classification using a deep learning neural network
CN109328362B (en) Progressive neural network
US20200134455A1 (en) Apparatus and method for training deep learning model
US20200134454A1 (en) Apparatus and method for training deep learning model
US20170213150A1 (en) Reinforcement learning using a partitioned input state space
US11010664B2 (en) Augmenting neural networks with hierarchical external memory
CN106170800A (en) Student DNN is learnt via output distribution
JP7494316B2 (en) Self-supervised representation learning using bootstrapped latent representations
KR20190056940A (en) Method and device for learning multimodal data
JP6521440B2 (en) Neural network and computer program therefor
US11823480B2 (en) Method for training image classification model and apparatus for executing the same
US11176424B2 (en) Method and apparatus for measuring confidence
Pulgar et al. On the impact of imbalanced data in convolutional neural networks performance
US11468267B2 (en) Apparatus and method for classifying image
KR20230141828A (en) Neural networks using adaptive gradient clipping
KR20200134813A (en) Apparatus and method for image processing for machine learning
KR102408186B1 (en) Method for preprocessing structured learning data
US20240212329A1 (en) Method for learning artificial neural network based knowledge distillation and computing device for executing the same
US20230140444A1 (en) Document classification method and document classification device
US20220398833A1 (en) Information processing device, learning method, and recording medium
CN118786440A (en) Discovering neural networks and feature representation neural networks using self-supervised learning training objects

Legal Events

Date Code Title Description
AS Assignment

Owner name: SAMSUNG SDS CO., LTD., KOREA, REPUBLIC OF

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:KIM, BYOUNG-JIP;CHOI, JONG-WON;CHOI, YOUNG-JOON;AND OTHERS;REEL/FRAME:050848/0018

Effective date: 20191025

FEPP Fee payment procedure

Free format text: ENTITY STATUS SET TO UNDISCOUNTED (ORIGINAL EVENT CODE: BIG.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

STCF Information on status: patent grant

Free format text: PATENTED CASE

MAFP Maintenance fee payment

Free format text: PAYMENT OF MAINTENANCE FEE, 4TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1551); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

Year of fee payment: 4