WO2023085190A1

WO2023085190A1 - Teaching data generation method, teaching data generation program, information processing device, information processing method and information processing program

Info

Publication number: WO2023085190A1
Application number: PCT/JP2022/041036
Authority: WO
Inventors: 貴裕平野
Original assignee: ソニーセミコンダクタソリューションズ株式会社
Priority date: 2021-11-09
Filing date: 2022-11-02
Publication date: 2023-05-19
Also published as: JPWO2023085190A1

Abstract

A teaching data generation method according to the present disclosure to be executed by a computer, said method including: the shape transformation of a default box (Db) positioned on a feature map (Fm) extracted from an image (Pa) and of label data (Ld) to be applied to an object in an image; and the generation of teaching data by determining the default box which is to be the ground truth of the image (Pa) by subjecting the shape-transformed default box (Db') and label data (Ld') to matching.

Description

Teacher data generation method, teacher data generation program, information processing device, information processing method, and information processing program

The present disclosure relates to a teacher data generation method, a teacher data generation program, an information processing device, an information processing method, and an information processing program.

There is an object detection model called SSD (Single Shot MultiBox Detector) (see Patent Document 1, for example). The SSD comprises a Convolutional Neural Network (CNN) machine-learned to detect objects from input images. CNN machine learning uses teacher data in which the type (class) of an object included in the image and a rectangular ground truth (GT) indicating the area of the object in the image are given to the image.

When generating training data, first prepare a plurality of input images in which objects appear, and an image with label data to which rectangular label data surrounding the object in each input image is added. Then, the feature amount of the image is extracted from the input image to generate a feature map.

Next, multiple default boxes with different aspect ratios and sizes are sequentially placed at arbitrary positions on the feature map. After that, the arranged default boxes and the label data are matched to determine the default box to be the GT of the input image, thereby generating teacher data.

In the matching between the default box and the label data, an IoU (Intersection over Union) value indicating the area of the overlapping portion of the default box and the label data with respect to the total area occupied by the default box and the label data in the feature map is calculated.

Then, a default box whose IoU value is greater than or equal to the threshold value is determined as GT to generate teacher data. For this reason, as GT, a default box whose position, size and aspect ratio in the feature map are similar to the label data is necessarily selected.

During machine learning, a large amount of teacher data generated in this way is input to CNN. The CNN repeats the process of detecting an object from the input teacher data image. The CNN then adjusts each parameter of the network so that the difference between the position, shape, and size of the detected object in the feature map and the position, shape, and size of the GT in the feature map is reduced.

As a result, the CNN places default boxes learned through learning in feature maps extracted from unknown input images, and derives the accuracy of the position, shape, and type of objects in the input images from the pixel data in the default boxes. can do.

JP 2020-98455 A

However, when CNN performs machine learning using teacher data in which GT is determined as a default box whose aspect ratio in the feature map is similar to label data, object detection accuracy may decrease.

Therefore, the present disclosure proposes a teacher data generation method, a teacher data generation program, an information processing device, an information processing method, and an information processing program that can improve object detection accuracy.

A training data generation method according to the present disclosure is a training data generation method executed by a computer, wherein default boxes arranged on a feature map extracted from an image and label data given to objects in the image are A shape transformation is performed, and by matching the default box after the shape transformation with the label data, a default box to be the ground truth of the image is determined to generate teacher data.

1 is a block diagram showing a configuration example of a vehicle control system according to the present disclosure; FIG. It is a figure which shows an example of a general training data generation method. It is a figure which shows the verification result of object detection. It is a figure which shows the verification result of object detection. It is a figure which shows the verification result of object detection. It is a figure which shows the verification result of object detection. It is a figure which shows the verification result of object detection. It is a figure which shows the verification result of object detection. It is a figure which shows the verification result of object detection. It is a figure which shows the verification result of object detection. FIG. 10 is a diagram showing verification results of default box matching during learning; FIG. 10 is a diagram showing verification results of default box matching during learning; It is a figure which shows an example of the shape conversion which concerns on embodiment. FIG. 10 is a diagram showing a verification result of default box matching before shape conversion; FIG. 10 is a diagram showing verification results of default box matching after shape conversion; It is a figure which shows the verification result of object detection. It is a figure which shows the verification result of object detection. It is a figure which shows the verification result of object detection. It is a figure which shows the verification result of object detection.

Below, embodiments of the present disclosure will be described in detail based on the drawings. In addition, in each of the following embodiments, the same parts are denoted by the same reference numerals, thereby omitting redundant explanations.

[1. Configuration example of vehicle control system]
FIG. 1 is a block diagram showing a configuration example of a vehicle control system 11, which is an example of a mobile device control system to which the present technology is applied.

The vehicle control system 11 is provided in the vehicle 1 and performs processing related to driving support and automatic driving of the vehicle 1.

The vehicle control system 11 includes a vehicle control ECU (Electronic Control Unit) 21, a communication unit 22, a map information accumulation unit 23, a position information acquisition unit 24, an external recognition sensor 25, an in-vehicle sensor 26, a vehicle sensor 27, a storage unit 28, a driving It has a support/automatic driving control unit 29, a DMS (Driver Monitoring System) 30, an HMI (Human Machine Interface) 31, and a vehicle control unit 32.

Vehicle control ECU 21, communication unit 22, map information storage unit 23, position information acquisition unit 24, external recognition sensor 25, in-vehicle sensor 26, vehicle sensor 27, storage unit 28, driving support/automatic driving control unit 29, driver monitoring system ( DMS) 30 , human machine interface (HMI) 31 , and vehicle control unit 32 are connected via a communication network 41 so as to be able to communicate with each other. The communication network 41 is, for example, a CAN (Controller Area Network), LIN (Local Interconnect Network), LAN (Local Area Network), FlexRay (registered trademark), Ethernet (registered trademark), and other digital two-way communication standards. It is composed of a communication network, a bus, and the like. The communication network 41 may be used properly depending on the type of data to be transmitted. For example, CAN may be applied to data related to vehicle control, and Ethernet may be applied to large-capacity data. In addition, each part of the vehicle control system 11 performs wireless communication assuming relatively short-range communication such as near field communication (NFC (Near Field Communication)) or Bluetooth (registered trademark) without going through the communication network 41. may be connected directly using

The communication unit 22 communicates with various devices inside and outside the vehicle, other vehicles, servers, base stations, etc., and transmits and receives various data.

The map information accumulation unit 23 accumulates one or both of the map obtained from the outside and the map created by the vehicle 1. For example, the map information accumulation unit 23 accumulates a three-dimensional high-precision map, a global map covering a wide area, and the like, which is lower in accuracy than the high-precision map.

The position information acquisition unit 24 receives GNSS signals from GNSS (Global Navigation Satellite System) satellites and acquires the position information of the vehicle 1 . The acquired position information is supplied to the driving support/automatic driving control unit 29 . Note that the location information acquisition unit 24 is not limited to the method using GNSS signals, and may acquire location information using beacons, for example.

The external recognition sensor 25 includes various sensors used for recognizing situations outside the vehicle 1 and supplies sensor data from each sensor to each part of the vehicle control system 11 . The type and number of sensors included in the external recognition sensor 25 are arbitrary.

For example, the external recognition sensor 25 includes a camera 51, a radar 52, a LiDAR (Light Detection and Ranging, Laser Imaging Detection and Ranging) 53, and an ultrasonic sensor 54.

The in-vehicle sensor 26 includes various sensors for detecting information inside the vehicle, and supplies sensor data from each sensor to each part of the vehicle control system 11 . The types and number of various sensors included in the in-vehicle sensor 26 are not particularly limited as long as they are the types and number that can be realistically installed in the vehicle 1 . For example, in-vehicle sensors 26 may comprise one or more of cameras, radar, seat sensors, steering wheel sensors, microphones, biometric sensors.

The vehicle sensor 27 includes various sensors for detecting the state of the vehicle 1, and supplies sensor data from each sensor to each section of the vehicle control system 11. The types and number of various sensors included in the vehicle sensor 27 are not particularly limited as long as the types and number are practically installable in the vehicle 1 . For example, the vehicle sensor 27 includes a velocity sensor, an acceleration sensor, an angular velocity sensor (gyro sensor), and an inertial measurement unit (IMU (Inertial Measurement Unit)) integrating them.

The storage unit 28 includes at least one of a nonvolatile storage medium and a volatile storage medium, and stores data and programs. The storage unit 28 is used as, for example, EEPROM (Electrically Erasable Programmable Read Only Memory) and RAM (Random Access Memory), and storage media include magnetic storage devices such as HDD (Hard Disc Drive), semiconductor storage devices, optical storage devices, And a magneto-optical storage device can be applied. The storage unit 28 stores various programs and data used by each unit of the vehicle control system 11 .

The driving support/automatic driving control unit 29 controls driving support and automatic driving of the vehicle 1 . For example, the driving support/automatic driving control unit 29 includes an analysis unit 61 , an action planning unit 62 and an operation control unit 63 .

The analysis unit 61 analyzes the vehicle 1 and its surroundings. The analysis unit 61 includes a self-position estimation unit 71 , a sensor fusion unit 72 and a recognition unit 73 .

The self-position estimation unit 71 estimates the self-position of the vehicle 1 based on the sensor data from the external recognition sensor 25 and the high-precision map accumulated in the map information accumulation unit 23.

The sensor fusion unit 72 combines a plurality of different types of sensor data (for example, image data supplied from the camera 51, LiDAR 53, and sensor data supplied from the radar 52) to perform sensor fusion processing to obtain new information. I do. Methods for combining different types of sensor data include integration, fusion, federation, and the like.

The recognition unit 73 executes a detection process for detecting the situation outside the vehicle 1 and a recognition process for recognizing the situation outside the vehicle 1 .

For example, the recognition unit 73 performs detection processing and recognition processing of the external situation of the vehicle 1 based on information from the external recognition sensor 25, information from the self-position estimation unit 71, information from the sensor fusion unit 72, and the like. .

Specifically, for example, the recognition unit 73 performs detection processing and recognition processing of objects around the vehicle 1 . Object detection processing is, for example, processing for detecting the presence or absence, size, shape, position, movement, and the like of an object. Object recognition processing is, for example, processing for recognizing an attribute such as the type of an object or identifying a specific object. However, detection processing and recognition processing are not always clearly separated, and may overlap.

For example, the recognition unit 73 detects objects around the vehicle 1 by clustering the point cloud based on sensor data from the radar 52 or the LiDAR 53 or the like for each cluster of point groups. As a result, presence/absence, size, shape, and position of objects around the vehicle 1 are detected.

For example, the recognition unit 73 detects the movement of objects around the vehicle 1 by performing tracking that follows the movement of the masses of point groups classified by clustering. As a result, the speed and traveling direction (movement vector) of the object around the vehicle 1 are detected.

For example, the recognition unit 73 detects or recognizes vehicles, people, bicycles, obstacles, structures, roads, traffic lights, traffic signs, road markings, etc. based on image data supplied from the camera 51 . Further, the recognition unit 73 may recognize types of objects around the vehicle 1 by performing recognition processing such as semantic segmentation.

The action plan section 62 creates an action plan for the vehicle 1. For example, the action planning unit 62 creates an action plan by performing route planning and route following processing.

It should be noted that global path planning is the process of planning a rough route from the start to the goal. This route planning is called trajectory planning, and in the planned route, trajectory generation (local path planning) that can proceed safely and smoothly in the vicinity of the vehicle 1 in consideration of the motion characteristics of the vehicle 1. It also includes the processing to be performed.

The motion control unit 63 controls the motion of the vehicle 1 in order to implement the action plan created by the action planning unit 62.

The DMS 30 performs driver authentication processing, driver state recognition processing, etc., based on sensor data from the in-vehicle sensor 26 and input data input to the HMI 31, which will be described later. As the state of the driver to be recognized, for example, physical condition, wakefulness, concentration, fatigue, gaze direction, drunkenness, driving operation, posture, etc. are assumed. The HMI 31 inputs various data, instructions, etc., and presents various data to the driver or the like.

The vehicle control unit 32 controls each unit of the vehicle 1. The vehicle control section 32 includes a steering control section 81 , a brake control section 82 , a drive control section 83 , a body system control section 84 , a light control section 85 and a horn control section 86 .

The steering control unit 81 detects and controls the state of the steering system of the vehicle 1 . The steering system includes, for example, a steering mechanism including a steering wheel, an electric power steering, and the like. The steering control unit 81 includes, for example, a steering ECU that controls the steering system, an actuator that drives the steering system, and the like.

The brake control unit 82 detects and controls the state of the brake system of the vehicle 1 . The brake system includes, for example, a brake mechanism including a brake pedal, an ABS (Antilock Brake System), a regenerative brake mechanism, and the like. The brake control unit 82 includes, for example, a brake ECU that controls the brake system, an actuator that drives the brake system, and the like.

The drive control unit 83 detects and controls the state of the drive system of the vehicle 1 . The drive system includes, for example, an accelerator pedal, a driving force generator for generating driving force such as an internal combustion engine or a driving motor, and a driving force transmission mechanism for transmitting the driving force to the wheels. The drive control unit 83 includes, for example, a drive ECU that controls the drive system, an actuator that drives the drive system, and the like.

The body system control unit 84 detects and controls the state of the body system of the vehicle 1 . The body system includes, for example, a keyless entry system, smart key system, power window device, power seat, air conditioner, air bag, seat belt, shift lever, and the like. The body system control unit 84 includes, for example, a body system ECU that controls the body system, an actuator that drives the body system, and the like.

The light control unit 85 detects and controls the states of various lights of the vehicle 1 . Lights to be controlled include, for example, headlights, backlights, fog lights, turn signals, brake lights, projections, bumper displays, and the like. The light control unit 85 includes a light ECU that controls the light, an actuator that drives the light, and the like.

The horn control unit 86 detects and controls the state of the car horn of the vehicle 1 . The horn control unit 86 includes, for example, a horn ECU for controlling the car horn, an actuator for driving the car horn, and the like.

[2. An example of an object detection model used by the recognition unit]
A general object detection model is an SSD (Single Shot MultiBox Detector). The SSD comprises a Convolutional Neural Network (CNN) that is machine-learned to detect objects from input images. CNN machine learning uses teacher data in which the type (class) of an object included in the image and a rectangular ground truth (GT) indicating the area of the object in the image are given to the image.

FIG. 2 is a diagram showing an example of a general training data generation method. When generating teacher data, first, an input image Pa showing an object Tg and an image Pb with label data added with rectangular label data Ld surrounding the object Tg shown in the input image Pa are prepared. Then, the feature amount of the image is extracted from the input image Pa to generate the feature map Fm.

Subsequently, a plurality of default boxes Db with different aspect ratios and sizes for each layer of CNN are sequentially arranged at arbitrary positions on the feature map Fm. After that, the arranged default box Db and the label data Ld are matched to determine the default box Db to be the GT of the input image Pa, thereby generating teacher data.

In the matching between the default box Db and the label data Ld, the overlapping portion (black area shown in FIG. 2) of the default box Db and the label data Ld with respect to the total area occupied by the default box Db and the label data Ld in the feature map Fm. Calculate the IoU (Intersection over Union) value that indicates the area.

Then, a default box Db whose IoU value is equal to or greater than the threshold value is determined as GT to generate teacher data. Therefore, as GT, the default box Db whose position, size and aspect ratio in the feature map Fm are similar to those of the label data Ld is necessarily selected.

During machine learning, a large amount of teacher data generated in this way is input to CNN. The CNN repeats the process of detecting the object Tg from each of the multiple input images Pa serving as teacher data. Then, the CNN adjusts each parameter of the network so that the difference between the detected position, shape, and size of the object Tg in the feature map Fm and the position, shape, and size of GT in the feature map Fm becomes small. .

As a result, the CNN arranges the default box Db learned by learning in the feature map extracted from the unknown input image, and the accuracy of the position, shape, and type of the object appearing in the input image from the pixel data in the default box Db. can be derived.

[3. Verification of object detection by general SSD]
However, in the case of the SSD, the accuracy of object detection may deteriorate when the own vehicle shakes or the course of the object to be detected changes. It can be assumed that this is because the distance between the center position of the default box Db arranged on the feature map extracted from the input image and the center position of the object Tg in the input image increases.

Therefore, the position of the image (GT) of the object Tg is shifted by 1 [pix] in the horizontal direction (both left and right) with respect to the arranged default box Db, and the object detection is verified. was obtained for all deviations.

Then, as a result of normalizing the upper bound so that the maximum value of the score of the detection result is 1.0, the evaluation result shown in FIG. 3 was obtained. The swelling of the data shown in FIG. 3 represents the amount of data in that portion (shift position). As shown in FIG. 3, in the evaluation results, the larger the amount of lateral displacement of GT with respect to the default box Db, the lower the score of the detection result.

Furthermore, as a result of conducting the same verification for each object class (type), the evaluation results shown in Figures 4 to 7 were obtained. As shown in FIG. 4, in the automobile image, the score of the detection result does not decrease significantly even if the amount of lateral displacement of GT with respect to the default box Db increases.

On the other hand, as shown in FIG. 5, in the motorcycle image, the greater the amount of lateral displacement of the GT with respect to the default box Db, the greater the score of the detection result.

On the other hand, as shown in FIG. 6, in the image of the bicycle, even if the amount of lateral displacement of the GT with respect to the default box Db increases, the score of the detection result does not fluctuate much. On the other hand, as shown in FIG. 7, in the image of a person, the score of the detection result greatly decreases as the amount of lateral displacement of GT with respect to the default box Db increases.

Here, focusing on the shapes in the images of automobiles, motorcycles, bicycles, and people in the image, the shape of the automobile is a substantially square or a horizontally long rectangle. The shape of the motorcycle is a vertically long rectangle when the direction of travel of the motorcycle is the same as or opposite to that of the own vehicle.

As for the shape of the bicycle, it often crosses in front of the own vehicle, but rarely runs in front of the own vehicle in the same or opposite direction as the own vehicle. Therefore, the shape of the bicycle is approximately square or oblong rectangle. The shape of a person is a vertically long rectangle.

From this, when the displacement amount of GT in the horizontal direction with respect to the default box Db increases, the score of the detection result does not decrease significantly for an object whose shape in the image is a substantially square or a horizontally long rectangle, and for an object with a vertically long rectangle It can be inferred that the score of the detection result will be greatly reduced.

Therefore, as a result of performing similar verification for each shape of the object in the image, the evaluation results shown in FIGS. 8 to 10 were obtained. As a result of this verification, as shown in FIGS. 8 to 10, when the displacement of GT in the horizontal direction with respect to the default box Db increases, the score of the detection result is It was demonstrated that the score of the detection result does not decrease significantly, and the score of the detection result decreases greatly for vertically long rectangular objects.

[4. Verification of default box matching during learning]
The occurrence of the above phenomenon is based on the hypothesis that there is a problem in the matching of the default box Db during learning. ] and calculated Jaccard (IoU value), the results shown in FIGS. 11 and 12 were obtained.

As shown in FIG. 11, when the label data Ld is a square and the default box Db is a square of the same size as the label data Ld, if the positions of the label data Ld and the default box Db match, Jaccard goes to 100%. If the label data Ld is shifted to the right by 4 [pix] from this state, Jaccard becomes 78.3%, but does not drop significantly.

On the other hand, as shown in FIG. 12, when the label data Ld is a vertically long rectangle and the default box Db is a vertically long rectangle having the same aspect ratio and size as the label data Ld, the positions of the label data Ld and the default box Db are matches, Jaccard will be 100%. If the label data Ld is shifted rightward by 4 [pix] from this state, Jaccard is reduced to 53.2%, which is approximately half.

Thus, when the label data Ld and the default box Db are vertical rectangles of the same shape and size, and the default box Db is applied during actual object detection, the feature amount of the object (GT) extracted from the feature map Fm is It becomes about half of the whole, and the detection accuracy decreases.

[5. Shape conversion of default box and label data]
Therefore, in the training data generation method according to the present disclosure, the information processing device included in the recognition unit 73 uses the default box Db arranged on the feature map extracted from the image and the label data Ld given to the object in the image. Shape transformation is performed on the object. Then, the information processing device determines the default box Db to be used as the ground truth (GT) of the image by matching the default box Db after shape conversion and the label data Ld, and generates teacher data.

As a result, the information processing device can improve Jaccard in default box matching during learning by performing shape conversion that brings the shapes of the default box Db and the label data Ld closer to a square. Therefore, the information processing apparatus can improve object detection accuracy by performing machine learning using the teacher data according to the present embodiment.

For example, as shown in FIG. 13, the initial label data Ld is a vertically long rectangle with an aspect ratio of 2:1 surrounding the image of a person, and the default box Db has the same aspect ratio and shape as the label data Ld. Suppose that In this case, the information processing device changes the aspect ratios of the default box Db and the label data Ld. As a result, the information processing device can approximate the shapes of the default box Db and the label data Ld to squares of the same size.

　The change of the aspect ratio performed by the information processing device includes the inverse conversion of the aspect ratio. As a result, the information processing device can approximate the shapes of the default box Db and the label data Ld to squares of the same size. For example, in the case of the default box Db and label data Ld shown in FIG. to generate a default box Db' and label data Ld'.

At this time, the information processing device performs shape conversion without changing the center position P of the default box Db and the label data Ld to generate the default box Db' and the label data Ld'. As a result, the information processing device matches the label data Ld and the label data Ld' so that even if the label data Ld' is slightly misaligned, the label data Ld' can surround almost the entire area of the object. can be done.

Before and after the shape conversion of the default box Db and the label data Ld, the default boxes Db and Db' having the same shape as the label data Ld and Ld' are arranged in the horizontal direction N [pix] with respect to the label data Ld and Ld' during learning. When the Jaccard was calculated by moving, the results shown in FIGS. 14 and 15 were obtained.

As shown in FIG. 14, in the default box Db and label data Ld before shape conversion, Jaccard was 42%. ' can improve Jaccard to 57%.

The information processing device performs the shape conversion of the default box Db and the label data Ld in this way, and then performs default box matching at the time of learning. to generate

In addition, in CNN, smaller objects are detected in shallower layers of the network, and larger objects are detected in deeper layers. Therefore, the information processing apparatus generates teacher data for each layer of the CNN using the learning method described above.

Then, the information processing device performs machine learning for each layer of the network using teacher data corresponding to each layer, and detects objects from images by CNN after learning. Thereby, the information processing device can improve the detection accuracy of objects of various sizes. The information processing device executes the information processing program stored in the storage unit 28 to perform the above-described CNN machine learning and object detection processing.

Verification similar to that of FIGS. 8 and 10 by using CNN machine-learned using teacher data before shape conversion of default box Db and label data Ld and CNN machine-learned using teacher data after shape conversion As a result, the results shown in FIGS. 16 to 19 were obtained.

As shown in FIGS. 16 and 17, when the shape of the object and default box Db is square, there is no significant difference in the score of the object detection result before and after shape conversion. On the other hand, as shown in FIG. 18, when the shape of the object and the default box Db is a vertically long rectangle, before shape conversion, the score of the object detection result decreases as the amount of horizontal displacement between the object and the default box Db increases. are doing.

On the other hand, as shown in FIG. 19, after shape conversion, when the shape of the object and default box Db is a vertically long rectangle, the score of the object detection result is improved. Thus, the information processing apparatus can improve the object detection accuracy by performing machine learning using the teacher data according to the present embodiment.

[6. effect]
The training data generation method according to the embodiment is a training data generation method executed by a computer, wherein default boxes Db arranged on a feature map Fm extracted from an image and label data Ld assigned to an object Tg in the image , and by matching the default box Db after the shape conversion with the label data Ld, a default box Db' to be used as the ground truth GT of the image is determined to generate teacher data. As a result, the information processing apparatus can improve object detection accuracy by machine-learning the CNN using the teacher data generated by the teacher data generation method according to the embodiment.

Also, the shape conversion includes changing the aspect ratio of the default box Db and the label data Ld. As a result, the information processing device can approximate the shapes of the default box Db and the label data Ld to squares of the same size.

Also, changing the aspect ratio includes inverse conversion of the aspect ratio. As a result, the information processing device can approximate the shapes of the default box Db and the label data Ld to squares of the same size.

Also, shape conversion is performed without changing the center positions of the default box Db and the label data Ld. As a result, the information processing device matches the label data Ld and the label data Ld' so that even if the label data Ld' is slightly misaligned, the label data Ld' can surround almost the entire area of the object. can be done.

In addition, teacher data is generated for each layer of the convolutional neural network. Thereby, the information processing device can improve the detection accuracy of objects of various sizes.

Further, the training data generation program according to the embodiment includes a procedure of performing shape conversion on the default box Db arranged on the feature map Fm extracted from the image and the label data Ld given to the object Tg in the image; By matching the default box Db after shape conversion with the label data Ld, a default box Db to be used as the ground truth of the image is determined, and a procedure for generating training data is executed by a computer. As a result, the computer can improve object detection accuracy by machine-learning the CNN using the teacher data generated by the teacher data generation method according to the embodiment.

Also, the information processing device according to the embodiment includes an information processing unit. The information processing unit performs shape conversion on the default box Db arranged on the feature map Fm extracted from the image and the label data Ld assigned to the object Tg in the image, and converts the shape-converted default box Db and the label By matching with the data Ld, the default box Db to be the ground truth of the image is determined to generate teacher data, the teacher data is used to learn the convolutional neural network, and the object Tg from the image input to the convolutional neural network to detect Thereby, the information processing device can improve the object detection accuracy.

Further, the information processing method according to the embodiment is an object Tg detection method executed by a computer, and includes a default box Db arranged on a feature map Fm extracted from an image and label data given to the object Tg in the image. Shape transformation is performed on Ld, and by matching the default box Db after the shape transformation with the label data Ld, a default box Db to be the ground truth of the image is determined to generate teacher data, and the teacher data is used A convolutional neural network is learned, and an object Tg is detected from an image input to the convolutional neural network. This allows the computer to improve the object detection accuracy.

Further, the information processing program according to the embodiment includes a procedure for performing shape conversion on the default box Db arranged on the feature map Fm extracted from the image and the label data Ld given to the object Tg in the image; A procedure for determining a default box Db to be the ground truth of an image by matching the converted default box Db and the label data Ld to generate teacher data, and a procedure for learning a convolutional neural network using the teacher data. , and detecting the object Tg from the image input to the convolutional neural network. This allows the computer to improve the object detection accuracy.

It should be noted that the effects described in this specification are only examples and are not limited, and other effects may also occur.

Note that the present technology can also take the following configuration.
(1)
A training data generation method executed by a computer,
performing shape transformation on a default box placed on a feature map extracted from an image and label data given to an object in the image;
A teacher data generation method, comprising: determining a default box to be ground truth of the image by matching the default box after the shape transformation and label data, and generating teacher data.
(2)
The shape transformation is
The teacher data generating method according to (1), including changing aspect ratios of the default box and the label data.
(3)
Changing the aspect ratio includes:
The teacher data generation method according to (2), including inverse transformation of the aspect ratio.
(4)
The teacher data generation method according to any one of (1) to (3), including performing the shape conversion without changing the center positions of the default box and the label data.
(5)
The teacher data generating method according to any one of (1) to (4), including generating the teacher data for each layer of the convolutional neural network.
(6)
a procedure of shape transformation for a default box placed on a feature map extracted from an image and label data given to an object in the image;
A training data generation program for causing a computer to execute a procedure of determining a default box to be the ground truth of the image by matching the default box after the shape transformation with the label data and generating training data.
(7)
performing shape transformation on a default box placed on a feature map extracted from an image and label data given to an object in the image;
determining a default box to be the ground truth of the image by matching the default box after the shape transformation and the label data to generate teacher data;
training a convolutional neural network using the training data;
An information processing apparatus comprising an information processing unit that detects an object from an image input to the convolutional neural network.
(8)
A computer-executed information processing method comprising:
performing shape transformation on a default box placed on a feature map extracted from an image and label data given to an object in the image;
determining a default box to be the ground truth of the image by matching the default box after the shape transformation and the label data to generate teacher data;
training a convolutional neural network using the training data;
An information processing method comprising detecting an object from an image input to the convolutional neural network.
(9)
a procedure of shape transformation for a default box placed on a feature map extracted from an image and label data given to an object in the image;
a step of determining a default box to be the ground truth of the image by matching the default box after the shape transformation and the label data to generate training data;
training a convolutional neural network using the training data;
An information processing program for causing a computer to execute a procedure for detecting an object from an image input to the convolutional neural network.

Pa input image Pb image with label data Fm feature map Db, Db' default box Ld, Ld' label data

Claims

A training data generation method executed by a computer,
performing shape transformation on a default box placed on a feature map extracted from an image and label data given to an object in the image;
A teacher data generation method, comprising: determining a default box to be ground truth of the image by matching the default box after the shape transformation and label data, and generating teacher data.
The shape transformation is
2. The teaching data generation method according to claim 1, further comprising: changing aspect ratios of said default box and said label data.
Changing the aspect ratio includes:
3. The teacher data generation method according to claim 2, further comprising inverse transformation of the aspect ratio.
2. The teaching data generating method according to claim 1, further comprising performing said shape conversion without changing center positions of said default box and said label data.
The teacher data generation method according to claim 1, comprising: generating the teacher data for each layer of the convolutional neural network.
a procedure of shape transformation for a default box placed on a feature map extracted from an image and label data given to an object in the image;
A training data generation program for causing a computer to execute a procedure of determining a default box to be ground truth of the image by matching the default box after the shape transformation with the label data and generating training data.
performing shape transformation on a default box placed on a feature map extracted from an image and label data given to an object in the image;
determining a default box to be the ground truth of the image by matching the default box after the shape transformation and the label data to generate teacher data;
training a convolutional neural network using the training data;
An information processing apparatus comprising an information processing unit that detects an object from an image input to the convolutional neural network.
A computer-executed information processing method comprising:
performing shape transformation on a default box placed on a feature map extracted from an image and label data given to an object in the image;
determining a default box to be the ground truth of the image by matching the default box after the shape transformation and the label data to generate teacher data;
training a convolutional neural network using the training data;
An information processing method comprising detecting an object from an image input to the convolutional neural network.
a procedure of shape transformation for a default box placed on a feature map extracted from an image and label data given to an object in the image;
a step of determining a default box to be the ground truth of the image by matching the default box after the shape transformation and the label data to generate training data;
training a convolutional neural network using the training data;
An information processing program for causing a computer to execute a procedure for detecting an object from an image input to the convolutional neural network.