CN109977943B

CN109977943B - Image target recognition method, system and storage medium based on YOLO

Info

Publication number: CN109977943B
Application number: CN201910114621.5A
Authority: CN
Inventors: 赵峰; 王健宗; 肖京
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2019-02-14
Filing date: 2019-02-14
Publication date: 2024-05-07
Anticipated expiration: 2039-02-14
Also published as: CN109977943A; WO2020164282A1

Abstract

The invention relates to an artificial intelligence technology, and provides a YOLO-based image target identification method, a YOLO-based image target identification system and a storage medium, wherein the method comprises the following steps: receiving an image to be detected; the size of the image to be detected is adjusted according to a preset requirement, and a first detection image is generated; the first detection image is sent to a neural network model for matching identification, and a detection frame, classification identification information and a classification probability value corresponding to the classification identification information are generated; judging whether the classification probability value is larger than a preset classification probability threshold value or not; and if the detection frame and the classification identification information are larger than the detection frame and the classification identification information, the detection frame and the classification identification information are used as the classification result of the identification. By the technical scheme, the detection precision can be effectively improved, and the detection time can be reduced. Compared with the detection method in the prior art, the method provided by the invention has the advantages that the identification accuracy is improved and the operation speed is increased.

Description

Image target recognition method, system and storage medium based on YOLO

Technical Field

The present invention relates to the field of computer learning and image recognition, and more particularly, to a YOLO-based image target recognition method, system and storage medium.

Background

The high-speed development of artificial intelligence technology, deep learning is increasingly applied to the field of computer vision, especially the field of image target detection.

In recent years, the target detection algorithm has made a great breakthrough. The more popular algorithms can be classified into two types, one is based on Region Proposal R-CNN (Fast R-CNN) which are two-stage, requiring the heuristic method (SELECTIVE SEARCH) or CNN network (RPN) to generate Region Proposal, and then classifying and regressing at Region Proposal. The other is Yolo (all You Only Look Once), a one-stage algorithm such as SSD, which uses only one CNN network to directly predict the class and location of different targets. The first class of methods are more accurate but slow; the second class of algorithms is fast but less accurate. More and more target detection methods are implemented based on YOLO, and many deep networks are also improved based on YOLO. YOLO solves the object detection as a regression problem, completing the input from the original image to the output of the object location and class based on a single end-to-end network.

The key idea of YOLO is to use the whole graph as the input of the network, and directly return the position of the binding box and the category to which the binding box belongs at the output layer. On the basis of utilizing the YOLO high-rate operation, how to design a method capable of improving the YOLO accuracy is currently urgent to be solved.

Disclosure of Invention

In order to solve at least one technical problem, the invention provides an image target identification method, an image target identification system and a storage medium based on YOLO.

In order to achieve the above object, the technical scheme of the present invention provides a YOLO-based image target recognition method, which includes:

Receiving an image to be detected;

The size of the image to be detected is adjusted according to a preset requirement, and a first detection image is generated;

the first detection image is sent to a neural network model for matching identification, and a detection frame, classification identification information and a classification probability value corresponding to the classification identification information are generated;

judging whether the classification probability value is larger than a preset classification probability threshold value or not;

And if the detection frame and the classification identification information are larger than the detection frame and the classification identification information, the detection frame and the classification identification information are used as the classification result of the identification.

In this solution, before the receiving the image to be detected, the method further includes:

Training pictures to obtain a neural network model; the neural network model is trained by the following steps:

acquiring a training image dataset;

performing image preprocessing on the training image data set to obtain a preprocessed image set;

training the preprocessed image set to obtain the neural network model with the input interface and the output interface.

In this scheme, the step of generating the detection frame specifically includes:

generating an initial detection frame according to initial preset coordinate points;

Performing prediction of a dynamic detection frame, performing iterative prediction on the generated detection frame, and generating a latest detection frame;

calculating the coincidence ratio of the latest detection frame;

if the latest detection frame overlap ratio is greater than or equal to a preset overlap ratio threshold value, reserving the latest detection frame; if the latest detection frame overlap ratio is smaller than a preset overlap ratio threshold value, continuing to predict the dynamic detection frame;

And finally generating N similar detection frames of the same class.

In this scheme, the performing prediction of the dynamic detection frame, performing iterative prediction on the generated detection frame, and generating the latest detection frame specifically includes:

Predicting 4 coordinates of each detection frame, (t _x,t_y,t_w,t_h); if the cell deviates from the upper left corner coordinate (c _x;c_y) of the image and the up-predicted detection frame has a width and height p _w,p_h, the coordinates of the latest detection frame are:

b_x＝σ(t_x)+c_x

b_y＝σ(t_y)+c_y

Wherein b _x、b_y、b_w、b_h is the four coordinate point position values of the latest detection frame respectively. The detection frame is quadrilateral, wherein the position of the quadrilateral detection frame can be determined through the values of 4 points.

Preferably, 53 layers of convolution operations are adopted in the neural network model, and the convolution operation of each layer is calculated alternately for 3×3 and 1×1 convolution layers.

In this embodiment, the size is defined by the neural network model.

The technical scheme of the invention also provides a YOLO-based image target recognition system, which comprises: the image target recognition system comprises a memory, a processor and an image pickup device, wherein the memory comprises a YOLO-based image target recognition method program, and the image target recognition method program based on the YOLO realizes the following steps when being executed by the processor:

Receiving an image to be detected;

calculating the coincidence ratio of the latest detection frame;

And finally generating N similar detection frames of the same class.

b_x＝σ(t_x)+c_x

b_y＝σ(t_y)+c_y

training pictures to obtain a neural network model; the neural network model is a model with an input interface and an output interface, which is obtained by performing image training on pictures of different categories.

In this embodiment, the size is defined by the neural network model.

The third aspect of the present invention also provides a computer-readable storage medium having embodied therein a YOLO-based image target recognition method program which, when executed by a processor, implements the steps of a YOLO-based image target recognition method as described above.

The invention provides a YOLO-based image target identification method, a YOLO-based image target identification system and a storage medium. According to the method, the classification recognition probability is judged, and the recognition information is taken as the recognition result when the preset classification probability threshold is reached, so that the accuracy of image recognition and the recognition experience are improved. The invention can also adjust the position of the detection frame in real time, thereby effectively improving the detection efficiency and precision; the detection time is reduced by optimizing the calculation mode of the detection. Through experiments and verification, the method of the invention is superior to the detection method in the prior art. The identification accuracy is improved and the operation speed is increased.

Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Drawings

FIG. 1 is a flowchart of a method for recognizing an image object based on YOLO according to the present invention;

FIG. 2 shows a schematic diagram of convolution operation in the classification process of the present invention;

FIG. 3 shows a block diagram of a YOLO-based image target recognition system of the present invention;

fig. 4 shows a schematic diagram of an embodiment of the invention.

Detailed Description

In order that the above-recited objects, features and advantages of the present application will be more clearly understood, a more particular description of the application will be rendered by reference to the appended drawings and appended detailed description. It should be noted that, without conflict, the embodiments of the present application and features in the embodiments may be combined with each other.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, but the present invention may be practiced in other ways than those described herein, and therefore the scope of the present invention is not limited to the specific embodiments disclosed below.

FIG. 1 is a flowchart of a method for recognizing an image object based on YOLO according to the present invention.

As shown in fig. 1, the technical scheme of the present invention provides a YOLO-based image target recognition method, which includes:

s102, receiving an image to be detected;

s104, adjusting the size of the image to be detected according to a preset requirement to generate a first detection image;

S106, the first detection image is sent to a neural network model for matching identification, and a detection frame, classification identification information and a classification probability value corresponding to the classification identification information are generated;

s108, judging whether the classification probability value is larger than a preset classification probability threshold value;

And S110, if the detection frame and the classification identification information are larger than each other, the detection frame and the classification identification information are used as the classification result of the identification.

The size of the dimension is the size specified by the neural network model. In the neural network model, the size of the image is generally selected to be smaller than that of the image to be detected, so that the speed of operation processing can be ensured, and class identification can be rapidly carried out. It will be appreciated by those skilled in the art that the size of the steps may be set according to actual needs, and are not limited to the above-mentioned sizes, and are not intended to limit the scope of the present invention.

And sending the first detection image to a neural network model, and generating a detection frame, classification identification information and a classification probability value corresponding to the classification identification information. A person skilled in the art can set the classification probability value according to actual needs, for example, set the classification probability threshold to 90%, and when detecting a picture containing a kitten, if the probability of identifying the kitten in the detection frame exceeds 90%, it indicates that the kitten is circled in the detection frame, and the kitten in the picture is already identified. If the classification probability value is smaller than the preset classification probability threshold, the step S106 is returned to carry out re-identification until the classification probability value is larger than the preset classification probability threshold. The neural network model performs a multi-layer convolution operation on the image. The YOLO convolution operation is a conventional operation in the field, belongs to the prior art, and is not described in detail.

In this solution, before receiving the image to be detected in the step S102, the method further includes:

acquiring a training image dataset;

It should be noted that the training image data set has 1000 object categories and 120 ten thousand training images. The data set is preprocessed before training, the preprocessing comprises one or more of rotation, contrast enhancement, tilting and scaling, after the preprocessing, the image has certain distortion, and the accuracy of the final image recognition can be increased through training the distorted image.

calculating the coincidence ratio of the latest detection frame;

And finally generating N similar detection frames of the same class.

It should be noted that, the initial preset coordinate point may be a coordinate point of a preset detection frame, which may be automatically generated during training and recognition detection, or may be generated by a person skilled in the art according to actual needs.

Predicting 4 coordinates of each detection frame, (t _x,t_y,t_w,t_h); if the cell deviates from the upper left corner coordinate (c _x;c_y) of the image and the up-predicted detection frame has a width and a height pw, p _h, the coordinates of the latest detection frame are:

b_x＝σ(t_x)+c_x

b_y＝σ(t_y)+c_y

The network predicts 4 coordinates per detection box, (t _x,t_y,t_w,t_h). If the cell deviates from the upper left corner coordinates (c _x;c_y) of the image, the coordinates of the latest detection frame expressed by the above formula can be derived.

It should be noted that each box uses multiple labels to classify classes that the prediction bounding box may contain. In the process of category identification, the application uses a binary cross entropy loss technique to conduct category prediction. The objective of using binary cross entropy loss for class prediction is mainly that the inventors have found that the softmax technique does not require good performance, but rather that a separate logic classifier is used, so this step does not require the softmax technique. The binary cross entropy loss technique provides more assistance when migrating to more complex category identification areas using the method of the present application. The binary cross entropy loss technology is a common technology in the field, and a person skilled in the art can implement the binary cross entropy loss technology according to requirements, so that the application is not repeated.

Preferably, 53 layers of convolution operations are adopted in the neural network model, and the convolution operation of each layer is calculated alternately for 3×3 and 1×1 convolution layers. The inventor finds that the accuracy can be increased and the operation speed can be effectively improved by adopting the convolution layer to perform alternating operation in limited practical tests. The convolution layers are alternately calculated, specifically, 3×3 convolution operation is adopted first, then 1×1 convolution operation is adopted, and the calculation is sequentially and alternately carried out until all the convolution layers participate in the operation.

In this embodiment, the size is defined by the neural network model.

In order to better explain the technical scheme of the present invention, the following describes the technical scheme of the present invention in detail.

After the detection frames are generated, the classification probability values in the detection frames are calculated, and the optimal N detection frames of the same type are screened out. It should be noted that, the size of the detection frame is dynamically predicted, and the dynamic prediction process is the scheme described above. Screening M classification probability values of all detection frames by using probability threshold values, and formulating a set of screening rules:

And calculating the classification probability value of each detection frame, arranging the classification probability values according to the sequence from large to small, and selecting the classification with the highest ranking. The first screening step can be said to be that the M categories of each detection frame are firstly compared, and a champion category with the highest probability value is selected.

Comparing the highest-ranking classification with a preset probability threshold, and if the highest-ranking classification is larger than or equal to the preset probability threshold, reserving the detection frame; and if the probability is smaller than the preset probability threshold, deleting the detection frame. It can be said that the second round of screening, the champion classification is compared with the probability threshold value, and the detection box with the value larger than the probability threshold value is qualified to enter the resolution. For example, the probability threshold may be set to 0.24 (24%). By comparing, the detection frame which is preset is displayed on the picture, and the classification probability value which is larger than or equal to the 0.24 probability threshold value can be displayed. Those skilled in the art may set the probability threshold according to actual needs, and the probability threshold described in the present application is not limited to the protection scope of the present application.

And calculating the coincidence degree of the N similar detection frames, and reserving the detection frame with the highest coincidence degree.

For example, after the screening step described above, all three of the test frames were tested for classification as "horse".

And sequencing the detection probabilities of the three detection units in a descending order.

And (3) performing coincidence degree calculation (IOU) two by two, and eliminating the detection frame with low probability if the coincidence degree calculated value IoU is more than 0.3.

The result is a unique detection box classified as "horse".

FIG. 2 shows a schematic diagram of convolution operation in the classification process of the present invention.

As shown in fig. 2, the neural network model adopts 53-layer convolution operations, and the convolution operations of each layer are alternately performed for 3×3 and 1×1 convolution layers.

The feature extraction scheme enables the highest measurement floating point operation per second. This means that the neural network architecture can better utilize the GPU of the machine, improving the evaluation efficiency and thus the speed. The convolution operation of the present application may be more efficient and accurate due to the much hierarchical and inefficient techniques ResNets.

For example, each neural network is trained using the same settings and tested with a single clipping precision of 256×256. The performance of the classifier adopting the feature extraction of the application is equivalent to that of the most advanced classifier in the prior art, but the floating point operation is less and the speed is faster.

FIG. 3 shows a block diagram of a YOLO-based image object recognition system of the present invention.

As shown in fig. 2, the technical solution of the present invention further provides a YOLO-based image target recognition system 2, which includes: the memory 201, the processor 202 and the image capturing device 203, wherein the memory 201 includes a YOLO-based image target recognition method program, and the YOLO-based image target recognition method program, when executed by the processor, implements the following steps:

Receiving an image to be detected;

acquiring a training image dataset;

calculating the coincidence ratio of the latest detection frame;

And finally generating N similar detection frames of the same class.

b_x＝σ(t_x)+c_x

b_y＝σ(t_y)+c_y

In the invention, dimension clusters can be used as anchor frames to dynamically predict detection frames, and the detection frames are also boundary frames. The network predicts 4 coordinates per detection box, t _x,t_y,t_w,t_h. If the cell deviates from the upper left corner coordinates (c _x;c_y) of the image, the coordinates of the latest detection frame expressed by the above formula can be derived, where b _x、b_y、b_w、b_h is the four coordinate point values of the latest detection frame, respectively. The detection frame is quadrilateral, wherein the position of the quadrilateral detection frame can be determined through the values of 4 points. .

It should be noted that each box uses multiple labels to classify classes that the prediction bounding box may contain. In the process of category identification, the application uses a binary cross entropy loss technique to conduct category prediction. The objective of using binary cross entropy loss for class prediction is mainly that the inventors have found that the softmax technique does not require good performance, but rather that a separate logic classifier is used, so this step does not require the softmax technique. The binary cross entropy loss technique provides even more assistance when migrating to more complex category identification areas using the method of the present application. The binary cross entropy loss technology is a common technology in the field, and a person skilled in the art can implement the binary cross entropy loss technology according to requirements, so that the application is not repeated.

In this embodiment, the size is defined by the neural network model.

It should be noted that, in the neural network model, 53-layer convolution operation is adopted, and the convolution operation of each layer is alternately performed by 3×3 and 1×1 convolution layers.

And calculating the classification probability value in the detection frames, and screening out the optimal N detection frames of the same kind. Screening M classification probability values of all detection frames by using probability threshold values, and formulating a set of screening rules:

Comparing the highest-ranking classification with a preset probability threshold, and if the highest-ranking classification is larger than or equal to the preset probability threshold, reserving the detection frame; and if the probability is smaller than the preset probability threshold, deleting the detection frame. It can be said that the second round of screening, the champion classification is compared with the probability threshold value, and the detection box with the value larger than the probability threshold value is qualified to enter the resolution. For example, the probability threshold may be set to 0.24 (24%). By comparison, the detection frame passing through the pre-match is displayed on the picture, and it can be seen that the detection frame can be displayed as long as the classification probability value is equal to or greater than the 0.24 probability threshold.

The result is a unique detection box classified as "horse".

In order to better explain the technical scheme of the invention, the following detailed description is given by an embodiment. Fig. 4 shows a schematic diagram of an embodiment of the invention.

As shown in fig. 4, the number of convolution layers is set to 0-52 layers in the neural network model. And then receiving the first detection image after size adjustment, wherein the size of the first detection image is 416×416, the specific size can be set according to the actual operation requirement and the operation capability, and the embodiment selects 416×416 for explanation, and the color is a color photo. Layer 0 of the neural network model receives the first color detection image of 416 x 416 size, 3 channels (RGB), and performs convolution operation.

After the convolution operation of layers 0-51, a feature map (feature map) of a channel with a size of 13 x 13 and 425 is obtained.

And the 52 th layer carries out convolution operation on the characteristic picture, and finally outputs a one-dimensional prediction array which comprises 13 x 5 x 85 numerical values. The multi-dimensional array or matrix is reduced to a one-dimensional array by a series of operations. The one-dimensional array is a prediction array.

Wherein the number 13 x 13 of the 13 x 5 x 85 values represents the broad x height of the feature map (feature map), and there are a total of 13 x 13 feature cells. YOLO equally divides the original picture (416×416) into 13×13 regions (cells), one picture region for each feature cell. The specific size may be set by those skilled in the art according to the actual operational requirements and operational capabilities.

Number 5: representing 5 different-shaped detection boxes (bounding boxes), YOLO generates 5 different-shaped detection boxes in each picture area, and uses the center point of the area as the center point of the detection boxes to detect objects, so that YOLO uses 13 x 5 detection boxes to detect one picture or image.

The numeral 85 can be split into 3 parts understanding that 85=4+1+80.

4: Each detection frame contains 4 coordinate values (x, y, width, height)

1: Each detection frame has 1 confidence value of the detected object, which is also the confidence (0-1), and the confidence value is understood as the confidence probability of detecting the object.

80: Each detection frame has 80 classification detection probability values (0-1), which is understood to be the probability that the objects in the detection frame may be each classification respectively.

It can be said that the above procedure is that a 416-416 picture is divided into 13-13 picture areas on average, each picture area generates 5 detection frames, each detection frame contains 85 values (4 coordinate values+1 detection object self-confidence value+80 classification detection values), the finally obtained one-dimensional prediction array (predictions) represents the object detected in the picture, and the array contains 13-5-85 numerical values predictions [0] to predictions [ 13-5-85-1 ].

The invention provides a YOLO-based image target identification method, a YOLO-based image target identification system and a storage medium. The method can effectively improve the detection precision and reduce the detection time. Through experiments and verification, the method of the invention is superior to the detection method in the prior art. The identification accuracy is improved and the operation speed is increased.

In the several embodiments provided by the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above described device embodiments are only illustrative, e.g. the division of the units is only one logical function division, and there may be other divisions in practice, such as: multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. In addition, the various components shown or discussed may be coupled or directly coupled or communicatively coupled to each other via some interface, whether indirectly coupled or communicatively coupled to devices or units, whether electrically, mechanically, or otherwise.

The units described above as separate components may or may not be physically separate, and components shown as units may or may not be physical units; can be located in one place or distributed to a plurality of network units; some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present invention may be integrated in one processing unit, or each unit may be separately used as one unit, or two or more units may be integrated in one unit; the integrated units may be implemented in hardware or in hardware plus software functional units.

Those of ordinary skill in the art will appreciate that: all or part of the steps for implementing the above method embodiments may be implemented by hardware related to program instructions, and the foregoing program may be stored in a computer readable storage medium, where the program, when executed, performs steps including the above method embodiments; and the aforementioned storage medium includes: a mobile storage device, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk or optical disk, or the like, which can store program codes.

Or the above-described integrated units of the invention may be stored in a computer-readable storage medium if implemented in the form of software functional modules and sold or used as separate products. Based on such understanding, the technical solutions of the embodiments of the present invention may be embodied in essence or a part contributing to the prior art in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the methods described in the embodiments of the present invention. And the aforementioned storage medium includes: a removable storage device, ROM, RAM, magnetic or optical disk, or other medium capable of storing program code.

The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A YOLO-based image target recognition method, comprising:

Receiving an image to be detected;

Arranging the classification probability values in each detection frame from large to small, and selecting the classification with highest ranking; then comparing the classification probability value of the highest-ranking classification with a preset probability threshold value, and judging whether the classification probability value of the highest-ranking classification is larger than the preset probability threshold value;

If the probability threshold value is smaller than the preset probability threshold value, deleting the detection frame; if the probability threshold value is larger than the preset probability threshold value, the detection frame and the classification identification information are used as the identification classification result;

the step of generating the detection frame specifically comprises the following steps:

calculating the coincidence ratio of the latest detection frame;

finally generating N similar detection frames of the same class;

the step of carrying out the prediction of the dynamic detection frame, carrying out iterative prediction on the generated detection frame, and generating the latest detection frame specifically comprises the following steps:

Predicting 4 coordinates of each detection frame ) ; If the cell deviates from the upper left corner coordinates of the imageAnd the detection frame of the upper prediction has a width/>And height/>The coordinates of the latest detection frame are:

wherein/> And the four coordinate point position values of the latest detection frame are respectively.

2. The YOLO-based image object recognition method according to claim 1, further comprising, before the receiving the image to be detected:

acquiring a training image dataset;

3. The YOLO-based image object recognition method of claim 2, wherein 53-layer convolution operations are used in the neural network model, and the convolution operations of each layer are calculated alternately for 3 x 3 and 1 x 1 convolution layers.

4. The YOLO-based image object recognition method of claim 1, wherein the size is a size specified by a neural network model.

5. A YOLO-based image target recognition system, the system comprising: the image target recognition system comprises a memory, a processor and an image pickup device, wherein the memory comprises a YOLO-based image target recognition method program, and the image target recognition method program based on the YOLO realizes the following steps when being executed by the processor:

Receiving an image to be detected;

calculating the coincidence ratio of the latest detection frame;

finally generating N similar detection frames of the same class;

the step of carrying out the prediction of the dynamic detection frame, carrying out iterative prediction on the generated detection frame, and generating the latest detection frame specifically comprises the following steps: predicting 4 coordinates of each detection frame ) ; If the cell deviates from the upper left corner coordinates/>, of the imageAnd the detection frame of the upper prediction has a width/>And height/>The coordinates of the latest detection frame are:

6. A computer-readable storage medium, characterized in that a YOLO-based image object recognition method program is included in the computer-readable storage medium, which, when executed by a processor, implements the steps of a YOLO-based image object recognition method according to any one of claims 1 to 4.