WO2020164282A1

WO2020164282A1 - Yolo-based image target recognition method and apparatus, electronic device, and storage medium

Info

Publication number: WO2020164282A1
Application number: PCT/CN2019/118499
Authority: WO
Inventors: 赵峰; 王健宗; 肖京
Original assignee: 平安科技（深圳）有限公司
Priority date: 2019-02-14
Filing date: 2019-11-14
Publication date: 2020-08-20
Also published as: CN109977943B; CN109977943A

Abstract

An artificial intelligence technique, providing a YOLO-based image target recognition method, system, and storage medium, the method comprising: receiving an image to be detected (S102); on the basis of a preset requirement, adjusting the size of the image to be detected to generate a first detection image (S104); sending the first detection image to a neural network model to implement matching recognition, and generating a detection frame and class recognition information, and a class probability value corresponding to the class recognition information (S106); determining whether the class probability value is greater than a preset class probability value (S108); if so, then setting the detection frame and the class recognition information as the recognised class result (S110). The present method can effectively improve detection precision and reduce detection time.

Description

YOLO-based image target recognition method, device, electronic equipment and storage medium

This application requires the priority of the patent application whose application number is 201910114621.5, the filing date is February 14, 2019, and the invention-creation title is "A YOLO-based image target recognition method, system and storage medium".

Technical field

This application relates to the field of computer learning and image recognition, and more specifically, to a YOLO-based image target recognition method, device, electronic equipment and storage medium.

Background technique

With the rapid development of artificial intelligence technology, deep learning is increasingly used in computer vision, especially in the field of image target detection.

In recent years, the target detection algorithm has made great breakthroughs. The more popular algorithms can be divided into two categories, one is based on Region Proposal R-CNN algorithm (R-CNN, Fast R-CNN, Faster R-CNN), they are two-stage, you need to use heuristics first Method (selective search) or CNN network (RPN) generates Region Proposal, and then perform classification and regression on Region Proposal. The other is Yolo (full name You Only Look Once), one-stage algorithms such as SSD, which use only a CNN network to directly predict the categories and positions of different targets. The first type of method is more accurate, but slower; the second type of algorithm is faster, but the accuracy is lower. More and more target detection methods are implemented based on YOLO, and many deep networks are also improved based on YOLO. YOLO solves object detection as a regression problem, based on a single end-to-end network, to complete the input from the original image to the output of the object position and category.

The core idea of YOLO is to use the entire graph as the input of the network, and directly return to the position of the bounding box and the category to which the bounding box belongs in the output layer. The inventor realized that based on the use of YOLO's high-speed operation, how to design a method that can improve the accuracy of YOLO is an urgent need to solve at present.

Application content

In order to solve the above-mentioned at least one technical problem, this application proposes a YOLO-based image target recognition method, device, electronic equipment and storage medium.

In order to achieve the above objectives, the technical solution of the present application provides a method for image target recognition based on YOLO, including:

Receive the image to be detected;

Adjusting the size of the image to be inspected according to preset requirements to generate a first inspection image;

Sending the first detection image to a neural network model for matching recognition, and generating a detection frame, classification identification information, and classification probability values corresponding to the classification identification information;

Judging whether the classification probability value is greater than a preset classification probability threshold;

If it is greater than, the detection frame and the classification identification information are used as the classification result of the identification.

The technical solution of the present application also proposes an image target recognition device based on YOLO, which includes: an input module to receive the image to be detected;

The adjustment module adjusts the size of the image to be detected received by the input module according to preset requirements, and generates the first detected image;

The matching recognition module sends the first detection image generated by the adjustment module to the neural network model for matching recognition, and generates a detection frame, classification identification information, and classification probability values corresponding to the classification identification information;

A judging module, judging whether the classification probability value is greater than a preset classification probability threshold, if it is not greater, send a signal to the matching recognition module, if it is greater, send a signal to the classification module;

The classification module uses the detection frame and the classification identification information as the classification result of the identification.

The technical solution of the present application also proposes an electronic device, including a memory, a processor, and a camera device. The memory includes a YOLO-based image target recognition program, and the YOLO-based image target recognition program is used by the processor. The steps of the above-mentioned YOLO-based image target recognition method are realized during execution.

The fourth aspect of the present application also provides a computer non-volatile readable storage medium, the computer non-volatile readable storage medium includes a YOLO-based image target recognition program, the YOLO-based image target recognition program is When the processor is executed, the steps of the above-mentioned YOLO-based image target recognition method are realized.

This application proposes a YOLO-based image target recognition method, device, system and storage medium. This method judges the classification and recognition probability, and only uses the recognition information as the recognition result when the preset classification probability threshold is reached, which improves the accuracy of image recognition and the sense of recognition experience. The present application can also adjust the position of the detection frame in real time, which effectively improves the detection efficiency and accuracy; the detection time is reduced by optimizing the detection calculation method. After experiments and verifications, the method of this application is superior to the detection method of the prior art. More embodied in improved recognition accuracy and increased computing speed.

Description of the drawings

Figure 1 is a flow chart of an image target recognition method based on YOLO in this application;

Figure 2 shows a schematic diagram of the convolution operation in the classification process of this application;

Figure 3 shows a block diagram of an electronic device of the present application;

Fig. 4 shows a schematic diagram of a specific embodiment of the present application.

detailed description

In order to be able to understand the above objectives, features and advantages of the application more clearly, the application will be further described in detail below in conjunction with the accompanying drawings and specific implementations. It should be noted that the embodiments of the application and the features in the embodiments can be combined with each other if there is no conflict.

Fig. 1 is a flow chart of an image target recognition method based on YOLO in this application.

As shown in Figure 1, the technical solution of the present application provides a YOLO-based image target recognition method, including:

S102, receiving an image to be detected;

S104: Adjust the size of the image to be detected according to a preset requirement to generate a first detected image;

S106: Send the first detection image to a neural network model for matching recognition, and generate a detection frame, classification identification information, and classification probability values corresponding to the classification identification information;

S108: Determine whether the classification probability value is greater than a preset classification probability threshold;

S110: If it is greater than, use the detection frame and the classification identification information as the classification result of the identification.

It should be noted that the size is the size specified by the neural network model. In the neural network model, the size of the generally selected image will be smaller than the size of the image to be detected, which can ensure the speed of calculation processing and can quickly perform class recognition. Generally, 448*448 or 416*416 is selected. Those skilled in the art should understand that the size selection in this step can be set according to actual needs, and is not limited to the above-mentioned sizes and cannot limit the protection of this application. range.

The first detection image is sent to a neural network model to generate a detection frame, classification identification information, and classification probability values corresponding to the classification identification information. Those skilled in the art can set the classification probability value according to actual needs. For example, if the classification probability threshold is set to 90%, when detecting a picture containing a kitten, if the probability of identifying the kitten in the detection frame exceeds 90% , It means that a kitten is circled in the detection box and the cat in the picture has been identified. If the classification probability value is less than the preset classification probability threshold, it will return to step S106 for re-identification until the classification probability value is greater than the preset classification probability threshold. The neural network model performs multi-layer convolution operations on the image. The described YOLO convolution operation is a conventional operation in the field, and belongs to the prior art, and will not be repeated in this application.

Preferably, it includes after step S106:

After the step of generating the detection frame, the classification identification information, and the classification probability value corresponding to the classification identification information, the step includes:

Calculate the classification probability value of each detection frame, arrange its classification probability values from largest to smallest, and select the highest ranked category;

Comparing the highest ranked category with a preset probability threshold, if it is greater than or equal to the preset probability threshold, keep the detection frame; if it is less than the preset probability threshold, delete the detection frame;

The remaining detection frames of the same type are calculated for the coincidence degree, and the detection frame with the highest coincidence degree is retained.

In this solution, before receiving the image to be detected in the step S102, the method further includes:

Perform image training to obtain a neural network model; the neural network model is trained through the following steps:

Obtain a training image data set;

Image preprocessing the training image data set to obtain a preprocessed image set;

The preprocessed image set is trained to obtain a neural network model with an input interface and an output interface.

Preferably, the step of obtaining the training image data set includes:

Establishing a tag library, which stores different tags and tag sequences corresponding to different objects;

Build a picture library to store the image data and label sequence of the picture;

A set number of positive samples and negative samples of each tag in the total set of identification tags are selected from the picture library to form the training set and the validation set, where a positive sample of a label is a picture containing the object corresponding to the label, and a negative of a label The sample is a picture that does not contain the object corresponding to the label, the training set is the image data of the positive sample and the negative sample, the verification set is the label sequence of the positive sample and the negative sample, and the output of the neural network model Is the predicted label sequence of the samples in the training set.

It should be noted that the training image data set has 1,000 object categories and 1.2 million training images. Before training, the data set will be preprocessed. The preprocessing includes one or more of rotation, contrast enhancement, tilt, and scaling. After preprocessing, the image will be distorted to a certain extent. The training of the distorted image can be Increase the accuracy of the final image recognition.

In this solution, the step of generating the detection frame is specifically as follows:

Generate an initial detection frame according to the initial preset coordinate points;

Predict the dynamic detection frame, perform iterative prediction on the generated detection frame, and generate the latest detection frame;

Calculating the coincidence degree of the latest detection frame;

If the latest detection frame coincidence degree is greater than or equal to the preset coincidence degree threshold, keep the latest detection frame; if the latest detection frame coincidence degree is less than the preset coincidence degree threshold, continue to predict the dynamic detection frame ；

Finally, N detection frames of the same category are generated.

It should be noted that the initial preset coordinate point is the coordinate point of the preset detection frame, which may be automatically generated during training and recognition detection, or may be generated by a person skilled in the art according to actual needs.

In this solution, the prediction of the dynamic detection frame, the iterative prediction of the generated detection frame, and the generation of the latest detection frame are specifically as follows:

Predict the 4 coordinates of each detection frame, (t _x , t _y , t _w , t _h ); if the cell deviates from the upper left corner of the image (c _x , c _y ), and the detection frame predicted in the previous step has a width p _w and height p _h , the coordinates of the latest detection frame are:

b _x ＝σ(t _x )+c _x

b _y ＝σ(t _y )+c _y

Among them, b _x , b _y , b _w , and b _h are the four coordinate point values of the latest detection frame. It should be noted that the detection frame is a quadrilateral, and the position of the quadrilateral detection frame can be determined by the values of 4 points.

The network predicts the 4 coordinates of each detection frame, (t _x , t _y , t _w , t _h ). If the cell deviates from the coordinates (c _x , c _y ) of the upper left corner of the image, the coordinates of the latest detection frame expressed by the above formula can be obtained.

It should be noted that each box uses multi-label classification to predict the classes that the bounding box may contain. In the process of class recognition, this application uses binary cross entropy loss technology for class prediction. The main purpose of using binary cross entropy loss for category prediction is that the applicant finds that the softmax technology does not require good performance, but only uses an independent logical classifier, so this step does not need to use the softmax technology. When using the method of this application to migrate to a more complex category recognition field, the binary cross entropy loss technology will provide more help. The binary cross-entropy loss technology is a common technology in the field, and those skilled in the art can implement the binary cross-entropy loss technology according to requirements, and this application will not repeat them one by one.

Preferably, 53 layers of convolution operation are used in the neural network model, and the convolution operation of each layer is calculated by alternating 3×3 and 1×1 convolution layers. The applicant has found through a limited number of actual tests that the use of the above-mentioned convolutional layer for alternating operation can increase the accuracy and effectively increase the operation speed. Among them, the alternate calculation of the convolutional layer is specifically as follows: firstly, a 3×3 convolution operation is used, and then a 1×1 convolution operation is used, and the operations are alternately performed in turn until all the convolutional layers have participated in the operation.

In this solution, the size is the size specified by the neural network model.

In order to better illustrate the technical solution of the present application, the technical solution of the present application will be described in detail below.

After the detection frame is generated, the classification probability value in the detection frame is calculated, and the optimal N detection frames of the same type are selected. It should be noted that the size of the detection frame is dynamically predicted, and the process of dynamic prediction is the solution described above. The probability threshold is used to filter the M classification probability values of all the detection frames, and a set of screening rules is formulated:

Calculate the classification probability value of each detection frame, arrange its classification probability values from largest to smallest, and select the highest ranked category. It can be said that this step is the first round of screening. The M categories of each detection frame are evaluated first, and a champion category with the highest probability value is selected.

The highest ranked category is compared with a preset probability threshold. If it is greater than or equal to the preset probability threshold, the detection frame is retained; if it is less than the preset probability threshold, the detection frame is deleted. It can be said that in the second round of screening, the champion classification is compared with the probability threshold, and the check box with a value greater than the probability threshold is eligible to enter the final. For example, the probability threshold can be set to 0.24 (24%). By comparison, the preset detection frame is displayed on the picture, and it can be seen that as long as the classification probability value is greater than or equal to the 0.24 probability threshold, it can be displayed. Those skilled in the art can set the probability threshold according to actual needs, and the probability threshold described in this application cannot limit the protection scope of this application.

The overlap degree calculation is performed on the N detection frames of the same type, and the detection frame with the highest overlap degree is retained.

For example, after the above-mentioned screening steps, all three detection boxes are classified as "horse".

Sort the detection probability of each of the three detections in descending order.

The coincidence degree calculation (IOU) is performed in pairs. If the coincidence degree calculation value IoU>0.3, the detection frame with low probability is eliminated.

Finally, a unique detection frame classified as "horse" is obtained.

Figure 2 shows a schematic diagram of the convolution operation in the classification process of this application.

As shown in Figure 2, the neural network model adopts 53 layers of convolution operation, and the convolution operation of each layer is 3×3 and 1×1 convolution layers alternately.

This feature extraction method realizes the highest measurement floating point operation per second. This also means that the neural network structure can make better use of the machine's GPU, improve the evaluation efficiency, and thus increase the speed. Because the ResNets technology has too many levels and is not efficient, the convolution operation described in this application can have higher efficiency and higher accuracy.

For example, each neural network is trained with the same settings and tested with a single cropping accuracy of 256×256. The performance of the classifier using the feature extraction of the present application is comparable to the most advanced classifier in the prior art, but there are fewer floating point operations and faster speed.

FIG. 3 shows a block diagram of the application of the above-mentioned YOLO-based image target recognition method of this application to an electronic device.

As shown in FIG. 3, the technical solution of the present application also proposes an electronic device 2, which includes a memory 201, a processor 202, and a camera 203. The memory 201 includes an image target recognition program based on YOLO. When the YOLO image target recognition program is executed by the processor, the following steps are implemented:

Receive the image to be detected;

In this solution, before the receiving the image to be detected, the method further includes:

Obtain a training image data set;

Carry out dynamic detection frame prediction, iterative prediction of the generated detection frame, and generate the latest detection frame;

Calculating the coincidence degree of the latest detection frame;

Finally, N detection frames of the same category are generated.

b _x ＝σ(t _x )+c _x

b _y ＝σ(t _y )+c _y

In this application, dimensional clustering can be used as an anchor frame to dynamically predict the detection frame, and the detection frame is also a bounding box. The network predicts the 4 coordinates of each detection frame, t _x , t _y , t _w , and t _h . If the cell deviates from the upper left corner of the image (c _x , c _y ), the coordinates of the latest detection frame expressed by the above formula can be obtained, where b _x , b _y , b _w , and b _h are the coordinates of the latest detection frame. Four coordinate point values. It should be noted that the detection frame is a quadrilateral, and the position of the quadrilateral detection frame can be determined by the values of 4 points.

It should be noted that each box uses multi-label classification to predict the classes that the bounding box may contain. In the process of class recognition, this application uses binary cross-entropy loss technology for class prediction. The main purpose of using binary cross entropy loss for category prediction is that the applicant finds that the softmax technology does not require good performance, but only uses an independent logical classifier, so this step does not need to use the softmax technology. When the method of this application is used to migrate to a more complex category recognition field, the binary cross entropy loss technology will provide more help. The binary cross-entropy loss technology is a common technology in the field, and those skilled in the art can implement the binary cross-entropy loss technology according to requirements, and this application will not repeat them one by one.

Image training is performed to obtain a neural network model; the neural network model is a model with an input interface and an output interface obtained by image training for different types of pictures.

Preferably, 53 layers of convolution operation are used in the neural network model, and the convolution operation of each layer is calculated by alternating 3×3 and 1×1 convolution layers. The applicant has found through a limited number of actual tests that the use of the above-mentioned convolutional layer for alternating operation can increase the accuracy and effectively increase the operation speed. Among them, the alternate calculation of the convolutional layer is specifically as follows: firstly, a 3×3 convolution operation is used, and then a 1×1 convolution operation is used, and the operations are alternated in turn until all the convolutional layers have participated in the operation.

In this solution, the size is the size specified by the neural network model.

It should be noted that 53 layers of convolution operation are used in the neural network model, and the convolution operation of each layer is performed alternately by 3×3 and 1×1 convolution layers.

Calculate the classification probability value in the detection frame, and screen out the optimal N detection frames of the same kind. The probability threshold is used to filter the M classification probability values of all the detection frames, and a set of screening rules is formulated:

The highest ranked category is compared with a preset probability threshold. If it is greater than or equal to the preset probability threshold, the detection frame is retained; if it is less than the preset probability threshold, the detection frame is deleted. It can be said that in the second round of screening, the champion classification is compared with the probability threshold, and the check box with a value greater than the probability threshold is eligible to enter the final. For example, the probability threshold can be set to 0.24 (24%). By comparison, the detection frame that passed the preliminaries is displayed on the picture. It can be seen that as long as the classification probability value is greater than or equal to the 0.24 probability threshold, it can be displayed.

Finally, a unique detection frame classified as "horse" is obtained.

In addition, this application also proposes an image target recognition device based on YOLO, including: an input module to receive the image to be detected;

Preferably, it further includes a training module to perform image training to obtain a neural network model, and the training module includes:

Data set acquisition unit to acquire training image data set;

A preprocessing unit, which performs image preprocessing on the training image data set to obtain a preprocessed image set;

The training unit trains the preprocessed image set to obtain a neural network model with an input interface and an output interface.

Further, preferably, the aforementioned data set obtaining unit includes:

Tag library, which stores different tags and tag sequences corresponding to different objects;

Picture library, which stores the image data and label sequence of pictures;

The screening unit selects a set number of positive samples and negative samples of each tag in the total identification tag set from the picture library to form a training set and a validation set, where a positive sample of one label is a picture containing the object corresponding to the label, and one The negative sample of the label is a picture that does not contain the object corresponding to the label, the training set is the image data of the positive sample and the negative sample, the verification set is the label sequence of the positive sample and the negative sample, the neural network The output of the model is the predicted label sequence of the samples in the training set.

Preferably, the above-mentioned matching recognition module includes:

The initial detection frame generating unit generates the initial detection frame according to the initial preset coordinate points;

The prediction unit predicts the dynamic detection frame, iteratively predicts the generated detection frame, and generates the latest detection frame;

The coincidence degree obtaining unit calculates the coincidence degree of the latest detection frame;

The screening unit, if the latest detection frame coincidence degree is greater than or equal to the preset coincidence degree threshold, keep the latest detection frame; if the latest detection frame coincidence degree is less than the preset coincidence degree threshold, send a signal to the prediction unit Then continue to predict the dynamic detection frame;

The detection frame generation unit generates N detection frames of the same category.

Preferably, the above prediction unit includes:

Prediction sub-unit, predict the 4 coordinates of each detection frame, (t _x , t _y , t _w , t _h );

Judge the sub-unit, judge whether the cell deviates from the upper left corner of the image (c _x , c _y ), and send a signal to the update sub-unit if it deviates;

Update the subunit, by predicting the width p _w and height p _h of the detection frame predicted by the subunit, and update the coordinates of the detection frame, the coordinates of the latest detection frame are:

b _x ＝σ(t _x )+c _x

b _y ＝σ(t _y )+c _y

Among them, b _x , b _y , b _w , and b _h are the four coordinate point values of the latest detection frame.

Preferably, 53 layers of convolution operation are used in the neural network model, and the convolution operation of each layer is alternately calculated by 3×3 and 1×1 convolution layers.

Preferably, the judgment module includes:

The classification probability obtaining unit calculates the classification probability value of each detection frame;

The first screening unit arranges the classification probability value of each detection frame from largest to smallest, and selects the highest ranked category;

The second screening unit compares the highest ranked category with a preset probability threshold, and if it is greater than or equal to the preset probability threshold, then keep the detection frame; if it is less than the preset probability threshold, delete the Check box,

The classification module includes a third screening unit, which calculates the degree of coincidence of the reserved detection frames of the same type, retains the detection frame with the highest degree of coincidence, and uses the detection frame with the highest degree of coincidence and its corresponding classification identification information as the recognized classification result.

In order to better explain the technical solution of the present application, an embodiment is used to describe in detail below. Figure 4 shows a schematic diagram of an embodiment of the present application.

As shown in Figure 4, the number of convolutional layers in the neural network model is set to 0-52. Then receive the first detected image after size adjustment. The size of the first detected image is 416*416. The specific size can be set according to actual computing requirements and computing capabilities. In this embodiment, 416*416 is selected for description, and the color is Color photo. The 0th layer of the neural network model receives the 416*416 size, 3-channel (RGB) color first detection image, and performs the convolution operation.

After the 0-51th layer of convolution operation, a 13*13 size, 425 channel feature map (feature map) is obtained.

The 52nd layer performs a convolution operation on the feature picture, and the final output one-dimensional prediction array contains 13*13*5*85 values. Reduce the multi-dimensional array or matrix to a one-dimensional array through a series of operations. The one-dimensional array is the prediction array.

Among them, the number 13*13 in the 13*13*5*85 values represents the width*height of the feature map, and there are a total of 13*13 feature units. YOLO divides the original picture (416*416) into 13*13 cells on average, and each feature unit corresponds to a picture area. The specific size can be set by those skilled in the art according to actual computing requirements and computing capabilities.

Number 5: Represents 5 bounding boxes with different shapes. YOLO will generate 5 bounding boxes in each image area, and use the center of the area as the center of the detection box to detect objects, so YOLO will use 13*13*5 detection frames to detect a picture or image.

The number 85 can be divided into 3 parts to understand, 85=4+1+80.

4: Each detection frame contains 4 coordinate values (x, y, width, height)

1: Each detection frame has a confidence value of the detected object, which is also the above-mentioned confidence (0~1), which is understood as the confidence probability of detecting the object, that is, the confidence value.

80: Each detection frame has 80 classification detection probability values (0~1), which means that the objects in the detection frame may be the probability of each classification respectively.

It can be said that the above process is to divide a 416*416 picture into 13*13 picture areas. Each picture area generates 5 detection frames, and each detection frame contains 85 values (4 coordinates). Value +1 detection object confidence value + 80 classification detection values), the final one-dimensional prediction array (predictions) represents the detected objects in the picture, the array contains a total of 13*13*5*85 numerical predictions[0 ]~predictions[13*13*5*85-1].

In addition, this application also proposes a computer-readable storage medium including a YOLO-based image target recognition program, which, when executed by a processor, implements the steps of the above-mentioned YOLO-based image target recognition method.

The specific implementation of the computer-readable storage medium of the present application is substantially the same as the specific implementation of the above-mentioned multi-tag classification method and electronic device, and will not be repeated here.

This application proposes a YOLO-based image target recognition method, device, electronic equipment and storage medium. This method can effectively improve the detection accuracy and reduce the detection time. After experiments and verifications, the method of this application is superior to the detection method of the prior art. More embodied in improved recognition accuracy and increased computing speed.

In the several embodiments provided in this application, it should be understood that the disclosed device and method may be implemented in other ways. The device embodiments described above are merely illustrative. For example, the division of the units is only a logical function division, and there may be other divisions in actual implementation, such as: multiple units or components can be combined, or It can be integrated into another system, or some features can be ignored or not implemented. In addition, the coupling, or direct coupling, or communication connection between the components shown or discussed may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms. of.

The units described above as separate components may or may not be physically separate, and the components displayed as units may or may not be physical units; they may be located in one place or distributed on multiple network units; Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, the functional units in the embodiments of the present application can all be integrated into one processing unit, or each unit can be individually used as a unit, or two or more units can be integrated into one unit; The unit can be implemented in the form of hardware, or in the form of hardware plus software functional units.

Those of ordinary skill in the art can understand that all or part of the steps in the above method embodiments can be implemented by a program instructing relevant hardware. The foregoing program can be stored in a computer readable storage medium. When the program is executed, the execution includes The steps of the foregoing method embodiment; and the foregoing storage medium includes: removable storage devices, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disks or optical disks, etc. The medium storing the program code.

Alternatively, if the above-mentioned integrated unit of this application is implemented in the form of a software function module and sold or used as an independent product, it can also be stored in a computer readable storage medium. Based on this understanding, the technical solutions of the embodiments of the present application can be embodied in the form of a software product in essence or a part that contributes to the prior art. The computer software product is stored in a storage medium and includes several instructions for A computer device (which may be a personal computer, a server, or a network device, etc.) executes all or part of the methods described in the various embodiments of the present application. The aforementioned storage media include: removable storage devices, ROM, RAM, magnetic disks, or optical disks and other media that can store program codes.

The above are only specific implementations of this application, but the protection scope of this application is not limited to this. Any person skilled in the art can easily think of changes or substitutions within the technical scope disclosed in this application. Should be covered within the scope of protection of this application. Therefore, the protection scope of this application should be subject to the protection scope of the claims.

Claims

An image target recognition method based on YOLO is characterized in that it includes:

Receive the image to be detected;

Adjusting the size of the image to be inspected according to preset requirements to generate a first inspection image;

Sending the first detection image to a neural network model for matching recognition, and generating a detection frame, classification identification information, and classification probability values corresponding to the classification identification information;

Judging whether the classification probability value is greater than a preset classification probability threshold;

If it is greater than, the detection frame and the classification identification information are used as the classification result of the identification.
The YOLO-based image target recognition method according to claim 1, characterized in that, before said receiving the image to be detected, it further comprises:

Perform image training to obtain a neural network model; the neural network model is trained through the following steps:

Obtain a training image data set;

Image preprocessing the training image data set to obtain a preprocessed image set;

The preprocessed image set is trained to obtain a neural network model with an input interface and an output interface.
The YOLO-based image target recognition method according to claim 2, wherein the step of obtaining a training image data set comprises:

Establishing a tag library, which stores different tags and tag sequences corresponding to different objects;

Build a picture library to store the image data and label sequence of the picture;

A set number of positive samples and negative samples of each tag in the total set of identification tags are selected from the picture library to form the training set and the validation set, where a positive sample of a label is a picture containing the object corresponding to the label, and a negative of a label The sample is a picture that does not contain the object corresponding to the label, the training set is the image data of the positive sample and the negative sample, the verification set is the label sequence of the positive sample and the negative sample, and the output of the neural network model Is the predicted label sequence of the samples in the training set.
The YOLO-based image target recognition method according to claim 1, wherein the step of generating a detection frame specifically includes:

Generate an initial detection frame according to the initial preset coordinate points;

Predict the dynamic detection frame, perform iterative prediction on the generated detection frame, and generate the latest detection frame;

Calculating the coincidence degree of the latest detection frame;

If the latest detection frame coincidence degree is greater than or equal to the preset coincidence degree threshold, keep the latest detection frame; if the latest detection frame coincidence degree is less than the preset coincidence degree threshold, continue to predict the dynamic detection frame ；

Finally, N detection frames of the same category are generated.
The YOLO-based image target recognition method according to claim 4, wherein the prediction of the dynamic detection frame, the iterative prediction of the generated detection frame, and the generation of the latest detection frame specifically include:

Predict the 4 coordinates of each detection frame, (t x , t y , t w , t h ); if the cell deviates from the upper left corner of the image (c x , c y ), and the detection frame predicted in the previous step has a width p w and height p h , the coordinates of the latest detection frame are:

b x ＝σ(t x )+c x

b y ＝σ(t y )+c y

Among them, b x , b y , b w , and b h are the four coordinate point values of the latest detection frame.
The YOLO-based image target recognition method according to claim 2, wherein the neural network model adopts 53 layers of convolution operation, and the convolution operation of each layer is 3×3 and 1×1 convolution Layer alternate calculation.
The YOLO-based image target recognition method according to claim 1, wherein the size is a size specified by a neural network model.
The YOLO-based image target recognition method according to claim 1, characterized in that, after the step of generating the detection frame, the classification identification information and the classification probability value corresponding to the classification identification information, the step comprises:

Calculate the classification probability value of each detection frame, arrange its classification probability values from largest to smallest, and select the highest ranked category;

Comparing the highest ranked category with a preset probability threshold, if it is greater than or equal to the preset probability threshold, keep the detection frame; if it is less than the preset probability threshold, delete the detection frame;

The remaining detection frames of the same type are calculated for the coincidence degree, and the detection frame with the highest coincidence degree is retained.
An image target recognition device based on YOLO, which is characterized by comprising: an input module, which receives an image to be detected;

The adjustment module adjusts the size of the image to be detected received by the input module according to preset requirements, and generates the first detected image;

The matching recognition module sends the first detection image generated by the adjustment module to the neural network model for matching recognition, and generates a detection frame, classification identification information, and classification probability values corresponding to the classification identification information;

A judging module, judging whether the classification probability value is greater than a preset classification probability threshold, if it is not greater, send a signal to the matching recognition module, if it is greater, send a signal to the classification module;

The classification module uses the detection frame and the classification identification information as the classification result of the identification.
The YOLO-based image target recognition device according to claim 9, characterized in that it further comprises a training module for image training to obtain a neural network model, and the training module comprises:

Data set acquisition unit to acquire training image data set;

A preprocessing unit, which performs image preprocessing on the training image data set to obtain a preprocessed image set;

The training unit trains the preprocessed image set to obtain a neural network model with an input interface and an output interface.
The YOLO-based image target recognition device according to claim 10, wherein the data set acquisition unit comprises:

Tag library, which stores different tags and tag sequences corresponding to different objects;

Picture library, which stores the image data and label sequence of pictures;

The screening unit selects a set number of positive samples and negative samples of each tag in the total identification tag set from the picture library to form a training set and a validation set, where a positive sample of one label is a picture containing the object corresponding to the label, and one The negative sample of the label is a picture that does not contain the object corresponding to the label, the training set is the image data of the positive sample and the negative sample, the verification set is the label sequence of the positive sample and the negative sample, the neural network The output of the model is the predicted label sequence of the samples in the training set.
The YOLO-based image target recognition device according to claim 9, wherein the matching recognition module comprises:

The initial detection frame generating unit generates the initial detection frame according to the initial preset coordinate points;

The prediction unit predicts the dynamic detection frame, iteratively predicts the generated detection frame, and generates the latest detection frame;

The coincidence degree obtaining unit calculates the coincidence degree of the latest detection frame;

The screening unit, if the latest detection frame coincidence degree is greater than or equal to the preset coincidence degree threshold, keep the latest detection frame; if the latest detection frame coincidence degree is less than the preset coincidence degree threshold, send a signal to the prediction unit Then continue to predict the dynamic detection frame;

The detection frame generation unit generates N detection frames of the same category.
The YOLO-based image target recognition device according to claim 12, wherein the prediction unit comprises:

Prediction sub-unit, predict the 4 coordinates of each detection frame, (t x , t y , t w , t h );

Judge the sub-unit, judge whether the cell deviates from the upper left corner of the image (c x , c y ), and send a signal to the update sub-unit if it deviates;

Update the subunit, by predicting the width p w and height p h of the detection frame predicted by the subunit, and update the coordinates of the detection frame, the coordinates of the latest detection frame are:

b x ＝σ(t x )+c x

b y ＝σ(t y )+c y

Among them, b x , b y , b w , and b h are the four coordinate point values of the latest detection frame.
The YOLO-based image target recognition device according to claim 9, wherein 53 layers of convolution operation are used in the neural network model, and the convolution operation of each layer is 3×3 and 1×1 convolution Layer alternate calculation.
The YOLO-based image target recognition device according to claim 9, characterized in that the judgment module comprises:

The classification probability obtaining unit calculates the classification probability value of each detection frame;

The first screening unit arranges the classification probability value of each detection frame from largest to smallest, and selects the highest ranked category;

The second screening unit compares the highest ranked category with a preset probability threshold, and if it is greater than or equal to the preset probability threshold, then keep the detection frame; if it is less than the preset probability threshold, delete the Check box,

The classification module includes a third screening unit, which calculates the degree of coincidence of the reserved detection frames of the same type, retains the detection frame with the highest degree of coincidence, and uses the detection frame with the highest degree of coincidence and its corresponding classification identification information as the recognized classification result.
An electronic device, comprising: a memory, a processor, and a camera device, the memory includes a YOLO-based image target recognition program, and the YOLO-based image target recognition program is executed by the processor as follows step:

Receive the image to be detected;

Adjusting the size of the image to be inspected according to preset requirements to generate a first inspection image;

Sending the first detection image to a neural network model for matching recognition, and generating a detection frame, classification identification information, and classification probability values corresponding to the classification identification information;

Judging whether the classification probability value is greater than a preset classification probability threshold;

If it is greater than, the detection frame and the classification identification information are used as the classification result of the identification.
The electronic device according to claim 16, wherein the step of generating a detection frame comprises:

Generate an initial detection frame according to the initial preset coordinate points;

Predict the dynamic detection frame, perform iterative prediction on the generated detection frame, and generate the latest detection frame;

Calculating the coincidence degree of the latest detection frame;

If the latest detection frame coincidence degree is greater than or equal to the preset coincidence degree threshold, keep the latest detection frame; if the latest detection frame coincidence degree is less than the preset coincidence degree threshold, continue to predict the dynamic detection frame ；

Finally, N detection frames of the same category are generated.
The electronic device according to claim 16, wherein the step of predicting the dynamic detection frame, performing iterative prediction on the generated detection frame, and generating the latest detection frame comprises:

Predict the 4 coordinates of each detection frame, (t x , t y , t w , t h ); if the cell deviates from the upper left corner of the image (c x , c y ), and the detection frame predicted in the previous step has a width p w and height p h , the coordinates of the latest detection frame are:

b x ＝σ(t x )+c x

b y ＝σ(t y )+c y

Among them, b x , b y , b w , and b h are the four coordinate point values of the latest detection frame.
The electronic device according to claim 16, wherein the training step of the neural network model comprises:

Obtain a training image data set;

Image preprocessing the training image data set to obtain a preprocessed image set;

The preprocessed image set is trained to obtain a neural network model with an input interface and an output interface.
A computer non-volatile readable storage medium, wherein the computer non-volatile readable storage medium includes a YOLO-based image target recognition program, and the YOLO-based image target recognition program is executed by a processor At the time, the steps of a YOLO-based image target recognition method according to any one of claims 1 to 8 are realized.