CN107194318B

CN107194318B - Target detection assisted scene identification method

Info

Publication number: CN107194318B
Application number: CN201710270013.4A
Authority: CN
Inventors: 王蕴红; 孙宇航; 赵文婷; 陈训逊; 刘庆杰; 王博
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2017-04-24
Filing date: 2017-04-24
Publication date: 2020-06-12
Anticipated expiration: 2037-04-24
Also published as: CN107194318A

Abstract

The invention provides a target detection-assisted scene recognition method. The target detection-assisted scene identification method comprises the following steps: acquiring a picture to be recognized, sampling the picture to be recognized to obtain samples with preset quantity and size, and performing scene recognition on each sample according to a convolutional neural network model to obtain at least two scenes corresponding to the picture to be recognized; acquiring a region proposal of a picture to be identified and a first feature map corresponding to the picture to be identified, and acquiring a classification score of each target in the picture to be identified according to the region proposal and the picture to be identified; and obtaining the scene corresponding to the picture to be recognized according to at least two scenes corresponding to the picture to be recognized and the classification score of each target. The invention assists the scene recognition method by the target detection method combining the Fast R-CNN network and the regional suggestion network, so that the accuracy of the scene recognition method is improved.

Description

Target detection assisted scene identification method

Technical Field

The invention relates to the technical field of computer vision, in particular to a scene recognition method assisted by target detection.

Background

Scene recognition is an important problem in the field of computer vision, and a computer automatically judges the specific scene to which the image or the photo belongs, and especially plays an important role in video monitoring, social network user behavior mining and the like.

However, due to the complexity of the scene itself and factors such as illumination, occlusion, and scale change, the existing scene recognition methods still cannot distinguish the types of the scenes well, for example, if one picture has a group of people, it is difficult for the computer to classify the picture into the scenes such as a mall, a station, or a party, so that the accuracy of scene recognition is reduced.

Disclosure of Invention

The invention provides a target detection-assisted scene recognition method, which is used for realizing the target detection-assisted scene recognition method and reducing the problem of scene distinguishing errors in the scene recognition method.

The invention provides a target detection assisted scene identification method, which comprises the following steps:

acquiring a picture to be identified, sampling the picture to be identified to obtain samples with preset quantity and size, and performing scene identification on each sample according to a convolutional neural network model to obtain at least two scenes corresponding to the picture to be identified;

acquiring a region suggestion of the picture to be recognized and a first feature map corresponding to the picture to be recognized, and acquiring a classification score of each target in the picture to be recognized according to the region suggestion and the picture to be recognized; the region suggestion is obtained by processing a second feature map by a region suggestion network, and the first feature map and the second feature map are obtained by performing convolution processing on the picture to be identified in a Fast R-CNN network;

and obtaining the scene corresponding to the picture to be identified according to the at least two scenes corresponding to the picture to be identified and the classification score of each target.

Optionally, before the scene recognition is performed on each sample according to the convolutional network model to obtain at least two scenes corresponding to the picture to be recognized, the method further includes:

acquiring a training picture and a label corresponding to the training picture, wherein the label is used for indicating a scene corresponding to the training picture;

acquiring network parameters corresponding to the convolutional neural network models corresponding to the scenes according to the convolutional neural network models and labels corresponding to the training pictures;

the scene recognition is carried out on each sample according to the convolutional network model to obtain at least two scenes corresponding to the picture to be recognized, and the method comprises the following steps:

and carrying out scene recognition on each sample according to the network parameters corresponding to the convolutional neural network model corresponding to each scene to obtain at least two scenes corresponding to the picture to be recognized.

Optionally, the obtaining, according to the convolutional neural network model and the label corresponding to the training picture, a network parameter corresponding to the convolutional neural network model corresponding to each of the scenes includes:

carrying out segmentation sampling on the training picture to obtain an amplified training picture;

according to a first preset training parameter, carrying out preset processing on the amplified training pictures to obtain a preset number of third feature maps, wherein the preset processing comprises convolution, pooling and normalization processing;

performing full connection processing on the preset number of third feature maps for multiple times to obtain scene probability corresponding to the training pictures;

and adjusting the first preset training parameter according to the scene probability and the label corresponding to the training picture to obtain the network parameter corresponding to the convolutional neural network model corresponding to the scene.

Optionally, the obtaining of the area suggestion of the picture to be recognized and the first feature map corresponding to the picture to be recognized includes:

carrying out convolution processing on the picture to be identified through the Fast R-CNN network to obtain a shared convolution layer;

extracting the second feature map from the shared convolutional layer, performing region suggestion processing on the second feature map through the region suggestion network according to network parameters corresponding to the region suggestion network to obtain region scores of all target regions, and obtaining the region suggestions according to the region scores;

and taking the shared convolution layer superposed with a preset number of convolution layers as a specific convolution layer to obtain the first characteristic diagram, wherein the number of convolution layers contained in the specific convolution layer is greater than that contained in the shared convolution layer.

Optionally, the obtaining a classification score of each target in the picture to be recognized according to the region suggestion and the picture to be recognized includes:

according to the region suggestion, performing region marking processing on the first feature map to obtain the first feature map subjected to the region marking processing;

pooling the first feature map subjected to the area marking treatment through the Fast R-CNN network to obtain a pooled first feature map;

performing full-connection processing on the first characteristic diagram after the pooling processing;

and obtaining the classification score of each target in the picture to be identified according to the network parameters corresponding to the Fast R-CNN network.

Optionally, before the obtaining of the area suggestion of the picture to be recognized and the first feature map corresponding to the picture to be recognized, the method further includes:

acquiring a training picture and a target area corresponding to the training picture, wherein the target area is used for indicating the position of a complete target in the training picture;

and acquiring network parameters corresponding to the Fast R-CNN network and the area suggestion network of each target according to the Fast R-CNN network, the area suggestion network and the target area corresponding to the training picture.

Optionally, the obtaining, according to the Fast R-CNN network, the area-suggested network, and the target area corresponding to the training picture, a network parameter corresponding to the Fast R-CNN network and a network parameter corresponding to the area-suggested network of each target includes:

performing convolution processing on the training picture through the Fast R-CNN network to obtain a shared convolution layer;

extracting the second feature map from the shared convolutional layer, performing region suggestion processing on the second feature map through the region suggestion network according to a second preset training parameter to obtain a region score, and obtaining the region suggestion according to the region score;

taking the shared convolution layer superposed with a preset number of convolution layers as a specific convolution layer to obtain the first characteristic diagram, wherein the number of convolution layers contained in the specific convolution layer is greater than that contained in the shared convolution layer;

pooling the first characteristic diagram marked by the area through the Fast R-CNN network to obtain a pooled first characteristic diagram;

performing full-connection processing on the pooled first feature map, and acquiring the classification score of each target in the training picture according to a third preset training parameter;

and adjusting the second preset training parameter and the third preset training parameter according to the area suggestion, the classification score of each target in the training picture and the target area to obtain a network parameter corresponding to the area suggestion network and a network parameter corresponding to the Fast R-CNN network of each target.

According to the target detection-assisted scene recognition method provided by the invention, scene recognition is carried out on samples obtained by sampling each picture to be recognized according to the convolutional neural network model, so as to obtain at least two scenes corresponding to the pictures to be recognized. And then, acquiring the classification score of each target in the picture to be recognized through the region suggestion of the picture to be recognized and the first characteristic graph corresponding to the picture to be recognized to complete the target detection process. And finally, obtaining the scene corresponding to the picture to be recognized through at least two scenes corresponding to the picture to be recognized and the classification score of each target. The invention assists the scene recognition method by the target detection method of the Fast R-CNN network and the regional suggestion network, so that the accuracy of the scene recognition method is improved. The area suggestion network provides area establishment with a target area for the Fast R-CNN network, so that the time for the Fast R-CNN network to perform target detection is greatly shortened, and the rate for the Fast R-CNN network to perform target detection is increased.

Drawings

Fig. 1 is a first flowchart of a target detection-assisted scene recognition method provided in the present invention;

FIG. 2 is a second flowchart of a target detection-assisted scene recognition method provided by the present invention;

FIG. 3 is a third flowchart of a target detection-assisted scene recognition method provided by the present invention;

FIG. 4 is a fourth flowchart of a target detection-assisted scene recognition method provided by the present invention;

fig. 5 is a schematic diagram of a training process of an Alexnet model in the target detection-assisted scene recognition method provided by the present invention;

FIG. 6 is a fifth flowchart of a target detection-assisted scene recognition method provided by the present invention;

FIG. 7 is a schematic diagram of a training process of a Fast R-CNN network and a regional suggestion network in the target detection-assisted scene recognition method provided by the present invention.

Detailed Description

Fig. 1 is a first flowchart of a scene recognition method assisted by target detection provided by the present invention, fig. 2 is a second flowchart of the scene recognition method assisted by target detection provided by the present invention, and fig. 3 is a third flowchart of the scene recognition method assisted by target detection provided by the present invention, as shown in fig. 1, the scene recognition method assisted by target detection of the present embodiment includes:

101, acquiring a picture to be recognized, sampling the picture to be recognized to obtain samples with preset quantity and preset size, and performing scene recognition on each sample according to a convolutional neural network model to obtain at least two scenes corresponding to the picture to be recognized.

Specifically, the size of the to-be-recognized picture in this embodiment may be selected according to a network in which the convolutional neural network model is specifically selected, and the number of the to-be-recognized pictures may be adjusted according to different quality data conditions in the convolutional neural network model training and learning process, which is not specifically limited in this embodiment. The convolutional neural network model may be selected from Alexnet, VGG16, VGG19, ResNet, and the like, which is not limited in this embodiment.

Furthermore, in this embodiment, each picture to be recognized may be subjected to multiple amplification by a method of cutting, cutting and sampling to obtain samples of a preset number and a preset size, so as to prevent overfitting, thereby improving the accuracy of scene recognition of each picture to be recognized. In the embodiment, the preset number and the preset size are not limited, and the number of the amplified pictures to be recognized can meet the requirement of training a convolutional neural network model for learning.

For example, for convenience of description, the Alexnet model is taken as an example for description in this embodiment. The method includes the steps that clipping sampling can be conducted on a picture to be recognized and the corresponding upper left position, upper right position, lower left position, lower right position and middle position of the picture to be recognized after horizontal turning is conducted, then 10 samples are obtained from the picture to be recognized, the 10 samples are all input into an Alexnet model to conduct scene recognition, corresponding 10 scene recognition results are obtained, the scene recognition results can be represented through a matrix, the number of rows of the matrix is one row, the number of columns of the matrix is a scene type, the value of each matrix is a decimal number between 0 and 1, and the sum of the added values of the matrixes is 1. And adding the matrixes corresponding to the 10 samples, taking an average value as a scene recognition probability matrix of the picture to be recognized, and taking a scene corresponding to the maximum probability value, namely a finally recognized scene result of the picture to be recognized. And the other pictures to be recognized are subjected to scene recognition according to the mode to obtain corresponding scene recognition results.

Further, for scenes such as stations and rendezvous parades which are difficult to distinguish, scene recognition of the pictures to be recognized according to the convolutional neural network cannot be recognized, the obtained scene recognition result can possibly recognize the pictures of the stations as rendezvous parades, the pictures of the rendezvous parades are recognized as the stations, and therefore the pictures to be recognized are possibly corresponding to at least two scenes.

102, obtaining a region suggestion of the picture to be recognized and a first feature map corresponding to the picture to be recognized, and obtaining a classification score of each target in the picture to be recognized according to the region suggestion and the picture to be recognized.

The Region suggestion is obtained by processing a second feature map by a Region suggestion network, and the first feature map and the second feature map are obtained by performing convolution processing on the picture to be recognized in a Fast Region-based conditional network (Fast R-CNN) network of a quick edition.

Specifically, in this embodiment, the target detection is performed on the picture to be recognized, so that the target in the picture to be recognized can be recognized.

On one hand, as shown in fig. 2, in this embodiment, acquiring the area suggestion of the picture to be recognized and the first feature map corresponding to the picture to be recognized includes:

step 201, performing convolution processing on the picture to be identified through the Fast R-CNN network to obtain a shared convolution layer.

Step 202, extracting the second feature map from the shared convolutional layer, performing region suggestion processing on the second feature map through the region suggestion network according to network parameters corresponding to the region suggestion network to obtain region scores of each target region, and obtaining the region suggestions according to the region scores.

Specifically, in this embodiment, the picture to be identified is input into the Fast R-CNN network for convolution processing to obtain a shared convolution layer, a layer of corresponding second feature map is extracted from the shared convolution layer, and the extracted layer of second feature map is input into the area suggestion network for area suggestion processing, so that the area score can be obtained. The area suggestion processing is to perform sliding window processing on the picture to be identified and perform multiple convolution or full connection processing, and the area score is used for expressing whether the area has the target and the probability of the target.

And 203, taking the shared convolution layer superposed with a preset number of convolution layers as a specific convolution layer to obtain the first characteristic diagram, wherein the number of convolution layers contained in the specific convolution layer is greater than that contained in the shared convolution layer.

Specifically, in order to avoid losing other information, the Fast R-CNN network continues convolution processing on the basis of the shared convolutional layers, so that the information is more complete, that is, the shared convolutional layers are superimposed with a predetermined number of convolutional layers to serve as specific convolutional layers, and a first characteristic diagram is obtained. In this embodiment, the number of preset layers is not specifically limited. For example, the second feature map corresponding to the 5 th layer is extracted from the shared convolutional layer and is transmitted to the area suggestion network, and the first feature map corresponding to the 9 th layer of the shared convolutional layer is used as the unique convolutional layer.

On the other hand, as shown in fig. 3, in this embodiment, obtaining the classification score of each target in the picture to be recognized according to the region suggestion and the picture to be recognized includes:

step 301, according to the region suggestion, performing region marking processing on the first feature map to obtain the region-marked first feature map.

Specifically, in this embodiment, a plurality of regions with higher region scores may be selected, and the regions corresponding to specific positions of the picture to be recognized are transmitted to the Fast R-CNN network as region suggestions, and the Fast R-CNN network may perform region labeling processing on the first feature map according to the region suggestions, so that the Fast R-CNN network only needs to perform a process of classifying and scoring each target on the labeled regions, and does not need to perform a process of classifying and scoring on regions without targets, thereby saving target detection time of the picture to be recognized.

And 302, performing pooling treatment on the first feature map subjected to the area marking treatment through the Fast R-CNN network to obtain a pooled first feature map.

Further, since the sizes of the first feature maps after the area marking process are not the same, the sizes of the first feature maps need to be standardized by pooling to obtain the pooled first feature maps.

And 303, carrying out full connection processing on the first characteristic diagram after the pooling processing.

Specifically, the first feature map after the pooling process has a plurality of all-directional maps, and the dimensionality of the first feature map can be reduced through the all-connection process, so that the operation of classification and scoring is facilitated.

And 304, acquiring the classification score of each target in the picture to be identified according to the network parameters corresponding to the Fast R-CNN network.

Specifically, the classification score in this embodiment can not only indicate whether there is a target in the picture to be recognized, but also indicate the probability of various targets in the picture to be recognized. And the network parameters corresponding to the Fast R-CNN network are the optimized parameters obtained by the Fast R-CNN network training, so that the target in the picture to be identified can be accurately detected. Therefore, whether a certain target is in the picture to be recognized can be judged by acquiring the classification score of each target in the picture to be recognized.

It should be noted here that the sequence of step 101 may precede step 102, the sequence of step 102 may precede step 101, and the sequences of step 101 and step 102 may be performed simultaneously. In this embodiment, the sequence of step 101 and step 102 is not limited.

103, obtaining a scene corresponding to the picture to be recognized according to the at least two scenes corresponding to the picture to be recognized and the classification score of each target.

Specifically, in step 101, at least two scenes corresponding to the picture to be recognized may be recognized, so as to narrow the range of scene selection, for example, scenes with a large number of people, such as a station and a gathering tour, may correspond to the scene of the picture to be recognized. Step 102 can detect the classification score of each target in the picture to be recognized, for example, the banner mostly exists in the scene of meeting parade, and the banner can be used as the target. Thus, at least two scenes corresponding to the picture to be recognized are combined with the classification scores of the targets, so that the scene corresponding to the picture to be recognized is recognized. For example, if there is a banner in the picture to be recognized, the scene corresponding to the picture to be recognized is a rendezvous tour. And if the picture to be recognized has no banner, the scene corresponding to the picture to be recognized is the station.

In the target detection-assisted scene recognition method provided by this embodiment, scene recognition is performed on samples obtained by sampling each picture to be recognized according to the convolutional neural network model, so as to obtain at least two scenes corresponding to the pictures to be recognized. And then, acquiring the classification score of each target in the picture to be recognized through the region suggestion of the picture to be recognized and the first characteristic graph corresponding to the picture to be recognized to complete the target detection process. And finally, obtaining the scene corresponding to the picture to be recognized through at least two scenes corresponding to the picture to be recognized and the classification score of each target. In the embodiment, the scene recognition method is assisted by the target detection method of the Fast R-CNN network and the regional suggestion network, so that the accuracy of the scene recognition method is improved. The area suggestion network provides the target area establishment for the FastR-CNN network, so that the time for the target detection of the FastR-CNN network is greatly shortened, and the target detection rate of the FastR-CNN network is increased.

Fig. 4 is a fourth flowchart of the target detection-assisted scene identification method provided in the present invention, and as shown in fig. 4, before the scene identification is performed on each sample according to the convolutional network model to obtain at least two scenes corresponding to the picture to be identified, the method of this embodiment further includes:

step 401, obtaining a training picture and a label corresponding to the training picture, where the label is used to indicate a scene corresponding to the training picture.

Specifically, the training pictures in this embodiment may be obtained from a database, or may be obtained manually, which is not limited in this embodiment, and it is only necessary to ensure that the number of training pictures corresponding to each scene can reach thousands of training pictures. If the training picture is acquired manually, sample expansion can be performed on the training picture in a horizontal turning and cutting sampling mode. If the convolutional neural network model has a requirement on the size of the training picture, the training picture can be changed into a uniform size in a cutting and sampling mode. Meanwhile, in this embodiment, a label corresponding to the training picture needs to be obtained, where the label is a scene category corresponding to the training picture.

Step 402, obtaining network parameters corresponding to the convolutional neural network model corresponding to each scene according to the convolutional neural network model and the label corresponding to the training picture.

Specifically, the training picture and the corresponding label are input into the convolutional neural network model, and the convolutional neural network model can perform a training and learning process. Because the training pictures and the labels corresponding to the training pictures are known, the convolutional neural network model can learn various scenes by dynamically adjusting the first preset training parameters, and then different scenes are distinguished. The specific method comprises the following steps:

step 4021, carrying out segmentation sampling on the training picture to obtain an amplified training picture.

Specifically, the training pictures of one scene category are changed into the training pictures of the same scene with the same size in a multiple-times manner through the segmentation sampling process, the number of the training pictures is increased, the operation is simple, and more references are provided for training and learning of the convolutional neural network model.

Step 4022, according to a first preset training parameter, carrying out preset processing on the amplified training pictures to obtain a preset number of third feature maps, wherein the preset processing comprises convolution, pooling and normalization processing.

Specifically, the parameters of the convolutional neural network model are randomly set as first preset training parameters, and after convolution, pooling and normalization processing are performed on the amplified training pictures, a third feature map which is preset in number and can clearly express the training pictures to the scene can be obtained.

And 4023, performing full connection processing on the preset number of third feature graphs for multiple times to obtain scene probabilities corresponding to the training pictures.

Specifically, the preset processed third feature map has a plurality of omnibearing maps, and the scene probability corresponding to the training picture can be obtained through multiple times of full connection processing. The dimension of the third feature graph can be reduced through multiple times of full-connection processing, excessive information cannot be lost, and the classification effect is effectively guaranteed.

Step 4024, adjusting the first preset training parameter according to the scene probability and the label corresponding to the training picture to obtain a network parameter corresponding to the convolutional neural network model corresponding to the scene.

Specifically, the first preset training parameter is dynamically adjusted by comparing whether the scene probability is consistent with the label corresponding to the training picture, so that the scene probability is consistent with the label corresponding to the training picture, and then the first preset training parameter at the moment can be used as the network parameter corresponding to the convolutional neural network model corresponding to the scene.

And 403, performing scene recognition on each sample according to the network parameters corresponding to the convolutional neural network model corresponding to each scene to obtain at least two scenes corresponding to the picture to be recognized.

Specifically, the scene can be better identified by the convolutional neural network model established by the network parameters corresponding to the convolutional neural network model corresponding to the scene, and the at least two scenes corresponding to the picture to be identified can be obtained by carrying out scene identification on the sample corresponding to the picture to be identified.

Fig. 5 is a schematic diagram of a training process of an Alexnet model in the target detection-assisted scene recognition method provided by the present invention, and in a specific embodiment, taking the Alexnet model as an example, it is necessary to train and learn the Alexnet model first, and then perform scene recognition on a picture to be recognized, as shown in fig. 5.

Firstly, 256 × 256 training pictures of three channels are input into an Alexnet model, the Alexnet model can detect the label of the training pictures, and the 256 × 256 pictures of the three channels are obtained after segmentation and sampling, and are subjected to convolution calculation and activation processing, wherein the activation processing samples an activation function, such as a modified linear unit (RELU). Wherein, the mathematical expression of ReLU is: f (x) max (0, x), where x represents the input signal and f (x) represents the output signal. When the input signal is less than 0, the output signal is 0; if the input signal is equal to or greater than 0, the output signal is equal to the input signal. It can be seen that the convergence rate of the SGD method using ReLU to obtain random gradient descent (SGD) is much faster than that of other functions (such as the activation function sigmoid/tanh in the conventional method). Since the ReLU is linear, the ReLU only needs one threshold to obtain the activation value, and does not need to calculate a large pile of complex operations. Therefore, the convolution process can be optimized, so that 96 feature maps with the size of 55 × 55 are obtained, pooling and normalization processing is performed to obtain 96 feature maps with the size of 27 × 27, convolution calculation and activation processing are performed again to obtain 256 feature maps with the size of 27 × 27, pooling and normalization processing is performed to obtain 256 feature maps with the size of 13 × 13, pooling and normalization processing is performed to obtain 384 feature maps with the size of 13 × 13, convolution calculation and activation processing are performed to obtain 256 feature maps with the size of 13 × 13, pooling processing is performed to obtain 256 feature maps with the size of 6 × 6, and full connection processing is performed for three times and activation processing is performed to obtain a result of a scene corresponding to the training picture, namely a scene probability corresponding to the training picture. And then, according to the label of the training picture, adjusting the whole process, namely changing a first preset training parameter of the Alexnet model until the scene probability corresponding to the training picture is consistent with the label of the training picture, the Alexnet model training converges to a certain interval, the loss value is controlled within an acceptable range, and at the moment, stopping training. And taking the first preset training parameter at the moment as a network parameter corresponding to the Alexnet model corresponding to the scene. Thus, the whole process of Alexnet model training and learning is completed.

Then, when a picture to be recognized is input into the Alexnet model for scene recognition, the training and learning process determines the network parameters corresponding to the Alexnet model corresponding to the scene of the Alexnet model, so that the scene corresponding to the picture to be recognized can be recognized.

Fig. 6 is a fifth flowchart of the target detection-assisted scene recognition method provided by the present invention, and as shown in fig. 6, before the obtaining of the area suggestion of the picture to be recognized and the first feature map corresponding to the picture to be recognized, the method of this embodiment further includes:

step 601, obtaining a training picture and a target region corresponding to the training picture, where the target region is used to indicate a position of a complete target in the training picture.

Specifically, the training picture in this embodiment may be obtained from a database, or may be obtained manually, which is not limited in this embodiment. As the size of the input picture of the Fast R-CNN network is not limited, the training picture does not need to be divided and sampled. Meanwhile, in this embodiment, a target region corresponding to the training picture needs to be obtained, where the target region is used to indicate an actual position of the complete target in the training picture.

Step 602, obtaining network parameters corresponding to the Fast R-CNN network and network parameters corresponding to the area suggestion network of each target according to the Fast R-CNN network, the area suggestion network and the target area corresponding to the training picture.

Specifically, the training picture and the corresponding target area are input into the Fast R-CNN network, and the Fast R-CNN network and the area suggestion network can perform the training and learning process. Because the training picture and the target area corresponding to the training picture are known, the area suggestion network can learn whether various targets exist in the training picture by dynamically adjusting the second preset training parameter to obtain the area suggestion of the targets, and the area suggestion is transmitted to the Fast R-CNN network, so that the real-time of target detection performed by the Fast R-CNN network can be reduced, and the efficiency is improved. Meanwhile, the Fast R-CNN network can learn various targets in various training pictures by dynamically adjusting the third preset training parameter, and further distinguish the various targets in the training pictures. The specific method comprises the following steps:

step 6021, carrying out convolution processing on the training picture through the Fast R-CNN network to obtain a shared convolution layer.

Specifically, in this embodiment, the training picture is input into the Fast R-CNN network for convolution processing, so as to obtain the shared convolutional layer.

Step 6022, extracting the second feature map from the shared convolutional layer, performing region suggestion processing on the second feature map through the region suggestion network according to a second preset training parameter to obtain the region score, and obtaining the region suggestion according to the region score.

Specifically, a layer of corresponding second feature map is extracted from the shared convolution layer and input into the area suggestion network for area suggestion processing, so that an area score can be obtained. The area suggestion processing is to perform sliding window processing on the training picture and perform multiple convolution or full connection processing. The region score is a value used to express whether or not the region has a target and the probability of the target.

Step 6023, superposing a preset number of convolutional layers on the shared convolutional layer to obtain a first characteristic diagram, wherein the number of convolutional layers included in the specific convolutional layer is greater than that of convolutional layers included in the shared convolutional layer.

And 6024, according to the region suggestion, performing region marking processing on the first feature map to obtain the region marked first feature map.

Specifically, in this embodiment, several regions with higher region scores may be selected, and the regions corresponding to specific positions of the training picture are transmitted to the Fast R-CNN network as region suggestions, and the Fast R-CNN network may perform region labeling processing on the first feature map according to the region suggestions, so that the Fast R-CNN network only needs to perform a process of classifying and scoring each target on the labeled regions, and does not need to perform a process of classifying and scoring on regions without targets, thereby saving target detection time for the training picture.

And 6025, performing pooling treatment on the first feature map marked by the area through the Fast R-CNN network to obtain a pooled first feature map.

Specifically, since the sizes of the first feature maps after the area marking process are not the same, the sizes of the first feature maps need to be standardized through the pooling process to obtain the pooled first feature maps.

And 6026, performing full connection processing on the pooled first feature map, and acquiring a classification score of each target in the training picture according to a third preset training parameter.

Step 6027, according to the region suggestion, the classification score of each target in the training picture and the target region, adjusting the second preset training parameter and the third preset training parameter to obtain a network parameter corresponding to the region suggestion network and a network parameter corresponding to the Fast R-CNN network of each target.

Specifically, since the target area is known, the third preset training parameter can be dynamically adjusted according to the target area, so that the target detection is accurate, and the third preset training parameter is used as a network parameter corresponding to the Fast R-CNN network of each target. Meanwhile, the second preset training parameter can be dynamically adjusted according to the area suggestion and the target area, so that the area suggestion corresponding to the area score is more accurate, the time for the Fast R-CNN network to carry out target detection on the training picture is reduced, and the second preset training parameter is used as the network parameter corresponding to the area suggestion network.

Fig. 7 is a schematic diagram of a training process of a Fast R-CNN network and an RPN network in the target detection-assisted scene recognition method provided by the present invention, and in a specific embodiment, a training picture needs to be trained and learned together with a Region suggestion network (RPN) for the Fast R-CNN network and the RPN, and then a scene recognition process of a picture to be recognized is performed, as shown in fig. 7.

Firstly, training pictures with any size are input into a Fast R-CNN network, and a target area corresponding to the training pictures is obtained. And carrying out convolution processing on the training picture through a Fast R-CNN network to obtain a shared convolution layer. And the RPN extracts a second characteristic diagram from the shared convolution layer, performs sliding window processing on the second characteristic diagram, performs convolution processing or full connection processing twice, namely pixel points in each sliding window form 9 regions with different sizes, compares the regions with target regions respectively, obtains corresponding region scores according to a second preset training parameter, selects 300 regions with the highest scores as region suggestions, and transmits the region suggestions to the Fast R-CNN network. And then, the shared convolution layer is superposed with a plurality of layers of convolution layers to be used as a special convolution layer, so that a first characteristic diagram can be obtained. In this embodiment, according to the area suggestion, the first feature map may be subjected to area labeling processing to obtain a first feature after the area labeling processing, and then the first feature map after the area labeling may be subjected to pooling processing by the Fast R-CNN network to obtain a pooled first feature map. And performing full connection processing on the pooled first feature map, and acquiring the classification score of each target in the training picture according to a third preset training parameter. At this time, whether the third preset training parameter is consistent with the target area or not is judged. If the training parameters are inconsistent, dynamically adjusting a third preset training parameter, and simultaneously adjusting a second preset training parameter according to the region suggestion and the target region until the third preset training parameter is consistent with the target region, the Fast R-CNN network and the RPN network are trained and converged to a certain interval, the loss value is controlled within an acceptable range, and the training is stopped at the moment. And taking the second preset training parameter as a network parameter corresponding to the RPN network, and taking the third preset training parameter as a network parameter corresponding to the FastR-CNN network. Thus, the whole process of training and learning of Fast R-CNN network and RPN network is completed.

Then, when a picture to be identified is input into the Fast R-CNN network for target detection, because the training and learning process determines the network parameters corresponding to the Fast R-CNN network corresponding to the target of the Fast R-CNN network and the network parameters corresponding to the RPN network corresponding to the target of the RPN network, the target corresponding to the picture to be identified can be identified, and the regional proposal of the RPN network can be saved and the detection time of the Fast R-CNN network can be saved.

Those of ordinary skill in the art will understand that: all or a portion of the steps of implementing the above-described method embodiments may be performed by hardware associated with program instructions. The program may be stored in a computer-readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A target detection assisted scene recognition method is characterized by comprising the following steps:

obtaining a scene corresponding to the picture to be identified according to at least two scenes corresponding to the picture to be identified and the classification score of each target;

the obtaining of the area suggestion of the picture to be recognized and the first feature map corresponding to the picture to be recognized includes:

2. The method according to claim 1, wherein before the scene recognition is performed on each of the samples according to the convolutional network model to obtain at least two scenes corresponding to the picture to be recognized, the method further comprises:

3. The method according to claim 2, wherein the obtaining of the network parameters corresponding to the convolutional neural network model corresponding to each scene according to the convolutional neural network model and the label corresponding to the training picture comprises:

4. The method according to claim 1, wherein the obtaining a classification score of each target in the picture to be recognized according to the region suggestion and the picture to be recognized comprises:

5. The method according to claim 4, further comprising, before the obtaining of the region suggestion of the picture to be recognized and the first feature map corresponding to the picture to be recognized, the steps of:

6. The method according to claim 5, wherein the obtaining of the network parameters corresponding to the Fast R-CNN network and the network parameters corresponding to the area suggestion network for each target according to the Fast R-CNN network, the area suggestion network, and the target area corresponding to the training picture comprises: