CN107194318B - Target detection assisted scene identification method - Google Patents

Target detection assisted scene identification method Download PDF

Info

Publication number
CN107194318B
CN107194318B CN201710270013.4A CN201710270013A CN107194318B CN 107194318 B CN107194318 B CN 107194318B CN 201710270013 A CN201710270013 A CN 201710270013A CN 107194318 B CN107194318 B CN 107194318B
Authority
CN
China
Prior art keywords
picture
network
training
recognized
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710270013.4A
Other languages
Chinese (zh)
Other versions
CN107194318A (en
Inventor
王蕴红
孙宇航
赵文婷
陈训逊
刘庆杰
王博
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beihang University
Original Assignee
Beihang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beihang University filed Critical Beihang University
Priority to CN201710270013.4A priority Critical patent/CN107194318B/en
Publication of CN107194318A publication Critical patent/CN107194318A/en
Application granted granted Critical
Publication of CN107194318B publication Critical patent/CN107194318B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/35Categorising the entire scene, e.g. birthday party or wedding scene
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Abstract

The invention provides a target detection-assisted scene recognition method. The target detection-assisted scene identification method comprises the following steps: acquiring a picture to be recognized, sampling the picture to be recognized to obtain samples with preset quantity and size, and performing scene recognition on each sample according to a convolutional neural network model to obtain at least two scenes corresponding to the picture to be recognized; acquiring a region proposal of a picture to be identified and a first feature map corresponding to the picture to be identified, and acquiring a classification score of each target in the picture to be identified according to the region proposal and the picture to be identified; and obtaining the scene corresponding to the picture to be recognized according to at least two scenes corresponding to the picture to be recognized and the classification score of each target. The invention assists the scene recognition method by the target detection method combining the Fast R-CNN network and the regional suggestion network, so that the accuracy of the scene recognition method is improved.

Description

Target detection assisted scene identification method
Technical Field
The invention relates to the technical field of computer vision, in particular to a scene recognition method assisted by target detection.
Background
Scene recognition is an important problem in the field of computer vision, and a computer automatically judges the specific scene to which the image or the photo belongs, and especially plays an important role in video monitoring, social network user behavior mining and the like.
However, due to the complexity of the scene itself and factors such as illumination, occlusion, and scale change, the existing scene recognition methods still cannot distinguish the types of the scenes well, for example, if one picture has a group of people, it is difficult for the computer to classify the picture into the scenes such as a mall, a station, or a party, so that the accuracy of scene recognition is reduced.
Disclosure of Invention
The invention provides a target detection-assisted scene recognition method, which is used for realizing the target detection-assisted scene recognition method and reducing the problem of scene distinguishing errors in the scene recognition method.
The invention provides a target detection assisted scene identification method, which comprises the following steps:
acquiring a picture to be identified, sampling the picture to be identified to obtain samples with preset quantity and size, and performing scene identification on each sample according to a convolutional neural network model to obtain at least two scenes corresponding to the picture to be identified;
acquiring a region suggestion of the picture to be recognized and a first feature map corresponding to the picture to be recognized, and acquiring a classification score of each target in the picture to be recognized according to the region suggestion and the picture to be recognized; the region suggestion is obtained by processing a second feature map by a region suggestion network, and the first feature map and the second feature map are obtained by performing convolution processing on the picture to be identified in a Fast R-CNN network;
and obtaining the scene corresponding to the picture to be identified according to the at least two scenes corresponding to the picture to be identified and the classification score of each target.
Optionally, before the scene recognition is performed on each sample according to the convolutional network model to obtain at least two scenes corresponding to the picture to be recognized, the method further includes:
acquiring a training picture and a label corresponding to the training picture, wherein the label is used for indicating a scene corresponding to the training picture;
acquiring network parameters corresponding to the convolutional neural network models corresponding to the scenes according to the convolutional neural network models and labels corresponding to the training pictures;
the scene recognition is carried out on each sample according to the convolutional network model to obtain at least two scenes corresponding to the picture to be recognized, and the method comprises the following steps:
and carrying out scene recognition on each sample according to the network parameters corresponding to the convolutional neural network model corresponding to each scene to obtain at least two scenes corresponding to the picture to be recognized.
Optionally, the obtaining, according to the convolutional neural network model and the label corresponding to the training picture, a network parameter corresponding to the convolutional neural network model corresponding to each of the scenes includes:
carrying out segmentation sampling on the training picture to obtain an amplified training picture;
according to a first preset training parameter, carrying out preset processing on the amplified training pictures to obtain a preset number of third feature maps, wherein the preset processing comprises convolution, pooling and normalization processing;
performing full connection processing on the preset number of third feature maps for multiple times to obtain scene probability corresponding to the training pictures;
and adjusting the first preset training parameter according to the scene probability and the label corresponding to the training picture to obtain the network parameter corresponding to the convolutional neural network model corresponding to the scene.
Optionally, the obtaining of the area suggestion of the picture to be recognized and the first feature map corresponding to the picture to be recognized includes:
carrying out convolution processing on the picture to be identified through the Fast R-CNN network to obtain a shared convolution layer;
extracting the second feature map from the shared convolutional layer, performing region suggestion processing on the second feature map through the region suggestion network according to network parameters corresponding to the region suggestion network to obtain region scores of all target regions, and obtaining the region suggestions according to the region scores;
and taking the shared convolution layer superposed with a preset number of convolution layers as a specific convolution layer to obtain the first characteristic diagram, wherein the number of convolution layers contained in the specific convolution layer is greater than that contained in the shared convolution layer.
Optionally, the obtaining a classification score of each target in the picture to be recognized according to the region suggestion and the picture to be recognized includes:
according to the region suggestion, performing region marking processing on the first feature map to obtain the first feature map subjected to the region marking processing;
pooling the first feature map subjected to the area marking treatment through the Fast R-CNN network to obtain a pooled first feature map;
performing full-connection processing on the first characteristic diagram after the pooling processing;
and obtaining the classification score of each target in the picture to be identified according to the network parameters corresponding to the Fast R-CNN network.
Optionally, before the obtaining of the area suggestion of the picture to be recognized and the first feature map corresponding to the picture to be recognized, the method further includes:
acquiring a training picture and a target area corresponding to the training picture, wherein the target area is used for indicating the position of a complete target in the training picture;
and acquiring network parameters corresponding to the Fast R-CNN network and the area suggestion network of each target according to the Fast R-CNN network, the area suggestion network and the target area corresponding to the training picture.
Optionally, the obtaining, according to the Fast R-CNN network, the area-suggested network, and the target area corresponding to the training picture, a network parameter corresponding to the Fast R-CNN network and a network parameter corresponding to the area-suggested network of each target includes:
performing convolution processing on the training picture through the Fast R-CNN network to obtain a shared convolution layer;
extracting the second feature map from the shared convolutional layer, performing region suggestion processing on the second feature map through the region suggestion network according to a second preset training parameter to obtain a region score, and obtaining the region suggestion according to the region score;
taking the shared convolution layer superposed with a preset number of convolution layers as a specific convolution layer to obtain the first characteristic diagram, wherein the number of convolution layers contained in the specific convolution layer is greater than that contained in the shared convolution layer;
according to the region suggestion, performing region marking processing on the first feature map to obtain the first feature map subjected to the region marking processing;
pooling the first characteristic diagram marked by the area through the Fast R-CNN network to obtain a pooled first characteristic diagram;
performing full-connection processing on the pooled first feature map, and acquiring the classification score of each target in the training picture according to a third preset training parameter;
and adjusting the second preset training parameter and the third preset training parameter according to the area suggestion, the classification score of each target in the training picture and the target area to obtain a network parameter corresponding to the area suggestion network and a network parameter corresponding to the Fast R-CNN network of each target.
According to the target detection-assisted scene recognition method provided by the invention, scene recognition is carried out on samples obtained by sampling each picture to be recognized according to the convolutional neural network model, so as to obtain at least two scenes corresponding to the pictures to be recognized. And then, acquiring the classification score of each target in the picture to be recognized through the region suggestion of the picture to be recognized and the first characteristic graph corresponding to the picture to be recognized to complete the target detection process. And finally, obtaining the scene corresponding to the picture to be recognized through at least two scenes corresponding to the picture to be recognized and the classification score of each target. The invention assists the scene recognition method by the target detection method of the Fast R-CNN network and the regional suggestion network, so that the accuracy of the scene recognition method is improved. The area suggestion network provides area establishment with a target area for the Fast R-CNN network, so that the time for the Fast R-CNN network to perform target detection is greatly shortened, and the rate for the Fast R-CNN network to perform target detection is increased.
Drawings
Fig. 1 is a first flowchart of a target detection-assisted scene recognition method provided in the present invention;
FIG. 2 is a second flowchart of a target detection-assisted scene recognition method provided by the present invention;
FIG. 3 is a third flowchart of a target detection-assisted scene recognition method provided by the present invention;
FIG. 4 is a fourth flowchart of a target detection-assisted scene recognition method provided by the present invention;
fig. 5 is a schematic diagram of a training process of an Alexnet model in the target detection-assisted scene recognition method provided by the present invention;
FIG. 6 is a fifth flowchart of a target detection-assisted scene recognition method provided by the present invention;
FIG. 7 is a schematic diagram of a training process of a Fast R-CNN network and a regional suggestion network in the target detection-assisted scene recognition method provided by the present invention.
Detailed Description
Fig. 1 is a first flowchart of a scene recognition method assisted by target detection provided by the present invention, fig. 2 is a second flowchart of the scene recognition method assisted by target detection provided by the present invention, and fig. 3 is a third flowchart of the scene recognition method assisted by target detection provided by the present invention, as shown in fig. 1, the scene recognition method assisted by target detection of the present embodiment includes:
101, acquiring a picture to be recognized, sampling the picture to be recognized to obtain samples with preset quantity and preset size, and performing scene recognition on each sample according to a convolutional neural network model to obtain at least two scenes corresponding to the picture to be recognized.
Specifically, the size of the to-be-recognized picture in this embodiment may be selected according to a network in which the convolutional neural network model is specifically selected, and the number of the to-be-recognized pictures may be adjusted according to different quality data conditions in the convolutional neural network model training and learning process, which is not specifically limited in this embodiment. The convolutional neural network model may be selected from Alexnet, VGG16, VGG19, ResNet, and the like, which is not limited in this embodiment.
Furthermore, in this embodiment, each picture to be recognized may be subjected to multiple amplification by a method of cutting, cutting and sampling to obtain samples of a preset number and a preset size, so as to prevent overfitting, thereby improving the accuracy of scene recognition of each picture to be recognized. In the embodiment, the preset number and the preset size are not limited, and the number of the amplified pictures to be recognized can meet the requirement of training a convolutional neural network model for learning.
For example, for convenience of description, the Alexnet model is taken as an example for description in this embodiment. The method includes the steps that clipping sampling can be conducted on a picture to be recognized and the corresponding upper left position, upper right position, lower left position, lower right position and middle position of the picture to be recognized after horizontal turning is conducted, then 10 samples are obtained from the picture to be recognized, the 10 samples are all input into an Alexnet model to conduct scene recognition, corresponding 10 scene recognition results are obtained, the scene recognition results can be represented through a matrix, the number of rows of the matrix is one row, the number of columns of the matrix is a scene type, the value of each matrix is a decimal number between 0 and 1, and the sum of the added values of the matrixes is 1. And adding the matrixes corresponding to the 10 samples, taking an average value as a scene recognition probability matrix of the picture to be recognized, and taking a scene corresponding to the maximum probability value, namely a finally recognized scene result of the picture to be recognized. And the other pictures to be recognized are subjected to scene recognition according to the mode to obtain corresponding scene recognition results.
Further, for scenes such as stations and rendezvous parades which are difficult to distinguish, scene recognition of the pictures to be recognized according to the convolutional neural network cannot be recognized, the obtained scene recognition result can possibly recognize the pictures of the stations as rendezvous parades, the pictures of the rendezvous parades are recognized as the stations, and therefore the pictures to be recognized are possibly corresponding to at least two scenes.
102, obtaining a region suggestion of the picture to be recognized and a first feature map corresponding to the picture to be recognized, and obtaining a classification score of each target in the picture to be recognized according to the region suggestion and the picture to be recognized.
The Region suggestion is obtained by processing a second feature map by a Region suggestion network, and the first feature map and the second feature map are obtained by performing convolution processing on the picture to be recognized in a Fast Region-based conditional network (Fast R-CNN) network of a quick edition.
Specifically, in this embodiment, the target detection is performed on the picture to be recognized, so that the target in the picture to be recognized can be recognized.
On one hand, as shown in fig. 2, in this embodiment, acquiring the area suggestion of the picture to be recognized and the first feature map corresponding to the picture to be recognized includes:
step 201, performing convolution processing on the picture to be identified through the Fast R-CNN network to obtain a shared convolution layer.
Step 202, extracting the second feature map from the shared convolutional layer, performing region suggestion processing on the second feature map through the region suggestion network according to network parameters corresponding to the region suggestion network to obtain region scores of each target region, and obtaining the region suggestions according to the region scores.
Specifically, in this embodiment, the picture to be identified is input into the Fast R-CNN network for convolution processing to obtain a shared convolution layer, a layer of corresponding second feature map is extracted from the shared convolution layer, and the extracted layer of second feature map is input into the area suggestion network for area suggestion processing, so that the area score can be obtained. The area suggestion processing is to perform sliding window processing on the picture to be identified and perform multiple convolution or full connection processing, and the area score is used for expressing whether the area has the target and the probability of the target.
And 203, taking the shared convolution layer superposed with a preset number of convolution layers as a specific convolution layer to obtain the first characteristic diagram, wherein the number of convolution layers contained in the specific convolution layer is greater than that contained in the shared convolution layer.
Specifically, in order to avoid losing other information, the Fast R-CNN network continues convolution processing on the basis of the shared convolutional layers, so that the information is more complete, that is, the shared convolutional layers are superimposed with a predetermined number of convolutional layers to serve as specific convolutional layers, and a first characteristic diagram is obtained. In this embodiment, the number of preset layers is not specifically limited. For example, the second feature map corresponding to the 5 th layer is extracted from the shared convolutional layer and is transmitted to the area suggestion network, and the first feature map corresponding to the 9 th layer of the shared convolutional layer is used as the unique convolutional layer.
On the other hand, as shown in fig. 3, in this embodiment, obtaining the classification score of each target in the picture to be recognized according to the region suggestion and the picture to be recognized includes:
step 301, according to the region suggestion, performing region marking processing on the first feature map to obtain the region-marked first feature map.
Specifically, in this embodiment, a plurality of regions with higher region scores may be selected, and the regions corresponding to specific positions of the picture to be recognized are transmitted to the Fast R-CNN network as region suggestions, and the Fast R-CNN network may perform region labeling processing on the first feature map according to the region suggestions, so that the Fast R-CNN network only needs to perform a process of classifying and scoring each target on the labeled regions, and does not need to perform a process of classifying and scoring on regions without targets, thereby saving target detection time of the picture to be recognized.
And 302, performing pooling treatment on the first feature map subjected to the area marking treatment through the Fast R-CNN network to obtain a pooled first feature map.
Further, since the sizes of the first feature maps after the area marking process are not the same, the sizes of the first feature maps need to be standardized by pooling to obtain the pooled first feature maps.
And 303, carrying out full connection processing on the first characteristic diagram after the pooling processing.
Specifically, the first feature map after the pooling process has a plurality of all-directional maps, and the dimensionality of the first feature map can be reduced through the all-connection process, so that the operation of classification and scoring is facilitated.
And 304, acquiring the classification score of each target in the picture to be identified according to the network parameters corresponding to the Fast R-CNN network.
Specifically, the classification score in this embodiment can not only indicate whether there is a target in the picture to be recognized, but also indicate the probability of various targets in the picture to be recognized. And the network parameters corresponding to the Fast R-CNN network are the optimized parameters obtained by the Fast R-CNN network training, so that the target in the picture to be identified can be accurately detected. Therefore, whether a certain target is in the picture to be recognized can be judged by acquiring the classification score of each target in the picture to be recognized.
It should be noted here that the sequence of step 101 may precede step 102, the sequence of step 102 may precede step 101, and the sequences of step 101 and step 102 may be performed simultaneously. In this embodiment, the sequence of step 101 and step 102 is not limited.
103, obtaining a scene corresponding to the picture to be recognized according to the at least two scenes corresponding to the picture to be recognized and the classification score of each target.
Specifically, in step 101, at least two scenes corresponding to the picture to be recognized may be recognized, so as to narrow the range of scene selection, for example, scenes with a large number of people, such as a station and a gathering tour, may correspond to the scene of the picture to be recognized. Step 102 can detect the classification score of each target in the picture to be recognized, for example, the banner mostly exists in the scene of meeting parade, and the banner can be used as the target. Thus, at least two scenes corresponding to the picture to be recognized are combined with the classification scores of the targets, so that the scene corresponding to the picture to be recognized is recognized. For example, if there is a banner in the picture to be recognized, the scene corresponding to the picture to be recognized is a rendezvous tour. And if the picture to be recognized has no banner, the scene corresponding to the picture to be recognized is the station.
In the target detection-assisted scene recognition method provided by this embodiment, scene recognition is performed on samples obtained by sampling each picture to be recognized according to the convolutional neural network model, so as to obtain at least two scenes corresponding to the pictures to be recognized. And then, acquiring the classification score of each target in the picture to be recognized through the region suggestion of the picture to be recognized and the first characteristic graph corresponding to the picture to be recognized to complete the target detection process. And finally, obtaining the scene corresponding to the picture to be recognized through at least two scenes corresponding to the picture to be recognized and the classification score of each target. In the embodiment, the scene recognition method is assisted by the target detection method of the Fast R-CNN network and the regional suggestion network, so that the accuracy of the scene recognition method is improved. The area suggestion network provides the target area establishment for the FastR-CNN network, so that the time for the target detection of the FastR-CNN network is greatly shortened, and the target detection rate of the FastR-CNN network is increased.
Fig. 4 is a fourth flowchart of the target detection-assisted scene identification method provided in the present invention, and as shown in fig. 4, before the scene identification is performed on each sample according to the convolutional network model to obtain at least two scenes corresponding to the picture to be identified, the method of this embodiment further includes:
step 401, obtaining a training picture and a label corresponding to the training picture, where the label is used to indicate a scene corresponding to the training picture.
Specifically, the training pictures in this embodiment may be obtained from a database, or may be obtained manually, which is not limited in this embodiment, and it is only necessary to ensure that the number of training pictures corresponding to each scene can reach thousands of training pictures. If the training picture is acquired manually, sample expansion can be performed on the training picture in a horizontal turning and cutting sampling mode. If the convolutional neural network model has a requirement on the size of the training picture, the training picture can be changed into a uniform size in a cutting and sampling mode. Meanwhile, in this embodiment, a label corresponding to the training picture needs to be obtained, where the label is a scene category corresponding to the training picture.
Step 402, obtaining network parameters corresponding to the convolutional neural network model corresponding to each scene according to the convolutional neural network model and the label corresponding to the training picture.
Specifically, the training picture and the corresponding label are input into the convolutional neural network model, and the convolutional neural network model can perform a training and learning process. Because the training pictures and the labels corresponding to the training pictures are known, the convolutional neural network model can learn various scenes by dynamically adjusting the first preset training parameters, and then different scenes are distinguished. The specific method comprises the following steps:
step 4021, carrying out segmentation sampling on the training picture to obtain an amplified training picture.
Specifically, the training pictures of one scene category are changed into the training pictures of the same scene with the same size in a multiple-times manner through the segmentation sampling process, the number of the training pictures is increased, the operation is simple, and more references are provided for training and learning of the convolutional neural network model.
Step 4022, according to a first preset training parameter, carrying out preset processing on the amplified training pictures to obtain a preset number of third feature maps, wherein the preset processing comprises convolution, pooling and normalization processing.
Specifically, the parameters of the convolutional neural network model are randomly set as first preset training parameters, and after convolution, pooling and normalization processing are performed on the amplified training pictures, a third feature map which is preset in number and can clearly express the training pictures to the scene can be obtained.
And 4023, performing full connection processing on the preset number of third feature graphs for multiple times to obtain scene probabilities corresponding to the training pictures.
Specifically, the preset processed third feature map has a plurality of omnibearing maps, and the scene probability corresponding to the training picture can be obtained through multiple times of full connection processing. The dimension of the third feature graph can be reduced through multiple times of full-connection processing, excessive information cannot be lost, and the classification effect is effectively guaranteed.
Step 4024, adjusting the first preset training parameter according to the scene probability and the label corresponding to the training picture to obtain a network parameter corresponding to the convolutional neural network model corresponding to the scene.
Specifically, the first preset training parameter is dynamically adjusted by comparing whether the scene probability is consistent with the label corresponding to the training picture, so that the scene probability is consistent with the label corresponding to the training picture, and then the first preset training parameter at the moment can be used as the network parameter corresponding to the convolutional neural network model corresponding to the scene.
And 403, performing scene recognition on each sample according to the network parameters corresponding to the convolutional neural network model corresponding to each scene to obtain at least two scenes corresponding to the picture to be recognized.
Specifically, the scene can be better identified by the convolutional neural network model established by the network parameters corresponding to the convolutional neural network model corresponding to the scene, and the at least two scenes corresponding to the picture to be identified can be obtained by carrying out scene identification on the sample corresponding to the picture to be identified.
Fig. 5 is a schematic diagram of a training process of an Alexnet model in the target detection-assisted scene recognition method provided by the present invention, and in a specific embodiment, taking the Alexnet model as an example, it is necessary to train and learn the Alexnet model first, and then perform scene recognition on a picture to be recognized, as shown in fig. 5.
Firstly, 256 × 256 training pictures of three channels are input into an Alexnet model, the Alexnet model can detect the label of the training pictures, and the 256 × 256 pictures of the three channels are obtained after segmentation and sampling, and are subjected to convolution calculation and activation processing, wherein the activation processing samples an activation function, such as a modified linear unit (RELU). Wherein, the mathematical expression of ReLU is: f (x) max (0, x), where x represents the input signal and f (x) represents the output signal. When the input signal is less than 0, the output signal is 0; if the input signal is equal to or greater than 0, the output signal is equal to the input signal. It can be seen that the convergence rate of the SGD method using ReLU to obtain random gradient descent (SGD) is much faster than that of other functions (such as the activation function sigmoid/tanh in the conventional method). Since the ReLU is linear, the ReLU only needs one threshold to obtain the activation value, and does not need to calculate a large pile of complex operations. Therefore, the convolution process can be optimized, so that 96 feature maps with the size of 55 × 55 are obtained, pooling and normalization processing is performed to obtain 96 feature maps with the size of 27 × 27, convolution calculation and activation processing are performed again to obtain 256 feature maps with the size of 27 × 27, pooling and normalization processing is performed to obtain 256 feature maps with the size of 13 × 13, pooling and normalization processing is performed to obtain 384 feature maps with the size of 13 × 13, convolution calculation and activation processing are performed to obtain 256 feature maps with the size of 13 × 13, pooling processing is performed to obtain 256 feature maps with the size of 6 × 6, and full connection processing is performed for three times and activation processing is performed to obtain a result of a scene corresponding to the training picture, namely a scene probability corresponding to the training picture. And then, according to the label of the training picture, adjusting the whole process, namely changing a first preset training parameter of the Alexnet model until the scene probability corresponding to the training picture is consistent with the label of the training picture, the Alexnet model training converges to a certain interval, the loss value is controlled within an acceptable range, and at the moment, stopping training. And taking the first preset training parameter at the moment as a network parameter corresponding to the Alexnet model corresponding to the scene. Thus, the whole process of Alexnet model training and learning is completed.
Then, when a picture to be recognized is input into the Alexnet model for scene recognition, the training and learning process determines the network parameters corresponding to the Alexnet model corresponding to the scene of the Alexnet model, so that the scene corresponding to the picture to be recognized can be recognized.
Fig. 6 is a fifth flowchart of the target detection-assisted scene recognition method provided by the present invention, and as shown in fig. 6, before the obtaining of the area suggestion of the picture to be recognized and the first feature map corresponding to the picture to be recognized, the method of this embodiment further includes:
step 601, obtaining a training picture and a target region corresponding to the training picture, where the target region is used to indicate a position of a complete target in the training picture.
Specifically, the training picture in this embodiment may be obtained from a database, or may be obtained manually, which is not limited in this embodiment. As the size of the input picture of the Fast R-CNN network is not limited, the training picture does not need to be divided and sampled. Meanwhile, in this embodiment, a target region corresponding to the training picture needs to be obtained, where the target region is used to indicate an actual position of the complete target in the training picture.
Step 602, obtaining network parameters corresponding to the Fast R-CNN network and network parameters corresponding to the area suggestion network of each target according to the Fast R-CNN network, the area suggestion network and the target area corresponding to the training picture.
Specifically, the training picture and the corresponding target area are input into the Fast R-CNN network, and the Fast R-CNN network and the area suggestion network can perform the training and learning process. Because the training picture and the target area corresponding to the training picture are known, the area suggestion network can learn whether various targets exist in the training picture by dynamically adjusting the second preset training parameter to obtain the area suggestion of the targets, and the area suggestion is transmitted to the Fast R-CNN network, so that the real-time of target detection performed by the Fast R-CNN network can be reduced, and the efficiency is improved. Meanwhile, the Fast R-CNN network can learn various targets in various training pictures by dynamically adjusting the third preset training parameter, and further distinguish the various targets in the training pictures. The specific method comprises the following steps:
step 6021, carrying out convolution processing on the training picture through the Fast R-CNN network to obtain a shared convolution layer.
Specifically, in this embodiment, the training picture is input into the Fast R-CNN network for convolution processing, so as to obtain the shared convolutional layer.
Step 6022, extracting the second feature map from the shared convolutional layer, performing region suggestion processing on the second feature map through the region suggestion network according to a second preset training parameter to obtain the region score, and obtaining the region suggestion according to the region score.
Specifically, a layer of corresponding second feature map is extracted from the shared convolution layer and input into the area suggestion network for area suggestion processing, so that an area score can be obtained. The area suggestion processing is to perform sliding window processing on the training picture and perform multiple convolution or full connection processing. The region score is a value used to express whether or not the region has a target and the probability of the target.
Step 6023, superposing a preset number of convolutional layers on the shared convolutional layer to obtain a first characteristic diagram, wherein the number of convolutional layers included in the specific convolutional layer is greater than that of convolutional layers included in the shared convolutional layer.
Specifically, in order to avoid losing other information, the Fast R-CNN network continues convolution processing on the basis of the shared convolutional layers, so that the information is more complete, that is, the shared convolutional layers are superimposed with a predetermined number of convolutional layers to serve as specific convolutional layers, and a first characteristic diagram is obtained. In this embodiment, the number of preset layers is not specifically limited. For example, the second feature map corresponding to the 5 th layer is extracted from the shared convolutional layer and is transmitted to the area suggestion network, and the first feature map corresponding to the 9 th layer of the shared convolutional layer is used as the unique convolutional layer.
And 6024, according to the region suggestion, performing region marking processing on the first feature map to obtain the region marked first feature map.
Specifically, in this embodiment, several regions with higher region scores may be selected, and the regions corresponding to specific positions of the training picture are transmitted to the Fast R-CNN network as region suggestions, and the Fast R-CNN network may perform region labeling processing on the first feature map according to the region suggestions, so that the Fast R-CNN network only needs to perform a process of classifying and scoring each target on the labeled regions, and does not need to perform a process of classifying and scoring on regions without targets, thereby saving target detection time for the training picture.
And 6025, performing pooling treatment on the first feature map marked by the area through the Fast R-CNN network to obtain a pooled first feature map.
Specifically, since the sizes of the first feature maps after the area marking process are not the same, the sizes of the first feature maps need to be standardized through the pooling process to obtain the pooled first feature maps.
And 6026, performing full connection processing on the pooled first feature map, and acquiring a classification score of each target in the training picture according to a third preset training parameter.
Specifically, the first feature map after the pooling process has a plurality of all-directional maps, and the dimensionality of the first feature map can be reduced through the all-connection process, so that the operation of classification and scoring is facilitated.
Step 6027, according to the region suggestion, the classification score of each target in the training picture and the target region, adjusting the second preset training parameter and the third preset training parameter to obtain a network parameter corresponding to the region suggestion network and a network parameter corresponding to the Fast R-CNN network of each target.
Specifically, since the target area is known, the third preset training parameter can be dynamically adjusted according to the target area, so that the target detection is accurate, and the third preset training parameter is used as a network parameter corresponding to the Fast R-CNN network of each target. Meanwhile, the second preset training parameter can be dynamically adjusted according to the area suggestion and the target area, so that the area suggestion corresponding to the area score is more accurate, the time for the Fast R-CNN network to carry out target detection on the training picture is reduced, and the second preset training parameter is used as the network parameter corresponding to the area suggestion network.
Fig. 7 is a schematic diagram of a training process of a Fast R-CNN network and an RPN network in the target detection-assisted scene recognition method provided by the present invention, and in a specific embodiment, a training picture needs to be trained and learned together with a Region suggestion network (RPN) for the Fast R-CNN network and the RPN, and then a scene recognition process of a picture to be recognized is performed, as shown in fig. 7.
Firstly, training pictures with any size are input into a Fast R-CNN network, and a target area corresponding to the training pictures is obtained. And carrying out convolution processing on the training picture through a Fast R-CNN network to obtain a shared convolution layer. And the RPN extracts a second characteristic diagram from the shared convolution layer, performs sliding window processing on the second characteristic diagram, performs convolution processing or full connection processing twice, namely pixel points in each sliding window form 9 regions with different sizes, compares the regions with target regions respectively, obtains corresponding region scores according to a second preset training parameter, selects 300 regions with the highest scores as region suggestions, and transmits the region suggestions to the Fast R-CNN network. And then, the shared convolution layer is superposed with a plurality of layers of convolution layers to be used as a special convolution layer, so that a first characteristic diagram can be obtained. In this embodiment, according to the area suggestion, the first feature map may be subjected to area labeling processing to obtain a first feature after the area labeling processing, and then the first feature map after the area labeling may be subjected to pooling processing by the Fast R-CNN network to obtain a pooled first feature map. And performing full connection processing on the pooled first feature map, and acquiring the classification score of each target in the training picture according to a third preset training parameter. At this time, whether the third preset training parameter is consistent with the target area or not is judged. If the training parameters are inconsistent, dynamically adjusting a third preset training parameter, and simultaneously adjusting a second preset training parameter according to the region suggestion and the target region until the third preset training parameter is consistent with the target region, the Fast R-CNN network and the RPN network are trained and converged to a certain interval, the loss value is controlled within an acceptable range, and the training is stopped at the moment. And taking the second preset training parameter as a network parameter corresponding to the RPN network, and taking the third preset training parameter as a network parameter corresponding to the FastR-CNN network. Thus, the whole process of training and learning of Fast R-CNN network and RPN network is completed.
Then, when a picture to be identified is input into the Fast R-CNN network for target detection, because the training and learning process determines the network parameters corresponding to the Fast R-CNN network corresponding to the target of the Fast R-CNN network and the network parameters corresponding to the RPN network corresponding to the target of the RPN network, the target corresponding to the picture to be identified can be identified, and the regional proposal of the RPN network can be saved and the detection time of the Fast R-CNN network can be saved.
Those of ordinary skill in the art will understand that: all or a portion of the steps of implementing the above-described method embodiments may be performed by hardware associated with program instructions. The program may be stored in a computer-readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims (6)

1. A target detection assisted scene recognition method is characterized by comprising the following steps:
acquiring a picture to be identified, sampling the picture to be identified to obtain samples with preset quantity and size, and performing scene identification on each sample according to a convolutional neural network model to obtain at least two scenes corresponding to the picture to be identified;
acquiring a region suggestion of the picture to be recognized and a first feature map corresponding to the picture to be recognized, and acquiring a classification score of each target in the picture to be recognized according to the region suggestion and the picture to be recognized; the region suggestion is obtained by processing a second feature map by a region suggestion network, and the first feature map and the second feature map are obtained by performing convolution processing on the picture to be identified in a Fast R-CNN network;
obtaining a scene corresponding to the picture to be identified according to at least two scenes corresponding to the picture to be identified and the classification score of each target;
the obtaining of the area suggestion of the picture to be recognized and the first feature map corresponding to the picture to be recognized includes:
carrying out convolution processing on the picture to be identified through the Fast R-CNN network to obtain a shared convolution layer;
extracting the second feature map from the shared convolutional layer, performing region suggestion processing on the second feature map through the region suggestion network according to network parameters corresponding to the region suggestion network to obtain region scores of all target regions, and obtaining the region suggestions according to the region scores;
and taking the shared convolution layer superposed with a preset number of convolution layers as a specific convolution layer to obtain the first characteristic diagram, wherein the number of convolution layers contained in the specific convolution layer is greater than that contained in the shared convolution layer.
2. The method according to claim 1, wherein before the scene recognition is performed on each of the samples according to the convolutional network model to obtain at least two scenes corresponding to the picture to be recognized, the method further comprises:
acquiring a training picture and a label corresponding to the training picture, wherein the label is used for indicating a scene corresponding to the training picture;
acquiring network parameters corresponding to the convolutional neural network models corresponding to the scenes according to the convolutional neural network models and labels corresponding to the training pictures;
the scene recognition is carried out on each sample according to the convolutional network model to obtain at least two scenes corresponding to the picture to be recognized, and the method comprises the following steps:
and carrying out scene recognition on each sample according to the network parameters corresponding to the convolutional neural network model corresponding to each scene to obtain at least two scenes corresponding to the picture to be recognized.
3. The method according to claim 2, wherein the obtaining of the network parameters corresponding to the convolutional neural network model corresponding to each scene according to the convolutional neural network model and the label corresponding to the training picture comprises:
carrying out segmentation sampling on the training picture to obtain an amplified training picture;
according to a first preset training parameter, carrying out preset processing on the amplified training pictures to obtain a preset number of third feature maps, wherein the preset processing comprises convolution, pooling and normalization processing;
performing full connection processing on the preset number of third feature maps for multiple times to obtain scene probability corresponding to the training pictures;
and adjusting the first preset training parameter according to the scene probability and the label corresponding to the training picture to obtain the network parameter corresponding to the convolutional neural network model corresponding to the scene.
4. The method according to claim 1, wherein the obtaining a classification score of each target in the picture to be recognized according to the region suggestion and the picture to be recognized comprises:
according to the region suggestion, performing region marking processing on the first feature map to obtain the first feature map subjected to the region marking processing;
pooling the first feature map subjected to the area marking treatment through the Fast R-CNN network to obtain a pooled first feature map;
performing full-connection processing on the first characteristic diagram after the pooling processing;
and obtaining the classification score of each target in the picture to be identified according to the network parameters corresponding to the Fast R-CNN network.
5. The method according to claim 4, further comprising, before the obtaining of the region suggestion of the picture to be recognized and the first feature map corresponding to the picture to be recognized, the steps of:
acquiring a training picture and a target area corresponding to the training picture, wherein the target area is used for indicating the position of a complete target in the training picture;
and acquiring network parameters corresponding to the Fast R-CNN network and the area suggestion network of each target according to the Fast R-CNN network, the area suggestion network and the target area corresponding to the training picture.
6. The method according to claim 5, wherein the obtaining of the network parameters corresponding to the Fast R-CNN network and the network parameters corresponding to the area suggestion network for each target according to the Fast R-CNN network, the area suggestion network, and the target area corresponding to the training picture comprises:
performing convolution processing on the training picture through the Fast R-CNN network to obtain a shared convolution layer;
extracting the second feature map from the shared convolutional layer, performing region suggestion processing on the second feature map through the region suggestion network according to a second preset training parameter to obtain a region score, and obtaining the region suggestion according to the region score;
taking the shared convolution layer superposed with a preset number of convolution layers as a specific convolution layer to obtain the first characteristic diagram, wherein the number of convolution layers contained in the specific convolution layer is greater than that contained in the shared convolution layer;
according to the region suggestion, performing region marking processing on the first feature map to obtain the first feature map subjected to the region marking processing;
pooling the first characteristic diagram marked by the area through the Fast R-CNN network to obtain a pooled first characteristic diagram;
performing full-connection processing on the pooled first feature map, and acquiring the classification score of each target in the training picture according to a third preset training parameter;
and adjusting the second preset training parameter and the third preset training parameter according to the area suggestion, the classification score of each target in the training picture and the target area to obtain a network parameter corresponding to the area suggestion network and a network parameter corresponding to the Fast R-CNN network of each target.
CN201710270013.4A 2017-04-24 2017-04-24 Target detection assisted scene identification method Active CN107194318B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710270013.4A CN107194318B (en) 2017-04-24 2017-04-24 Target detection assisted scene identification method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710270013.4A CN107194318B (en) 2017-04-24 2017-04-24 Target detection assisted scene identification method

Publications (2)

Publication Number Publication Date
CN107194318A CN107194318A (en) 2017-09-22
CN107194318B true CN107194318B (en) 2020-06-12

Family

ID=59872812

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710270013.4A Active CN107194318B (en) 2017-04-24 2017-04-24 Target detection assisted scene identification method

Country Status (1)

Country Link
CN (1) CN107194318B (en)

Families Citing this family (27)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107563357B (en) * 2017-09-29 2021-06-04 北京奇虎科技有限公司 Live-broadcast clothing dressing recommendation method and device based on scene segmentation and computing equipment
CN107610146B (en) * 2017-09-29 2021-02-23 北京奇虎科技有限公司 Image scene segmentation method and device, electronic equipment and computer storage medium
CN107622498B (en) * 2017-09-29 2021-06-04 北京奇虎科技有限公司 Image crossing processing method and device based on scene segmentation and computing equipment
CN107730514B (en) * 2017-09-29 2021-02-12 北京奇宝科技有限公司 Scene segmentation network training method and device, computing equipment and storage medium
CN107844977B (en) * 2017-10-09 2021-08-27 中国银联股份有限公司 Payment method and device
CN109688351B (en) 2017-10-13 2020-12-15 华为技术有限公司 Image signal processing method, device and equipment
CN107808138B (en) * 2017-10-31 2021-03-30 电子科技大学 Communication signal identification method based on FasterR-CNN
CN107832795B (en) * 2017-11-14 2021-07-27 深圳码隆科技有限公司 Article identification method and system and electronic equipment
CN109784131B (en) * 2017-11-15 2023-08-22 深圳光启合众科技有限公司 Object detection method, device, storage medium and processor
CN109981695B (en) * 2017-12-27 2021-03-26 Oppo广东移动通信有限公司 Content pushing method, device and equipment
CN110012210B (en) * 2018-01-05 2020-09-22 Oppo广东移动通信有限公司 Photographing method and device, storage medium and electronic equipment
CN108734162B (en) * 2018-04-12 2021-02-09 上海扩博智能技术有限公司 Method, system, equipment and storage medium for identifying target in commodity image
CN108764235B (en) * 2018-05-23 2021-06-29 中国民用航空总局第二研究所 Target detection method, apparatus and medium
CN108681752B (en) * 2018-05-28 2023-08-15 电子科技大学 Image scene labeling method based on deep learning
CN108765033B (en) * 2018-06-08 2021-01-12 Oppo广东移动通信有限公司 Advertisement information pushing method and device, storage medium and electronic equipment
CN108960209B (en) * 2018-08-09 2023-07-21 腾讯科技(深圳)有限公司 Identity recognition method, identity recognition device and computer readable storage medium
CN109325491B (en) 2018-08-16 2023-01-03 腾讯科技(深圳)有限公司 Identification code identification method and device, computer equipment and storage medium
CN109086742A (en) * 2018-08-27 2018-12-25 Oppo广东移动通信有限公司 scene recognition method, scene recognition device and mobile terminal
TWI717655B (en) * 2018-11-09 2021-02-01 財團法人資訊工業策進會 Feature determination apparatus and method adapted to multiple object sizes
CN113569796A (en) * 2018-11-16 2021-10-29 北京市商汤科技开发有限公司 Key point detection method and device, electronic equipment and storage medium
CN109727268A (en) * 2018-12-29 2019-05-07 西安天和防务技术股份有限公司 Method for tracking target, device, computer equipment and storage medium
CN111383246B (en) * 2018-12-29 2023-11-07 杭州海康威视数字技术股份有限公司 Scroll detection method, device and equipment
CN110390262B (en) * 2019-06-14 2023-06-30 平安科技(深圳)有限公司 Video analysis method, device, server and storage medium
CN111104942B (en) * 2019-12-09 2023-11-03 熵智科技(深圳)有限公司 Template matching network training method, recognition method and device
CN111062441A (en) * 2019-12-18 2020-04-24 武汉大学 Scene classification method and device based on self-supervision mechanism and regional suggestion network
CN112633064B (en) * 2020-11-19 2023-12-15 深圳银星智能集团股份有限公司 Scene recognition method and electronic equipment
CN113569734B (en) * 2021-07-28 2023-05-05 山东力聚机器人科技股份有限公司 Image recognition and classification method and device based on feature recalibration

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103778443A (en) * 2014-02-20 2014-05-07 公安部第三研究所 Method for achieving scene analysis description based on theme model method and field rule library
CN106504233A (en) * 2016-10-18 2017-03-15 国网山东省电力公司电力科学研究院 Image electric power widget recognition methodss and system are patrolled and examined based on the unmanned plane of Faster R CNN

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170098162A1 (en) * 2015-10-06 2017-04-06 Evolv Technologies, Inc. Framework for Augmented Machine Decision Making

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103778443A (en) * 2014-02-20 2014-05-07 公安部第三研究所 Method for achieving scene analysis description based on theme model method and field rule library
CN106504233A (en) * 2016-10-18 2017-03-15 国网山东省电力公司电力科学研究院 Image electric power widget recognition methodss and system are patrolled and examined based on the unmanned plane of Faster R CNN

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
"Faster R-CNN:Towards Real-Time Object Detection with Region Proposal Networks";Shaoqing Ren etc.;《arXiv》;20160106;论文第1,3节 *

Also Published As

Publication number Publication date
CN107194318A (en) 2017-09-22

Similar Documents

Publication Publication Date Title
CN107194318B (en) Target detection assisted scene identification method
US20200285896A1 (en) Method for person re-identification based on deep model with multi-loss fusion training strategy
CN108710865B (en) Driver abnormal behavior detection method based on neural network
CN108562589B (en) Method for detecting surface defects of magnetic circuit material
CN110427807B (en) Time sequence event action detection method
CN110610166B (en) Text region detection model training method and device, electronic equipment and storage medium
CN107833213B (en) Weak supervision object detection method based on false-true value self-adaptive method
US10592726B2 (en) Manufacturing part identification using computer vision and machine learning
CN105590099B (en) A kind of more people's Activity recognition methods based on improvement convolutional neural networks
CN111178120B (en) Pest image detection method based on crop identification cascading technology
US11657513B2 (en) Method and system for generating a tri-map for image matting
CN105144239A (en) Image processing device, program, and image processing method
CN111061915B (en) Video character relation identification method
CN114241548A (en) Small target detection algorithm based on improved YOLOv5
US10678848B2 (en) Method and a system for recognition of data in one or more images
CN103810473A (en) Hidden Markov model based human body object target identification method
WO2023124278A1 (en) Image processing model training method and apparatus, and image classification method and apparatus
CN109615610B (en) Medical band-aid flaw detection method based on YOLO v2-tiny
CN108446688B (en) Face image gender judgment method and device, computer equipment and storage medium
CN111178405A (en) Similar object identification method fusing multiple neural networks
CN114140696A (en) Commodity identification system optimization method, commodity identification system optimization device, commodity identification equipment and storage medium
CN112396042A (en) Real-time updated target detection method and system, and computer-readable storage medium
CN111539390A (en) Small target image identification method, equipment and system based on Yolov3
CN111571567A (en) Robot translation skill training method and device, electronic equipment and storage medium
CN110929013A (en) Image question-answer implementation method based on bottom-up entry and positioning information fusion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant