CN113065443A - Training method, recognition method, system, device and medium of image recognition model - Google Patents

Training method, recognition method, system, device and medium of image recognition model Download PDF

Info

Publication number
CN113065443A
CN113065443A CN202110321375.8A CN202110321375A CN113065443A CN 113065443 A CN113065443 A CN 113065443A CN 202110321375 A CN202110321375 A CN 202110321375A CN 113065443 A CN113065443 A CN 113065443A
Authority
CN
China
Prior art keywords
output
layer
picture
feature map
branch
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110321375.8A
Other languages
Chinese (zh)
Inventor
杨凯
罗超
胡泓
李巍
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ctrip Computer Technology Shanghai Co Ltd
Original Assignee
Ctrip Computer Technology Shanghai Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ctrip Computer Technology Shanghai Co Ltd filed Critical Ctrip Computer Technology Shanghai Co Ltd
Priority to CN202110321375.8A priority Critical patent/CN113065443A/en
Publication of CN113065443A publication Critical patent/CN113065443A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a training method, a recognition method, a system, equipment and a medium of a picture recognition model, wherein the training method of the picture recognition model obtains a plurality of pictures, labels the pictures according to preset classification labels to obtain label pictures so as to construct an initial training data set; wherein the pictures comprise a target picture with an explicit scene and an interference picture without an explicit scene; the preset classification labels comprise a plurality of scene labels corresponding to scenes in the target picture and a non-scene label corresponding to the interference picture; adopting a convolutional neural network as a basic network, and adding a branch convolutional layer in the convolutional neural network layer to construct a network structure of the picture recognition model; the branch convolution layer is used for fusing the multi-level feature maps in the basic network; and inputting the plurality of label pictures into a network structure of the picture recognition model for training to generate the picture recognition model, thereby realizing automatic recognition of the open scene pictures.

Description

Training method, recognition method, system, device and medium of image recognition model
Technical Field
The invention relates to the technical field of information processing of an OTA (on-line Travel Agency) platform, in particular to a training method, an identification system, equipment and a medium of a picture identification model.
Background
With the development of communication technology and intelligent terminal equipment, people can take a plurality of photos at will during travel and tour, and upload the photos to a platform for sharing. The gallery of the platform is newly added with a large number of pictures uploaded by users, merchants or scenic spot officers every day, and the gallery is accumulated with a large number of pictures. Since these numerous and disorderly pictures cannot be audited and labeled manually, they are difficult to use. At present, there are many researches on classifying and identifying specific scene pictures, and the classification method is relatively mature, but the method for identifying the pictures containing specific contents in open scenes is rarely researched. The pictures of the open scene refer to pictures without interference of the clear scene, for example, the clear scene is a natural landscape scene, the pictures of the open scene are pictures other than the natural landscape scene, and the pictures of the open scene also include a large number of pictures irrelevant to the natural landscape scene, such as pictures of plants, buildings, animals, people, pictures, cartoon pictures, posters, and the like.
Disclosure of Invention
The invention aims to overcome the defect that an open scene picture cannot be automatically identified in the prior art, and provides a training method, an identification method, a system, equipment and a medium of a picture identification model.
The invention solves the technical problems through the following technical scheme:
the invention provides a training method of a picture recognition model, which comprises the following steps:
acquiring a plurality of pictures, labeling the pictures according to preset classification labels to obtain labeled pictures so as to construct an initial training data set; wherein the pictures comprise a target picture with an explicit scene and an interference picture without an explicit scene; the preset classification labels comprise a plurality of scene labels respectively corresponding to scenes in the target picture and a non-scene label corresponding to the interference picture;
adopting a convolutional neural network as a basic network, and adding a branch convolutional layer in the convolutional neural network layer to construct a network structure of a picture recognition model; the branch convolution layer is used for fusing the multi-level feature maps in the basic network;
inputting a plurality of label pictures into a network structure of the picture recognition model for training to generate the picture recognition model.
Preferably, the step of constructing the network structure of the image recognition model by using the convolutional neural network as a base network and adding a branch convolutional layer in the convolutional neural network layer includes:
the method comprises the following steps that a wide resnet50 (a convolutional neural network) is used as a basic network, and a wide resnet50 comprises a first output layer, a second output layer, a third output layer, a fourth output layer, a fifth output layer and a full connection layer;
adding a first branch convolution layer, a second branch convolution layer and a third branch convolution layer at the output end of the second output layer, the output end of the third output layer and the output end of the fourth output layer respectively to construct a network structure of the picture identification model;
the first branch convolutional layer is used for receiving the characteristic diagram output by the second output layer and converting the size and the channel number of the characteristic diagram output by the second output layer to obtain the characteristic diagram output by the first branch convolutional layer;
the second branch convolutional layer is used for receiving a first fused feature map and transforming the size and the channel number of the first fused feature map to obtain a feature map output by the second branch convolutional layer, wherein the first fused feature map is obtained by fusing the feature map output by the first branch convolutional layer with the feature map output by the third output layer;
the third branch convolutional layer is used for receiving a second fused feature map and transforming the size and the channel number of the second fused feature map to obtain a feature map output by the third branch convolutional layer, wherein the second fused feature map is obtained by fusing the feature map output by the second branch convolutional layer with the feature map output by the fourth output layer;
the full connection layer is used for receiving a third fused feature map, wherein the third fused feature map is obtained by fusing the feature map output by the third branch convolution layer with the feature map output by the fifth output layer;
the size and the channel number of the characteristic diagram output by the first branch convolution layer and the characteristic diagram output by the third output layer are the same; the size and the channel number of the characteristic diagram output by the second branch convolution layer and the characteristic diagram output by the fourth output layer are the same; the size and the channel number of the characteristic diagram output by the third branch convolution layer and the characteristic diagram output by the fifth output layer are the same.
Preferably, the step of inputting a plurality of label picture data into the network structure of the picture recognition model for training to generate the picture recognition model further includes:
obtaining a loss value of a network structure of the image recognition model by using the weighted cross entropy loss as a main loss function and the Ring loss as an auxiliary loss function;
optimizing parameters of the network structure of the picture recognition model by adopting a momentum random gradient descent algorithm based on the loss value of the network structure of the picture recognition model to obtain optimized parameters of the model;
and setting the learning rate by adopting a transfer learning method, and adjusting the optimization parameters of the model so as to enable the accuracy of the image recognition model to reach a preset threshold value.
The invention also provides a picture identification method, which comprises the following steps:
acquiring a picture to be identified;
inputting the picture to be identified into a picture identification model for identification so as to obtain a probability value of the label picture;
the picture recognition model is generated by using the training method of the picture recognition model as described above.
The invention also provides a training system of the picture recognition model, which comprises the following components:
the system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring a plurality of pictures and labeling the pictures according to preset classification labels to obtain label pictures so as to construct an initial training data set; wherein the pictures comprise a target picture with an explicit scene and an interference picture without an explicit scene; the preset classification labels comprise a plurality of scene labels respectively corresponding to scenes in the target picture and a non-scene label corresponding to the interference picture;
the building module is used for adopting a convolutional neural network as a basic network and adding a branch convolutional layer in the convolutional neural network layer so as to build a network structure of the image recognition model; the branch convolution layer is used for fusing the multi-level feature maps in the basic network;
and the generating module is used for inputting the plurality of label pictures into a network structure of the picture recognition model for training so as to generate the picture recognition model.
Preferably, the building module is configured to use the wide resnet50 as a basic network, and the wide resnet50 includes a first output layer, a second output layer, a third output layer, a fourth output layer, a fifth output layer, and a full connection layer;
adding a first branch convolution layer, a second branch convolution layer and a third branch convolution layer at the output end of the second output layer, the output end of the third output layer and the output end of the fourth output layer respectively to construct a network structure of the picture identification model;
the first branch convolutional layer is used for receiving the characteristic diagram output by the second output layer and converting the size and the channel number of the characteristic diagram output by the second output layer to obtain the characteristic diagram output by the first branch convolutional layer;
the second branch convolutional layer is used for receiving a first fused feature map and transforming the size and the channel number of the first fused feature map to obtain a feature map output by the second branch convolutional layer, wherein the first fused feature map is obtained by fusing the feature map output by the first branch convolutional layer with the feature map output by the third output layer;
the third branch convolutional layer is used for receiving a second fused feature map and transforming the size and the channel number of the second fused feature map to obtain a feature map output by the third branch convolutional layer, wherein the second fused feature map is obtained by fusing the feature map output by the second branch convolutional layer with the feature map output by the fourth output layer;
the full connection layer is used for receiving a third fused feature map, wherein the third fused feature map is obtained by fusing the feature map output by the third branch convolution layer with the feature map output by the fifth output layer;
the size and the channel number of the characteristic diagram output by the first branch convolution layer and the characteristic diagram output by the third output layer are the same; the size and the channel number of the characteristic diagram output by the second branch convolution layer and the characteristic diagram output by the fourth output layer are the same; the size and the channel number of the characteristic diagram output by the third branch convolution layer and the characteristic diagram output by the fifth output layer are the same.
Preferably, the generating module includes:
a loss value obtaining unit, configured to obtain a loss value of a network structure of the picture identification model by using the weighted cross entropy loss as a main loss function and Ring loss as an auxiliary loss function;
the optimization parameter obtaining unit is used for optimizing the parameters of the network structure of the picture recognition model by adopting a momentum random gradient descent algorithm based on the loss value of the network structure of the picture recognition model so as to obtain the optimization parameters of the model;
and the adjusting unit is used for setting the learning rate by adopting a transfer learning method and adjusting the optimization parameters of the model so as to enable the accuracy of the image recognition model to reach a preset threshold value.
The invention also provides a picture recognition system, which comprises:
the image acquisition module is used for acquiring an image to be identified;
the input module is used for inputting the picture to be identified into a picture identification model for identification so as to obtain the probability value of the label picture;
the picture recognition model is generated using a training system of recognition models as previously described.
The invention further provides an electronic device, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and when the processor executes the computer program, the processor implements the training method of the image recognition model or the image recognition method.
The present invention also provides a computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the method for training a picture recognition model as described above or the steps of the method for picture recognition as described above.
The positive progress effects of the invention are as follows:
the invention discloses a training method, an identification method, a system, equipment and a medium of a picture identification model, wherein a convolutional neural network is used as a basic network, branch convolutional layers are added in the hierarchy of the basic network to construct a network structure of the picture identification model, the network structure of the picture identification model is trained to generate the picture identification model, a picture is identified by using the picture identification model, and the picture identification model can realize multi-stage feature fusion of pictures in a basic classification network by adding the branch convolutional layers, so that automatic identification of open scene pictures is realized, and the identification efficiency is further improved.
Drawings
Fig. 1 is a flowchart of a training method of a picture recognition model according to embodiment 1 of the present invention;
FIG. 2 is a flowchart of step S102 according to embodiment 1 of the present invention;
FIG. 3 is a flowchart of a network structure of a picture recognition model according to embodiment 1 of the present invention;
FIG. 4 is a flowchart of step S103 according to embodiment 1 of the present invention;
fig. 5 is a flowchart of a picture recognition method according to embodiment 2 of the present invention;
FIG. 6 is a block diagram of a training system for a picture recognition model according to embodiment 3 of the present invention;
fig. 7 is a block diagram of a generation module 3 according to embodiment 3 of the present invention;
fig. 8 is a schematic block diagram of a picture recognition system according to embodiment 4 of the present invention;
fig. 9 is a schematic structural diagram of an electronic device according to embodiment 5 of the present invention.
Detailed Description
The invention is further illustrated by the following examples, which are not intended to limit the scope of the invention.
Example 1
As shown in fig. 1, the embodiment discloses a training method of an image recognition model, which includes the following steps:
s101, obtaining a plurality of pictures, labeling the pictures according to preset classification labels to obtain label pictures so as to construct an initial training data set; wherein the pictures comprise a target picture with an explicit scene and an interference picture without an explicit scene; the preset classification labels comprise a plurality of scene labels respectively corresponding to scenes in the target picture and a non-scene label corresponding to the interference picture; for example, when the clear scene is a natural landform scene, the scene label may be 4 categories such as grassland wetland, desert gobi, danxia landform, karst landform, and the non-scene label is "other". Specifically, natural geomorphic scene data and "other" picture data may be collected separately in various ways, including data collected using crawler technology, related data accumulated by the platform in the past, and artificially supplemented labeled data.
Step S102, a convolutional neural network is adopted as a basic network, and a branch convolutional layer is added in the convolutional neural network layer to construct a network structure of a picture recognition model; the branch convolution layer is used for fusing the multi-level feature maps in the basic network; in this embodiment, a network structure may be defined by using an open-source PyTorch (a machine learning library) deep learning framework.
In this embodiment, the branch convolution layer may transform the size and the channel of the feature picture, so as to fuse the multi-level feature maps in the base network.
Step S103, inputting a plurality of label pictures into a network structure of the picture recognition model for training so as to generate the picture recognition model.
In this embodiment, the picture recognition model may be deployed as a service interface.
As shown in fig. 2, in this embodiment, step S102 includes:
step S1021, adopting wide resnet50 as a basic network, wherein the wide resnet50 comprises a first output layer, a second output layer, a third output layer, a fourth output layer, a fifth output layer and a full connection layer;
step S1022, respectively adding a first branch convolution layer, a second branch convolution layer, and a third branch convolution layer at the output end of the second output layer, the output end of the third output layer, and the output end of the fourth output layer, so as to construct a network structure of the image recognition model;
as shown in fig. 3, in the present embodiment, a picture (Image) is input to a first output layer (Conv1), and a first output feature map is output through processing of the first output layer; inputting the first output profile to a second output layer (Conv2_ x), the second output layer processing the first output profile and outputting a second output profile; inputting the second output characteristic diagram into a first branch convolution layer (3 × 3Conv) and a third output layer (Conv3_ x), wherein the first branch convolution layer transforms the size and the number of channels of the second output characteristic diagram to output a first branch characteristic diagram, the third output layer processes the second output characteristic diagram to output a third output characteristic diagram, the size and the number of channels of the first branch characteristic diagram and the third output characteristic diagram are the same, and the first branch characteristic diagram and the third output characteristic diagram are fused to obtain a first fused characteristic diagram; inputting the first fused feature map and the third output feature map into a second branch convolutional layer (3 × 3Conv) and a fourth output layer (Conv4_ x), respectively, converting the first fused feature map by the second branch convolutional layer according to the size and the number of channels to output a second branch feature map, processing the third output feature map by the fourth output layer to obtain a fourth output feature map, wherein the size and the number of channels of the second branch feature map and the fourth output feature map are the same, and fusing the second branch feature map and the fourth output feature map to obtain a second fused feature map; inputting the second fused feature map and the fourth output feature map into a third branch convolutional layer (3 × 3Conv) and a fifth output layer (Conv5_ x), respectively, converting the size and the number of channels of the second fused feature map by the third branch convolutional layer to output a third branch feature map, processing the fourth output feature map by the fifth output layer to obtain a fifth output feature map, wherein the size and the number of channels of the third branch feature map and the fifth output feature map are the same, and fusing the third branch feature map and the fifth output feature map to obtain a third fused feature map; and inputting the third fusion feature map into a full connection layer (FC) after an Average pooling (Average pool) operation, and processing the full connection layer through a Softmax operation to obtain a probability value of the corresponding label picture of the picture.
Specifically, the picture size of the input first output layer may be limited to 224 × 224, and the number of fully-connected layer output nodes is set to N +1, i.e., N scene tags and 1 "other". Marking the first output layer, the second output layer, the third output layer, the fourth output layer, the fifth output layer and the full connection layer as Conv1, Conv2_ x, Conv3_ x, Conv4_ x and Conv5_ x layers respectively; the characteristic values output by the second output layer, the third output layer, the fourth output layer and the fifth output layer are recorded as F2_out、F3_out、F4_out、F5_outThe 3 x 3 convolution is used as each branch convolution to adjust the size and the channel number of the characteristic diagram, so that different levels can be conveniently adjustedIs fused based on the fused feature map FoutThe classification is carried out according to the following specific formula:
Fout=F5_out+f3×3(F4_out+f3×3(F3_out+f3×3(F2_out)))
Result=Soft_max(FC(Avg_pool(Fout)))
wherein f is3×3Representing a 3 × 3 convolution, Soft _ max representing a Softmax operation, FC representing a fully connected layer, Avg-pool representing mean pooling.
As shown in fig. 4, step S103 includes:
step S1031, using the weighted cross entropy loss as a main loss function and Ring loss as an auxiliary loss function to obtain a loss value of the network structure of the image recognition model;
in this embodiment, the model output is Y ═ Y1,y2,...,yN+1W ═ W1,w2,...,wN+1And the value is based on the proportion of various sample numbers in the training set, and the cross entropy loss between the N scene labels and other samples is expressed as lossce
Figure BDA0002993011810000091
Wherein label represents the sequence number of the real category label of the picture, and the value range is [1, N +1 ]]An integer of (d); w is alabelE is W and is the weight corresponding to the real category label of the picture; y islabelAnd E is Y, and is the model output value corresponding to the picture real category label.
The target modular length is R, the mean value of the characteristic vector modular lengths after the first iteration is used for initializing R, and Ring loss is expressed as lossrl
Figure BDA0002993011810000092
Ultimate losstotalIs divided into twoWeighted sum of seed loss functions:
losstotal=lossce+λlossrl
wherein, the lambda is a weight factor and takes a value of 0.01.
S1032, optimizing parameters of the network structure of the picture recognition model by adopting a momentum random gradient descent algorithm based on the loss value of the network structure of the picture recognition model to obtain optimized parameters of the model;
in this embodiment, the back propagation of the loss adopts a random gradient descent method based on momentum, and the momentum factor is momentum ═ 0.9.
Step S1033, setting a learning rate by using a transfer learning method, and adjusting an optimization parameter of the model, so that an accuracy of the image recognition model reaches a preset threshold.
In this embodiment, migration learning is performed based on an open source model trained on a public scene classification dataset place365 (a dataset), and pre-training weights except for a full connection layer and a feature fusion branch in a basic network are loaded. Training weight parameters in the added branch convolution layer and the fully-connected layer of the basic network, wherein the initial learning rate is set to be 0.01; fine-tuning the pre-training weight parameters in the networks conv2_ x, conv3_ x, conv4_ x and conv5_ x, with the initial learning rates of conv2_ x and conv3_ x set to 0.001 and the initial learning rates of conv4_ x and conv5_ x set to 0.002; the parameters in the other layers are frozen and no update is performed. In the training process, the value of the parameter learning rate is reduced by half every 5 iterations.
In this embodiment, the data in the strategy diagram library is used for testing, and the accuracy and recall rate of the model are evaluated by checking the identification result. And (3) supplementing corresponding positive and negative samples to a training set aiming at the wrongly-divided cases, eliminating the atypical samples which are not beneficial to model training, updating the weight W of the cross entropy, and retraining the model. And repeating multiple rounds of data iteration until the accuracy of the model meets the production requirement, and stopping training.
The embodiment discloses a training method of a picture recognition model, which comprises the steps of adopting a convolutional neural network as a basic network, adding branch convolutional layers in the hierarchy of the basic network, constructing a network structure of the picture recognition model, training the network structure of the picture recognition model to generate the picture recognition model, and realizing multi-stage feature fusion of pictures in a basic classification network by the picture recognition model through adding the branch convolutional layers, so that automatic recognition of open scene pictures is realized, and the recognition efficiency is improved. In addition, during model training, weighted cross entropy loss is used as a main loss function, Ring loss is used as an auxiliary loss function, so that a loss value of a network structure of the image recognition model is obtained, parameters of the network structure of the image recognition model are optimized by adopting a momentum random gradient descent algorithm, a transfer learning method is adopted, and the learning rate is set, so that the accuracy of the model is increased.
Example 2
As shown in fig. 5, the present embodiment discloses a picture identification method, which includes the following steps:
step S201, obtaining a picture to be identified;
step S202, inputting the picture to be identified into a picture identification model for identification so as to obtain a probability value of the label picture;
in this embodiment, the image recognition model is generated by using the aforementioned training method of the image recognition model.
Example 3
As shown in fig. 6, the present embodiment discloses a training system for a picture recognition model, where the training system includes:
the system comprises a first acquisition module 1, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring a plurality of pictures, labeling the pictures according to preset classification labels to obtain label pictures so as to construct an initial training data set; wherein the pictures comprise a target picture with an explicit scene and an interference picture without an explicit scene; the preset classification labels comprise a plurality of scene labels respectively corresponding to scenes in the target picture and a non-scene label corresponding to the interference picture; for example, when the clear scene is a natural landform scene, the scene label may be 4 categories such as grassland wetland, desert gobi, danxia landform, karst landform, and the non-scene label is "other". Specifically, natural geomorphic scene data and "other" picture data may be collected separately in various ways, including data collected using crawler technology, related data accumulated by the platform in the past, and artificially supplemented labeled data.
The building module 2 is used for adopting a convolutional neural network as a basic network, and adding a branch convolutional layer in the convolutional neural network layer to build a network structure of the picture recognition model; the branch convolution layer is used for fusing the multi-level feature maps in the basic network; in this embodiment, the network structure may be defined by using an open-source PyTorch deep learning framework.
In this embodiment, the branch convolution layer may transform the size and the channel of the feature picture, so as to fuse the multi-level feature maps in the base network.
And the generating module 3 is configured to input the plurality of label pictures into a network structure of the picture recognition model for training to generate the picture recognition model.
In this embodiment, the picture recognition model may be deployed as a service interface.
In this embodiment, the building module 2 is configured to use the wide resnet50 as a basic network, where the wide resnet50 includes a first output layer, a second output layer, a third output layer, a fourth output layer, a fifth output layer, and a full connection layer;
respectively adding a first branch convolution layer, a second branch convolution layer and a third branch convolution layer at the output end of the second output layer, the output end of the third output layer and the output end of the fourth output layer to construct a network structure of the picture identification model;
as shown in fig. 3, the picture (Image) is input to the first output layer (Conv1), and the first output characteristic diagram is output through the processing of the first output layer; inputting the first output profile to a second output layer (Conv2_ x), the second output layer processing the first output profile and outputting a second output profile; inputting the second output characteristic diagram into a first branch convolution layer (3 × 3Conv) and a third output layer (Conv3_ x), wherein the first branch convolution layer transforms the size and the number of channels of the second output characteristic diagram to output a first branch characteristic diagram, the third output layer processes the second output characteristic diagram to output a third output characteristic diagram, the size and the number of channels of the first branch characteristic diagram and the third output characteristic diagram are the same, and the first branch characteristic diagram and the third output characteristic diagram are fused to obtain a first fused characteristic diagram; inputting the first fused feature map and the third output feature map into a second branch convolutional layer (3 × 3Conv) and a fourth output layer (Conv4_ x), respectively, converting the first fused feature map by the second branch convolutional layer according to the size and the number of channels to output a second branch feature map, processing the third output feature map by the fourth output layer to obtain a fourth output feature map, wherein the size and the number of channels of the second branch feature map and the fourth output feature map are the same, and fusing the second branch feature map and the fourth output feature map to obtain a second fused feature map; inputting the second fused feature map and the fourth output feature map into a third branch convolutional layer (3 × 3Conv) and a fifth output layer (Conv5_ x), respectively, converting the size and the number of channels of the second fused feature map by the third branch convolutional layer to output a third branch feature map, processing the fourth output feature map by the fifth output layer to obtain a fifth output feature map, wherein the size and the number of channels of the third branch feature map and the fifth output feature map are the same, and fusing the third branch feature map and the fifth output feature map to obtain a third fused feature map; and inputting the third fusion feature map into a full connection layer (FC) after an Average pooling (Average pool) operation, and processing the full connection layer through a Softmax operation to obtain a probability value of the corresponding label picture of the picture.
Specifically, the picture size of the input first output layer may be limited to 224 × 224, and the number of fully-connected layer output nodes is set to N +1, i.e., N scene tags and 1 "other". Recording the first output layer, the second output layer, the third output layer, the fourth output layer, the fifth output layer and the full connection layer as conv1, conv2_ x, conv3_ x, conv4_ x and conv5_ x layers respectively; the characteristic values output by the second output layer, the third output layer, the fourth output layer and the fifth output layer are recorded as F2_out、F3_out、F4_out、F5_outThe 3 multiplied by 3 convolution is used as each branch convolution to adjust the size and the channel number of the feature map, so that the features of different levels can be conveniently fused, and the feature map F is based on the fused feature mapoutThe classification is carried out according to the following specific formula:
Fout=F5_out+f3×3(F4_out+f3×3(F3_out+f3×3(F2_out)))
Result=Soft_max(FC(Avg_pool(Fout)))
wherein f is3×3Representing a 3 × 3 convolution, Soft _ max representing a Softmax operation, FC representing a fully connected layer, Avg-pool representing mean pooling.
As shown in fig. 7, the generation module 3 includes:
a loss value obtaining unit 31, configured to obtain a loss value of the network structure of the picture recognition model by using the weighted cross entropy loss as a main loss function and Ring loss as an auxiliary loss function;
in this embodiment, the model output is Y ═ Y1,y2,...,yN+1W ═ W1,w2,...,wN+1And the value is based on the proportion of various sample numbers in the training set, and the cross entropy loss between the N scene labels and other samples is expressed as lossce
Figure BDA0002993011810000131
Wherein label represents the sequence number of the real category label of the picture, and the value range is [1, N +1 ]]An integer of (d); w is alabelE is W and is the weight corresponding to the real category label of the picture; y islabelAnd E is Y, and is the model output value corresponding to the picture real category label.
The target modular length is R, the mean value of the characteristic vector modular lengths after the first iteration is used for initializing R, and Ring loss is expressed as lossrl
Figure BDA0002993011810000132
Ultimate losstotalIs a weighted sum of two loss functions:
losstotal=lossce+λlossrl
wherein, the lambda is a weight factor and takes a value of 0.01.
An optimized parameter obtaining unit 32, configured to optimize a parameter of the network structure of the picture recognition model by using a random gradient descent algorithm of momentum based on a loss value of the network structure of the picture recognition model to obtain an optimized parameter of the model;
in this embodiment, the back propagation of the loss adopts a random gradient descent method based on momentum, and the momentum factor is momentum ═ 0.9.
The adjusting unit 33 is configured to set a learning rate by using a transfer learning method, and adjust an optimization parameter of the model, so that the accuracy of the image recognition model reaches a preset threshold.
In this embodiment, migration learning is performed based on the open source model trained on the public scene classification dataset place365, and pre-training weights except for the full connection layer and the feature fusion branch in the basic network are loaded. Training weight parameters in the added branch convolution layer and the fully-connected layer of the basic network, wherein the initial learning rate is set to be 0.01; fine-tuning the pre-training weight parameters in the networks conv2_ x, conv3_ x, conv4_ x and conv5_ x, with the initial learning rates of conv2_ x and conv3_ x set to 0.001 and the initial learning rates of conv4_ x and conv5_ x set to 0.002; the parameters in the other layers are frozen and no update is performed. In the training process, the value of the parameter learning rate is reduced by half every 5 iterations.
In this embodiment, the data in the strategy diagram library is used for testing, and the accuracy and recall rate of the model are evaluated by checking the identification result. And supplementing corresponding positive and negative samples to the training set aiming at the wrongly-divided cases, removing the atypical samples which are not beneficial to model training, updating the weight W of the cross entropy, and retraining the model. And repeating multiple rounds of data iteration until the accuracy of the model meets the production requirement, and stopping training.
The embodiment discloses a training system of a picture recognition model, which adopts a convolutional neural network as a basic network, adds branch convolutional layers in the hierarchy of the basic network, constructs a network structure of the picture recognition model, trains the network structure of the picture recognition model to generate the picture recognition model, and can realize multi-stage feature fusion of pictures in a basic classification network by adding the branch convolutional layers, so that automatic recognition of open scene pictures is realized, and the recognition efficiency is improved. In addition, during model training, weighted cross entropy loss is used as a main loss function, Ring loss is used as an auxiliary loss function, so that a loss value of a network structure of the image recognition model is obtained, parameters of the network structure of the image recognition model are optimized by adopting a momentum random gradient descent algorithm, a transfer learning method is adopted, and the learning rate is set, so that the accuracy of the model is increased.
Example 4
As shown in fig. 8, the present embodiment discloses a picture recognition system, which includes: the picture acquisition module 4 is used for acquiring a picture to be identified;
the input module 5 is used for inputting the picture to be identified into a picture identification model for identification so as to obtain a probability value of the label picture;
the picture recognition model is generated by using the training system of the picture recognition model.
Example 5
Fig. 9 is a schematic structural diagram of an electronic device according to embodiment 5 of the present invention. The electronic device comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor implements the training method of the picture recognition model provided by embodiment 1 or the picture recognition method provided by embodiment 2 when executing the program. The electronic device 40 shown in fig. 9 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiment of the present invention.
As shown in fig. 9, the electronic device 40 may be embodied in the form of a general purpose computing device, which may be, for example, a server device. The components of electronic device 40 may include, but are not limited to: the at least one processor 41, the at least one memory 42, and a bus 43 connecting the various system components (including the memory 42 and the processor 41).
The bus 43 includes a data bus, an address bus, and a control bus.
The memory 42 may include volatile memory, such as Random Access Memory (RAM)421 and/or cache memory 422, and may further include Read Only Memory (ROM) 423.
Memory 42 may also include a program/utility 425 having a set (at least one) of program modules 424, such program modules 424 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.
The processor 41 executes various functional applications and data processing, such as a training method of a picture recognition model provided in embodiment 1 of the present invention or a picture recognition method provided in embodiment 2, by running a computer program stored in the memory 42.
The electronic device 40 may also communicate with one or more external devices 44 (e.g., keyboard, pointing device, etc.). Such communication may be through an input/output (I/O) interface 45. Also, model-generating device 40 may also communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the Internet) via network adapter 46. As shown, the network adapter 46 communicates with the other modules of the model-generated device 40 over a bus 43. It should be understood that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the model-generating device 40, including but not limited to: microcode, device drivers, redundant processors, external disk drive arrays, RAID (disk array) systems, tape drives, and data backup storage systems, etc.
It should be noted that although in the above detailed description several units/modules or sub-units/modules of the electronic device are mentioned, such a division is merely exemplary and not mandatory. Indeed, the features and functionality of two or more of the units/modules described above may be embodied in one unit/module according to embodiments of the invention. Conversely, the features and functions of one unit/module described above may be further divided into embodiments by a plurality of units/modules.
Example 6
The present embodiment provides a computer-readable storage medium on which a computer program is stored, which when executed by a processor, implements the steps of the training method of the picture recognition model provided in embodiment 1 or the picture recognition method provided in embodiment 2.
More specific examples, among others, that the readable storage medium may employ may include, but are not limited to: a portable disk, a hard disk, random access memory, read only memory, erasable programmable read only memory, optical storage device, magnetic storage device, or any suitable combination of the foregoing.
In a possible implementation manner, the present invention can also be implemented in the form of a program product, which includes program code for causing a terminal device to execute the steps in the training method for implementing the picture recognition model provided in embodiment 1 or the picture recognition method provided in embodiment 2 when the program product runs on the terminal device.
Where program code for carrying out the invention is written in any combination of one or more programming languages, the program code may be executed entirely on the user device, partly on the user device, as a stand-alone software package, partly on the user device and partly on a remote device or entirely on the remote device.
While specific embodiments of the invention have been described above, it will be appreciated by those skilled in the art that this is by way of example only, and that the scope of the invention is defined by the appended claims. Various changes and modifications to these embodiments may be made by those skilled in the art without departing from the spirit and scope of the invention, and these changes and modifications are within the scope of the invention.

Claims (10)

1. A training method of a picture recognition model is characterized by comprising the following steps:
acquiring a plurality of pictures, labeling the pictures according to preset classification labels to obtain labeled pictures so as to construct an initial training data set; wherein the pictures comprise a target picture with an explicit scene and an interference picture without an explicit scene; the preset classification labels comprise a plurality of scene labels respectively corresponding to scenes in the target picture and a non-scene label corresponding to the interference picture;
adopting a convolutional neural network as a basic network, and adding a branch convolutional layer in the convolutional neural network layer to construct a network structure of a picture recognition model; the branch convolution layer is used for fusing the multi-level feature maps in the basic network;
inputting a plurality of label pictures into a network structure of the picture recognition model for training to generate the picture recognition model.
2. The method for training the picture recognition model according to claim 1, wherein the step of constructing the network structure of the picture recognition model by using the convolutional neural network as a base network and adding branch convolutional layers in the convolutional neural network layer comprises:
the method comprises the following steps that a wide resnet50 is used as a basic network, and a wide resnet50 comprises a first output layer, a second output layer, a third output layer, a fourth output layer, a fifth output layer and a full connection layer;
adding a first branch convolution layer, a second branch convolution layer and a third branch convolution layer at the output end of the second output layer, the output end of the third output layer and the output end of the fourth output layer respectively to construct a network structure of the picture identification model;
the first branch convolutional layer is used for receiving the characteristic diagram output by the second output layer and converting the size and the channel number of the characteristic diagram output by the second output layer to obtain the characteristic diagram output by the first branch convolutional layer;
the second branch convolutional layer is used for receiving a first fused feature map and transforming the size and the channel number of the first fused feature map to obtain a feature map output by the second branch convolutional layer, wherein the first fused feature map is obtained by fusing the feature map output by the first branch convolutional layer with the feature map output by the third output layer;
the third branch convolutional layer is used for receiving a second fused feature map and transforming the size and the channel number of the second fused feature map to obtain a feature map output by the third branch convolutional layer, wherein the second fused feature map is obtained by fusing the feature map output by the second branch convolutional layer with the feature map output by the fourth output layer;
the full connection layer is used for receiving a third fused feature map, wherein the third fused feature map is obtained by fusing the feature map output by the third branch convolution layer with the feature map output by the fifth output layer;
the size and the channel number of the characteristic diagram output by the first branch convolution layer and the characteristic diagram output by the third output layer are the same; the size and the channel number of the characteristic diagram output by the second branch convolution layer and the characteristic diagram output by the fourth output layer are the same; the size and the channel number of the characteristic diagram output by the third branch convolution layer and the characteristic diagram output by the fifth output layer are the same.
3. The method for training the picture recognition model according to claim 1, wherein the step of inputting a plurality of the labeled picture data into the network structure of the picture recognition model for training to generate the picture recognition model further comprises:
obtaining a loss value of a network structure of the image recognition model by using the weighted cross entropy loss as a main loss function and the Ring loss as an auxiliary loss function;
optimizing parameters of the network structure of the picture recognition model by adopting a momentum random gradient descent algorithm based on the loss value of the network structure of the picture recognition model to obtain optimized parameters of the model;
and setting the learning rate by adopting a transfer learning method, and adjusting the optimization parameters of the model so as to enable the accuracy of the image recognition model to reach a preset threshold value.
4. A picture identification method is characterized by comprising the following steps:
acquiring a picture to be identified;
inputting the picture to be identified into a picture identification model for identification so as to obtain a probability value of the label picture;
the picture recognition model is generated using a training method of the picture recognition model according to any one of claims 1 to 3.
5. A training system for a picture recognition model, the training system comprising:
the system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring a plurality of pictures and labeling the pictures according to preset classification labels to obtain label pictures so as to construct an initial training data set; wherein the pictures comprise a target picture with an explicit scene and an interference picture without an explicit scene; the preset classification labels comprise a plurality of scene labels respectively corresponding to scenes in the target picture and a non-scene label corresponding to the interference picture;
the building module is used for adopting a convolutional neural network as a basic network and adding a branch convolutional layer in the convolutional neural network layer so as to build a network structure of the image recognition model; the branch convolution layer is used for fusing the multi-level feature maps in the basic network;
and the generating module is used for inputting the plurality of label pictures into a network structure of the picture recognition model for training so as to generate the picture recognition model.
6. The system for training a picture recognition model according to claim 5, wherein the building module is configured to use a wide resnet50 as a basic network, and the wide resnet50 includes a first output layer, a second output layer, a third output layer, a fourth output layer, a fifth output layer, and a full connection layer;
adding a first branch convolution layer, a second branch convolution layer and a third branch convolution layer at the output end of the second output layer, the output end of the third output layer and the output end of the fourth output layer respectively to construct a network structure of the picture identification model;
the first branch convolutional layer is used for receiving the characteristic diagram output by the second output layer and converting the size and the channel number of the characteristic diagram output by the second output layer to obtain the characteristic diagram output by the first branch convolutional layer;
the second branch convolutional layer is used for receiving a first fused feature map and transforming the size and the channel number of the first fused feature map to obtain a feature map output by the second branch convolutional layer, wherein the first fused feature map is obtained by fusing the feature map output by the first branch convolutional layer with the feature map output by the third output layer;
the third branch convolutional layer is used for receiving a second fused feature map and transforming the size and the channel number of the second fused feature map to obtain a feature map output by the third branch convolutional layer, wherein the second fused feature map is obtained by fusing the feature map output by the second branch convolutional layer with the feature map output by the fourth output layer;
the full connection layer is used for receiving a third fused feature map, wherein the third fused feature map is obtained by fusing the feature map output by the third branch convolution layer with the feature map output by the fifth output layer;
the size and the channel number of the characteristic diagram output by the first branch convolution layer and the characteristic diagram output by the third output layer are the same; the size and the channel number of the characteristic diagram output by the second branch convolution layer and the characteristic diagram output by the fourth output layer are the same; the size and the channel number of the characteristic diagram output by the third branch convolution layer and the characteristic diagram output by the fifth output layer are the same.
7. The system for training a picture recognition model according to claim 5, wherein the generating module comprises:
a loss value obtaining unit, configured to obtain a loss value of a network structure of the picture identification model by using weighted cross entropy loss as a main loss function and Ringloss as an auxiliary loss function;
the optimization parameter obtaining unit is used for optimizing the parameters of the network structure of the picture recognition model by adopting a momentum random gradient descent algorithm based on the loss value of the network structure of the picture recognition model so as to obtain the optimization parameters of the model;
and the adjusting unit is used for setting the learning rate by adopting a transfer learning method and adjusting the optimization parameters of the model so as to enable the accuracy of the image recognition model to reach a preset threshold value.
8. A picture recognition system, the picture recognition system comprising:
the image acquisition module is used for acquiring an image to be identified;
the input module is used for inputting the picture to be identified into a picture identification model for identification so as to obtain the probability value of the label picture;
the picture recognition model is generated using a training system of recognition models according to any one of claims 5 to 7.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method for training a picture recognition model according to any one of claims 1 to 3 or the method for picture recognition according to claim 4 when executing the computer program.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method for training a picture recognition model according to any one of claims 1 to 3 or the method for picture recognition according to claim 4.
CN202110321375.8A 2021-03-25 2021-03-25 Training method, recognition method, system, device and medium of image recognition model Pending CN113065443A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110321375.8A CN113065443A (en) 2021-03-25 2021-03-25 Training method, recognition method, system, device and medium of image recognition model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110321375.8A CN113065443A (en) 2021-03-25 2021-03-25 Training method, recognition method, system, device and medium of image recognition model

Publications (1)

Publication Number Publication Date
CN113065443A true CN113065443A (en) 2021-07-02

Family

ID=76563517

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110321375.8A Pending CN113065443A (en) 2021-03-25 2021-03-25 Training method, recognition method, system, device and medium of image recognition model

Country Status (1)

Country Link
CN (1) CN113065443A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114461986A (en) * 2022-01-17 2022-05-10 北京快乐茄信息技术有限公司 Method for training identification model and method and device for image identification
CN116821699A (en) * 2023-08-31 2023-09-29 山东海量信息技术研究院 Perception model training method and device, electronic equipment and storage medium

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114461986A (en) * 2022-01-17 2022-05-10 北京快乐茄信息技术有限公司 Method for training identification model and method and device for image identification
CN116821699A (en) * 2023-08-31 2023-09-29 山东海量信息技术研究院 Perception model training method and device, electronic equipment and storage medium
CN116821699B (en) * 2023-08-31 2024-01-19 山东海量信息技术研究院 Perception model training method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN111079532B (en) Video content description method based on text self-encoder
US11783227B2 (en) Method, apparatus, device and readable medium for transfer learning in machine learning
CN114241282B (en) Knowledge distillation-based edge equipment scene recognition method and device
CN109948029B (en) Neural network self-adaptive depth Hash image searching method
CN110366734B (en) Optimizing neural network architecture
CN113065013B (en) Image annotation model training and image annotation method, system, equipment and medium
CN110728317A (en) Training method and system of decision tree model, storage medium and prediction method
CN112765373B (en) Resource recommendation method and device, electronic equipment and storage medium
CN112445876A (en) Entity alignment method and system fusing structure, attribute and relationship information
CN113065443A (en) Training method, recognition method, system, device and medium of image recognition model
CN108229986B (en) Feature construction method in information click prediction, information delivery method and device
CN111916144B (en) Protein classification method based on self-attention neural network and coarsening algorithm
CN111709493A (en) Object classification method, training method, device, equipment and storage medium
US20220129747A1 (en) System and method for deep customized neural networks for time series forecasting
CN112001485B (en) Group convolution number searching method and device
CN111882157A (en) Demand prediction method and system based on deep space-time neural network and computer readable storage medium
CN110866564A (en) Season classification method, system, electronic device and medium for multiple semi-supervised images
CN110210540A (en) Across social media method for identifying ID and system based on attention mechanism
CN112633246A (en) Multi-scene recognition method, system, device and storage medium in open scene
CN113254649B (en) Training method of sensitive content recognition model, text recognition method and related device
CN113987236A (en) Unsupervised training method and unsupervised training device for visual retrieval model based on graph convolution network
US20220129790A1 (en) System and method for deep enriched neural networks for time series forecasting
CN112559877A (en) CTR (China railway) estimation method and system based on cross-platform heterogeneous data and behavior context
CN117458440A (en) Method and system for predicting generated power load based on association feature fusion
CN116992151A (en) Online course recommendation method based on double-tower graph convolution neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination