CN112330651A

CN112330651A - Logo detection method and system based on deep learning

Info

Publication number: CN112330651A
Application number: CN202011266939.4A
Authority: CN
Inventors: 侯素娟; 孟晔; 侯强; 王静
Original assignee: Shandong Normal University
Current assignee: Shandong Normal University
Priority date: 2020-11-13
Filing date: 2020-11-13
Publication date: 2021-02-05

Abstract

The invention discloses a Logo detection method and system based on deep learning, which comprises the following steps: constructing a Logo detection model based on improved up-sampling operation and a loss function; predicting an upsampling core according to the input feature map, and recombining the target features according to the predicted upsampling core to obtain recombined features; and training a Logo detection model based on the recombination characteristics, and detecting the Logo image to be detected by using the trained Logo detection model. Based on a deep learning detection model, combined with new up-sampling operation and a loss function, a larger receptive field can be obtained during feature recombination, the recombination process is guided according to input features, the whole process is lighter, better balance is achieved in speed and precision, problems such as divergence and the like cannot occur in the training process, and the regression process becomes more stable.

Description

Logo detection method and system based on deep learning

Technical Field

The invention relates to the technical field of image processing, in particular to a Logo detection method and system based on deep learning.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

With the rapid development of computer vision, the target detection and identification technology has wide application in many aspects such as multimedia, traffic, medical images and the like. Due to the rapid development of internet technology, a large amount of picture data is stored on a network, and information contained in the pictures is very meaningful, such as Logo contained in the pictures, and plays a very important role in commercial advertisement as a relatively important mark of a brand. By accurately identifying the commodity Logo of the advertiser and judging whether the commodity Logo belongs to an illegal commodity brand, false advertisements are suppressed, so that a healthy environment exists in a network market; the occurrence time of the Logo of the commodity appearing in the advertisement can be monitored so as to reflect the value of the advertisement put by the advertiser; and the system can also help the merchant to optimize the marketing scheme and quickly recommend favorite products for the customer.

The deep learning method in the field of target detection is mainly divided into three categories: two stages, one stage, anchor free. Wherein, two stages: firstly, a series of candidate frames serving as samples are generated by an algorithm, and then sample classification and target regression are carried out through a convolutional neural network, wherein common algorithms comprise R-CNN, Fast R-CNN, Faster R-CNN and the like, and the accuracy is high but the speed is low. A first stage: the problem of target frame positioning is directly converted into a regression problem for processing without generating a candidate frame, common algorithms include YOLO, SSD and the like, the speed is high, and the accuracy is not as good as two stages. For the two modes, the former mode is superior in detection accuracy and positioning accuracy, and the latter mode is superior in speed.

At present, an anchor free algorithm is also provided, a detection scheme without an anchor frame is realized, and good balance is achieved in speed and precision. The detection work of the image is relatively complex, and the detection of the Logo image is a great challenge because various scenes exist in life, and Logo data has the characteristics of multiple categories, similarity among the categories, small targets, deformation and the like. In addition, deep learning has high requirements on hardware such as a GPU, and is computationally expensive and time consuming.

In summary, there is no effective solution to balance the accuracy and speed of Logo detection.

Disclosure of Invention

In order to solve the problems, the invention provides a Logo detection method and a Logo detection system based on deep learning, wherein a deep learning detection model is based on, a new upsampling operation and a loss function are combined, a larger receptive field can be obtained when characteristics are recombined, the recombination process is guided according to input characteristics, the whole process is light in weight, good balance on speed and precision is achieved, and the problems of divergence and the like can not occur in the training process, so that the regression process becomes more stable.

In order to achieve the purpose, the invention adopts the following technical scheme:

in a first aspect, the present invention provides a Logo detection method based on deep learning, including:

constructing a Logo detection model based on improved up-sampling operation and a loss function;

predicting an upsampling core according to the input feature map, and recombining the target features according to the predicted upsampling core to obtain recombined features;

and training a Logo detection model based on the recombination characteristics, and detecting the Logo image to be detected by using the trained Logo detection model.

In a second aspect, the present invention provides a Logo detection system based on deep learning, including:

a model building module configured to build a Logo detection model based on the improved upsampling operation and the loss function;

the up-sampling module is configured to predict an up-sampling core according to the input feature map, and recombine the target features according to the predicted up-sampling core to obtain recombined features;

and the detection module is configured to train a Logo detection model based on the recombination characteristics and detect the Logo image to be detected by the trained Logo detection model.

In a third aspect, the present invention provides an electronic device comprising a memory and a processor, and computer instructions stored on the memory and executed on the processor, wherein when the computer instructions are executed by the processor, the method of the first aspect is performed.

In a fourth aspect, the present invention provides a computer readable storage medium for storing computer instructions which, when executed by a processor, perform the method of the first aspect.

Compared with the prior art, the invention has the beneficial effects that:

the invention provides a new up-sampling operation on a model of pixel-by-pixel point prediction, an up-sampling kernel is predicted by utilizing an input characteristic diagram, and characteristic recombination is carried out based on the predicted up-sampling kernel; the whole operation is lighter, shallow features and deep features of the network are fully used, the shallow features pay more attention to some detailed information, and the method is suitable for regression positioning; while deep features focus more on semantic information and are suitable for classification.

The invention provides a new regression loss function, which fully considers the problems of the distance, the overlapping rate and the scale between a predicted target frame and a real target frame, does not have the problems of divergence and the like in the training process, ensures that the regression process is more stable, and improves the convergence speed and the regression precision.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the invention and together with the description serve to explain the invention and not to limit the invention.

Fig. 1 is a flowchart of a Logo detection method based on deep learning according to embodiment 1 of the present invention;

fig. 2 is a Logo detection model framework diagram based on deep learning according to embodiment 1 of the present invention;

fig. 3(a) - (c) are exemplary diagrams of the case where the GIoU and IoU values provided in example 1 of the present invention are equal.

The specific implementation mode is as follows:

the invention is further described with reference to the following figures and examples.

It is to be understood that the following detailed description is exemplary and is intended to provide further explanation of the invention as claimed. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention. As used herein, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise, and it should be understood that the terms "comprises" and "comprising", and any variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The embodiments and features of the embodiments of the present invention may be combined with each other without conflict.

Example 1

As shown in fig. 1, the embodiment provides a Logo detection method based on deep learning, including:

s1: constructing a Logo detection model based on improved up-sampling operation and a loss function;

s2: predicting an upsampling core according to the input feature map, and recombining the target features according to the predicted upsampling core to obtain recombined features;

s3: and training a Logo detection model based on the recombination characteristics, and detecting the Logo image to be detected by using the trained Logo detection model.

In the embodiment, a crawler technology is adopted to crawl various Logo image data in a plurality of websites; specifically, keywords such as clothes brands, food brands, electronic brands and the like are input, each type of the keywords obtains a plurality of brands, and a Logo image of each brand is crawled in a website through a crawler technology.

In the embodiment, data cleaning and labeling are carried out on the obtained Logo image; the method specifically comprises the following steps: when the collected Logo image is cleaned, manual detection is needed, and the Logo image with poor quality such as particularly fuzzy or serious deformation, the wrong Logo image with inconsistent Logo and class name and the Logo image with too high pixels are deleted;

and marking the Logo in the image by using marking software, and directly deleting the Logo if the image with poor quality is found, so that the diversity and integrity of data are ensured.

In this embodiment, an anchor free frame based Logo detection frame is constructed, and includes a feature extraction module, a multi-scale fusion module and a loss function module, so as to implement a Logo detection model with good speed and precision, as shown in fig. 2, the Logo detection model frame is specifically as follows:

extracting characteristics; the embodiment adopts a classification network with a ResNet-50 layer removed as a main network for feature extraction.

Multi-scale fusion; in the embodiment, the improved feature pyramid FPN is adopted to extract features of different scales from different layers of a network, and then feature fusion is carried out to realize multi-scale detection;

the characteristic pyramid FPN solves the multi-scale problem in Logo detection by using p 3-p 7 layers and is divided into a bottom-up process and a top-down process; wherein, the bottom-up process, namely the forward propagation process of the network, is to obtain a characteristic diagram which is reduced through a series of convolution operations; the top-down process, i.e. feature maps with more semantic pixel level information, is passed to the bottom layer by the upsampling operation. Therefore, features with different resolutions and different semantic strengths are fused in each layer of feature map, Logo with different sizes is detected and identified, and the FPN greatly improves the Logo detection performance by using simple network connection.

In multi-scale fusion, the up-sampling operation of the features is very critical, the most widely used feature up-sampling operators at present are nearest neighbor, bilinear interpolation and deconvolution, the former two adopt the spatial distance between pixels to guide the up-sampling process, but the two modes only consider the neighborhood of sub-pixels and cannot capture rich semantic information required by a dense prediction task. Deconvolution applies the same kernel to the entire image, regardless of underlying content, limiting its ability to respond to local variations; secondly, when larger kernels are used, the number of parameters and the heavy computational effort make it difficult to cover a larger area beyond a small area, thereby limiting performance.

The specific operation of the upsampling operator adopted in this embodiment is as follows: performing dot product on the up-sampling kernel of each position and pixels of a corresponding neighborhood in the input feature map, namely, feature recombination; specifically, an up-sampling kernel is predicted by using an input feature map, the up-sampling kernel at each position is different, and then feature recombination is carried out based on the predicted up-sampling kernel;

in the embodiment, the improved FPN upsampling operation may have a larger receptive field when recombining, information is collected in the large receptive field, the information can dynamically adapt to the content of a specific example, the recombining process can be guided according to the input characteristics, and the whole process is light-weight, and the calculation efficiency is maintained. The up-sampling operation is divided into an up-sampling kernel prediction module and a feature recombination module, wherein the up-sampling kernel prediction module is used for predicting an up-sampling kernel, the feature recombination module is used for completing up-sampling, and finally an output feature map is obtained.

The improved FPN sufficiently extracts and fuses shallow features and deep features of the Logo image, and the shallow features pay more attention to detail position information and are suitable for regression positioning; deep features are more focused on semantic information and are suitable for classification, etc. The shallow characteristic and the deep characteristic of the network are combined, the targets with different sizes are output at the same time from the multiple branches, and the small targets can be better stored in a multi-scale prediction mode.

In this embodiment, given a feature map X of size C × H × W and an upper sample ratio σ (assuming σ is an integer), a new feature map X ' of size C × σ H × σ W is generated, with an arbitrary target position l ' for output X ' (i ', j '), and a corresponding source position l at input X (i, j), where l is the corresponding source position of (i, j), where

Represents N (X)_lK) is the k × k sub-region of X at the l center position, i.e. X_lA neighborhood of (c);

firstly, predicting a recombination kernel according with the content of each target position, and secondly, recombining by using the predicted kernel characteristics; in the first placeIn step, the kernel prediction module psi may be based on the neighborhood X_lPredicting the kernel W of each location for each location l_l'The recombination procedure is represented as follows:

W_l'＝ψ(N(X_l,k_encoder)) (1)

X_l′′＝φ(N(X_l,k_up),W_l') (2)

where φ is with kernel W_l'Recombined neighborhood X of_lThe content-aware restructuring module of (1);

for each recombination nucleus W_l'The content perception recombination module recombines the characteristics in a region through a functional element phi, namely a weighting sum operator; for a target location l' and a corresponding square region N (X) centered at l ═ i, j_l,k_up) Recombining the formula (3), wherein

In the recombination nucleus, N (X)_l,k_up) Each pixel of the region contributes differently to the pixel l' according to the content of the feature rather than the distance of the position, and the reconstructed feature map can focus more on information from local relevant points, so that the semantic meaning of the feature map is stronger than that of the original feature map.

A loss function module; the current commonly used loss functions comprise IoU, GIoU and the like, and the IoU cross-over ratio is used for determining a positive sample and a negative sample, and also can be used for evaluating the distance between an output frame and a real frame and reflecting the detection effect of a prediction detection frame and the real detection frame; the formula is as follows:

however, this loss function has the following disadvantages: if the two frames do not intersect, IoU is 0, and the distance between the two frames cannot be reflected; meanwhile, the loss is 1-IoU, the loss is 0, no gradient feedback exists, and the network cannot learn and train.

GIoU takes into account IoU non-overlapping regions not considered: the content of GIoU (A, B) is less than or equal to IoU (A, B); when A and B are completely coincident, GIoU is IoU is 1; (U.S. B)/C → 0, GIoU converges to-1 in the range of-1. ltoreq. GIoU (A, B). ltoreq.1. The formula is as follows:

however, this loss function has the following disadvantages: the GIoU becomes completely IoU in the case of fig. 3(a) - (c).

In the embodiment, a DIoU loss function is adopted, so that the method is more in line with a target frame regression mechanism, the problems of the distance, the overlapping rate and the scale between the prediction frame and the real target frame are fully considered, the problems of divergence and the like cannot occur in the training process, the regression process is more stable, and the convergence speed and the regression precision are improved. The formula is as follows:

wherein, b^gtRespectively representing the central points of the prediction frame and the real frame, wherein rho represents the Euclidean distance between the two central points, and c represents the diagonal distance of the minimum closure area which can simultaneously contain the prediction frame and the real frame.

In step S3, the training process specifically includes: using an SGD optimizer, setting the initial learning rate to be 0.005, setting the batch _ size to be 4, using ImageNet pre-training weights to initialize, cutting the input picture into the size of a short edge not less than 800 and a long edge not less than 1333, training by adopting 53049 pictures, testing 9559 pictures and testing the parameter loss_clsReduced to about 0.1, loss_bboxThe training effect is about 0.2 approximately, and the best training effect is achieved when the total loss is about 0.9; and carrying out accuracy test and visual display on the trained network model by applying a test set, wherein the test result is the position and category score of each target marked by a rectangular frame.

Example 2

The embodiment provides a Logo detection system based on deep learning, including:

It should be noted that the above modules correspond to steps S1 to S3 in embodiment 1, and the above modules are the same as the examples and application scenarios realized by the corresponding steps, but are not limited to the disclosure in embodiment 1. It should be noted that the modules described above as part of a system may be implemented in a computer system such as a set of computer-executable instructions.

In further embodiments, there is also provided:

an electronic device comprising a memory and a processor and computer instructions stored on the memory and executed on the processor, the computer instructions when executed by the processor performing the method of embodiment 1. For brevity, no further description is provided herein.

It should be understood that in this embodiment, the processor may be a central processing unit CPU, and the processor may also be other general purpose processors, digital signal processors DSP, application specific integrated circuits ASIC, off-the-shelf programmable gate arrays FPGA or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, and so on. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory may include both read-only memory and random access memory, and may provide instructions and data to the processor, and a portion of the memory may also include non-volatile random access memory. For example, the memory may also store device type information.

A computer readable storage medium storing computer instructions which, when executed by a processor, perform the method described in embodiment 1.

The method in embodiment 1 may be directly implemented by a hardware processor, or implemented by a combination of hardware and software modules in the processor. The software modules may be located in ram, flash, rom, prom, or eprom, registers, among other storage media as is well known in the art. The storage medium is located in a memory, and a processor reads information in the memory and completes the steps of the method in combination with hardware of the processor. To avoid repetition, it is not described in detail here.

Those of ordinary skill in the art will appreciate that the various illustrative elements, i.e., algorithm steps, described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The above is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and changes will occur to those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Although the embodiments of the present invention have been described with reference to the accompanying drawings, it is not intended to limit the scope of the present invention, and it should be understood by those skilled in the art that various modifications and variations can be made without inventive efforts by those skilled in the art based on the technical solution of the present invention.

Claims

1. A Logo detection method based on deep learning is characterized by comprising the following steps:

2. The Logo detection method based on deep learning as claimed in claim 1, wherein the reorganizing of the target feature is to perform dot product on the predicted upsampling kernel and pixels of the corresponding neighborhood in the target feature.

3. The Logo detection method based on deep learning of claim 1, wherein the Logo detection model adopts a ResNet-50 network with a full connection layer removed as a main network for feature extraction.

4. The Logo detection method based on deep learning as claimed in claim 1, wherein the Logo detection model adopts an improved feature pyramid to extract target features of different scales.

5. The Logo detection method based on deep learning as claimed in claim 4, wherein in the Logo detection model, the shallow feature and the deep feature of a Logo image to be detected are extracted and fused by adopting an improved feature pyramid and an up-sampling operation, the shallow feature is used for regression positioning, and the deep feature is used for classification.

6. The Logo detection method based on deep learning as claimed in claim 1, wherein the Logo detection model adopts a DIoU loss function.

7. The Logo detection method based on deep learning as claimed in claim 1, wherein a crawler technology is adopted to crawl a Logo image to be detected, and the acquired Logo image to be detected is subjected to data cleaning and Logo labeling.

8. A Logo detection system based on deep learning, comprising:

9. An electronic device comprising a memory and a processor and computer instructions stored on the memory and executed on the processor, the computer instructions when executed by the processor performing the method of any of claims 1-7.

10. A computer-readable storage medium storing computer instructions which, when executed by a processor, perform the method of any one of claims 1 to 7.