CN110956115B

CN110956115B - Scene recognition method and device

Info

Publication number: CN110956115B
Application number: CN201911172445.7A
Authority: CN
Inventors: 陶民泽
Original assignee: E Capital Transfer Co ltd
Current assignee: E Capital Transfer Co ltd
Priority date: 2019-11-26
Filing date: 2019-11-26
Publication date: 2023-09-29
Anticipated expiration: 2039-11-26
Also published as: CN110956115A

Abstract

The invention relates to a scene recognition method, which comprises the following steps: acquiring a scene image; extracting candidate areas from the scene image using a convolutional neural network; determining a target region based on the candidate region; and identifying a logo or slogan in the scene from the target area; wherein the logo or slogan indicates the institution or company in which the scene is located. The method can accurately determine the organization, company or business hall where the scene is located.

Description

Scene recognition method and device

Technical Field

The present invention relates to the field of image recognition technologies, and in particular, to a scene recognition method and apparatus.

Background

Given a picture of a scene, a human or robot can determine whether the picture is from a previously seen scene, which is the problem to be solved by visual scene recognition. Visual scene recognition is widely applied to the fields of mobile robots, automatic driving and the like.

In the prior art, the identification method is often aimed at large-size picture information, the corresponding scene images comprise integral features such as decoration styles and furnishing of scenes, but the furnishings of the scenes (such as business halls) are often not provided with obvious features and are even similar to each other, so that the scene places are difficult to accurately judge, and therefore, the effect of the image identification method is not ideal.

Furthermore, the lack of explicit screening criteria in determining the target region from the candidate region has no public theory as to what type of image features the target region should include, so how to accurately identify and distinguish scenes remains a challenge.

Disclosure of Invention

The invention aims to provide a scene recognition method.

According to an aspect of the present invention, there is provided a scene recognition method, including: acquiring a scene image; extracting candidate areas from the scene image using a convolutional neural network; determining a target region based on the candidate region; and identifying a logo or slogan in the scene from the target area; wherein the logo or slogan indicates the institution or company in which the scene is located.

Optionally, extracting the candidate region from the scene image using the convolutional neural network comprises: generating at least one convolution feature map based on the scene image; for each convolution feature map: the convolution feature map is input into a convolution neural network to obtain coordinates of at least one candidate region and probabilities of the presence of markers or slogans in each candidate region.

Optionally, determining the target region based on the candidate region includes: threshold filtering and/or non-maximum suppression screening are performed on each candidate region.

Optionally, determining the target region based on the candidate region includes: inputting the convolution feature map corresponding to the candidate region into a convolution neural network to obtain the coordinate correction quantity of the candidate region; the target area is determined based on the coordinate correction amount.

Optionally, identifying a logo or tagline of the scene from the target area comprises: and inputting the scene image corresponding to the target area into a convolutional neural network to identify a sign or a slogan.

Optionally, the method further comprises training a convolutional neural network, wherein training the convolutional neural network comprises: providing a sample scene image comprising a logo or slogan; for each sample scene image: generating at least one sample convolution feature map based on the sample scene image; convolving the feature map for each sample: dividing a sample convolution feature map into a plurality of lattices; for each cell: predicting at least one sample signature box; determining the confidence coefficient of each sample marking frame; and determining a probability of the sample marker box category.

Optionally, training the convolutional neural network further comprises: determining a loss function of the convolutional neural network; the loss function is solved to determine at least one parameter of the convolutional neural network.

Optionally, the loss function solving includes: the loss function is solved using an adaptive moment estimation optimizer.

According to another aspect of the present invention, there is provided a scene recognition apparatus including a scene image acquisition unit, a target determination unit, and a target recognition unit, wherein: the scene image acquisition unit is configured to acquire a scene image; the target determination unit is configured to: extracting candidate areas from the scene image using a convolutional neural network; determining a target region based on the candidate region; the target recognition unit is configured to recognize a logo or slogan in the scene from the target area; wherein the logo or slogan indicates the institution or company in which the scene is located.

Optionally, the target determining unit is configured to: generating at least one convolution feature map based on the scene image; for each convolution feature map: the convolution feature map is input into a convolution neural network to obtain coordinates of at least one candidate region and probabilities of the presence of markers or slogans in each candidate region.

Optionally, the targeting unit is configured to perform threshold filtering and/or non-maximum suppression screening on each candidate region.

Optionally, the target determining unit is configured to input the convolution feature map corresponding to the candidate region into a convolution neural network to obtain a coordinate correction amount of the candidate region; the target area is determined based on the coordinate correction amount.

Optionally, the target recognition unit is configured to input the scene image corresponding to the target region into a convolutional neural network to recognize a logo or slogan.

Optionally, the target determination unit is further configured to train the convolutional neural network, wherein training the convolutional neural network comprises: providing a sample scene image comprising a logo or slogan; for each sample scene image: generating at least one sample convolution feature map based on the sample scene image; convolving the feature map for each sample: dividing a sample convolution feature map into a plurality of lattices; for each cell: predicting at least one sample signature box; determining the confidence coefficient of each sample marking frame; and determining a probability of the sample marker box category.

Optionally, in training the convolutional neural network, the target determination unit is further configured to: determining a loss function of the convolutional neural network; the loss function is solved to determine at least one parameter of the convolutional neural network.

Optionally, in training the convolutional neural network, the target determination unit is further configured to: the loss function is solved using an adaptive moment estimation optimizer.

According to the scene recognition method provided by the invention, the candidate region is extracted from the scene image and the target region is determined by using the convolutional neural network, the parameters of the convolutional neural network are set to make the convolutional neural network more sensitive to the mark or the tagline part in the scene image, and the mechanism, the company or the business hall where the scene is located can be accurately determined by recognizing the mark or the tagline.

Drawings

Fig. 1 is a schematic flow chart of a scene recognition method according to a first embodiment of the present invention.

Fig. 2 is a block diagram showing a configuration of a scene recognition apparatus according to a second embodiment of the present invention.

Detailed Description

In the following description, specific details are set forth in order to provide a thorough understanding of the present invention. It will be apparent, however, to one skilled in the art that embodiments of the invention can be practiced without these specific details. In the present invention, specific numerical references such as "first element", "second device", etc. may be made. However, a specific numerical reference should not be construed as necessarily subject to its literal order, but rather as a "first element" distinct from a "second element".

The particular details presented herein are exemplary only and the particular details may vary and yet fall within the spirit and scope of the present invention. The term "coupled" is defined as either directly connected to an element or indirectly connected to an element via another element.

Preferred embodiments of methods, systems and apparatus suitable for implementing the present invention are described below with reference to the accompanying drawings. Although the embodiments are described with respect to a single combination of elements, it is to be understood that the invention includes all possible combinations of the disclosed elements. Thus, if one embodiment includes elements A, B and C, while a second embodiment includes elements B and D, the present invention should also be considered to include other remaining combinations of A, B, C or D, even if not explicitly disclosed.

The first embodiment of the present invention provides a scene recognition method, as shown in fig. 1, which includes steps S10-S12-S14-S16.

Step S10: a scene image is acquired.

In this step, a video is captured by a camera provided in a scene to be identified, the video is read and parsed, and frames in the video are extracted at a certain frequency (for example, 5 frames per second) and are arranged in sequence, so as to form a scene image.

Step S12: candidate regions are extracted from the scene image using a convolutional neural network.

In this step, a convolutional neural network (whether training is complete or not) is used to extract candidate regions from the scene image in which markers or banners in the scene may be present. But may include those in the candidate region where no scene markers or banners are present.

It should be noted that, the "sign or slogan" referred to in the present invention may indicate the name or identification of the organization or company where the scene is located. The logo includes not only a text logo but also an iconic logo, such as a trademark or design of a company. There may be similarities in the presentation of different institutions or companies, but such signs or banners should be quite different. Such signs or banners should be the same or similar for the same organization or company, regardless of differences in time, weather, lighting factors, and more or less for people in the scene. Thus, a logo or slogan in a scene can be used to determine an organization, company, or business hall.

Specifically, to extract candidate regions, a plurality of convolution feature maps are first generated based on the scene image. The convolution feature map may correspond to the RGB three pixel images, or may correspond to different convolution kernels employed. Different convolution feature maps can represent different feature dimensions of the same scene image. As an example, the same scene image is convolved in R, G, B three channels to generate a convolution feature map, where the R pixel convolution feature map contains the convolution features of all R pixels, the G pixel convolution feature map contains the convolution features of all G pixels, and the B pixel convolution feature map is similar.

After generating the convolution feature map, the convolution feature map is input into a convolution neural network for each convolution feature map, and the convolution neural network may output coordinates of a plurality of candidate regions and probabilities of scene markers or slogans existing in each candidate region.

According to some embodiments of the invention, the convolutional neural network may comprise a detection network for determining the target area and a classification network for identifying the marker or taggant, wherein the detection network further comprises: the feature extraction sub-network, the region generation sub-network, the pooling layer and the frame regression sub-network. The detection network and the classification network should be trained independently of each other. The sub-networks in the detection network may be trained in a unified manner or may be trained independently of each other. The parameters of the convolutional neural network (including its various sub-networks) should be set so that it is more sensitive to the logo or tagline portions in the scene image.

Step S14: a target region is determined based on the candidate region.

In this step, the candidate regions are processed or screened to select a target region, which is one or more candidate regions where a scene mark or tagline actually exists (as seen by the scene recognition device of the present invention).

According to some embodiments of the present invention, in order to determine the target region, a threshold filtering, or non-maximum suppression screening, may be performed on each candidate region.

According to further embodiments, the convolution feature map corresponding to the candidate region is first input to a convolution neural network (e.g., to a region generation sub-network) to obtain a coordinate correction amount for the candidate region, and then a final target region is determined based on the coordinate correction amount.

Step S16: a logo or slogan in the scene is identified from the target area.

As an example, in this step, a scene image corresponding to the target region may be input into the convolutional neural network to identify scene markers or banners therein. Wherein these flags or banners indicate the name or identification of the institution or company in which the scene is located. Thus, the method can accurately identify and distinguish different scenes. Even though there is a high degree of similarity in scene presentation between different scenes, the logo or slogan therein is one of the features that is most recognized by the scene.

It can be used to effectively distinguish between different scenes, just due to the recognizability of the logo or slogan of the scene. However, a logo or slogan often corresponds to a small portion, even a small portion that is not in eye, of an image of a scene, which includes even small amounts of features (e.g., image gradients or variances) compared to scene furnishings, scene characters. The traditional image recognition method often has the defects of difficult extraction, poor recognition and the like when extracting and recognizing scene marks or slogans. To this end, the detection network introduced by the invention comprises a feature extraction sub-network, a region generation sub-network, a pooling layer and a frame regression sub-network. The feature extraction sub-network extracts features of the scene image, the region generation sub-network is used for outputting candidate regions, and the pooling layer compresses the input feature images, so that the feature images are reduced on one hand, and the network calculation complexity is simplified; and on the other hand, carrying out feature compression, extracting main features and determining a target area.

In order to accurately extract the target area and identify the scene markers or slogans, the convolutional neural network needs to be trained in advance. In addition, training may be performed periodically to adjust various parameters of the convolutional neural network to improve the adaptability of the convolutional neural network to the scene environment.

According to some embodiments, training a convolutional neural network may proceed as follows.

First, a plurality of sample scene images including at least in part a logo or slogan are provided. At least one sample convolution feature map is generated for each sample scene image. Each sample convolution feature map is divided into a plurality of lattices of the same size.

Subsequently, 1) for each grid, at least one sample marker frame is predicted separately, wherein the prediction can be made based on the feature quantity contained in the convolution feature map, the sample marker frame being represented in coordinates. 2) And determining the confidence coefficient of each sample mark frame, wherein if a mark or a slogan exists in the sample mark frame, the confidence coefficient of the mark frame is 1, otherwise, the confidence coefficient is 0, and if the sample mark frame is partially overlapped with the mark or the slogan, the confidence coefficient takes the ratio of the intersection and the union of the mark frame and the slogan, and particularly takes the intermediate value of 0-1. 3) The probability of the sample marker box class is determined, where classification can be based on the magnitude of the confidence. For example, if 4 marker boxes are predicted in the grid, and the confidence levels are 0, and 1, two sample marker box categories exist, and the probabilities thereof are 0.75 and 0.25, respectively.

Further, training the convolutional neural network further comprises: a loss function of the convolutional neural network is determined and the loss function is solved to determine at least one parameter of the convolutional neural network. The solution to the loss function may take a number of ways, with an adaptive moment estimation optimizer being the more preferred way to solve.

One specific example of training a convolutional neural network is provided below.

1. A sample convolution feature map is generated based on the sample scene image using a dark-53 model. From layer 0 up to layer 74 there are 53 convolutions of layers, the remainder being res. The model uses a series of convolution kernels of 3*3 and 1*1. The formed convolution layer is obtained by integrating a plurality of main stream network structures with good performance.

2. The input convolution feature map is divided into s×s lattices on average, and the number of the marker frames in each lattice is predicted to be B. The confidence level of each marker box is determined separately as mentioned above. The mark frame format is (x, y, w, h, confidence), which is the offset and width and height of the center position of the target (logo or slogan) relative to the grid position, are normalized. Confidence reflects whether the target is contained and the accuracy of the grid position if the target is contained. Then, the probability of the sample marker box class (set to C) is determined as mentioned above.

3. The convolution feature map is connected to the full connection layer (where the full connection layer connects all features and outputs values to the classifier) in the form of a tensor of S x (5 x b+c). The loss function is calculated as follows:

wherein x is _i 、y _i 、w _i 、h _i X, y axis offset and width and height, respectively, of the ith sample mark frame, C _i For its confidence, class represents the sample tag box category, p _i (C) The probability of a box class is labeled for the sample.Indicating whether the jth marker box of the ith grid contains a target, wherein the coordinates of the target are predictable from the marker box having the greatest intersection ratio (intersection-to-union ratio) with the true marker box in which the target is located. Lambda (lambda) _coord The error weight of the coordinates is represented and the subscript chord indicates the coordinates. In the above formula, the first row and the second row are used to calculate the coordinate error, the third row and the fourth row are used to calculate the cross ratio error, and the last row is used to calculate the classification error.

4. Solving the loss function adopts an Adam optimizer. Adam (Adaptive Moment Estimation) is essentially RMSprop with a motion term that dynamically adjusts the learning rate of each parameter using the first and second moment estimates of the gradient. Adam has the advantages that after bias correction, each iteration learning rate has a certain range, so that parameters are stable. The equation for moment estimation is as follows:

m _t ＝μ*m _t-1 +(1-μ)*g _t

wherein m is _t 、n _t 、g _t Respectively an exponentially moving average value, a square gradient and a gradient of the gradient,to correct the deviation of the first moment, +.>To correct the deviation of the second moment, u and v are the exponential decay rate of the first moment estimate and the exponential decay rate of the second moment estimate, Δθ, respectively _t The calculated update value representing the effective step-down step over time step t and parameter space.

One specific example of training a convolutional neural network is provided above. It should be appreciated that the convolutional neural network may also be trained in other ways, however, it should be noted that the goal of training is to make the convolutional neural network more sensitive to the logo or tagline portions in the scene image.

It should be noted that the convolutional neural network includes a detection network for determining the target area and a classification network for identifying the logo or slogan. It should be understood that their specific structure may vary according to the theoretical model on which the neural network is based. For example, the detection network may employ other sub-network structures, without necessarily including feature extraction sub-networks, region generation sub-networks, and the like. Preferably, the detection network and the classification network are independently configured and independently trained.

According to further embodiments of the present invention, to increase the recognition rate, the proportion of scene markers or banners in the scene image may be considered. As an example, in the case of a fixed camera position, the sample marker frame size during the training phase and the candidate region size during the detection and recognition phase may be selected according to the ratio, given that the sign or tagline is also typically in a fixed position and not easily moved, the proportion of the image portion in which the sign or tagline is located to the area of the scene image may be 1% -5% (this proportion range may be adjusted according to the actual scene). As an example, if the scene image is 1600×1200 pixels, the sample mark box may be taken as 200×200 pixels.

A second embodiment of the present invention provides a scene recognition apparatus, as shown in fig. 2, which includes a scene image acquisition unit 201, a target determination unit 203, and a target recognition unit 205.

The scene image acquisition unit 201 is configured to acquire a scene image. The target determination unit 203 is configured to extract candidate regions from the scene image using a convolutional neural network and determine a target region based on the candidate regions. The target recognition unit 205 is configured to recognize a logo or slogan of the scene from the target area, wherein the logo or slogan is capable of indicating the institution or company in which the scene is located.

Specifically, the targeting unit 203 may generate at least one convolution feature map based on the scene image. The target determination unit 203 inputs each convolution feature map into a convolution neural network (specifically, an input feature extraction sub-network), and the convolution neural network outputs coordinates of at least one candidate region and a probability that a logo or slogan exists in each candidate region.

The target determining unit 203 may further perform threshold filtering or non-maximum suppression filtering on each candidate region to filter out the excessive interference items.

Further, the target determination unit 203 inputs the convolution feature map corresponding to the candidate region into a convolution neural network (specifically, an input region generation sub-network) to obtain the coordinate correction amount of the candidate region, and determines the target region based on the coordinate correction amount. In this process, the frame returns to the subnetwork to play a major role.

The target recognition unit 205 is configured to input a scene image corresponding to a target region into a convolutional neural network (specifically, into a classification network) to recognize a logo or a slogan after the target region is obtained from the target determination unit 203. The target recognition unit 205 may recognize text, other logos or trademarks in a logo or slogan, and may further determine the institution or company in which the scene is located.

In the training process, the target determination unit 203 trains the convolutional neural network by: providing a sample scene image comprising a logo or slogan; generating at least one sample convolution feature map for each sample scene image based on the sample scene image; for each sample convolution feature map, the sample convolution feature map is divided into a plurality of lattices. For each grid, at least one sample tag frame is predicted, the confidence of each sample tag frame is determined, and the probability of the sample tag frame class is also determined. The objective determination unit 203 further determines a loss function of the convolutional neural network and solves the loss function to determine at least one parameter of the convolutional neural network, which may be performed using an adaptive moment estimation optimizer. Wherein the learning rate of each parameter can be dynamically adjusted using the first moment estimate and the second moment estimate of the gradient.

As a specific example, the scene recognition device may be implemented based on a dark network platform, where dark network is set up above the C language, and other services, such as KCF, may be implemented in Python language, and the docking of the modules is implemented by virtue of a Python interface based on dark network. In order to configure the scene recognition function as an API interface or as a scene recognition device to serve different users, a workflow of JavaScript-Java-Python-C may be constructed in which a Python-based server container is employed to load the deep learning model. Here, consider Django, a mature and powerful server container implemented on python, thus building a service stack, with the specific sequential activity logic as follows:

1) The administrator starts the Django server, and the server is initialized.

2) When the Django server is initialized, a Python API interface of the Darknet is called, the Darknet service is started, and the model weight is loaded into the GPU.

3) The user sends a request at the client, and the JavaScript of the front end sends the request to upload the image data and the control flow.

4) The Web server Tomcat on the server responds to the user request, analyzes the Json request of the Restful, transcodes the Base64 format picture in the Json request, and simultaneously analyzes parameters in the control flow.

5) After the Tomcat parses the request, a request to invoke dark is sent to the server container Django.

6) After receiving the request, django routes the cached data, and invokes the Darknet model, and related control parameters are transferred to the Darknet model.

7) And calculating by the Darknet model, and returning a detection result to the Django server.

8) And the Django server encapsulates the calculation result and transmits the calculation result to the Tomcat server.

9) The Tomcat server processes the request of the Django server and transmits the request to the user requesting detection.

In some embodiments of the invention, at least a portion of the device or system may be implemented using a set of distributed computing devices connected by a communications network, or based on a "cloud". In such a system, multiple computing devices operate together to provide services through the use of shared resources.

The "cloud" based implementation may provide one or more advantages, including: openness, flexibility and extensibility, centralness management, reliability, scalability, optimization of computing resources, the ability to aggregate and analyze information across multiple users, the ability to connect across multiple geographic areas, and the use of multiple mobile or data network operators for network connectivity.

According to some embodiments of the present invention, there is provided a machine storage medium having stored thereon a set of computer executable instructions which, when executed by a processor, implement the scene recognition method provided by the first embodiment described above.

According to still further embodiments of the present invention, there is provided a computer control apparatus which, when executing computer-executable instructions stored in a memory, will perform the steps of the scene recognition method provided in the first embodiment described above.

Those of skill would appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the aspects disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. To demonstrate interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software will depend upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The above description is only for the preferred embodiments of the invention and is not intended to limit the scope of the invention. Numerous variations and modifications can be made by those skilled in the art without departing from the spirit of the invention and the appended claims.

Claims

1. A scene recognition method, comprising:

acquiring a scene image;

extracting a candidate region from the scene image using a convolutional neural network, the candidate region having a size of 1% -5% of an area of the scene image;

determining a target region based on the candidate region; and

identifying a logo or slogan in the scene from the target area;

wherein the sign or slogan indicates the institution or company in which the scene is located;

wherein the convolutional neural network is trained as follows:

providing a sample scene image comprising the logo or slogan;

for each of the sample scene images:

generating at least one sample convolution feature map based on the sample scene image;

for each of the sample convolution feature maps:

dividing the sample convolution feature map into a plurality of lattices;

for each of the cells:

predicting at least one sample marker frame such that the size of the predicted sample marker frame is 1% -5% of the area of the sample scene image;

determining the confidence of each sample marking frame; and

the probability of the sample marker box class is determined.

2. The method of claim 1, wherein extracting candidate regions from the scene image using a convolutional neural network comprises:

generating at least one convolution feature map based on the scene image;

for each of the convolution feature maps:

and inputting the convolution characteristic map into the convolution neural network to obtain the coordinates of at least one candidate region and the probability of the mark or the slogan in each candidate region.

3. The method of claim 1, determining a target region based on the candidate region comprising:

and carrying out threshold filtering and/or non-maximum suppression screening on each candidate region.

4. The method of claim 1, wherein determining a target region based on the candidate region comprises:

inputting the convolution feature map corresponding to the candidate region into the convolution neural network to obtain the coordinate correction quantity of the candidate region;

the target area is determined based on the coordinate correction amount.

5. The method of claim 1, wherein identifying a logo or slogan of the scene from the target area comprises:

and inputting the scene image corresponding to the target area into the convolutional neural network to identify the sign or the slogan.

6. The method of claim 1, wherein training the convolutional neural network further comprises:

determining a loss function of the convolutional neural network;

the loss function is solved to determine at least one parameter of the convolutional neural network.

7. The method of claim 6, wherein solving the loss function comprises:

the loss function is solved using an adaptive moment estimation optimizer.

8. The method of any one of claims 1 to 7, wherein the convolutional neural network comprises a detection network for determining the target region and a classification network for identifying the logo or slogan,

wherein the detection network comprises:

a feature extraction sub-network;

generating a sub-network in the area;

pooling layers; and

the border returns to the subnetwork.

9. A scene recognition device comprising a scene image acquisition unit, a target determination unit and a target recognition unit, wherein:

the scene image acquisition unit is configured to acquire a scene image;

the target determination unit is configured to:

determining a target region based on the candidate region;

the target recognition unit is configured to recognize a logo or slogan in the scene from the target area;

the target determination unit is further configured to train the convolutional neural network, wherein training the convolutional neural network comprises:

providing a sample scene image comprising the logo or slogan;

for each of the sample scene images:

for each of the sample convolution feature maps:

dividing the sample convolution feature map into a plurality of lattices;

for each of the cells:

determining the confidence of each sample marking frame; and

the probability of the sample marker box class is determined.

10. The apparatus according to claim 9, wherein the targeting unit is configured to:

generating at least one convolution feature map based on the scene image;

for each of the convolution feature maps:

11. The apparatus according to claim 9, wherein the targeting unit is configured to perform threshold filtering and/or non-maximum suppression screening on each of the candidate regions.

12. The apparatus according to claim 9, wherein the target determination unit is configured to input the convolution feature map corresponding to the candidate region into the convolution neural network to obtain a coordinate correction amount of the candidate region;

the target area is determined based on the coordinate correction amount.

13. The apparatus of claim 9, wherein the target recognition unit is configured to input the scene image corresponding to the target region into the convolutional neural network to recognize the logo or slogan.

14. The apparatus of claim 9, wherein in training the convolutional neural network, the targeting unit is further configured to:

determining a loss function of the convolutional neural network;

15. The apparatus of claim 14, wherein in training the convolutional neural network, the targeting unit is further configured to:

the loss function is solved using an adaptive moment estimation optimizer.

16. The apparatus according to any one of claims 9 to 15, wherein the convolutional neural network comprises a detection network for determining the target region and a classification network for identifying the logo or slogan,

wherein the detection network comprises:

a feature extraction sub-network;

generating a sub-network in the area;

pooling layers; and

the border returns to the subnetwork.

17. A machine storage medium having stored thereon a set of computer executable instructions which, when executed by a processor, implement the method of any one of claims 1 to 7.

18. A computer control apparatus which, when implementing computer executable instructions stored in a memory, performs the steps of the method of any one of claims 1 to 7.