CN110956115A

CN110956115A - Scene recognition method and device

Info

Publication number: CN110956115A
Application number: CN201911172445.7A
Authority: CN
Inventors: 陶民泽
Original assignee: E Capital Transfer Co ltd
Current assignee: E Capital Transfer Co ltd
Priority date: 2019-11-26
Filing date: 2019-11-26
Publication date: 2020-04-03
Anticipated expiration: 2039-11-26
Also published as: CN110956115B

Abstract

The invention relates to a scene recognition method, which comprises the following steps: acquiring a scene image; extracting a candidate region from the scene image by using a convolutional neural network; determining a target region based on the candidate region; and identifying a logo or slogan in the scene from the target area; wherein the logo or slogan indicates the organization or company where the scene is located. The method can accurately determine the mechanism, company or business hall where the scene is located.

Description

Scene recognition method and device

Technical Field

The present invention relates to the field of image recognition technologies, and in particular, to a scene recognition method and apparatus.

Background

Given a picture of a scene, a human or robot can determine whether the picture is from a previously seen scene, which is the problem to be solved by visual scene recognition. Visual scene recognition is widely applied to the fields of mobile robots, automatic driving and the like.

In the prior art, the identification method is usually directed at large-size picture information, and the corresponding scene image includes the overall characteristics of the decoration style, the display and the like of the scene, but the display of the scenes (such as business halls) does not have obvious characteristics, even the scenes are similar to each other, so that the location of the scene is difficult to accurately distinguish, and the image identification method is not ideal in effect.

Furthermore, the lack of explicit screening criteria in determining a target region from candidate regions is still a difficult problem as to what type of image features the target region should include, and thus how to accurately identify and distinguish scenes.

Disclosure of Invention

The invention aims to provide a scene recognition method.

According to an aspect of the present invention, there is provided a scene recognition method, including: acquiring a scene image; extracting a candidate region from the scene image by using a convolutional neural network; determining a target region based on the candidate region; and identifying a logo or slogan in the scene from the target area; wherein the logo or slogan indicates the organization or company where the scene is located.

Optionally, extracting the candidate region from the scene image using a convolutional neural network comprises: generating at least one convolution feature map based on the scene image; for each convolution signature: and inputting the convolution characteristic graph into a convolution neural network to obtain the coordinates of at least one candidate area and the probability of the mark or the slogan existing in each candidate area.

Optionally, determining the target region based on the candidate region comprises: and performing threshold filtering and/or non-maximum inhibition screening on each candidate region.

Optionally, determining the target region based on the candidate region comprises: inputting the convolution characteristic graph corresponding to the candidate area into a convolution neural network to obtain the coordinate correction of the candidate area; the target area is determined based on the coordinate correction amount.

Optionally, identifying the logo or slogan of the scene from the target area comprises: and inputting the scene image corresponding to the target area into a convolutional neural network to identify the mark or the slogan.

Optionally, the method further comprises training a convolutional neural network, wherein training the convolutional neural network comprises: providing a sample scene image including a logo or slogan; for each sample scene image: generating at least one sample convolution feature map based on the sample scene image; convolving the feature maps for each sample: dividing the sample convolution characteristic graph into a plurality of grids; for each grid: predicting at least one sample marker box; determining the confidence of each sample marking frame; and determining a probability of the sample marker box category.

Optionally, training the convolutional neural network further comprises: determining a loss function of the convolutional neural network; the loss function is solved to determine at least one parameter of the convolutional neural network.

Optionally, the solving of the loss function comprises: an adaptive moment estimation optimizer is utilized to solve the loss function.

According to another aspect of the present invention, there is provided a scene recognition apparatus including a scene image acquisition unit, an object determination unit, and an object recognition unit, wherein: the scene image acquisition unit is configured to acquire a scene image; the target determination unit is configured to: extracting a candidate region from the scene image by using a convolutional neural network; determining a target region based on the candidate region; the target identification unit is configured to identify a logo or slogan in the scene from the target area; wherein the logo or slogan indicates the organization or company where the scene is located.

Optionally, the target determination unit is configured to: generating at least one convolution feature map based on the scene image; for each convolution signature: and inputting the convolution characteristic graph into a convolution neural network to obtain the coordinates of at least one candidate area and the probability of the mark or the slogan existing in each candidate area.

Optionally, the target determination unit is configured to perform threshold filtering and/or non-maxima suppression screening on each candidate region.

Optionally, the target determination unit is configured to input the convolution feature map corresponding to the candidate region into the convolution neural network to obtain a coordinate correction amount of the candidate region; the target area is determined based on the coordinate correction amount.

Optionally, the target recognition unit is configured to input the scene image corresponding to the target area into a convolutional neural network to recognize the logo or slogan.

Optionally, the target determination unit is further configured to train a convolutional neural network, wherein training the convolutional neural network comprises: providing a sample scene image including a logo or slogan; for each sample scene image: generating at least one sample convolution feature map based on the sample scene image; convolving the feature maps for each sample: dividing the sample convolution characteristic graph into a plurality of grids; for each grid: predicting at least one sample marker box; determining the confidence of each sample marking frame; and determining a probability of the sample marker box category.

Optionally, in training the convolutional neural network, the target determination unit is further configured to: determining a loss function of the convolutional neural network; the loss function is solved to determine at least one parameter of the convolutional neural network.

Optionally, in training the convolutional neural network, the target determination unit is further configured to: an adaptive moment estimation optimizer is utilized to solve the loss function.

According to the scene recognition method provided by the invention, the candidate area is extracted from the scene image by utilizing the convolutional neural network and the target area is determined, the parameter of the convolutional neural network is set to be more sensitive to the mark or the slogan part in the scene image, and the mechanism, the company or the business hall where the scene is located can be accurately determined by recognizing the mark or the slogan.

Drawings

Fig. 1 is a schematic flow chart illustrating a scene recognition method according to a first embodiment of the present invention.

Fig. 2 is a block diagram illustrating a scene recognition apparatus according to a second embodiment of the present invention.

Detailed Description

In the following description specific details are set forth in order to provide a thorough understanding of the present invention. It will be apparent, however, to one skilled in the art that embodiments of the invention may be practiced without these specific details. In the present invention, specific numerical references such as "first element", "second device", and the like may be made. However, specific numerical references should not be construed as necessarily subject to their literal order, but rather construed as "first element" as opposed to "second element".

The specific details set forth herein are merely exemplary and may be varied while remaining within the spirit and scope of the invention. The term "coupled" is defined to mean either directly connected to a component or indirectly connected to the component via another component.

Preferred embodiments of methods, systems and devices suitable for implementing the present invention are described below with reference to the accompanying drawings. Although embodiments are described with respect to a single combination of elements, it is to be understood that the invention includes all possible combinations of the disclosed elements. Thus, if one embodiment includes elements A, B and C, while a second embodiment includes elements B and D, the invention should also be considered to include A, B, C or the other remaining combinations of D, even if not explicitly disclosed.

The first embodiment of the present invention provides a scene recognition method, as shown in fig. 1, which includes steps S10-S12-S14-S16.

Step S10: a scene image is acquired.

In this step, a camera arranged in a scene to be identified is used to capture a video, the video is read and analyzed, and frames in the video are extracted according to a certain frequency (for example, 5 frames per second) and are sequenced to form a scene image.

Step S12: candidate regions are extracted from the scene image using a convolutional neural network.

In this step, candidate regions, in which landmarks or slogans in the scene may be present, are extracted from the scene image using a convolutional neural network (whether its training is complete or not). But those for which no scene markers or slogans are present may be included in the candidate regions.

It should be noted that the "logo or slogan" referred to in the present invention may indicate the name or identification of the organization or company where the scene is located. The logo includes not only a text logo but also an icon logo, such as a trademark or an appearance design of a company. There may be similarities between the furnishings of different organizations or companies, but such logos or slogans should be distinct. Such signs or slogans should be the same or similar for the same institution or company, with more or less people in the scene, regardless of differences in time, weather, lighting factors. Thus, a logo or slogan in a scene can be used to identify an institution, company, or business hall.

Specifically, to extract candidate regions, a plurality of convolution feature maps are first generated based on a scene image. The convolution signature may correspond to the RGB three-pixel image, or may correspond to a different convolution kernel employed. Different convolution characteristic maps can represent different characteristic dimensions of the same scene image. By way of example, the same scene image is convolved according to R, G, B three channels to generate a convolution feature map, wherein the R pixel convolution feature map contains convolution features of all R pixels, the G pixel convolution feature map contains convolution features of all G pixels, and the B pixel convolution feature map is similar.

After the convolution feature maps are generated, for each convolution feature map, the convolution feature map is input to a convolution neural network, and the convolution neural network may output the coordinates of a plurality of candidate regions and the probability of the scene markers or slogans existing in each candidate region.

According to some embodiments of the invention, the convolutional neural network may comprise a detection network for determining the target region and a classification network for identifying the landmark or slogan, wherein the detection network in turn comprises: the system comprises a feature extraction sub-network, a region generation sub-network, a pooling layer and a border regression sub-network. The detection network and the classification network should be trained independently of each other. The sub-networks in the detection network may be trained together or independently of each other. The parameters of the convolutional neural network (including its various sub-networks) should be set so that it is more sensitive to the logo or slogan portions in the scene image.

Step S14: a target region is determined based on the candidate regions.

In this step, through processing or screening of candidate regions, a target region can be selected from the candidate regions, where the target region is one or more candidate regions (viewed by the scene recognition device of the present invention) where a scene mark or a slogan actually exists.

According to some embodiments of the invention, to determine the target region, each candidate region may be threshold filtered, or non-maximum suppression screened.

According to other embodiments, the convolution feature maps corresponding to the candidate regions are first input into a convolution neural network (e.g., into a region generation sub-network) to obtain coordinate correction quantities of the candidate regions, and a final target region is determined based on the coordinate correction quantities.

Step S16: a logo or slogan in the scene is identified from the target region.

As an example, in this step, the scene image corresponding to the target area may be input into a convolutional neural network to identify the scene mark or slogan therein. Where these logos or slogans indicate the name or identity of the organization or company in which the scene is located. Thus, the method can accurately identify and distinguish different scenes. Even if there is a high similarity of scene furnishings between different scenes, the logo or slogan therein is one of the most scene-recognizable features.

It is the logo or slogan of a scene that can be used to effectively distinguish between different scenes. However, the logo or slogan often corresponds to a small portion of the scene image, even an unobtrusive portion, that includes even smaller amounts of features (e.g., image gradients or variances) than the scene furnishings, scene characters. When a scene mark or a slogan is extracted and identified by a traditional image identification method, the defects of difficult extraction, poor identification rate and the like exist. To this end, the detection network introduced by the present invention includes a feature extraction sub-network, a region generation sub-network, a pooling layer, and a bounding box regression sub-network. The feature extraction sub-network extracts features of a scene image, the region generation sub-network is used for outputting a candidate region, and the pooling layer compresses an input feature map, so that the feature map is reduced, and the network calculation complexity is simplified; and on the other hand, feature compression is carried out, and main features are extracted to determine the target area.

In order to accurately extract the target region and recognize the scene mark or slogan, the convolutional neural network needs to be trained in advance. In addition, training may be performed periodically to adjust various parameters of the convolutional neural network to improve the adaptability of the convolutional neural network to the scene environment.

According to some embodiments, training the convolutional neural network may proceed as follows.

First, a number of sample scene images are provided that at least partially include a logo or slogan. And respectively generating at least one sample convolution characteristic map for each sample scene image. Each sample convolution signature is divided into a plurality of grids of the same size.

Then, for each grid, 1), at least one sample marker box is predicted separately, wherein the prediction can be made based on the feature quantities contained in the convolution feature map, the sample marker boxes being represented in coordinates. 2) And determining the confidence of each sample marking frame, specifically, if a mark or a slogan exists in the sample marking frame, the confidence of the marking frame is 1, otherwise, the confidence is 0, and if the sample marking frame is partially overlapped with the mark or the slogan, the confidence is the ratio of the intersection and the union of the two, specifically, the median of 0 to 1. 3) And determining the probability of the sample marking frame class, wherein the sample marking frame class can be classified according to the confidence degree. For example, if 4 labeled boxes are predicted in the lattice and the confidence levels are 0, and 1, there are two sample labeled box categories with probabilities of 0.75 and 0.25, respectively.

Further, training the convolutional neural network further comprises: a loss function of the convolutional neural network is determined and solved to determine at least one parameter of the convolutional neural network. The solution to the loss function can take a number of methods, with an adaptive moment estimation optimizer being a more preferred approach to solving.

One specific example of training a convolutional neural network is provided below.

Firstly, generating a sample convolution characteristic diagram based on a sample scene image by utilizing a Darknet-53 model. From layer 0 to layer 74, there are a total of 53 convolutional layers, the remainder being res layers. The model uses a series of convolution kernels of 3 x 3 and 1 x 1. The formed convolution layer is obtained by selecting and integrating better performance from a plurality of main flow network structures.

And secondly, averagely dividing the input convolution characteristic diagram into S multiplied by S grids, and respectively predicting the mark frames in each grid, wherein the number of the mark frames is B. The confidence level of each marker box is determined separately as mentioned above. The marker box format is (x, y, w, h, confidence), and is normalized for the offset of the center position of the target (logo or slogan) from the grid position, as well as the width and height. The confidence reflects whether the target is included and the accuracy of the grid position if the target is included. Subsequently, the probability of the sample mark-box category (set to C) is determined as mentioned above.

And thirdly, connecting the convolution feature graph with a full connection layer (wherein the full connection layer is connected with all the features and sends output values to the classifier), and the output format is the tensor of S. The loss function is calculated as follows:

wherein x is_i、y_i、w_i、h_iThe x and y axes offset and width and height, C, of the marker box for the ith sample, respectively_iFor its confidence, classes represents the sample markup box class, p_i(C) The probability of the box class is labeled for the sample.

Indicating whether the jth marker box of the ith lattice contains the target, wherein the coordinates of the target can be predicted from the marker box having the largest intersection-to-union ratio (ratio of intersection to union) with the real marker box in which the target is located. Lambda [ alpha ]_coordError weights for the coordinates are indicated, and subscript coord indicates the coordinates. In the above formula, the first line and the second line are used to calculate the coordinate error, the third line and the fourth line are used to calculate the cross-ratio error, and the last line is used to calculate the classification error.

And fourthly, solving the loss function by adopting an Adam optimizer. Adam (adaptive motion estimation) is essentially RMSprop with momentum terms that dynamically adjusts the learning rate of each parameter using first and second Moment estimates of the gradient. Adam has the advantages that after offset correction, the learning rate of each iteration has a certain range, so that the parameters are relatively stable. The formula for moment estimation is as follows:

m_t＝μ*m_t-1+(1-μ)*g_t

wherein m is_t、n_t、g_tRespectively an exponential moving average of the gradient, a squared gradient and a gradient,

in order to correct the deviation of the first order moment,

to correct for deviations of the second order moments, u, v are the exponential decay rate of the first order moment estimate and the exponential decay rate of the second order moment estimate, respectively, Δ θ_tRepresenting the calculated update value of the effective step down in time step t and parameter space.

One specific example of training a convolutional neural network is provided above. It should be understood that the convolutional neural network may also be trained in other ways, however, it should be noted that the goal of the training is to make the convolutional neural network more sensitive to the logo or slogan portions in the scene image.

It should be noted that the convolutional neural network includes a detection network for determining the target region and a classification network for identifying the logo or slogan. It will be appreciated that their specific structure may vary according to the theoretical model on which the neural network is based. For example, the detection network may employ other sub-network structures, and need not necessarily include a feature extraction sub-network, a region generation sub-network, or the like. Preferably, the detection network and the classification network are independently constructed and independently trained.

According to still other embodiments of the present invention, in order to increase the recognition rate, the proportion of the scene mark or the slogan in the scene image may be considered. As an example, in the case of a fixed camera position, since the logo or the slogan is usually in a fixed position and is not easily moved, the area ratio of the image portion where the logo or the slogan is located to the scene image may be 1% -5% (this ratio range may be adjusted according to the actual scene), and therefore, the sample mark frame size in the training stage and the candidate area size in the detection and recognition stage may be selected according to this ratio. As an example, if the scene image is 1600 × 1200 pixels, the sample marker box may be 200 × 200 pixels.

A second embodiment of the present invention provides a scene recognition apparatus, as shown in fig. 2, which includes a scene image acquisition unit 201, an object determination unit 203, and an object recognition unit 205.

The scene image acquisition unit 201 is configured to acquire a scene image. The target determination unit 203 is configured to extract candidate regions from the scene image using a convolutional neural network and determine a target region based on the candidate regions. The object recognition unit 205 is configured to recognize a logo or slogan of the scene from the target area, wherein the logo or slogan is capable of indicating the organization or company where the scene is located.

Specifically, the target determination unit 203 may generate at least one convolution feature map based on the scene image. The target determination unit 203 inputs each of the convolution feature maps into a convolutional neural network (specifically, an input feature extraction sub-network), which outputs the coordinates of at least one candidate region and the probability of the presence of a logo or slogan in each candidate region.

The target determination unit 203 may also perform threshold filtering or non-maximum suppression screening on each candidate region to filter out excessive interference terms.

Further, the target determining unit 203 inputs the convolution feature map corresponding to the candidate region to a convolution neural network (specifically, an input region generation sub-network) to obtain a coordinate correction amount of the candidate region, and determines the target region based on the coordinate correction amount. In this process, the bounding box regression sub-network plays a major role.

The object identifying unit 205 is configured to, after obtaining the target area from the object determining unit 203, input the scene image corresponding to the target area into a convolutional neural network (specifically, into a classification network) to identify the logo or slogan. The object recognition unit 205 may recognize a character, other logo, or trademark in the logo or slogan, and may determine the organization or company where the scene is located.

In the training process, the target determination unit 203 trains the convolutional neural network by adopting the following means: providing a sample scene image including a logo or slogan; for each sample scene image, generating at least one sample convolution feature map based on the sample scene image; and for each sample convolution characteristic graph, dividing the sample convolution characteristic graph into a plurality of grids. And predicting at least one sample marking frame for each grid, determining the confidence of each sample marking frame, and determining the probability of the class of the sample marking frame. The target determination unit 203 further determines a loss function of the convolutional neural network and solves the loss function to determine at least one parameter of the convolutional neural network, the solving process may be performed using an adaptive moment estimation optimizer. Wherein the learning rate of each parameter can be dynamically adjusted using first and second moment estimates of the gradient.

As a specific example, the scene recognition device may be implemented based on a Darknet platform, the Darknet is installed on the C language, and other services, such as KCF, may be implemented in Python language, and the docking of the modules is implemented by using a Python interface based on the Darknet. In order to configure the scene recognition function as an API (application programming interface) or construct the scene recognition function as a scene recognition device so as to provide services for different users, a JavaScript-Java-Python-C workflow can be constructed, wherein a Python-based service container is adopted to load a deep learning model. Here, Django, which is a mature and powerful service container implemented on python, can be considered, and thus a service stack is constructed, and the specific sequential activity logic is as follows:

1) and (4) starting the Django server by the administrator, and initializing the server.

2) And calling a Python API (application program interface) of Darknet when the Django server is initialized, starting the Darknet service, and loading the model weight into the GPU.

3) A user sends a request at a client, and the JavaScript at the front end sends the request to upload image data and control flow.

4) And responding to the user request by a Web server Tomcat on the server, analyzing the Json request of Restful, transcoding the Base64 format picture in the Json request, and analyzing the parameters in the control flow.

5) After the Tomcat parses the request, it sends a request to the servization container Django to invoke Darknet.

6) After Django receives the request and routes the cache data, the Darknet model is called, and the relevant control parameters are transferred to the Darknet model.

7) And calculating by the Darknet model, and returning a detection result to the Django server.

8) And the Django server packages the calculation result and transmits the calculation result to the Tomcat server.

9) The Tomcat server processes the request of the Django server and transmits the request to the user requesting detection.

In some embodiments of the invention, at least a portion of the device or system may be implemented using a distributed set of computing devices connected by a communications network, or based on a "cloud". In such systems, multiple computing devices operate together to provide services through the use of shared resources.

A "cloud" based implementation may provide one or more advantages, including: openness, flexibility and extensibility, centrally manageable, reliable, scalable, optimized for computing resources, having the ability to aggregate and analyze information across multiple users, connecting across multiple geographic areas, and the ability to use multiple mobile or data network operators for network connectivity.

According to some embodiments of the present invention, there is provided a machine-storage medium having stored thereon a collection of computer-executable instructions that, when executed by a processor, may implement the scene recognition method provided in the first embodiment described above.

According to further embodiments of the present invention, there is provided a computer-controlled apparatus which, when executing computer-executable instructions stored in a memory, will perform the steps of the scene recognition method provided in the first embodiment above.

Those of skill in the art would appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the aspects disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. To demonstrate interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The above description is only for the preferred embodiment of the present invention and is not intended to limit the scope of the present invention. Various modifications may be made by those skilled in the art without departing from the spirit of the invention and the appended claims.

Claims

1. A method of scene recognition, comprising:

acquiring a scene image;

extracting candidate regions from the scene image using a convolutional neural network;

determining a target region based on the candidate region; and

identifying a logo or slogan in the scene from the target region;

wherein the logo or slogan indicates an organization or company where the scene is located.

2. The method of claim 1, wherein extracting candidate regions from the scene image using a convolutional neural network comprises:

generating at least one convolution feature map based on the scene image;

for each of the convolution signatures:

and inputting the convolution characteristic graph into the convolution neural network to obtain the coordinates of at least one candidate region and the probability of the mark or the slogan existing in each candidate region.

3. The method of claim 1, determining a target region based on the candidate regions comprising:

and carrying out threshold filtering and/or non-maximum inhibition screening on each candidate region.

4. The method of claim 1, wherein determining a target region based on the candidate regions comprises:

inputting the convolution characteristic graph corresponding to the candidate area into the convolution neural network to obtain the coordinate correction of the candidate area;

and determining the target area based on the coordinate correction quantity.

5. The method of claim 1, wherein identifying a logo or slogan of the scene from the target region comprises:

and inputting the scene image corresponding to the target area into the convolutional neural network to identify the mark or the slogan.

6. The method of claim 1, further comprising training the convolutional neural network, wherein training the convolutional neural network comprises:

providing a sample scene image including the logo or slogan;

for each of the sample scene images:

generating at least one sample convolution feature map based on the sample scene image; convolving the feature maps for each of the samples:

dividing the sample convolution feature map into a plurality of grids; for each of the lattices:

predicting at least one sample marker box;

determining a confidence level of each sample marking frame; and

the probability of the sample label box category is determined.

7. The method of claim 6, wherein training the convolutional neural network further comprises:

determining a loss function of the convolutional neural network;

solving the loss function to determine at least one parameter of the convolutional neural network.

8. The method of claim 7, wherein solving the loss function comprises:

solving the loss function using an adaptive moment estimation optimizer.

9. The method of any one of claims 1 to 8, wherein the convolutional neural network comprises a detection network for determining the target region and a classification network for identifying the landmark or slogan,

wherein the detection network comprises:

a feature extraction subnetwork;

generating a sub-network by the region;

a pooling layer; and

the bounding box regresses into a subnetwork.

10. A scene recognition apparatus comprising a scene image acquisition unit, an object determination unit, and an object recognition unit, wherein:

the scene image acquisition unit is configured to acquire a scene image;

the target determination unit is configured to:

determining a target region based on the candidate region;

the object recognition unit is configured to recognize a logo or slogan in the scene from the target area;

11. The apparatus of claim 10, wherein the target determination unit is configured to:

generating at least one convolution feature map based on the scene image;

for each of the convolution signatures:

12. The apparatus of claim 10, wherein the target determination unit is configured to perform threshold filtering and/or non-maxima suppression screening on each of the candidate regions.

13. The apparatus according to claim 10, wherein the target determination unit is configured to input the convolution feature map corresponding to the candidate region into the convolution neural network to obtain a coordinate correction amount of the candidate region;

and determining the target area based on the coordinate correction quantity.

14. The apparatus of claim 10, wherein the object recognition unit is configured to input the scene image corresponding to the object region into the convolutional neural network to recognize the logo or slogan.

15. The apparatus of claim 10, wherein the target determination unit is further configured to train the convolutional neural network, wherein training the convolutional neural network comprises:

providing a sample scene image including the logo or slogan;

for each of the sample scene images:

predicting at least one sample marker box;

determining a confidence level of each sample marking frame; and

the probability of the sample label box category is determined.

16. The apparatus of claim 15, wherein in training the convolutional neural network, the target determination unit is further configured to:

determining a loss function of the convolutional neural network;

17. The apparatus of claim 16, wherein in training the convolutional neural network, the target determination unit is further configured to:

solving the loss function using an adaptive moment estimation optimizer.

18. The apparatus of any one of claims 10 to 17, wherein the convolutional neural network comprises a detection network for determining the target region and a classification network for identifying the landmark or slogan,

wherein the detection network comprises:

a feature extraction subnetwork;

generating a sub-network by the region;

a pooling layer; and

the bounding box regresses into a subnetwork.

19. A machine storage medium having stored thereon a collection of computer-executable instructions, wherein the computer-executable instructions, when executed by a processor, implement the method of any one of claims 1 to 8.

20. A computer-controlled apparatus which, when executing computer-executable instructions stored in a memory, performs the steps of the method of any one of claims 1 to 8.