CN110956115B - Scene recognition method and device - Google Patents

Scene recognition method and device Download PDF

Info

Publication number
CN110956115B
CN110956115B CN201911172445.7A CN201911172445A CN110956115B CN 110956115 B CN110956115 B CN 110956115B CN 201911172445 A CN201911172445 A CN 201911172445A CN 110956115 B CN110956115 B CN 110956115B
Authority
CN
China
Prior art keywords
neural network
sample
scene
convolutional neural
scene image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911172445.7A
Other languages
Chinese (zh)
Other versions
CN110956115A (en
Inventor
陶民泽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
E Capital Transfer Co ltd
Original Assignee
E Capital Transfer Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by E Capital Transfer Co ltd filed Critical E Capital Transfer Co ltd
Priority to CN201911172445.7A priority Critical patent/CN110956115B/en
Publication of CN110956115A publication Critical patent/CN110956115A/en
Application granted granted Critical
Publication of CN110956115B publication Critical patent/CN110956115B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/52Surveillance or monitoring of activities, e.g. for recognising suspicious objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Abstract

The invention relates to a scene recognition method, which comprises the following steps: acquiring a scene image; extracting candidate areas from the scene image using a convolutional neural network; determining a target region based on the candidate region; and identifying a logo or slogan in the scene from the target area; wherein the logo or slogan indicates the institution or company in which the scene is located. The method can accurately determine the organization, company or business hall where the scene is located.

Description

Scene recognition method and device
Technical Field
The present invention relates to the field of image recognition technologies, and in particular, to a scene recognition method and apparatus.
Background
Given a picture of a scene, a human or robot can determine whether the picture is from a previously seen scene, which is the problem to be solved by visual scene recognition. Visual scene recognition is widely applied to the fields of mobile robots, automatic driving and the like.
In the prior art, the identification method is often aimed at large-size picture information, the corresponding scene images comprise integral features such as decoration styles and furnishing of scenes, but the furnishings of the scenes (such as business halls) are often not provided with obvious features and are even similar to each other, so that the scene places are difficult to accurately judge, and therefore, the effect of the image identification method is not ideal.
Furthermore, the lack of explicit screening criteria in determining the target region from the candidate region has no public theory as to what type of image features the target region should include, so how to accurately identify and distinguish scenes remains a challenge.
Disclosure of Invention
The invention aims to provide a scene recognition method.
According to an aspect of the present invention, there is provided a scene recognition method, including: acquiring a scene image; extracting candidate areas from the scene image using a convolutional neural network; determining a target region based on the candidate region; and identifying a logo or slogan in the scene from the target area; wherein the logo or slogan indicates the institution or company in which the scene is located.
Optionally, extracting the candidate region from the scene image using the convolutional neural network comprises: generating at least one convolution feature map based on the scene image; for each convolution feature map: the convolution feature map is input into a convolution neural network to obtain coordinates of at least one candidate region and probabilities of the presence of markers or slogans in each candidate region.
Optionally, determining the target region based on the candidate region includes: threshold filtering and/or non-maximum suppression screening are performed on each candidate region.
Optionally, determining the target region based on the candidate region includes: inputting the convolution feature map corresponding to the candidate region into a convolution neural network to obtain the coordinate correction quantity of the candidate region; the target area is determined based on the coordinate correction amount.
Optionally, identifying a logo or tagline of the scene from the target area comprises: and inputting the scene image corresponding to the target area into a convolutional neural network to identify a sign or a slogan.
Optionally, the method further comprises training a convolutional neural network, wherein training the convolutional neural network comprises: providing a sample scene image comprising a logo or slogan; for each sample scene image: generating at least one sample convolution feature map based on the sample scene image; convolving the feature map for each sample: dividing a sample convolution feature map into a plurality of lattices; for each cell: predicting at least one sample signature box; determining the confidence coefficient of each sample marking frame; and determining a probability of the sample marker box category.
Optionally, training the convolutional neural network further comprises: determining a loss function of the convolutional neural network; the loss function is solved to determine at least one parameter of the convolutional neural network.
Optionally, the loss function solving includes: the loss function is solved using an adaptive moment estimation optimizer.
According to another aspect of the present invention, there is provided a scene recognition apparatus including a scene image acquisition unit, a target determination unit, and a target recognition unit, wherein: the scene image acquisition unit is configured to acquire a scene image; the target determination unit is configured to: extracting candidate areas from the scene image using a convolutional neural network; determining a target region based on the candidate region; the target recognition unit is configured to recognize a logo or slogan in the scene from the target area; wherein the logo or slogan indicates the institution or company in which the scene is located.
Optionally, the target determining unit is configured to: generating at least one convolution feature map based on the scene image; for each convolution feature map: the convolution feature map is input into a convolution neural network to obtain coordinates of at least one candidate region and probabilities of the presence of markers or slogans in each candidate region.
Optionally, the targeting unit is configured to perform threshold filtering and/or non-maximum suppression screening on each candidate region.
Optionally, the target determining unit is configured to input the convolution feature map corresponding to the candidate region into a convolution neural network to obtain a coordinate correction amount of the candidate region; the target area is determined based on the coordinate correction amount.
Optionally, the target recognition unit is configured to input the scene image corresponding to the target region into a convolutional neural network to recognize a logo or slogan.
Optionally, the target determination unit is further configured to train the convolutional neural network, wherein training the convolutional neural network comprises: providing a sample scene image comprising a logo or slogan; for each sample scene image: generating at least one sample convolution feature map based on the sample scene image; convolving the feature map for each sample: dividing a sample convolution feature map into a plurality of lattices; for each cell: predicting at least one sample signature box; determining the confidence coefficient of each sample marking frame; and determining a probability of the sample marker box category.
Optionally, in training the convolutional neural network, the target determination unit is further configured to: determining a loss function of the convolutional neural network; the loss function is solved to determine at least one parameter of the convolutional neural network.
Optionally, in training the convolutional neural network, the target determination unit is further configured to: the loss function is solved using an adaptive moment estimation optimizer.
According to the scene recognition method provided by the invention, the candidate region is extracted from the scene image and the target region is determined by using the convolutional neural network, the parameters of the convolutional neural network are set to make the convolutional neural network more sensitive to the mark or the tagline part in the scene image, and the mechanism, the company or the business hall where the scene is located can be accurately determined by recognizing the mark or the tagline.
Drawings
Fig. 1 is a schematic flow chart of a scene recognition method according to a first embodiment of the present invention.
Fig. 2 is a block diagram showing a configuration of a scene recognition apparatus according to a second embodiment of the present invention.
Detailed Description
In the following description, specific details are set forth in order to provide a thorough understanding of the present invention. It will be apparent, however, to one skilled in the art that embodiments of the invention can be practiced without these specific details. In the present invention, specific numerical references such as "first element", "second device", etc. may be made. However, a specific numerical reference should not be construed as necessarily subject to its literal order, but rather as a "first element" distinct from a "second element".
The particular details presented herein are exemplary only and the particular details may vary and yet fall within the spirit and scope of the present invention. The term "coupled" is defined as either directly connected to an element or indirectly connected to an element via another element.
Preferred embodiments of methods, systems and apparatus suitable for implementing the present invention are described below with reference to the accompanying drawings. Although the embodiments are described with respect to a single combination of elements, it is to be understood that the invention includes all possible combinations of the disclosed elements. Thus, if one embodiment includes elements A, B and C, while a second embodiment includes elements B and D, the present invention should also be considered to include other remaining combinations of A, B, C or D, even if not explicitly disclosed.
The first embodiment of the present invention provides a scene recognition method, as shown in fig. 1, which includes steps S10-S12-S14-S16.
Step S10: a scene image is acquired.
In this step, a video is captured by a camera provided in a scene to be identified, the video is read and parsed, and frames in the video are extracted at a certain frequency (for example, 5 frames per second) and are arranged in sequence, so as to form a scene image.
Step S12: candidate regions are extracted from the scene image using a convolutional neural network.
In this step, a convolutional neural network (whether training is complete or not) is used to extract candidate regions from the scene image in which markers or banners in the scene may be present. But may include those in the candidate region where no scene markers or banners are present.
It should be noted that, the "sign or slogan" referred to in the present invention may indicate the name or identification of the organization or company where the scene is located. The logo includes not only a text logo but also an iconic logo, such as a trademark or design of a company. There may be similarities in the presentation of different institutions or companies, but such signs or banners should be quite different. Such signs or banners should be the same or similar for the same organization or company, regardless of differences in time, weather, lighting factors, and more or less for people in the scene. Thus, a logo or slogan in a scene can be used to determine an organization, company, or business hall.
Specifically, to extract candidate regions, a plurality of convolution feature maps are first generated based on the scene image. The convolution feature map may correspond to the RGB three pixel images, or may correspond to different convolution kernels employed. Different convolution feature maps can represent different feature dimensions of the same scene image. As an example, the same scene image is convolved in R, G, B three channels to generate a convolution feature map, where the R pixel convolution feature map contains the convolution features of all R pixels, the G pixel convolution feature map contains the convolution features of all G pixels, and the B pixel convolution feature map is similar.
After generating the convolution feature map, the convolution feature map is input into a convolution neural network for each convolution feature map, and the convolution neural network may output coordinates of a plurality of candidate regions and probabilities of scene markers or slogans existing in each candidate region.
According to some embodiments of the invention, the convolutional neural network may comprise a detection network for determining the target area and a classification network for identifying the marker or taggant, wherein the detection network further comprises: the feature extraction sub-network, the region generation sub-network, the pooling layer and the frame regression sub-network. The detection network and the classification network should be trained independently of each other. The sub-networks in the detection network may be trained in a unified manner or may be trained independently of each other. The parameters of the convolutional neural network (including its various sub-networks) should be set so that it is more sensitive to the logo or tagline portions in the scene image.
Step S14: a target region is determined based on the candidate region.
In this step, the candidate regions are processed or screened to select a target region, which is one or more candidate regions where a scene mark or tagline actually exists (as seen by the scene recognition device of the present invention).
According to some embodiments of the present invention, in order to determine the target region, a threshold filtering, or non-maximum suppression screening, may be performed on each candidate region.
According to further embodiments, the convolution feature map corresponding to the candidate region is first input to a convolution neural network (e.g., to a region generation sub-network) to obtain a coordinate correction amount for the candidate region, and then a final target region is determined based on the coordinate correction amount.
Step S16: a logo or slogan in the scene is identified from the target area.
As an example, in this step, a scene image corresponding to the target region may be input into the convolutional neural network to identify scene markers or banners therein. Wherein these flags or banners indicate the name or identification of the institution or company in which the scene is located. Thus, the method can accurately identify and distinguish different scenes. Even though there is a high degree of similarity in scene presentation between different scenes, the logo or slogan therein is one of the features that is most recognized by the scene.
It can be used to effectively distinguish between different scenes, just due to the recognizability of the logo or slogan of the scene. However, a logo or slogan often corresponds to a small portion, even a small portion that is not in eye, of an image of a scene, which includes even small amounts of features (e.g., image gradients or variances) compared to scene furnishings, scene characters. The traditional image recognition method often has the defects of difficult extraction, poor recognition and the like when extracting and recognizing scene marks or slogans. To this end, the detection network introduced by the invention comprises a feature extraction sub-network, a region generation sub-network, a pooling layer and a frame regression sub-network. The feature extraction sub-network extracts features of the scene image, the region generation sub-network is used for outputting candidate regions, and the pooling layer compresses the input feature images, so that the feature images are reduced on one hand, and the network calculation complexity is simplified; and on the other hand, carrying out feature compression, extracting main features and determining a target area.
In order to accurately extract the target area and identify the scene markers or slogans, the convolutional neural network needs to be trained in advance. In addition, training may be performed periodically to adjust various parameters of the convolutional neural network to improve the adaptability of the convolutional neural network to the scene environment.
According to some embodiments, training a convolutional neural network may proceed as follows.
First, a plurality of sample scene images including at least in part a logo or slogan are provided. At least one sample convolution feature map is generated for each sample scene image. Each sample convolution feature map is divided into a plurality of lattices of the same size.
Subsequently, 1) for each grid, at least one sample marker frame is predicted separately, wherein the prediction can be made based on the feature quantity contained in the convolution feature map, the sample marker frame being represented in coordinates. 2) And determining the confidence coefficient of each sample mark frame, wherein if a mark or a slogan exists in the sample mark frame, the confidence coefficient of the mark frame is 1, otherwise, the confidence coefficient is 0, and if the sample mark frame is partially overlapped with the mark or the slogan, the confidence coefficient takes the ratio of the intersection and the union of the mark frame and the slogan, and particularly takes the intermediate value of 0-1. 3) The probability of the sample marker box class is determined, where classification can be based on the magnitude of the confidence. For example, if 4 marker boxes are predicted in the grid, and the confidence levels are 0, and 1, two sample marker box categories exist, and the probabilities thereof are 0.75 and 0.25, respectively.
Further, training the convolutional neural network further comprises: a loss function of the convolutional neural network is determined and the loss function is solved to determine at least one parameter of the convolutional neural network. The solution to the loss function may take a number of ways, with an adaptive moment estimation optimizer being the more preferred way to solve.
One specific example of training a convolutional neural network is provided below.
1. A sample convolution feature map is generated based on the sample scene image using a dark-53 model. From layer 0 up to layer 74 there are 53 convolutions of layers, the remainder being res. The model uses a series of convolution kernels of 3*3 and 1*1. The formed convolution layer is obtained by integrating a plurality of main stream network structures with good performance.
2. The input convolution feature map is divided into s×s lattices on average, and the number of the marker frames in each lattice is predicted to be B. The confidence level of each marker box is determined separately as mentioned above. The mark frame format is (x, y, w, h, confidence), which is the offset and width and height of the center position of the target (logo or slogan) relative to the grid position, are normalized. Confidence reflects whether the target is contained and the accuracy of the grid position if the target is contained. Then, the probability of the sample marker box class (set to C) is determined as mentioned above.
3. The convolution feature map is connected to the full connection layer (where the full connection layer connects all features and outputs values to the classifier) in the form of a tensor of S x (5 x b+c). The loss function is calculated as follows:
wherein x is i 、y i 、w i 、h i X, y axis offset and width and height, respectively, of the ith sample mark frame, C i For its confidence, class represents the sample tag box category, p i (C) The probability of a box class is labeled for the sample.Indicating whether the jth marker box of the ith grid contains a target, wherein the coordinates of the target are predictable from the marker box having the greatest intersection ratio (intersection-to-union ratio) with the true marker box in which the target is located. Lambda (lambda) coord The error weight of the coordinates is represented and the subscript chord indicates the coordinates. In the above formula, the first row and the second row are used to calculate the coordinate error, the third row and the fourth row are used to calculate the cross ratio error, and the last row is used to calculate the classification error.
4. Solving the loss function adopts an Adam optimizer. Adam (Adaptive Moment Estimation) is essentially RMSprop with a motion term that dynamically adjusts the learning rate of each parameter using the first and second moment estimates of the gradient. Adam has the advantages that after bias correction, each iteration learning rate has a certain range, so that parameters are stable. The equation for moment estimation is as follows:
m t =μ*m t-1 +(1-μ)*g t
wherein m is t 、n t 、g t Respectively an exponentially moving average value, a square gradient and a gradient of the gradient,to correct the deviation of the first moment, +.>To correct the deviation of the second moment, u and v are the exponential decay rate of the first moment estimate and the exponential decay rate of the second moment estimate, Δθ, respectively t The calculated update value representing the effective step-down step over time step t and parameter space.
One specific example of training a convolutional neural network is provided above. It should be appreciated that the convolutional neural network may also be trained in other ways, however, it should be noted that the goal of training is to make the convolutional neural network more sensitive to the logo or tagline portions in the scene image.
It should be noted that the convolutional neural network includes a detection network for determining the target area and a classification network for identifying the logo or slogan. It should be understood that their specific structure may vary according to the theoretical model on which the neural network is based. For example, the detection network may employ other sub-network structures, without necessarily including feature extraction sub-networks, region generation sub-networks, and the like. Preferably, the detection network and the classification network are independently configured and independently trained.
According to further embodiments of the present invention, to increase the recognition rate, the proportion of scene markers or banners in the scene image may be considered. As an example, in the case of a fixed camera position, the sample marker frame size during the training phase and the candidate region size during the detection and recognition phase may be selected according to the ratio, given that the sign or tagline is also typically in a fixed position and not easily moved, the proportion of the image portion in which the sign or tagline is located to the area of the scene image may be 1% -5% (this proportion range may be adjusted according to the actual scene). As an example, if the scene image is 1600×1200 pixels, the sample mark box may be taken as 200×200 pixels.
A second embodiment of the present invention provides a scene recognition apparatus, as shown in fig. 2, which includes a scene image acquisition unit 201, a target determination unit 203, and a target recognition unit 205.
The scene image acquisition unit 201 is configured to acquire a scene image. The target determination unit 203 is configured to extract candidate regions from the scene image using a convolutional neural network and determine a target region based on the candidate regions. The target recognition unit 205 is configured to recognize a logo or slogan of the scene from the target area, wherein the logo or slogan is capable of indicating the institution or company in which the scene is located.
Specifically, the targeting unit 203 may generate at least one convolution feature map based on the scene image. The target determination unit 203 inputs each convolution feature map into a convolution neural network (specifically, an input feature extraction sub-network), and the convolution neural network outputs coordinates of at least one candidate region and a probability that a logo or slogan exists in each candidate region.
The target determining unit 203 may further perform threshold filtering or non-maximum suppression filtering on each candidate region to filter out the excessive interference items.
Further, the target determination unit 203 inputs the convolution feature map corresponding to the candidate region into a convolution neural network (specifically, an input region generation sub-network) to obtain the coordinate correction amount of the candidate region, and determines the target region based on the coordinate correction amount. In this process, the frame returns to the subnetwork to play a major role.
The target recognition unit 205 is configured to input a scene image corresponding to a target region into a convolutional neural network (specifically, into a classification network) to recognize a logo or a slogan after the target region is obtained from the target determination unit 203. The target recognition unit 205 may recognize text, other logos or trademarks in a logo or slogan, and may further determine the institution or company in which the scene is located.
In the training process, the target determination unit 203 trains the convolutional neural network by: providing a sample scene image comprising a logo or slogan; generating at least one sample convolution feature map for each sample scene image based on the sample scene image; for each sample convolution feature map, the sample convolution feature map is divided into a plurality of lattices. For each grid, at least one sample tag frame is predicted, the confidence of each sample tag frame is determined, and the probability of the sample tag frame class is also determined. The objective determination unit 203 further determines a loss function of the convolutional neural network and solves the loss function to determine at least one parameter of the convolutional neural network, which may be performed using an adaptive moment estimation optimizer. Wherein the learning rate of each parameter can be dynamically adjusted using the first moment estimate and the second moment estimate of the gradient.
As a specific example, the scene recognition device may be implemented based on a dark network platform, where dark network is set up above the C language, and other services, such as KCF, may be implemented in Python language, and the docking of the modules is implemented by virtue of a Python interface based on dark network. In order to configure the scene recognition function as an API interface or as a scene recognition device to serve different users, a workflow of JavaScript-Java-Python-C may be constructed in which a Python-based server container is employed to load the deep learning model. Here, consider Django, a mature and powerful server container implemented on python, thus building a service stack, with the specific sequential activity logic as follows:
1) The administrator starts the Django server, and the server is initialized.
2) When the Django server is initialized, a Python API interface of the Darknet is called, the Darknet service is started, and the model weight is loaded into the GPU.
3) The user sends a request at the client, and the JavaScript of the front end sends the request to upload the image data and the control flow.
4) The Web server Tomcat on the server responds to the user request, analyzes the Json request of the Restful, transcodes the Base64 format picture in the Json request, and simultaneously analyzes parameters in the control flow.
5) After the Tomcat parses the request, a request to invoke dark is sent to the server container Django.
6) After receiving the request, django routes the cached data, and invokes the Darknet model, and related control parameters are transferred to the Darknet model.
7) And calculating by the Darknet model, and returning a detection result to the Django server.
8) And the Django server encapsulates the calculation result and transmits the calculation result to the Tomcat server.
9) The Tomcat server processes the request of the Django server and transmits the request to the user requesting detection.
In some embodiments of the invention, at least a portion of the device or system may be implemented using a set of distributed computing devices connected by a communications network, or based on a "cloud". In such a system, multiple computing devices operate together to provide services through the use of shared resources.
The "cloud" based implementation may provide one or more advantages, including: openness, flexibility and extensibility, centralness management, reliability, scalability, optimization of computing resources, the ability to aggregate and analyze information across multiple users, the ability to connect across multiple geographic areas, and the use of multiple mobile or data network operators for network connectivity.
According to some embodiments of the present invention, there is provided a machine storage medium having stored thereon a set of computer executable instructions which, when executed by a processor, implement the scene recognition method provided by the first embodiment described above.
According to still further embodiments of the present invention, there is provided a computer control apparatus which, when executing computer-executable instructions stored in a memory, will perform the steps of the scene recognition method provided in the first embodiment described above.
Those of skill would appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the aspects disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. To demonstrate interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software will depend upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The above description is only for the preferred embodiments of the invention and is not intended to limit the scope of the invention. Numerous variations and modifications can be made by those skilled in the art without departing from the spirit of the invention and the appended claims.

Claims (18)

1. A scene recognition method, comprising:
acquiring a scene image;
extracting a candidate region from the scene image using a convolutional neural network, the candidate region having a size of 1% -5% of an area of the scene image;
determining a target region based on the candidate region; and
identifying a logo or slogan in the scene from the target area;
wherein the sign or slogan indicates the institution or company in which the scene is located;
wherein the convolutional neural network is trained as follows:
providing a sample scene image comprising the logo or slogan;
for each of the sample scene images:
generating at least one sample convolution feature map based on the sample scene image;
for each of the sample convolution feature maps:
dividing the sample convolution feature map into a plurality of lattices;
for each of the cells:
predicting at least one sample marker frame such that the size of the predicted sample marker frame is 1% -5% of the area of the sample scene image;
determining the confidence of each sample marking frame; and
the probability of the sample marker box class is determined.
2. The method of claim 1, wherein extracting candidate regions from the scene image using a convolutional neural network comprises:
generating at least one convolution feature map based on the scene image;
for each of the convolution feature maps:
and inputting the convolution characteristic map into the convolution neural network to obtain the coordinates of at least one candidate region and the probability of the mark or the slogan in each candidate region.
3. The method of claim 1, determining a target region based on the candidate region comprising:
and carrying out threshold filtering and/or non-maximum suppression screening on each candidate region.
4. The method of claim 1, wherein determining a target region based on the candidate region comprises:
inputting the convolution feature map corresponding to the candidate region into the convolution neural network to obtain the coordinate correction quantity of the candidate region;
the target area is determined based on the coordinate correction amount.
5. The method of claim 1, wherein identifying a logo or slogan of the scene from the target area comprises:
and inputting the scene image corresponding to the target area into the convolutional neural network to identify the sign or the slogan.
6. The method of claim 1, wherein training the convolutional neural network further comprises:
determining a loss function of the convolutional neural network;
the loss function is solved to determine at least one parameter of the convolutional neural network.
7. The method of claim 6, wherein solving the loss function comprises:
the loss function is solved using an adaptive moment estimation optimizer.
8. The method of any one of claims 1 to 7, wherein the convolutional neural network comprises a detection network for determining the target region and a classification network for identifying the logo or slogan,
wherein the detection network comprises:
a feature extraction sub-network;
generating a sub-network in the area;
pooling layers; and
the border returns to the subnetwork.
9. A scene recognition device comprising a scene image acquisition unit, a target determination unit and a target recognition unit, wherein:
the scene image acquisition unit is configured to acquire a scene image;
the target determination unit is configured to:
extracting a candidate region from the scene image using a convolutional neural network, the candidate region having a size of 1% -5% of an area of the scene image;
determining a target region based on the candidate region;
the target recognition unit is configured to recognize a logo or slogan in the scene from the target area;
wherein the sign or slogan indicates the institution or company in which the scene is located;
the target determination unit is further configured to train the convolutional neural network, wherein training the convolutional neural network comprises:
providing a sample scene image comprising the logo or slogan;
for each of the sample scene images:
generating at least one sample convolution feature map based on the sample scene image;
for each of the sample convolution feature maps:
dividing the sample convolution feature map into a plurality of lattices;
for each of the cells:
predicting at least one sample marker frame such that the size of the predicted sample marker frame is 1% -5% of the area of the sample scene image;
determining the confidence of each sample marking frame; and
the probability of the sample marker box class is determined.
10. The apparatus according to claim 9, wherein the targeting unit is configured to:
generating at least one convolution feature map based on the scene image;
for each of the convolution feature maps:
and inputting the convolution characteristic map into the convolution neural network to obtain the coordinates of at least one candidate region and the probability of the mark or the slogan in each candidate region.
11. The apparatus according to claim 9, wherein the targeting unit is configured to perform threshold filtering and/or non-maximum suppression screening on each of the candidate regions.
12. The apparatus according to claim 9, wherein the target determination unit is configured to input the convolution feature map corresponding to the candidate region into the convolution neural network to obtain a coordinate correction amount of the candidate region;
the target area is determined based on the coordinate correction amount.
13. The apparatus of claim 9, wherein the target recognition unit is configured to input the scene image corresponding to the target region into the convolutional neural network to recognize the logo or slogan.
14. The apparatus of claim 9, wherein in training the convolutional neural network, the targeting unit is further configured to:
determining a loss function of the convolutional neural network;
the loss function is solved to determine at least one parameter of the convolutional neural network.
15. The apparatus of claim 14, wherein in training the convolutional neural network, the targeting unit is further configured to:
the loss function is solved using an adaptive moment estimation optimizer.
16. The apparatus according to any one of claims 9 to 15, wherein the convolutional neural network comprises a detection network for determining the target region and a classification network for identifying the logo or slogan,
wherein the detection network comprises:
a feature extraction sub-network;
generating a sub-network in the area;
pooling layers; and
the border returns to the subnetwork.
17. A machine storage medium having stored thereon a set of computer executable instructions which, when executed by a processor, implement the method of any one of claims 1 to 7.
18. A computer control apparatus which, when implementing computer executable instructions stored in a memory, performs the steps of the method of any one of claims 1 to 7.
CN201911172445.7A 2019-11-26 2019-11-26 Scene recognition method and device Active CN110956115B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911172445.7A CN110956115B (en) 2019-11-26 2019-11-26 Scene recognition method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911172445.7A CN110956115B (en) 2019-11-26 2019-11-26 Scene recognition method and device

Publications (2)

Publication Number Publication Date
CN110956115A CN110956115A (en) 2020-04-03
CN110956115B true CN110956115B (en) 2023-09-29

Family

ID=69978460

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911172445.7A Active CN110956115B (en) 2019-11-26 2019-11-26 Scene recognition method and device

Country Status (1)

Country Link
CN (1) CN110956115B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111507253B (en) * 2020-04-16 2023-06-30 腾讯科技(深圳)有限公司 Display article auditing method and device based on artificial intelligence
CN111461101B (en) * 2020-04-20 2023-05-19 上海东普信息科技有限公司 Method, device, equipment and storage medium for identifying work clothes mark

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109522938A (en) * 2018-10-26 2019-03-26 华南理工大学 The recognition methods of target in a kind of image based on deep learning
WO2019144575A1 (en) * 2018-01-24 2019-08-01 中山大学 Fast pedestrian detection method and device
CN110163187A (en) * 2019-06-02 2019-08-23 东北石油大学 Remote road traffic sign detection recognition methods based on F-RCNN
CN110188705A (en) * 2019-06-02 2019-08-30 东北石油大学 A kind of remote road traffic sign detection recognition methods suitable for onboard system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019144575A1 (en) * 2018-01-24 2019-08-01 中山大学 Fast pedestrian detection method and device
CN109522938A (en) * 2018-10-26 2019-03-26 华南理工大学 The recognition methods of target in a kind of image based on deep learning
CN110163187A (en) * 2019-06-02 2019-08-23 东北石油大学 Remote road traffic sign detection recognition methods based on F-RCNN
CN110188705A (en) * 2019-06-02 2019-08-30 东北石油大学 A kind of remote road traffic sign detection recognition methods suitable for onboard system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
张明 ; 桂凯 ; .基于深度学习的室内场景识别的研究.现代计算机(专业版).2018,(16),全文. *
李家兴 ; 覃兴平 ; 刘达才 ; .基于卷积神经网络的交通标志检测.工业控制计算机.2018,(05),全文. *

Also Published As

Publication number Publication date
CN110956115A (en) 2020-04-03

Similar Documents

Publication Publication Date Title
US10289940B2 (en) Method and apparatus for providing classification of quality characteristics of images
CN108229509B (en) Method and device for identifying object class and electronic equipment
CN108596055B (en) Airport target detection method of high-resolution remote sensing image under complex background
CN109145759B (en) Vehicle attribute identification method, device, server and storage medium
CN102959946B (en) The technology of view data is expanded based on relevant 3D cloud data
US10354433B2 (en) Method and apparatus for generating an abstract texture for a building facade or model
US20150138310A1 (en) Automatic scene parsing
JP2019087229A (en) Information processing device, control method of information processing device and program
KR101261409B1 (en) System for recognizing road markings of image
CN110222686B (en) Object detection method, object detection device, computer equipment and storage medium
Ghorbanzadeh et al. Dwelling extraction in refugee camps using cnn–first experiences and lessons learnt
CN110717532A (en) Real-time detection method for robot target grabbing area based on SE-RetinaGrasp model
CN110956115B (en) Scene recognition method and device
CN102270304A (en) Data difference guided image capture
CN111767878A (en) Deep learning-based traffic sign detection method and system in embedded device
CN110399882A (en) A kind of character detecting method based on deformable convolutional neural networks
CN112966548A (en) Soybean plot identification method and system
CN110570435A (en) method and device for carrying out damage segmentation on vehicle damage image
CN110555420A (en) fusion model network and method based on pedestrian regional feature extraction and re-identification
CN109657082A (en) Remote sensing images multi-tag search method and system based on full convolutional neural networks
CN115482523A (en) Small object target detection method and system of lightweight multi-scale attention mechanism
KR20200017612A (en) Method for positioning learning by using Deep learning
KR20200017611A (en) Method and apparatus for positioning by using Deep learning
CN113901911B (en) Image recognition method, image recognition device, model training method, model training device, electronic equipment and storage medium
CN114022837A (en) Station left article detection method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant