CN116266393A - Article identification method, apparatus, electronic device, and computer-readable storage medium - Google Patents

Article identification method, apparatus, electronic device, and computer-readable storage medium Download PDF

Info

Publication number
CN116266393A
CN116266393A CN202111540033.1A CN202111540033A CN116266393A CN 116266393 A CN116266393 A CN 116266393A CN 202111540033 A CN202111540033 A CN 202111540033A CN 116266393 A CN116266393 A CN 116266393A
Authority
CN
China
Prior art keywords
target
map
saliency map
image
article
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111540033.1A
Other languages
Chinese (zh)
Inventor
罗中华
连自锋
熊君君
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
SF Technology Co Ltd
Original Assignee
SF Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SF Technology Co Ltd filed Critical SF Technology Co Ltd
Priority to CN202111540033.1A priority Critical patent/CN116266393A/en
Publication of CN116266393A publication Critical patent/CN116266393A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The application provides an article identification method, an article identification device, an electronic device and a computer readable storage medium. The article identification method comprises the following steps: acquiring a target image of an object to be identified; extracting global features of the target image to obtain target global features of the target image; acquiring a target representation feature map of the target image; acquiring a target saliency map of the target image part based on the target global feature and the target representation feature map; and carrying out category identification on the target image based on the target saliency map to obtain the target object category of the object to be identified in the target image. The method and the device can improve the classification accuracy of the articles to a certain extent.

Description

Article identification method, apparatus, electronic device, and computer-readable storage medium
Technical Field
The present disclosure relates to the field of computer vision, and in particular, to an article identification method, an apparatus, an electronic device, and a computer readable storage medium.
Background
With the rapid development of computer vision technology, in recent years, computer vision technology has been applied to more and more scenes. Among them, classification of items, such as merchandise on supermarket shelves, is one of the main applications of computer vision technology.
In the prior art, the type of an article is generally determined by directly detecting or classifying an image object based on an image of the article.
However, in some situations, such as a wide variety of articles like commodities and small differences between articles, it is difficult to accurately classify the articles by using a conventional image target detection or classification method. Therefore, when the existing image target detection or classification mode is adopted to classify the articles, the classification accuracy of the articles is lower.
Disclosure of Invention
The application provides an article identification method, an article identification device, electronic equipment and a computer readable storage medium, and aims to solve the problem that when an existing image target detection or classification mode is adopted to classify articles, the classification accuracy of the articles is low.
In a first aspect, the present application provides a method of article identification, the method comprising:
acquiring a target image of an object to be identified;
extracting global features of the target image to obtain target global features of the target image;
acquiring a target representation feature map of the target image;
acquiring a target saliency map of the target image part based on the target global feature and the target representation feature map;
And carrying out category identification on the target image based on the target saliency map to obtain the target object category of the object to be identified in the target image.
In a second aspect, the present application provides an article identification device comprising:
the first acquisition unit is used for acquiring a target image of the object to be identified;
the global extraction unit is used for carrying out global feature extraction on the target image to obtain target global features of the target image;
a second acquisition unit configured to acquire a target representation feature map of the target image;
the local extraction unit is used for acquiring a target saliency map of the target image part based on the target global feature and the target representation feature map;
and the identification unit is used for carrying out category identification on the target image based on the target saliency map to obtain the target object category in the target image.
In a third aspect, the present application also provides an electronic device comprising a processor and a memory, the memory having stored therein a computer program, the processor executing the steps of any one of the article identification methods provided herein when invoking the computer program in the memory.
In a fourth aspect, the present application also provides a computer readable storage medium having stored thereon a computer program to be loaded by a processor for performing the steps of the article identification method.
The method comprises the steps of firstly carrying out global feature extraction on a target image to obtain target global features of the target image; based on the target global feature and the target representation feature map, acquiring a target saliency map of a target image part, and identifying a target object type of an object to be identified in the target image; on the one hand, by the mode of extracting the characteristics from thick to thin, on the basis of focusing on the global characteristics of the target, finer local characteristics can be further extracted by combining the target representation characteristic diagram for identifying the category of the object, so that the classification accuracy of the object can be improved. On the other hand, as the fusion target global features gradually pay attention to the local features of finer parts, the target saliency map not only has global semantic information, but also contains the detail information of the bottom layer, so that the distinguishing local features and the complete global features can be completely extracted for identifying the object types, and the object classification accuracy can be improved to a certain extent.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic view of a scenario of an item identification system provided in an embodiment of the present application;
FIG. 2 is a schematic flow chart of a method for identifying an article according to an embodiment of the present application;
FIG. 3 is a schematic illustration of the relationship between a target image and an image of a target object area provided in an embodiment of the present application;
FIG. 4 is a schematic diagram of a network architecture of an article identification model provided in an embodiment of the present application;
FIG. 5 is a schematic diagram of a network architecture of a local feature extraction branch provided in an embodiment of the present application;
fig. 6 is a schematic diagram of the working principle of the HSG Module provided in the embodiment of the present application;
FIG. 7 is a schematic diagram of another network structure of an article identification model according to an embodiment of the present application;
FIG. 8 is a schematic illustration of a training process for an article identification model provided in an embodiment of the present application;
FIG. 9 is a schematic diagram of an embodiment of an article identification device provided in an embodiment of the present application;
fig. 10 is a schematic structural diagram of an embodiment of an electronic device provided in an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.
In the description of the embodiments of the present application, it should be understood that the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or an implicit indication of the number of features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more of the described features. In the description of the embodiments of the present application, the meaning of "a plurality" is two or more, unless explicitly defined otherwise.
The following description is presented to enable any person skilled in the art to make and use the application. In the following description, details are set forth for purposes of explanation. It will be apparent to one of ordinary skill in the art that the present application may be practiced without these specific details. In other instances, well-known processes have not been described in detail in order to avoid unnecessarily obscuring descriptions of the embodiments of the present application. Thus, the present application is not intended to be limited to the embodiments shown, but is to be accorded the widest scope consistent with the principles and features disclosed in the embodiments of the present application.
The execution body of the article identification method in this embodiment may be an article identification device provided in this embodiment, or different types of electronic devices such as a server device, a physical host, or a User Equipment (UE) integrated with the article identification device, where the article identification device may be implemented in a hardware or software manner, and the UE may specifically be a terminal device such as a smart phone, a tablet computer, a notebook computer, a palm computer, a desktop computer, or a personal digital assistant (Personal Digital Assistant, PDA).
The electronic device may be operated in a single operation mode, or may also be operated in a device cluster mode.
Referring to fig. 1, fig. 1 is a schematic view of a scenario of an article identification system provided in an embodiment of the present application. The article identification system may include an electronic device 100, where the electronic device 100 has an article identification device integrated therein. For example, the electronic device may acquire a target image of the item to be identified; extracting global features of the target image to obtain target global features of the target image; acquiring a target representation feature map of the target image; acquiring a target saliency map of the target image part based on the target global feature and the target representation feature map; and carrying out category identification on the target image based on the target saliency map to obtain the target object category of the object to be identified in the target image.
In addition, as shown in FIG. 1, the item identification system may also include a memory 200 for storing data, such as image data, video data.
It should be noted that, the schematic view of the scenario of the article identification system shown in fig. 1 is only an example, and the article identification system and scenario described in the embodiments of the present application are for more clearly describing the technical solutions of the embodiments of the present application, and do not constitute a limitation on the technical solutions provided in the embodiments of the present application, and those skilled in the art can know that, as the article identification system evolves and new business scenarios appear, the technical solutions provided in the embodiments of the present invention are applicable to similar technical problems.
Next, an article identifying method provided in the embodiment of the present application will be described, where an electronic device is used as an execution body, and in order to simplify and facilitate the description, the execution body will be omitted in the subsequent method embodiments.
Referring to fig. 2, fig. 2 is a schematic flow chart of an article identification method according to an embodiment of the present application. It should be noted that although a logical order is depicted in the flowchart, in some cases the steps depicted or described may be performed in a different order than presented herein. The article identification method comprises steps 201 to 205, wherein:
201. a target image of the item to be identified is acquired.
The item to be identified may be any item, for example, a commodity on a supermarket shelf.
The target image refers to an image for identifying the category of the object to be identified. The target image can be an image obtained by directly shooting a certain object to be identified; or, the method is an image obtained by capturing a screen of a certain object to be identified in the images of a plurality of objects.
In step 201, there are various ways to acquire the target image, which illustratively include:
1) And directly shooting an image obtained by a certain object to be identified as a target image. In practical application, the electronic device may integrate a camera on hardware, and obtain a video frame or image of an object to be identified through real-time shooting by the camera, so as to serve as a target image. Or, the camera can be arranged above the article placing area, a certain video frame or image of the article to be identified in the article placing area is obtained by shooting the camera above the article placing area in real time, and the electronic equipment is connected with the camera above the article placing area in a network. And according to the network connection, a video frame or an image of a certain object to be identified, which is shot by the camera above the object placing area, is obtained on line from the camera above the object placing area, and the video frame or the image is taken as a target image. Alternatively, the electronic device may read, as the target image, an image of a certain object to be identified obtained by capturing the image of the certain object to be identified by the camera from a relevant storage medium in which the image of the certain object to be identified obtained by capturing the image of the certain object to be identified by the camera is stored (including the camera integrated with the electronic device or the camera above the object placement area).
2) And acquiring an image of the target object area, and capturing a screen of a certain object to be identified in the image of the target object area. In this case, step 201 may specifically include the following steps 2011 to 2013:
2011. an image of a target item area is acquired.
The target item area refers to an area for placing items, in particular for placing items to be identified, for example shelves for placing goods in supermarkets.
In step 2011, various ways of capturing an image of the target object area are provided, including, for example:
(1) In practical application, the electronic device may integrate a camera in hardware, and obtain a video frame or image of the target object area through real-time shooting by the camera, so as to be used as an image of the target object area.
(2) The camera can be arranged above the target object area, the video frame or the image of the target object area can be obtained through real-time shooting of the camera above the target object area, and the electronic equipment is connected with the camera above the target object area in a network. And according to the network connection, acquiring video frames or images of the target object area, which are shot by the camera above the target object area, from the camera above the target object area on line to serve as images of the target object area.
(3) The electronic device may also read out the target object area image captured by the camera from a relevant storage medium storing the target object area image captured by the camera (including the camera integrated by the electronic device or the camera above the target object area) as an image of the target object area.
(4) And reading a video frame or an image of the target object area which is acquired in advance and stored in the electronic equipment as an image of the target object area.
2012. And performing classification detection on the image of the target object area to obtain each object detection area in the image of the target object area.
The article detection area refers to an area where an article exists in the image of the target article area.
In some embodiments, step 2012 may be implemented using the deep-learned item detection network. I.e. step 2012 may specifically comprise: inputting the image of the target object area into the trained object detection network to call the object detection network to perform classification detection processing on the image of the target object area, and determining whether each pixel point in the image of the target object area is a background pixel point or an object pixel point, thereby determining each object detection area in the image of the target object area.
The trained object detection network can be used for performing two classifications on the image (specifically, classifying each pixel point in the image of the target object area into two classifications of background and object), so as to detect the area where the object exists in the image of the target object area.
For example, first, a preset item detection network is trained based on a training data set (including a positive sample and a negative sample, the positive sample including a plurality of images including items, the negative sample being an image not including items), so that the trained item detection network learns the characteristics of the items, thereby obtaining a trained (available for performing a detection process based on the images to determine the area of the items present in the images) item detection network. The preset article detection network may be an open source network model that may be used for detection tasks, such as an OverFeat network, a YOLOv1 network, a YOLOv2 network, a YOLOv3 network, an SSD network, a RetinaNet network, and a YOLOv5 network, etc. Specifically, an open source network (available for detection tasks) with model parameters as default values may be employed as a preset item detection network.
And then inputting the image of the target article area into a trained article detection network, and calling the trained article detection network to perform article area detection processing on the image of the target article area, and predicting the area of the article in the image of the target article area so as to obtain each article detection area in the image of the target article area.
2013. And acquiring screenshot of each article in the target article area according to the image of the target article area and each article detection area to serve as the target image.
For example, as shown in fig. 3, the image of the target article area acquired in step 2011 is classified into two types by step 2012 as shown in fig. 3 (a), and each article detection area in the image of the target article area is obtained as shown by a dashed box in fig. 3 (b); then, each article screenshot in the target article area is acquired as a target image based on the image of the target article area and each article detection area, as shown in fig. 3 (c).
On the one hand, when the image of the same target object area contains a plurality of objects to be identified, due to various kinds of objects such as commodities and the like and small differences among the objects, the accurate classification of the objects is difficult to achieve by adopting a conventional image target detection or classification mode. On the other hand, when the object to be identified occupies a relatively small area in the image of the target object area, it is difficult to extract fine image features, and thus it is difficult to achieve accurate classification of the object. By regarding all kinds of articles as one kind, the articles and the background are classified by two kinds firstly in steps 2011-2013, and then screenshot is carried out on a single article to be identified based on the classification result for classifying and identifying the subsequent articles to be identified. On the other hand, the method and the device enable the follow-up extraction of finer image features for each article screenshot to conduct classification and identification, and therefore identification accuracy of the articles to be identified is improved.
202. And carrying out global feature extraction on the target image to obtain target global features of the target image.
The target global feature refers to an image space feature of the target image global.
In some embodiments, global feature extraction may be performed on a target image by a global extraction module in the trained article identification model provided in the embodiments of the present application, so as to obtain a target global feature of the target image.
In other embodiments, the global feature extraction may be performed on the target image to obtain the target global feature of the target image by pre-learning the obtained global feature extraction parameters (e.g., by extracting model parameters of the global extraction module from the object recognition model trained in step 806, which is described later, as the global feature extraction parameters).
For easy understanding, the network architecture of the article identification model in the embodiment of the present application is described first, and as shown in fig. 4, the article identification model includes a global extraction module, a local extraction module, and an identification module.
1. And a global extraction module.
And the global extraction module is used for extracting global features according to the target image to obtain target global features of the target image. The global extraction module takes the target image as input, carries out convolution operation through a plurality of convolution layers, and finally outputs the target global feature through a full connection layer. Further, in order to make the network easier to train and avoid the problem of gradient explosion, the global extraction module in the embodiment of the present application adopts a residual network structure, for example, adopts a res net50 network structure.
2. And a local extraction module.
And the local extraction module is used for acquiring a target saliency map of the target image part according to the target global feature and the target representation feature map. The local extraction module takes the target global feature and the target representation feature map as input, and takes the target global feature and the target representation feature map as dot products to obtain a target saliency map of the target image part.
In fig. 4, a case where the local extraction module is one local feature extraction branch is shown, and further, the local extraction module may include a plurality of local feature extraction branches.
In order to fuse the detail information of the bottom layer and the semantic information of the higher layer, in the embodiment of the present application, at least one High Salience-Guided Module (HSG Module) is fused in one local feature extraction branch. Specifically, as shown in fig. 5, fig. 5 is a schematic diagram of a network structure of a local feature extraction branch provided in an embodiment of the present application, where the local feature extraction branch includes a plurality of saliency layers, each of the saliency layers includes an HSG Module and a convolution layer (e.g., block in fig. 5), and each of the saliency layers uses a high-level feature vector (denoted as f tk K is more than or equal to 0 and less than or equal to M, M is the number of local Feature extraction branches), and a Feature Map of the upper layer (marked as Feature Map) n-1 N is more than or equal to 0 and less than or equal to N, N is the number of saliency layers of the local feature extraction branch) and a saliency map of the upper layer (marked as A n-1 ) As inputFirst, the high-level feature vector f is extracted by HSG Module tk Feature Map of upper layer n-1 Dot product is performed to obtain a saliency map A of the layer n The method comprises the steps of carrying out a first treatment on the surface of the Saliency map A for this layer then passes through the convolution layer n Performing convolution and other operations to obtain a Feature Map of the layer n The method comprises the steps of carrying out a first treatment on the surface of the And outputting a saliency map by the last saliency layer, namely extracting the saliency map of the branch attention local information for the local feature.
Wherein, as shown in fig. 4, the saliency map of the upper layer of the first saliency layer is the result of dot product of the high-level feature vector and the target representation feature map.
There are various ways in which each saliency layer determines a saliency map, including, illustratively:
(1) In some embodiments, each saliency layer may directly dot product the high-level feature vector with the feature map of the previous layer as the saliency map of the present layer.
(2) In other embodiments, as shown in fig. 6, each saliency layer firstly performs maximum pooling on the feature images of the previous layer to obtain a maximum feature image, and performs average pooling on the feature images of the previous layer to obtain an average feature image; then, the dot product of the high-level feature vector and the maximum feature map is performed to obtain a maximum saliency map, and the dot product of the high-level feature vector and the average feature map is performed to obtain an average saliency map; and finally, weighting and summing the saliency map, the maximum saliency map and the average saliency map of the upper layer according to a certain weight to obtain the saliency map of the layer.
When the local extraction module comprises a plurality of local feature extraction branches, the first local feature extraction branch takes the target global feature output by the global extraction module as a high-level feature vector; the feature vector which is represented by the saliency map vectorization of the partial feature extraction branches above the rest partial feature extraction branches is used as a high-level feature vector. For example, as shown in fig. 7, a case where the local extraction module includes 2 local feature extraction branches is shown in fig. 7; the high-level feature vector input by the local feature extraction branch 1 is a target global feature output by the global extraction module; the high-level feature vector of the local feature extraction branch 2 is a feature vector after the vectorized representation of the saliency map of the local feature extraction branch 1.
3. And an identification module.
And the identification module is used for carrying out category identification on the target image according to the target salient feature map to obtain the target object category in the target image. The identification module takes the target salient feature map as input, predicts according to the target salient feature map and outputs the target object class. The recognition module may be a full connection layer connected with the local extraction module, and the local extraction module is connected with two full connection layers as the recognition module for predicting and outputting the target object category in the target image.
In some embodiments, as shown in fig. 4, the identification module includes a first sub-identification module and a second sub-identification module. The first sub-recognition module takes the global feature of the image output by the global extraction module as input, performs classification prediction based on the global feature of the image output by the global extraction module, and takes the object type in the image as output. The second sub-recognition module takes the saliency map of the image part output by the local feature extraction branch as input, performs classification prediction based on the saliency map of the image part output by the local feature extraction branch, and takes the object class in the image as output.
In some embodiments, as shown in fig. 7, the identification modules include a first sub-identification module, a second sub-identification module, and a third sub-identification module. The first sub-recognition module takes the global feature of the image output by the global extraction module as input, performs classification prediction based on the global feature of the image output by the global extraction module, and takes the object type in the image as output. The second sub-recognition module takes the saliency map of the image part output by the first local feature extraction branch (namely the local feature extraction branch 1) as input, performs classification prediction based on the saliency map of the image part output by the first local feature extraction branch, and takes the class of the object in the image as output. The third sub-recognition module takes the saliency map of the image part output by the second local feature extraction branch (namely the local feature extraction branch 2) as input, performs classification prediction based on the saliency map of the image part output by the second local feature extraction branch, and takes the class of the object in the image as output.
203. And acquiring a target representation feature map of the target image.
The object representation feature map refers to image space features of the object image.
In some embodiments, a feature map obtained by performing operations such as convolution on a backbone network in a global extraction module (in the item identification model) may be obtained as a target representation feature map.
In other embodiments, a backbone network for representing feature extraction may be additionally set in the object identification model, and a feature map obtained by performing operations such as convolution on the backbone network for representing feature extraction may be used as the target representation feature map.
204. And acquiring a target saliency map of the target image part based on the target global feature and the target representation feature map.
The saliency map is the result of dot product of the feature map of the image and the high-level feature vector.
The target saliency map refers to a saliency map of local information of the target image of interest determined according to the target global feature and the target representation feature map.
In step 204, there are various ways to obtain the target saliency map, which illustratively includes:
as shown in fig. 4, the target saliency map is extracted by a global extraction module and a local feature extraction module including 1 local feature extraction branch in the embodiment of the present application.
1) As shown in fig. 4, the target saliency map is extracted by a global extraction module and a local feature extraction module including 2 local feature extraction branches in the embodiment of the present application. In this case, step 204 may specifically include 2041A to 2042A:
2041A, determining a first maximum saliency map and a first average saliency map of the target image according to the target global feature and the target representation feature map.
Taking the example that the local feature extraction branch only comprises one saliency layer, each saliency layer weights and sums the saliency map, the maximum saliency map and the average saliency map of the previous layer according to a certain weight to obtain the saliency map of the layer, and respectively introducing the obtaining modes of the maximum saliency map and the average saliency map.
1. And obtaining a maximum saliency map.
The maximum saliency map is a saliency map obtained by performing a maximum pooling operation on the target representation feature map and performing dot product with the target global feature.
The first maximum saliency map is a determined maximum saliency map from the target global features and the target representation feature map.
As shown in fig. 4, since there is only one saliency layer, the result of dot product of the high-level feature vector and the target representation feature map is taken as the saliency map of the upper layer, and the target representation feature map is taken as the feature map of the upper layer, so the maximum saliency map of the saliency layer is as follows: and the target represents the result of dot product between the maximum feature map obtained after the feature map is subjected to the maximum pooling operation and the target global feature. At this time, in step 2041A, the step of determining the first maximum saliency map of the target image "according to the target global feature and the target representative feature map may specifically include: performing maximum pooling operation on the target representation feature map to obtain a maximum feature map of the target representation feature map; and carrying out dot product on the maximum feature map and the target global feature to obtain the first maximum saliency map.
By way of example, the target is first represented by the HSG Module in the local Feature extraction branch (i.e., the Feature Map of the previous layer n-1 ) Performing maximum pooling operation to obtain a maximum feature map of the target representation feature map; and then the target global feature (i.e. high-level feature vector f tk ) And carrying out dot product with the maximum feature map to obtain a first maximum saliency map.
2. And obtaining an average saliency map.
The average saliency map is a saliency map obtained by carrying out dot product on an average characteristic map obtained by carrying out average pooling operation on the target representative characteristic map and the target global characteristic.
The first average saliency map is an average saliency map determined from the target global features and the target representation feature map.
As shown in fig. 4, since there is only one saliency layer, the result of dot product of the high-level feature vector and the target representation feature map is taken as the saliency map of the upper layer, and the target representation feature map is taken as the feature map of the upper layer, so that the average saliency map of the saliency layer is: and the target represents the results of dot products of the average feature map obtained after the feature map is subjected to the average pooling operation and the target global feature. At this time, in step 2041A, the step of "determining the first average saliency map of the target image from the target global feature and the target representative feature map" may specifically include: carrying out average pooling operation on the target representation feature map to obtain an average feature map of the target representation feature map; and carrying out dot product on the average feature map and the target global feature to obtain the first average saliency map.
By way of example, the target is first represented by the HSG Module in the local Feature extraction branch (i.e., the Feature Map of the previous layer n-1 ) Carrying out average pooling operation to obtain an average feature map of the target representation feature map; and then the target global feature (i.e. high-level feature vector f tk ) And carrying out dot product with the average feature map to obtain a first average saliency map.
2042A, determining the target saliency map from the first maximum saliency map and the first average saliency map.
Specifically, first, the result of dot product of the target global feature and the target representative feature map is taken as a saliency map of the upper layer. Then, the weighted sum of the saliency map of the upper layer, the first maximum saliency map and the first average saliency map is taken as a target saliency map according to a certain weight.
From the above, it can be seen that, by using the target global feature and the target representation feature map, the first maximum saliency map and the first average saliency map of the target image are determined, and then the target saliency map is determined based on the first maximum saliency map and the first average saliency map; the feature extraction mode from thick to thin can ensure that global semantic information of global features of the target is effectively fused in the target saliency map and the global semantic information of the target comprises the detail information of the bottom layer, so that the classification accuracy of the objects is improved.
(II) as shown in FIG. 7, the target saliency map is extracted by a global extraction module and a local feature extraction module including 2 local feature extraction branches in the embodiment of the present application. In this case, step 204 may specifically include 2041B to 2044B:
2041B, determining a first maximum saliency map and a first average saliency map of the target image according to the target global feature and the target representation feature map.
The implementation of step 2041B is similar to that of step 2041A, and reference is made to the above description, and details are not repeated here.
2042B, obtaining a preliminary saliency map of the target image part according to the first maximum saliency map and the first average saliency map.
The preliminary saliency map is a saliency map determined according to the first maximum saliency map and the first average saliency map through the first local feature extraction branch, namely, the saliency map extracted by the first local feature extraction branch.
First, the dot product of the target global feature and the target representative feature is used as the saliency map of the upper layer. Then, the weighted sum of the first maximum saliency map and the first average saliency map of the upper layer is taken as a preliminary saliency map according to a certain weight.
2043B determining a second maximum saliency map and a second average saliency map of the target image portion from the preliminary saliency map and the target representation feature map.
The second maximum saliency map is a determined maximum saliency map from the preliminary saliency map and the target representative feature map.
The second mean saliency map is a mean saliency map determined from the preliminary saliency map and the target representative feature map.
The preliminary saliency map is a saliency map extracted by the first local feature extraction branch, so in step 2043B, first, the feature vector obtained by vectorizing the preliminary saliency map is used as a high-level feature vector of the second local feature extraction branch. Then, in a similar manner to step 2041A described above, a second maximum saliency map and a second average saliency map are determined.
By way of example, the target representation Feature Map (i.e., the Feature Map of the upper layer) is first extracted by the HSG Module in the second local Feature extraction branch n-1 ) Performing maximum pooling operation to obtain a maximum feature map of the target representation feature map; the preliminary saliency map is vectorized to represent the characteristic vector (namely, the high-level characteristic vector f tk ) And carrying out dot product with the maximum feature map to obtain a second maximum saliency map.
The target is first represented by HSG Module in the second local Feature extraction branch (i.e. Feature Map of the upper layer) n-1 ) Carrying out average pooling operation to obtain an average feature map of the target representation feature map; the preliminary saliency map is vectorized to represent the characteristic vector (namely, the high-level characteristic vector f tk ) And carrying out dot product with the average characteristic diagram to obtain a second average saliency diagram.
2044B, determining the target saliency map from at least two of the second maximum saliency map, the second average saliency map, and the preliminary saliency map.
There are various ways to determine the target saliency map in step 2044B, including, illustratively:
(1) And taking the weighted sum of the second maximum significance map and the second average significance map as a target significance map according to a certain weight.
(2) And taking the weighted sum of the second maximum saliency map and the preliminary saliency map as a target saliency map according to a certain weight.
(3) And taking the weighted sum result of the second average saliency map and the preliminary saliency map as a target saliency map according to a certain weight.
(4) And taking the weighted sum of the second maximum saliency map, the second average saliency map and the preliminary saliency map as a target saliency map according to a certain weight.
The above description has described the acquisition of the target saliency map by taking 1 and 2 local feature extraction branches as an example, and further, the target saliency map may be acquired by taking the feature vector vectorized by the saliency map of one or more branches of each local feature extraction branch as a high-level feature vector based on 2 or more local feature extraction branches, and referring to the steps 2041B to 2044B.
From the above, it can be seen that by representing feature maps with target global features and targets, a first maximum saliency map and a first average saliency map of the target image are determined; determining a preliminary saliency map based on the first maximum saliency map and the first average saliency map; determining a second maximum saliency map and a second average saliency map of the target image part again according to the preliminary saliency map and the target representation feature map, and determining the target saliency map; the feature extraction mode from thick to thin layer by layer can ensure that global semantic information of global features of the target is effectively fused in the target saliency map and comprises detail information of a bottom layer, so that the classification accuracy of the objects is improved.
205. And carrying out category identification on the target image based on the target saliency map to obtain a target object category of the object to be identified in the target image.
The target object class refers to the class of the object to be identified in the target image obtained by identification according to the target saliency map.
And based on an identification module in the trained object identification model, carrying out category identification on the target image based on the target saliency map to obtain a target object category in the target image.
In the embodiment of the application, the global feature extraction is performed on the target image to obtain the target global feature of the target image; based on the target global feature and the target representation feature map, acquiring a target saliency map of a target image part, and identifying a target object type of an object to be identified in the target image; on the one hand, by the mode of extracting the characteristics from thick to thin, on the basis of focusing on the global characteristics of the target, finer local characteristics can be further extracted by combining the target representation characteristic diagram for identifying the category of the object, so that the classification accuracy of the object can be improved. On the other hand, as the fusion target global features gradually pay attention to the local features of finer parts, the target saliency map not only has global semantic information, but also contains the detail information of the bottom layer, so that the distinguishing local features and the complete global features can be completely extracted for identifying the object types, and the object classification accuracy can be improved to a certain extent.
Further, after determining the target item category of the target image for each item screenshot in the target item area as the target image in steps 201 to 205, that is, after determining the target item category corresponding to each item screenshot, the items in the target item area may be counted so as to be convenient. At this time, the article identification method may further include: and counting the objects in the object area according to the object type to obtain the number of each object.
Under the conditions of various articles such as commodities and the like and small differences among the articles, the accurate classification of the articles is difficult to realize by adopting a conventional image target detection or classification mode, so that the statistics accuracy is low when the various articles are counted based on the conventional target detection or classification mode; it is difficult to apply to, for example, counting the number of various types of goods on a shelf. According to the method and the device, the object screenshot in the object area is obtained according to the image of the object area, after the object class of the object image is determined by taking the object screenshot in the object area as the object image, the objects in the object area are counted, the number of the objects in each object area is obtained, and therefore the statistics accuracy rate can be improved.
As shown in fig. 8, the training process of the article identification model specifically includes the following steps 801 to 806:
801. a sample image of a sample article is acquired.
Wherein the sample image is annotated with an actual item category of the sample item.
The sample item may be any item, such as a commodity on a supermarket shelf.
The sample image refers to an image for identifying the category of the sample article.
The sample image is obtained in a similar manner to that of the target image, and reference may be made to the above description, so that description is omitted herein for simplicity.
802. And acquiring a sample representation feature map and sample global features of the sample image based on a global extraction module in a preset recognition model.
The sample representation feature map refers to the image space features of the sample image.
The sample global feature refers to an image space feature of the sample image global.
The preset recognition model is similar to the network structure of the article recognition model, and is not described in detail herein for simplicity. The sample image is input into a preset recognition model to call a global extraction module in the preset recognition model, and operations such as convolution, pooling and the like are carried out according to the sample image to obtain a sample representation feature map of the sample image; and the feature vector which is vectorized and represented is carried out on the feature map obtained by carrying out operations such as convolution, pooling and the like after the sample image passes through a plurality of convolution layers of the global extraction module and is used as the sample global feature of the sample image.
803. And determining a sample local saliency map of the sample image according to the sample representation feature map and the sample global feature based on a local extraction module in the preset recognition model.
The sample local saliency map refers to a saliency map of local information of a sample image of interest determined according to a sample global feature and a sample representation feature map.
The manner of obtaining the local saliency map of the sample in step 803 is similar to that of obtaining the target saliency map in step 204, and reference may be made to the above description, so that the description is omitted here for simplicity.
804. And based on an identification module in the preset identification model, carrying out classification identification according to the sample local saliency map to obtain a first prediction category of the sample article.
The first prediction category refers to an article category of a sample article obtained by predicting according to a sample image through a preset recognition model.
805. And determining a training loss value of the preset recognition model according to the actual object class and the first prediction class.
There are various ways of setting the training loss value of the preset recognition model, and the following are exemplified as several ways of setting the training loss value of the preset recognition model, and the way of obtaining the training loss value of the preset recognition model in step 805:
As shown in fig. 4, when the preset recognition model includes a global extraction module, a local extraction module composed of 1 local feature extraction branch, and a recognition module composed of a first sub-recognition module and a second sub-recognition module.
At this time, a sample local saliency map is determined in step 803 by local feature branching as shown in fig. 4. In step 804, on the one hand, the first sub-recognition module shown in fig. 4 may perform classification prediction based on the global features of the sample output by the global extraction module, to determine a first item category of the sample item; on the other hand, the second article class of the sample article may be determined by performing classification prediction based on the sample local saliency map output by the local feature extraction branch by the second sub-recognition module as shown in fig. 4. At this time, the first prediction category specifically means that the second sub-recognition module performs classification prediction based on the sample local saliency map outputted by the local feature extraction branch, and the determined second category of the sample article.
Illustratively, the training loss value may be set as follows:
(1) The training loss value is the classification loss value L of the second sub-recognition module ID2 . At this time, step 805 may specifically include step 8051A:
8051A determining the classification loss value L of the second sub-recognition module according to the actual article category, the second article category and a preset classification loss function for the second sub-recognition module ID2 As a training loss value of a preset recognition model.
For example, the classification loss function preset for the second sub-recognition module is shown in the following formula 1:
Figure BDA0003413731590000171
in formula 1, L ID2 For the classification loss value of the second sub-identification module, N is the total number of samples, y represents the corresponding item class (such as the value 0,1, … N-1), q i Indicating whether the ith sample belongs to the category y, and takes a value of 0 or 1, p i Representing the probability of the predicted class of the i-th sample.
(2) The training loss value comprises a classification loss value L of the first sub-recognition module ID1 Classification loss value L of second sub-recognition module ID2 . In this case, the step 805 may specifically include steps 8051B to 8053B:
8051B determining the classification loss value L of the first sub-identification module according to the actual article category, the first article category and a classification loss function preset for the first sub-identification module ID1
Wherein, step 8051B determines a classification loss value L of the first sub-recognition module ID1 In the manner of determining the classification loss value L of the second sub-recognition module in accordance with the above-described step 8051A ID2 In a similar manner, reference may be made to the above description, and details are not repeated here.
8052B determining the classification loss value L of the second sub-recognition module according to the actual article category, the second article category and the classification loss function preset for the second sub-recognition module ID2
8053B, according to a certain weight ratio, classifying the loss value L of the first sub-recognition module ID1 And a classification loss value L of the second sub-recognition module ID2 The weighted sum is taken as a preset recognition modelIs used for training the loss value.
(3) The training loss values include: classification loss value L of first sub-recognition module ID1 Classification loss value L of second sub-recognition module ID2 Ordering loss value L between global extraction module and local feature extraction branch rank1 . At this time, step 805 may specifically include steps 8051C to 8054C:
8051C determining the classification loss value L of the first sub-recognition module according to the actual article category, the first article category and a classification loss function preset for the first sub-recognition module ID1
8052C determining the classification loss value L of the second sub-recognition module according to the actual article category, the second article category and the classification loss function preset for the second sub-recognition module ID2
8053C, obtaining a loss of order value L between the global extraction module and the local feature extraction branch rank1
In step 8053C, "obtain the ordering loss value L between the global extraction module and the local feature extraction branch rank1 The step of "may specifically include: acquiring a predicted probability value of a first article category and a predicted probability value of a second article category; determining a predicted probability value of a first item class, a predicted probability value of a second item class, and a ranking penalty value L between a global extraction module and a local feature extraction branch for a ranking penalty function preset between the global extraction module and the local feature extraction branch rank1
For example, the sorting loss function preset between the global extraction module and the local feature extraction branch is shown in the following formula 2:
L rank1 =max{0,H a -H g +margin } equation 2
Wherein, ha and Hg are defined as shown in the following formulas 3 and 4, respectively:
Figure BDA0003413731590000181
Figure BDA0003413731590000182
in equations 2, 3 and 4, L rank1 For the ranking penalty value between the global extraction module and the local feature extraction branches, C is the number of categories of items in the training set,
Figure BDA0003413731590000183
the prediction probability value of the second sub-recognition module corresponding to the local feature extraction branch to the category i and the prediction probability value of the first sub-recognition module corresponding to the global extraction module to the category i are respectively represented, the margin represents the difference between entropy values obtained in two stages, and the margin is set to be a constant value (for example margin=0.05) before training.
The ordering loss function shown in equation 2 is used to compare the probability distribution of the predicted values of the two phases, if the information comparison determines that the class of the item can be predicted with a high probability, then its resulting entropy will be small, otherwise it will be large. The most extreme case is that the predictive probability value for each class is the same size, and the entropy value is the largest. In this embodiment of the present application, the local extraction module, due to the addition of more information, predicts the corresponding second sub-recognition module more accurately, so that the entropy value is smaller than that of the first sub-recognition module corresponding to the global extraction module. Therefore, the sorting loss value is determined through the sorting loss function of the formula 2, and the training loss value of the preset recognition model is determined according to the sorting loss value, so that the attention to the local features is effectively improved, and the classification accuracy of the trained article recognition model to the articles is improved to a certain extent.
8054C, according to a certain weight ratio, classifying the loss value L of the first sub-recognition module ID1 Classification loss value L of second sub-recognition module ID2 Ordering loss value L between global extraction module and local feature extraction branch rank1 And obtaining a weighted sum result as a training loss value of a preset recognition model.
(4) The training loss values include: the classification loss value of the first sub-recognition module and the classification loss value of the second sub-recognition moduleClass penalty value, ordering penalty value L between global extraction module and local feature extraction branch rank1 Feature distance loss value L of global extraction module triplet . In this case, the step 805 may specifically include steps 8051D to 8054D:
8051D determining the classification loss value L of the first sub-recognition module according to the actual article category, the first article category and a classification loss function preset for the first sub-recognition module ID1
8052D determining the classification loss value L of the second sub-recognition module according to the actual article category, the second article category and the preset classification loss function for the second sub-recognition module ID2
8053D obtaining the ordering loss value L between the global extraction module and the local feature extraction branch rank1
8054D obtaining the feature distance loss value L of the global extraction module triplet
In step 8054D, "obtain feature distance loss value L of global extraction module triplet The step of "may specifically include: based on the positive sample characteristics output by the global characteristic extraction module, acquiring characteristic distances between positive sample pairs; based on the negative sample characteristics output by the global characteristic extraction module, obtaining characteristic distances between negative sample pairs; determining a feature distance loss value L of the global extraction module according to the feature distance between the positive sample pairs, the feature distance between the negative sample pairs and a feature distance loss function preset for the global extraction module triplet
For example, the feature distance loss function preset for the global extraction module is shown in the following equation 5:
L triplet =[d p -d n +α]equation 5
In formula 5, L triplet L rank1 The feature distance loss value d of the local extraction module p Representing the characteristic distance between the positive sample pairs, d n Representing the characteristic distance between negative pairs of samples, alpha is the edge setting, alpha is used to pick out difficult pairs of samples, and alpha can be set to a specific constant value (e.g., alpha =)0.3)。
8055D, according to a certain weight ratio, classifying the loss value L of the first sub-recognition module ID1 Classification loss value L of second sub-recognition module ID2 Ordering loss value L between global extraction module and local feature extraction branch rank1 Feature distance loss value L of global extraction module triplet And obtaining a weighted sum result as a training loss value of a preset recognition model.
(II) as shown in FIG. 7, when the preset recognition model includes a global extraction module, a local extraction module composed of 2 local feature extraction branches, and a recognition module composed of a first sub-recognition module, a second sub-recognition module, and a third sub-recognition module.
At this time, a sample local saliency map is determined by local feature branching as shown in fig. 7 in step 803; wherein a first sample local saliency map is determined by a first local feature extraction branch (i.e., local feature extraction branch 1) and a second sample local saliency map is determined by a second local feature extraction branch (i.e., local feature extraction branch 2). In step 804, in the first aspect, the first sub-identification module shown in fig. 7 may perform classification prediction based on the global features of the sample output by the global extraction module, to determine a first item category of the sample item; in a second aspect, the second object class of the sample object may be determined by performing classification prediction based on the first sample local saliency map output by the first local feature extraction branch (i.e., the local feature extraction branch 1) by the second sub-recognition module as shown in fig. 7; in a third aspect, a third object class of the sample article may be determined by performing a classification prediction based on a second sample local saliency map output by a second local feature extraction branch (i.e., local feature extraction branch 2) by a third sub-recognition module as shown in fig. 7. At this time, the first prediction category specifically refers to the second sub-recognition module performs classification prediction based on the first sample local saliency map output by the first local feature extraction branch, and the determined second sample category of the sample item; and/or the second sub-recognition module performs classification prediction based on the second sample local saliency map output by the second local feature extraction branch, and determines a third object class of the sample object.
Illustratively, the training loss value may be set as follows:
(1) The training loss value comprises a classification loss value L of the first sub-recognition module ID1 Classification loss value L of second sub-recognition module ID2 Classification loss value L of third sub-recognition module ID3 . At this time, step 805 may specifically include steps 8051E to 8054E:
8051E determining the classification loss value L of the first sub-identification module according to the actual article category, the first article category and a classification loss function preset for the first sub-identification module ID1
8052E determining the classification loss value L of the second sub-recognition module according to the actual article category, the second article category and the classification loss function preset for the second sub-recognition module ID2
8053E determining a classification loss value L of the third sub-recognition module according to the actual article category, the third article category and a classification loss function preset for the third sub-recognition module ID3
Wherein, step 8053E determines a classification loss value L of the third sub-recognition module ID3 In the manner of determining the classification loss value L of the second sub-recognition module in accordance with the above-described step 8051A ID2 In a similar manner, reference may be made to the above description, and details are not repeated here.
8054E according to a certain weight ratio, classifying the loss value L of the first sub-recognition module ID1 Classification loss value L of second sub-recognition module ID2 Classification loss value L of the third sub-recognition module ID3 And obtaining a weighted sum result as a training loss value of a preset recognition model.
(2) The training loss values include: classification loss value L of first sub-recognition module ID1 Classification loss value L of second sub-recognition module ID2 Classification loss value L of third sub-recognition module ID3 And a ranking penalty value L between the global extraction module and the first local feature extraction branch rank1 First local feature extraction branch and second local feature extraction branchOrdering loss value L between local feature extraction branches rank2 . At this time, step 805 may specifically include steps 8051F to 8054F:
8051G determining the classification loss value L of the first sub-recognition module according to the actual article category, the first article category and a classification loss function preset for the first sub-recognition module ID1
8052G determining the classification loss value L of the second sub-recognition module according to the actual article category, the second article category and the classification loss function preset for the second sub-recognition module ID2
8053G determining the classification loss value L of the third sub-recognition module according to the actual article category, the third article category and a classification loss function preset for the third sub-recognition module ID3
8054G obtaining a loss of order value L between the global extraction module and the first local feature extraction branch rank1
8055G obtaining a ranking penalty value L between the first local feature extraction branch and the second local feature extraction branch rank2
Wherein step 8055G determines a ranking penalty value L rank2 In the manner described above with respect to step 8053C, a ranking loss value L is determined rank1 In a similar manner, reference may be made to the above description, and details are not repeated here.
8056D, according to a certain weight ratio, classifying the loss value L of the first sub-recognition module ID1 Classification loss value L of second sub-recognition module ID2 Classification loss value L of third sub-recognition module ID3 A ranking penalty value L between the global extraction module and the first local feature extraction branch rank1 And a ranking penalty value L between the first local feature extraction branch and the second local feature extraction branch rank2 And obtaining a weighted sum result as a training loss value of a preset recognition model.
(3) The training loss values include: classification loss value L of first sub-recognition module ID1 Classification loss value L of second sub-recognition module ID2 Classification loss of the third sub-recognition moduleLoss value L ID3 A ranking penalty value L between the global extraction module and the first local feature extraction branch rank1 A ranking loss value L between the first local feature extraction branch and the second local feature extraction branch rank2 Feature distance loss value L of global extraction module triplet . In this case, the step 805 may specifically include steps 8051G to 8054G:
8051G determining the classification loss value L of the first sub-recognition module according to the actual article category, the first article category and a classification loss function preset for the first sub-recognition module ID1
8052G determining the classification loss value L of the second sub-recognition module according to the actual article category, the second article category and the classification loss function preset for the second sub-recognition module ID2
8053G determining the classification loss value L of the third sub-recognition module according to the actual article category, the third article category and a classification loss function preset for the third sub-recognition module ID3
8054G obtaining a loss of order value L between the global extraction module and the first local feature extraction branch rank1
8055G obtaining a ranking penalty value L between the first local feature extraction branch and the second local feature extraction branch rank2
8056D obtaining the feature distance loss value L of the global extraction module triplet
8057D, according to a certain weight ratio, classifying the loss value L of the first sub-recognition module ID1 Classification loss value L of second sub-recognition module ID2 Classification loss value L of third sub-recognition module ID3 A ranking penalty value L between the global extraction module and the first local feature extraction branch rank1 A ranking loss value L between the first local feature extraction branch and the second local feature extraction branch rank2 Feature distance loss value L of global extraction module triplet And obtaining a weighted sum result as a training loss value of a preset recognition model.
The contents of the above steps 8051A, 8051B-8053B, 8051C-8054C, 8051D-8054D, 8051E-8054E, 8051F-8054F, and 8051G-8054G are described with emphasis on the classification loss value L ID1 Classification loss value L ID2 Classification loss value L ID3 Ranking loss value L rank1 Ranking loss value L rank2 Characteristic distance loss value L triplet The descriptions of the corresponding portions may be referred to each other, and are not repeated herein for simplicity.
As can be seen from the above manner of determining the training loss values in steps 8051C to 8054C, steps 8051D to 8054D, steps 8051F to 8054F, and steps 8051G to 8054G, before step 805, the method may further include: and based on the identification module in the preset identification model, carrying out classification identification according to the global characteristics of the sample to obtain a second prediction category of the sample article. Step 805 specifically includes: acquiring a first prediction probability value of the first prediction category and a second prediction probability value of the second prediction category; determining a sorting loss value of the preset recognition model according to the first prediction probability value and the second prediction probability value; and determining the training loss value according to the sorting loss value.
In steps 8051C to 8054C and steps 8051D to 8054D, the first predicted category specifically refers to the second article category, and the second predicted category specifically refers to the first article category. In steps 8051F to 8054F and steps 8051G to 8054G, the first predicted category specifically refers to the second article category and the third article category, and the second predicted category specifically refers to the first article category.
806. And adjusting model parameters of the preset recognition model based on the training loss value until the preset recognition model meets the preset training stopping condition, and taking the preset recognition model as the article recognition model.
The preset training stopping condition can be set according to actual requirements. For example, when the training loss value is smaller than a preset value, or when the training loss value is basically unchanged, that is, the difference value of the training loss values corresponding to the adjacent multiple training is smaller than the preset value; or when the preset iteration times of the recognition model training reach the maximum iteration times.
For example, as shown in fig. 4 or fig. 7, the counter-propagation may be performed based on the training loss value, so as to adjust model parameters of the global extraction module, the local extraction module, the first sub-recognition module, the second sub-recognition module, and the third sub-recognition module in the preset recognition model, until the training stopping condition is met, and the trained article recognition model may be obtained.
On the one hand, the object identification model obtained through training in the steps 801-806 can fully learn the relation between the object types and the image characteristics, so that the object types in the image can be accurately identified based on the image; thus, the identification model of the trained item identification models may be employed for identifying the target item category in step 205. On the other hand, since the feature extraction parameters of the saliency map of the image part can be learned when the model converges, the feature extraction parameters in the trained object recognition model can be used for extracting the target saliency map in step 204 and extracting the target global features in step 203.
In order to better implement the method for identifying an article in the embodiment of the present application, based on the method for identifying an article, an article identifying device is further provided in the embodiment of the present application, as shown in fig. 9, which is a schematic structural diagram of an embodiment of the article identifying device in the embodiment of the present application, where the article identifying device 900 includes:
a first acquiring unit 901 for acquiring a target image of an article to be identified;
a global extracting unit 902, configured to perform global feature extraction on the target image, so as to obtain a target global feature of the target image;
A second obtaining unit 903, configured to obtain a target representation feature map of the target image;
a local extraction unit 904, configured to obtain a target saliency map of the target image local based on the target global feature and the target representation feature map;
and the identifying unit 905 is configured to perform category identification on the target image based on the target saliency map, so as to obtain a target object category in the target image.
In some embodiments of the present application, the local extraction unit 904 is specifically configured to:
determining a first maximum saliency map and a first average saliency map of the target image according to the target global feature and the target representation feature map;
and determining the target saliency map according to the first maximum saliency map and the first average saliency map.
In some embodiments of the present application, the local extraction unit 904 is specifically configured to:
obtaining a preliminary saliency map of the target image part according to the first maximum saliency map and the first average saliency map;
determining a second maximum saliency map and a second average saliency map of the target image part according to the preliminary saliency map and the target representation feature map;
The target saliency map is determined from at least two of the second maximum saliency map, the second average saliency map, and the preliminary saliency map.
In some embodiments of the present application, the local extraction unit 904 is specifically configured to:
performing maximum pooling operation on the target representation feature map to obtain a maximum feature map of the target representation feature map;
and carrying out dot product on the maximum feature map and the target global feature to obtain the first maximum saliency map.
In some embodiments of the present application, the local extraction unit 904 is specifically configured to:
carrying out average pooling operation on the target representation feature map to obtain an average feature map of the target representation feature map;
and carrying out dot product on the average feature map and the target global feature to obtain the first average saliency map.
In some embodiments of the present application, the identifying unit 905 is specifically configured to:
and based on an identification module in the trained object identification model, carrying out category identification on the target image based on the target saliency map to obtain a target object category in the target image.
In some embodiments of the present application, the article identification device 900 further includes a training unit (not shown in the figures), and the training unit is specifically configured to:
Acquiring a sample image of a sample article, wherein the sample image is marked with an actual article category of the sample article;
based on a global extraction module in a preset recognition model, acquiring a sample representation feature map and sample global features of the sample image;
based on a local extraction module in the preset recognition model, determining a sample local saliency map of the sample image according to the sample representation feature map and the sample global feature;
based on an identification module in the preset identification model, carrying out classification identification according to the sample local saliency map to obtain a first prediction category of the sample article;
determining a training loss value of the preset recognition model according to the actual object class and the first prediction class;
and adjusting model parameters of the preset recognition model based on the training loss value until the preset recognition model meets the preset training stopping condition, and taking the preset recognition model as the article recognition model.
In some embodiments of the present application, the training unit is specifically configured to:
based on an identification module in the preset identification model, carrying out classification identification according to the global characteristics of the sample to obtain a second prediction category of the sample article;
The determining the training loss value of the preset recognition model according to the actual article category and the first prediction category comprises the following steps:
acquiring a first prediction probability value of the first prediction category and a second prediction probability value of the second prediction category;
determining a sorting loss value of the preset recognition model according to the first prediction probability value and the second prediction probability value;
and determining the training loss value according to the sorting loss value.
In some embodiments of the present application, the first obtaining unit 901 is specifically configured to:
acquiring an image of a target object area;
performing classification detection on the image of the target object area to obtain each object detection area in the image of the target object area;
and acquiring screenshot of each article in the target article area according to the image of the target article area and each article detection area to serve as the target image.
In some embodiments of the present application, the article identification device 900 further includes a statistics unit (not shown in the figure), and the statistics unit is specifically configured to:
and counting the objects in the object area according to the object type to obtain the number of each object.
In the implementation, each unit may be implemented as an independent entity, or may be implemented as the same entity or several entities in any combination, and the implementation of each unit may be referred to the foregoing method embodiment, which is not described herein again.
Since the article identifying device can execute the steps in the article identifying method according to any embodiment of the present application, such as fig. 1 to 8, the beneficial effects that can be achieved by the article identifying method according to any embodiment of the present application, such as fig. 1 to 8, can be achieved, and detailed descriptions thereof will be omitted.
In addition, in order to better implement the method for identifying an article in the embodiment of the present application, on the basis of the method for identifying an article, the embodiment of the present application further provides an electronic device, referring to fig. 10, fig. 10 shows a schematic structural diagram of the electronic device in the embodiment of the present application, specifically, the electronic device provided in the embodiment of the present application includes a processor 1001, where the processor 1001 is configured to implement steps of the method for identifying an article in any embodiment when executing a computer program stored in a memory 1002, as shown in fig. 1 to 8; alternatively, the processor 1001 is configured to implement the functions of each unit in the corresponding embodiment as in fig. 9 when executing the computer program stored in the memory 1002.
By way of example, a computer program may be partitioned into one or more modules/units that are stored in the memory 1002 and executed by the processor 1001 to accomplish the embodiments of the present application. One or more of the modules/units may be a series of computer program instruction segments capable of performing particular functions to describe the execution of the computer program in a computer device.
Electronic devices may include, but are not limited to, a processor 1001, a memory 1002. It will be appreciated by those skilled in the art that the illustrations are merely examples of electronic devices, and are not limiting of electronic devices, and may include more or fewer components than shown, or may combine some components, or different components, e.g., electronic devices may also include input and output devices, network access devices, buses, etc., through which the processor 1001, memory 1002, input and output devices, network access devices, etc., are connected.
The processor 1001 may be a central processing unit (Central Processing Unit, CPU), but may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), off-the-shelf programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like, which is a control center for an electronic device, with various interfaces and lines connecting various parts of the overall electronic device.
The memory 1002 may be used to store computer programs and/or modules, and the processor 1001 implements various functions of the computer device by running or executing the computer programs and/or modules stored in the memory 1002 and invoking data stored in the memory 1002. The memory 1002 may mainly include a storage program area that may store an operating system, application programs required for at least one function (such as a sound playing function, an image playing function, etc.), and a storage data area; the storage data area may store data (such as audio data, video data, etc.) created according to the use of the electronic device, and the like. In addition, the memory may include high-speed random access memory, and may also include non-volatile memory, such as a hard disk, memory, plug-in hard disk, smart Media Card (SMC), secure Digital (SD) Card, flash Card (Flash Card), at least one disk storage device, flash memory device, or other volatile solid-state storage device.
It will be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process of the article identifying apparatus, the electronic device and the corresponding units thereof described above may refer to the description of the article identifying method in any embodiment corresponding to fig. 1 to 8, and will not be repeated herein.
Those of ordinary skill in the art will appreciate that all or a portion of the steps of the various methods of the above embodiments may be performed by instructions, or by instructions controlling associated hardware, which may be stored in a computer-readable storage medium and loaded and executed by a processor.
For this reason, the embodiment of the present application provides a computer readable storage medium, in which a plurality of instructions are stored, where the instructions can be loaded by a processor to execute steps in the method for identifying an article in any embodiment corresponding to fig. 1 to 8, and specific operations may refer to descriptions of the method for identifying an article in any embodiment corresponding to fig. 1 to 8, which are not repeated herein.
Wherein the computer-readable storage medium may comprise: read Only Memory (ROM), random access Memory (RAM, random Access Memory), magnetic or optical disk, and the like.
Since the instructions stored in the computer readable storage medium may perform the steps in the method for identifying an article according to any embodiment of the present application, as shown in fig. 1 to 8, the beneficial effects that can be achieved by the method for identifying an article according to any embodiment of the present application, as shown in fig. 1 to 8, are detailed in the foregoing description, and are not repeated herein.
The foregoing has described in detail the methods, apparatuses, electronic devices and computer readable storage medium for identifying articles provided in the embodiments of the present application, and specific examples have been applied to illustrate the principles and embodiments of the present application, where the foregoing examples are provided to assist in understanding the methods and core ideas of the present application; meanwhile, those skilled in the art will have variations in the specific embodiments and application scope in light of the ideas of the present application, and the present description should not be construed as limiting the present application in view of the above.

Claims (13)

1. A method of identifying an article, the method comprising:
acquiring a target image of an object to be identified;
extracting global features of the target image to obtain target global features of the target image;
acquiring a target representation feature map of the target image;
acquiring a target saliency map of the target image part based on the target global feature and the target representation feature map;
and carrying out category identification on the target image based on the target saliency map to obtain the target object category of the object to be identified in the target image.
2. The method of claim 1, wherein the obtaining a target saliency map of the target image part based on the target global feature and the target representation feature map comprises:
determining a first maximum saliency map and a first average saliency map of the target image according to the target global feature and the target representation feature map;
and determining the target saliency map according to the first maximum saliency map and the first average saliency map.
3. The method of claim 2, wherein the determining the target saliency map from the first maximum saliency map and the first average saliency map comprises:
obtaining a preliminary saliency map of the target image part according to the first maximum saliency map and the first average saliency map;
determining a second maximum saliency map and a second average saliency map of the target image part according to the preliminary saliency map and the target representation feature map;
the target saliency map is determined from at least two of the second maximum saliency map, the second average saliency map, and the preliminary saliency map.
4. The method of claim 2, wherein the determining a first maximum saliency map of the target image from the target global feature and the target representation feature map comprises:
performing maximum pooling operation on the target representation feature map to obtain a maximum feature map of the target representation feature map;
and carrying out dot product on the maximum feature map and the target global feature to obtain the first maximum saliency map.
5. The method of claim 2, wherein the determining a first average saliency map of the target image from the target global feature and the target representation feature map comprises:
carrying out average pooling operation on the target representation feature map to obtain an average feature map of the target representation feature map;
and carrying out dot product on the average feature map and the target global feature to obtain the first average saliency map.
6. The method for identifying an object according to claim 1, wherein the step of identifying the object image by category based on the object saliency map, to obtain a category of the object in the object image, comprises:
And based on an identification module in the trained object identification model, carrying out category identification on the target image based on the target saliency map to obtain a target object category in the target image.
7. The method of claim 6, further comprising:
acquiring a sample image of a sample article, wherein the sample image is marked with an actual article category of the sample article;
based on a global extraction module in a preset recognition model, acquiring a sample representation feature map and sample global features of the sample image;
based on a local extraction module in the preset recognition model, determining a sample local saliency map of the sample image according to the sample representation feature map and the sample global feature;
based on an identification module in the preset identification model, carrying out classification identification according to the sample local saliency map to obtain a first prediction category of the sample article;
determining a training loss value of the preset recognition model according to the actual object class and the first prediction class;
and adjusting model parameters of the preset recognition model based on the training loss value until the preset recognition model meets the preset training stopping condition, and taking the preset recognition model as the article recognition model.
8. The method of claim 7, further comprising:
based on an identification module in the preset identification model, carrying out classification identification according to the global characteristics of the sample to obtain a second prediction category of the sample article;
the determining the training loss value of the preset recognition model according to the actual article category and the first prediction category comprises the following steps:
acquiring a first prediction probability value of the first prediction category and a second prediction probability value of the second prediction category;
determining a sorting loss value of the preset recognition model according to the first prediction probability value and the second prediction probability value;
and determining the training loss value according to the sorting loss value.
9. The method of any one of claims 1-8, wherein the acquiring a target image of the item to be identified comprises:
acquiring an image of a target object area;
performing classification detection on the image of the target object area to obtain each object detection area in the image of the target object area;
and acquiring screenshot of each article in the target article area according to the image of the target article area and each article detection area to serve as the target image.
10. The method of claim 9, wherein the method further comprises:
and counting the objects in the object area according to the object type to obtain the number of each object.
11. An article identification device, characterized in that the article identification device comprises:
the first acquisition unit is used for acquiring a target image of the object to be identified;
the global extraction unit is used for carrying out global feature extraction on the target image to obtain target global features of the target image;
a second acquisition unit configured to acquire a target representation feature map of the target image;
the local extraction unit is used for acquiring a target saliency map of the target image part based on the target global feature and the target representation feature map;
and the identification unit is used for carrying out category identification on the target image based on the target saliency map to obtain the target object category in the target image.
12. An electronic device comprising a processor and a memory, the memory having stored therein a computer program, the processor executing the article identification method of any of claims 1 to 10 when the computer program in the memory is invoked by the processor.
13. A computer-readable storage medium, having stored thereon a computer program, the computer program being loaded by a processor to perform the steps of the article identification method of any of claims 1 to 10.
CN202111540033.1A 2021-12-15 2021-12-15 Article identification method, apparatus, electronic device, and computer-readable storage medium Pending CN116266393A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111540033.1A CN116266393A (en) 2021-12-15 2021-12-15 Article identification method, apparatus, electronic device, and computer-readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111540033.1A CN116266393A (en) 2021-12-15 2021-12-15 Article identification method, apparatus, electronic device, and computer-readable storage medium

Publications (1)

Publication Number Publication Date
CN116266393A true CN116266393A (en) 2023-06-20

Family

ID=86743079

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111540033.1A Pending CN116266393A (en) 2021-12-15 2021-12-15 Article identification method, apparatus, electronic device, and computer-readable storage medium

Country Status (1)

Country Link
CN (1) CN116266393A (en)

Similar Documents

Publication Publication Date Title
CN108197532A (en) The method, apparatus and computer installation of recognition of face
US20120027252A1 (en) Hand gesture detection
CN111931592B (en) Object recognition method, device and storage medium
CN110717366A (en) Text information identification method, device, equipment and storage medium
US20160189396A1 (en) Image processing
CN102385592A (en) Image concept detection method and device
CN113869282A (en) Face recognition method, hyper-resolution model training method and related equipment
JP6713162B2 (en) Image recognition device, image recognition method, and image recognition program
CN111339884B (en) Image recognition method, related device and apparatus
CN115115825B (en) Method, device, computer equipment and storage medium for detecting object in image
CN111340213B (en) Neural network training method, electronic device, and storage medium
KR102576157B1 (en) Method and apparatus for high speed object detection using artificial neural network
Amisse et al. Fine-tuning deep learning models for pedestrian detection
CN113011253B (en) Facial expression recognition method, device, equipment and storage medium based on ResNeXt network
KR101334858B1 (en) Automatic butterfly species identification system and method, and portable terminal having automatic butterfly species identification function using the same
CN113705643B (en) Target detection method and device and electronic equipment
Dalara et al. Entity Recognition in Indian Sculpture using CLAHE and machine learning
CN116266393A (en) Article identification method, apparatus, electronic device, and computer-readable storage medium
CN114387489A (en) Power equipment identification method and device and terminal equipment
CN113780335A (en) Small sample commodity image classification method, device, equipment and storage medium
CN112949672A (en) Commodity identification method, commodity identification device, commodity identification equipment and computer readable storage medium
CN111582107B (en) Training method and recognition method of target re-recognition model, electronic equipment and device
Laptev et al. Integrating Traditional Machine Learning and Neural Networks for Image Processing
CN117333926B (en) Picture aggregation method and device, electronic equipment and readable storage medium
CN117671312A (en) Article identification method, apparatus, electronic device, and computer-readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination