CN114693918A

CN114693918A - Image identification method and device and computer readable storage medium

Info

Publication number: CN114693918A
Application number: CN202210331824.1A
Authority: CN
Inventors: 张夏杰
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Wodong Tianjun Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Wodong Tianjun Information Technology Co Ltd
Priority date: 2022-03-30
Filing date: 2022-03-30
Publication date: 2022-07-01

Abstract

The embodiment of the invention discloses an image identification method and device and a computer readable storage medium, wherein the method comprises the steps of carrying out multi-resolution down-sampling on an image to be identified so as to obtain first feature maps corresponding to at least two resolutions respectively; performing feature fusion on the first feature maps corresponding to the at least two resolutions respectively to obtain a second feature map; carrying out multi-scale target detection on the second characteristic diagram so as to obtain a preset number of candidate frames; screening a preset number of candidate frames to obtain target candidate frames; the target candidate box represents a region containing the target object in the second feature map; and identifying the target object corresponding to the target candidate frame so as to obtain an identification result. The target objects with different sizes in the image to be recognized are recognized, the richness of the recognition result is improved, and the requirements of the practical application scene are met.

Description

Image identification method and device and computer readable storage medium

Technical Field

The present invention relates to the field of image recognition technologies, and in particular, to an image recognition method and apparatus, and a computer-readable storage medium.

Background

Image recognition technology refers to technology that processes, analyzes, and understands images to recognize various target objects; the method and the device have the advantages that the type of the trademark in the image is judged in the input image, the image can be used for interacting with a user in commodity activities, application scenes such as information and sales promotion information are indicated for the user, brand influence of commodities can be improved, and commodity sales amount can be improved.

At present, target objects in images are generally identified based on an algorithm of image matching, and the identification result is limited, so that the requirements of actual scene application cannot be met.

Disclosure of Invention

Embodiments of the present invention are intended to provide an image recognition method and apparatus, and a computer-readable storage medium, which can improve the richness of image recognition results to meet the requirements of real-world scene applications.

The technical scheme of the invention is realized as follows:

the embodiment of the invention provides an image identification method, which comprises the following steps:

performing multi-resolution down-sampling on an image to be identified so as to obtain first feature maps corresponding to at least two resolutions respectively;

performing feature fusion on the first feature maps corresponding to the at least two resolutions respectively to obtain a second feature map;

carrying out multi-scale target detection on the second feature map so as to obtain a preset number of candidate frames;

screening the preset number of candidate frames to obtain target candidate frames; the target candidate box characterizes a region containing a target object in the second feature map;

and identifying the target object corresponding to the target candidate frame so as to obtain an identification result.

In the above technical solution, the identification result includes category information, location information, area ratio, and quantity information.

In the above technical solution, the performing multi-resolution down-sampling on the image to be recognized to obtain the first feature maps corresponding to at least two resolutions respectively includes:

determining a down-sampling proportional coefficient based on the side length of the image to be identified; the down-sampling proportional coefficient is a common divisor of the side length of the image to be identified;

and performing downsampling of at least two resolutions on the image to be identified according to the downsampling proportion coefficient to obtain the first feature maps corresponding to the at least two resolutions respectively.

In the foregoing technical solution, the performing multi-scale target detection on the second feature map to obtain a preset number of candidate frames includes:

detecting the sub-features of each decomposition layer in the second feature map through sliding windows until the target sub-features are detected by a preset number of sliding windows, so as to realize multi-scale target detection;

and mapping the target sub-features to the area of the image to be identified to generate corresponding candidate frames, so as to obtain the preset number of candidate frames.

In the above technical solution, the identifying the region corresponding to the target candidate frame to obtain an identification result includes:

identifying a target object corresponding to the target candidate frame to obtain position information of the target candidate frame;

and obtaining the area ratio information and the target position information in the recognition result according to the position information and the size information of the image to be recognized.

identifying a target object corresponding to the target candidate box to obtain a prediction classification label of the target object;

and matching the predicted classification label with a preset classification label to obtain the target quantity information and the category information in the identification result.

In the above technical solution, the method further includes:

and when the quantity information is a target object, displaying the article activity information corresponding to the target object.

In the above technical solution, the method further includes:

when the quantity information is at least two target objects, receiving an activity information acquisition request, wherein the activity information acquisition request carries any one of the at least two target objects;

and responding to the activity information acquisition request, and displaying corresponding article activity information according to the any one target object.

In the above technical solution, the method further includes:

receiving a commodity tracking request, wherein the commodity tracking request carries the identification result;

responding to the commodity tracking request, obtaining corresponding tracking information based on the identification result, and displaying prompt information according to the tracking information and the area ratio information in the identification result.

The embodiment of the invention provides an image identification device, which comprises a down-sampling unit, a fusion unit, a detection unit, a screening unit and an identification unit; wherein the content of the first and second substances,

the down-sampling unit is used for carrying out multi-resolution down-sampling on the image to be identified so as to obtain first feature maps corresponding to at least two resolutions respectively;

the fusion unit is used for performing feature fusion on the first feature maps corresponding to the at least two resolutions respectively to obtain a second feature map;

the detection unit is used for carrying out multi-scale target detection on the second feature map so as to obtain a preset number of candidate frames;

the screening unit is used for screening the candidate frames with the preset number to obtain target candidate frames; the target candidate box characterizes a region containing a target object in the second feature map;

and the identification unit is used for identifying the target object corresponding to the target candidate frame so as to obtain an identification result.

In the above technical solution, the down-sampling unit is further configured to perform down-sampling on the image to be identified according to a preset resolution, so as to obtain the first feature maps corresponding to at least two resolutions respectively.

In the above technical solution, the downsampling unit is further configured to determine a downsampling scaling factor according to the side length based on the image to be identified; the down-sampling proportional coefficient is a common divisor of the side length of the image to be identified; and performing downsampling of at least two resolutions on the image to be identified according to the downsampling proportion coefficient to obtain the first feature maps corresponding to the at least two resolutions respectively.

In the above technical solution, the detection unit is further configured to detect the sub-features of each decomposition layer in the second feature map through sliding windows until a preset number of sliding windows detect target sub-features, so as to implement multi-scale target detection; and mapping the target sub-features to the area of the image to be identified to generate corresponding candidate frames, so as to obtain the preset number of candidate frames.

In the above technical solution, the identifying unit is further configured to identify a target object corresponding to the target candidate frame to obtain position information of the target candidate frame; and obtaining the area ratio information and the position information in the identification result according to the position information and the size information of the image to be identified.

In the above technical solution, the identifying unit is further configured to identify a target object corresponding to the target candidate box to obtain a prediction classification tag of the target object;

and matching the predicted classification label with a preset classification label to obtain the quantity information and the category information in the identification result.

In the above technical solution, the apparatus further includes a display unit, configured to display the item activity information corresponding to one target object when the quantity information is the target object.

In the above technical solution, the apparatus further comprises a receiving unit, wherein,

the receiving unit is configured to receive a motion information obtaining request when the quantity information is at least two target objects, where the motion information obtaining request carries any one of the at least two target objects;

and the display unit is used for responding to the activity information acquisition request and displaying corresponding article activity information according to the any one target object.

In the above technical solution, the receiving unit is further configured to receive a commodity tracking request, where the commodity tracking request carries the identification result;

the display unit is further configured to respond to the commodity tracking request, obtain corresponding tracking information based on the identification result, and display prompt information according to the tracking information and the area ratio information in the identification result.

An embodiment of the present invention provides an image recognition apparatus, including:

a memory for storing executable data instructions;

and the processor is used for realizing the image identification method according to the embodiment of the invention when executing the executable instructions stored in the memory.

Embodiments of the present invention provide a computer-readable storage medium, which stores executable instructions for causing a processor to implement an image recognition method according to an embodiment of the present invention when the processor executes the executable instructions.

The embodiment of the invention provides an image identification method and device and a computer readable storage medium, wherein the method comprises the steps of carrying out multi-resolution down-sampling on an image to be identified, carrying out feature fusion on a multi-resolution feature map obtained by the down-sampling, extracting features with different resolutions in the feature map after the feature fusion through a multi-scale method, judging a region possibly containing a target object, generating a candidate frame, carrying out frame selection on the region, screening the region possibly containing the target object to determine a region with the highest probability of the target object and the target candidate frame, and obtaining an identification result through identification of the target candidate frame and the region in the target candidate frame.

In the embodiment of the invention, target objects with different sizes in the image to be recognized can be recognized through multi-resolution down-sampling and multi-scale methods, so that the richness of the recognition result is improved, and the requirements of practical application scenes are met.

Drawings

Fig. 1 is a first flowchart of an image recognition method according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a predetermined detection network according to an embodiment of the present invention;

fig. 3 is a flowchart of a method for image recognition according to an embodiment of the present invention;

fig. 4 is a flowchart of an image recognition method according to an embodiment of the present invention;

fig. 5 is a first schematic structural diagram of an image recognition apparatus according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of an image recognition apparatus according to an embodiment of the present invention.

Detailed Description

The technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention.

Fig. 1 is a first flowchart of an image recognition method according to an embodiment of the present invention, and as shown in fig. 1, an image recognition method according to an embodiment of the present invention includes:

s101, performing multi-resolution down-sampling on an image to be identified so as to obtain first feature maps corresponding to at least two resolutions respectively.

The method and the device are suitable for sampling the image to be recognized so as to facilitate the subsequent recognition of the scene.

In the embodiment of the invention, an image recognition device carries out multi-resolution down-sampling on an image to be recognized through a preset detection network so as to obtain first feature maps corresponding to at least two resolutions respectively; the first feature map represents feature information in the image to be recognized, and the feature information refers to features of the image to be recognized, such as brightness, edges, textures and colors, which can be distinguished from other types of images.

In the embodiment of the invention, after an image to be recognized is acquired by an image recognition device, the image to be recognized is input into a preset detection network, so that the preset detection network performs multi-resolution down-sampling on the image to be recognized; the multi-resolution down-sampling refers to down-sampling the images to be recognized according to different resolutions to obtain the images to be recognized with different resolutions, digitizing the images to be recognized with different resolutions to obtain the feature information in the images to be recognized with different resolutions, and thus obtaining the first feature maps corresponding to the different resolutions respectively.

In the embodiment of the invention, when the image to be identified is subjected to multi-resolution down-sampling, the image to be identified is adjusted to different resolutions, and the characteristic information in the image to be identified at different resolutions is extracted; combining feature information with the same resolution (size) to form a feature vector, namely obtaining a first feature map corresponding to the resolution; the first feature maps corresponding to different resolutions are arranged according to a preset sequence, and the preset sequence may be arranged from a high resolution to a low resolution, or from the low resolution to the high resolution. For example, fig. 2 is a schematic structural diagram of a predetermined detection network according to an embodiment of the present invention, and as shown in fig. 2, the predetermined detection network includes a backhaul, a Neck, and a Head. Wherein, the Backbone network is used for extracting the information of the image to be identified; head is a detection Head and is mainly used for predicting the type and the position of a target; the Neck is arranged between the backhaul and the Head and is used for adding network layers for collecting characteristic diagrams in different stages so as to better utilize information extracted by the backhaul. In actual use, the Backbone realizes multi-resolution down-sampling of an image to be identified, so that first feature maps corresponding to at least two resolutions are obtained; when the resolution of the image to be recognized is 256 × 256 and 5 times of downsampling are required, obtaining first feature maps with the resolutions of 256 × 256, 128 × 128, 64 × 64, 32 × 32, 16 × 16 and 8 × 8 in sequence; the first characteristic diagram is arranged in the order from high resolution to low resolution. The Backbone is also used for outputting the first characteristic diagram corresponding to each of the acquired at least two resolutions to the Neck for collection.

It can be understood that, by performing multi-resolution down-sampling on an image to be recognized, first feature maps corresponding to at least two resolutions acquired according to different resolutions can be obtained, and because the resolutions are different, features corresponding to feature information in sizes of the first feature maps corresponding to the at least two resolutions are also different, so that the recognition effect and the recognition accuracy of subsequently extracting features to recognize a target object can be improved.

In some embodiments of the present invention, S101 may further include S1011 and S1012, where S1011 and S1012 are as follows:

and S1011, determining a down-sampling proportional coefficient based on the side length of the image to be identified.

In some embodiments of the present invention, a down-sampling scaling factor is determined based on a side length of an image to be identified, where the down-sampling scaling factor is a common divisor of the side length of the image to be identified.

In some embodiments of the present invention, for example, an image (image to be identified) with a resolution of M × N is subjected to s-fold (down-sampling scaling factor) down-sampling, and an image (first feature map) with a resolution of (M/s) × (N/s) is obtained, where s is a common divisor of M and N; when the resolution of the image to be recognized is 256 × 256 and 5 times of downsampling of the image to be recognized is required, the downsampling scale factor may be determined to be 2.

S1012, performing downsampling of at least two resolutions on the image to be identified according to the downsampling scale coefficient to obtain first feature maps corresponding to the at least two resolutions.

In some embodiments of the invention, the image to be identified is downsampled at least two resolutions according to a downsampling scaling factor; since each down-sampling will obtain a layer of first feature map, the down-sampling of at least two resolutions of the image to be recognized can obtain the first feature maps corresponding to the at least two resolutions.

In some embodiments of the present invention, when the image to be identified is downsampled, each downsampling represents that the image to be identified is reduced according to the downsampling scaling factor. With the reduction of the size of the image to be recognized, the resolution of the image to be recognized is also reduced, and the corresponding first feature maps can be obtained according to the feature information in the image to be recognized with different resolutions, and the resolution of each layer of the first feature maps is different and is in a decreasing state. Illustratively, after an image (to-be-recognized image) with a resolution of M × N is subjected to s-fold (down-sampling scale factor) down-sampling and an image with a resolution of (M/s) × (N/s) is obtained, it means that a first-layer first feature map is obtained, and the obtained image with the resolution of (M/s) × (N/s) is subjected to s-fold down-sampling, thereby obtaining an image with a resolution of (M/s) × (N/s)²)×(M/s²) And obtaining the second layer first characteristic diagram by the image with the resolution.

It can be understood that, since the resolutions of the first feature maps of each layer are different, the sizes of the feature information included in the first feature maps of each layer are also different, so that when the first feature maps corresponding to the at least two resolutions are detected, the comprehensiveness of the obtained feature information can be improved, and the recognition effect can be improved.

And S102, performing feature fusion on the first feature maps corresponding to the at least two resolutions respectively to obtain a second feature map.

The embodiment of the invention is suitable for enhancing the characteristics and is convenient for subsequent recognition processing scenes.

In the embodiment of the invention, the first feature maps corresponding to at least two resolutions are subjected to feature fusion, that is, feature information in the first feature maps corresponding to at least two resolutions is added bit by bit, so that the feature fusion is completed, and a second feature map is obtained, wherein the second feature map comprises all feature information extracted from the first feature maps corresponding to at least two resolutions, and all feature information is distributed according to the corresponding resolutions and levels.

In the embodiment of the present invention, in the first feature maps corresponding to at least two kinds of resolutions, the first feature map with the higher resolution will contain more position information and detail information, but the convolution of the first feature map with the higher resolution is less, so that the semantic information of the first feature map with the higher resolution is less. The first feature map with low resolution has much semantic information, but has poor detail perception due to low resolution. The position information, the detail information and the semantic information all belong to characteristic information.

In the embodiment of the present invention, as shown in fig. 2, the Neck part indicates a process of adding bit by bit the first feature maps corresponding to at least two resolutions respectively to obtain the second feature map, where an arrow indicates a sequence of adding bit by bit the first feature maps corresponding to at least two resolutions respectively. When the features are fused, deep information (semantic information) of a first feature map with low resolution is extracted, shallow information (position information and detail information) in the first feature map with high resolution is extracted, and finally the deep information and the shallow information are added element by element to obtain a plurality of layers of sub-features, wherein each layer is called a decomposition layer. The sub-feature resolution within the decomposition level in the second feature map is the same and the sub-feature resolution between levels is decreased. Illustratively, the Neck part performs feature fusion on the first feature maps with the resolutions of 64 × 64, 32 × 32, 16 × 16 and 8 × 8 respectively to obtain a second feature map, and the resolution of the sub-features between layers in the second feature map is sequentially decreased from 64 × 64 to 8 × 8.

It can be understood that the calculation amount of the hack can be reduced, and when the second feature map is processed to obtain the target object subsequently, more feature information can be obtained according to the sub-features of multiple resolutions, so that the accuracy of identifying the target object is improved.

S103, carrying out multi-scale target detection on the second feature map, and thus obtaining a preset number of candidate frames.

The method and the device are suitable for selecting the scene of the region of interest from the second feature map.

In the embodiment of the invention, the second characteristic diagram is subjected to multi-scale target detection, namely, the second characteristic diagram is subjected to target detection through a multi-scale method, so that the characteristic diagrams with different resolutions are obtained through detection, the characteristic diagrams with different resolutions represent target objects with different sizes, and after the target objects are detected, the generated candidate frames are used for framing the target objects, so that a preset number of candidate frames are obtained.

In the embodiment of the invention, the second feature map is composed of at least two decomposition layers, the at least two decomposition layers respectively comprise sub-features with different resolutions, and the sub-features with different resolutions represent different feature information of the image to be identified. The sub-feature resolution of each of the at least two decomposition layers is the same as the resolution of the corresponding first feature map.

In the embodiment of the present invention, a multi-scale method, also referred to as a multi-resolution method, is used to obtain sub-features with different resolutions in the second feature map, and determine the target object and the position of the target object through the sub-features with different resolutions; the approximate feature of the target object may be obtained through the low-resolution sub-feature, and the detail feature of the target object may be obtained through the high-resolution sub-feature, and for example, the decomposition layer with the resolution of 8 × 8 in the second feature map corresponds to the first feature map with the resolution of 8 × 8. When the target detection is performed, after the sub-features of the decomposition layer with the resolution of 8 × 8 in the second feature map are mapped to the image to be recognized, the size of the region corresponding to each sub-feature with the resolution of 8 × 8 is relatively large, so that a target object with a relatively large area can be obtained by detecting the sub-features with the resolution of 8 × 8; the decomposition level with the resolution of 64 × 64 in the second feature map corresponds to the first feature map with the resolution of 64 × 64. Since, after the sub-features of the decomposition layer with the resolution of 64 × 64 are mapped to the image to be identified, the size of the region corresponding to each sub-feature with the resolution of 64 × 64 is smaller than the region corresponding to the sub-feature with the resolution of 8 × 8; therefore, by detecting the sub-feature with the resolution of 64 × 64, the target object with a small area can be obtained.

In the embodiment of the present invention, the generation of the candidate frame is used to search the positions possibly containing the target object in the image to be identified, and these positions are also called regions of interest (ROI). In actual use, the sub-features in each decomposition layer in the second feature map are scanned with a sliding window (sliding window). The size and the number of the sliding windows are preset, and the size and the number of the sliding windows are set according to each decomposition layer in the second characteristic diagram. In the scanning process of the sliding window, the sub-features of each decomposition layer in the second feature map are detected, so that the target sub-features in each decomposition layer are obtained, the semantic information of the region corresponding to the target sub-features is captured, and then the corresponding region frame is selected when the target sub-features are mapped to the image to be recognized, so as to generate the candidate frame.

In the embodiment of the present invention, as shown in fig. 2, for example, if the second feature map is obtained by feature fusion from the first feature maps with resolutions of 64 × 64, 32 × 32, 16 × 16, and 8 × 8, and it is preset that there are 6 candidate frames at each position of the decomposition layer with resolution of 8 × 8 in the second feature map, and there are 2 candidate frames at each position of the remaining decomposition layers. When the sliding window captures the target sub-features in the decomposition layer with the resolution of 8 × 8 in the second feature map, scoring is performed according to the captured sub-features in the capturing process, and 6 candidate frames are preferentially generated at each position according to the scoring, and as the target sub-features of the decomposition layers with the resolutions of 64 × 64, 32 × 32, and 16 × 16 in the second feature map are captured by the sliding window and 2 candidate frames are correspondingly generated at each position, the second feature map obtains 8 × 8 × 6+16 × 16 × 2+32 × 2+64 × 64 × 2 ═ 11136 candidate frames. In fig. 2, 8 × 8 × 6, 16 × 16 × 2, 32 × 32 × 2, and 64 × 64 × 2 between the Neck part and the Head part are the numbers of candidate frames generated at each position on the decomposition level with the resolution of 8 × 8, 16 × 16, 32 × 32, and 64 × 64 in the second feature map, respectively, and the scoring criterion for generating the candidate frames is generated by the preset detection network during training. And after the second feature map completes multi-scale target detection, the obtained preset number of candidate frames are conveyed to the Boxes.

It can be understood that the sub-features with different resolutions in the second feature map are obtained through multi-scale target detection, so that the size change caused by the distance from a target object in real life can be effectively simulated, and the identification accuracy is improved.

In some embodiments of the present invention, the sub-features of each decomposition layer in the second feature map are detected through the sliding window until the target sub-features are detected by a preset number of sliding windows, so as to implement multi-scale target detection.

In some embodiments of the present invention, the sub-features of each decomposition layer in the second feature map are detected through a sliding window preset for each decomposition layer in the second feature map until each sliding window detects the corresponding target sub-feature in each decomposition layer, thereby completing the multi-scale target detection on the second feature map.

In some embodiments of the present invention, a corresponding candidate frame is generated according to the region of the target sub-feature mapped to the image to be recognized, so as to obtain a preset number of candidate frames.

In some embodiments of the present invention, the region of the target sub-feature mapped to the image to be recognized is a region that may include the target object, and the candidate frames having the same size as the sliding window are generated according to the region, so as to obtain a preset number of candidate frames.

It can be understood that by acquiring sub-features with different resolutions in the second feature map through multi-scale target detection, more feature information can be acquired, and the position of the target object can be determined according to the feature information.

S104, screening a preset number of candidate frames to obtain target candidate frames. The target candidate box characterizes a region of the second feature map containing the target object.

The method and the device are suitable for screening the candidate frames with the preset number, so that the candidate frame with the maximum target object probability is determined to be used as the scene of the target candidate frame.

In the embodiment of the invention, a preset number of candidate frames are screened to obtain target candidate frames; the target candidate box is the candidate box with the highest probability of containing the target object, and therefore, the target candidate box represents the region containing the target object in the second feature map.

In the embodiment of the present invention, when a preset number of candidate frames are screened, since each position of each decomposition layer in the second feature map has a preset number of candidate frames, the preset number of candidate frames may overlap with each other, and according to an overlapping condition of the candidate frames, a positioning accuracy of the candidate frames with respect to a target object position may be obtained through an Intersection Over Unit (IOU).

In the embodiment of the present invention, as shown in fig. 2, a preset number of candidate frames are screened through NMS (Non-Maximum Suppression). The screening process comprises the following steps: firstly, a preset number of candidate frames are ranked according to the score of each candidate frame, wherein the score of each candidate frame is obtained when the candidate frame is generated in S103. Secondly, screening out candidate frames larger than a preset score, calculating IOUs between the candidate frames larger than the preset score, and finally removing the candidate frames with the IOUs higher than a threshold value. And through the screening process, iteratively selecting a preset number of candidate frames until a target candidate frame is obtained. Wherein the number of the target candidate boxes is at least one.

It can be understood that, by screening candidate frames, target candidate frames are determined, each target candidate frame represents one target object, and in the process of screening the candidate frames, at least one candidate frame can be obtained, which means that the embodiment of the present invention can identify a plurality of target objects from the image to be identified, thereby expanding the range of application scenarios.

And S105, identifying the target object corresponding to the target candidate frame to obtain an identification result.

The method and the device are suitable for the scene of the recognition result output through the target candidate frame.

In the embodiment of the invention, the target object framed and selected by the target candidate frame and the corresponding target sub-feature in the second feature map are identified, and the target object and the corresponding identification result are obtained according to the target sub-feature.

In the embodiment of the invention, the feature information of the target object framed by the target candidate frame is extracted, and the feature information is the target sub-feature corresponding to the target candidate frame in the second feature map. In actual use, the target sub-features can be distinguished through the classifier to obtain a prediction classification label corresponding to each target sub-feature, the prediction classification label is matched with a preset classification label according to the prediction classification label, and if the prediction classification label belongs to the preset classification label, the content of frame selection in the target candidate frame is a target object. In this case, the prediction classification label is the category information of the target object. For example, if the target object to be identified is a trademark, the preset classification label is "wine", "beverage", "condiment", and the like, and the predicted classification label determined by the classifier for the target sub-feature is "wine", because the predicted classification label belongs to the preset classification label, the category information of the target object will be: wine. If the number of the obtained predicted classification tags is 3, and 2 of the 3 predicted classification tags are successfully matched with the preset classification tag, it means that the number of the target objects is 2, and the category information of the 2 target objects may be the same or different.

In the embodiment of the invention, the position information of the target object, namely the position information, can be obtained by identifying and obtaining the coordinates of 4 angles of the target candidate frame. Meanwhile, the side length of the target candidate frame can be obtained according to the coordinates of 4 corners of the target candidate frame, so that the area of the target candidate frame can be obtained, the quotient of the area of the target candidate frame and the area of the image to be recognized is taken in combination with the area of the image to be recognized, the area ratio of the target object can be obtained, and the area ratio in the recognition result can be obtained.

It will be appreciated that this enriches the image recognition results. The method solves the problems that the position of the target object, the area ratio of the target object and the number of the target objects cannot be obtained when the target object is identified by an image matching technology, and achieves the purpose of meeting the requirements in a real application scene.

In some embodiments of the invention, the recognition result includes category information, location information, area ratio, and quantity information.

In some embodiments of the present invention, the number of the target objects and the category information of the target objects are obtained by identifying the target candidate frame, and the area ratio of the target objects and the position information of the target objects are determined according to the size and the position information of the target candidate frame. The following recognition results were thus obtained: category information, location information, area ratio, and quantity information.

In some embodiments of the present invention, the target candidate box may be identified by the same Convolutional Neural Network (CNN) model, and the identification result is output. The target candidate frame can also be identified through a plurality of cascaded CNN models, so that different sub-identification results in the identification result are separately output.

In some embodiments of the present invention, a target object corresponding to a target candidate frame is identified to obtain position information of the target candidate frame.

In some embodiments of the present invention, a target object corresponding to a target candidate frame is identified, and a coordinate system is constructed according to an image to be identified, so as to obtain coordinate information of the target candidate frame.

In some embodiments of the present invention, the area ratio information and the position information in the recognition result are obtained according to the position information and the size information of the image to be recognized.

In some embodiments of the present invention, the position information may be obtained according to the coordinate information of the target candidate frame. And calculating to obtain the area of the target candidate frame according to the coordinate information of the target candidate frame, and then obtaining the area of the image to be recognized according to the size information of the image to be recognized, thereby obtaining the area ratio information in the recognition result.

In some embodiments of the present invention, a target object corresponding to the target candidate box is identified to obtain a predicted classification tag of the target object.

In some embodiments of the present invention, the predicted classification label is matched with a preset classification label, so as to obtain the quantity information and the category information of the target object in the recognition result.

In some embodiments of the present invention, the predicted classification tag is matched with a preset classification tag, if the matching result is that the predicted classification tag is successfully matched with the preset classification tag, that is, the predicted classification tag belongs to the preset classification tag, the target object is in the target candidate frame, and if the target object is a trademark, the predicted classification tag is category information corresponding to the trademark; and if the matching result is that the predicted classification label is unsuccessfully matched with the preset classification label, namely the predicted classification label does not belong to the preset classification label, determining that the target candidate frame is not the target object. And obtaining the trademark number information according to the matching condition of the output predicted classification label and the preset classification label.

In some embodiments of the invention, the image recognition method further comprises: and when the quantity information is a target object, displaying the article activity information corresponding to the target.

In some embodiments of the present invention, when the obtained number is one target object, that is, when there is only one target object in the image to be recognized, the article activity information related to the category information is displayed according to the category information corresponding to the one target object.

In some embodiments of the present invention, the article activity information may be displayed in a pop-up window form, the display terminal may be a mobile phone or a computer, and the article activity information is stored in the client database and may be called from the client database when displaying the article activity information.

In some embodiments of the invention, the image recognition method further comprises: and when the quantity information is at least two target objects, receiving a motion information acquisition request, wherein the motion information acquisition request carries any one of the at least two target objects.

In some embodiments of the present invention, when the obtained quantity information is at least two target objects, that is, when there are at least two target objects in the image to be recognized, the client receives an activity information acquisition request, where the activity information acquisition request is initiated through interaction between the user and the client, that is, the activity information acquisition request is initiated through selection of any one target object of the at least two target objects by the user; therefore, the activity information acquisition request carries any one of the at least two target objects.

In some embodiments of the present invention, in response to the activity information acquisition request, the corresponding item activity information is displayed according to any one of the target objects.

In some embodiments of the present invention, item activity information related to any one of the target objects is displayed according to the location information and/or the category information of the target object, where the item activity information is a response result in response to the activity information acquisition request.

In some embodiments of the present invention, the activity information acquisition request is used to request item activity information associated with any one of the at least two target objects. Illustratively, when at least two target objects exist in the image to be recognized, the client acquires the target object selected by the user according to the activity information, and displays the activity information of the object related to the target object in a pop-up window mode.

It can be understood that the activity information of the article required by the user can be displayed through the activity information acquisition request, so that more interactive operations are provided for the user, and more requirements of the user are met.

In some embodiments of the invention, the image recognition method further comprises: receiving a commodity tracking request, wherein the commodity tracking request carries an identification result;

in some embodiments of the present invention, an item tracking request is received, where the item tracking request carries area fraction information of a target corresponding to an item.

In some embodiments of the invention, the current picture is collected by the camera, the target object in the current picture is identified, and the target object related to the article corresponding to the article tracking request is captured. In the tracking process, different prompt messages can be popped up along with the change of the area ratio of the target object. Such as: category information of the target object, a distance between the target object and a user, and the like.

In some embodiments of the invention, in response to the commodity tracking request, corresponding tracking information is obtained based on the identification result, and prompt information is displayed according to the tracking information and the area ratio information in the identification result.

In some embodiments of the invention, corresponding tracking information is obtained according to the article determined to be related to the identification result in the identification result, and prompt information is displayed according to the tracking information and the area ratio information to inform a user of the current distance to the article.

It can be understood that tracking of the article can be achieved through the recognition result, and the distance between the user and the article can be known according to the area ratio information of the target object, so that the method is suitable for more application scenes, and the application range is expanded.

Fig. 3 is a flowchart of a second image recognition method according to an embodiment of the present invention, and as shown in fig. 3, the embodiment of the present invention provides an image recognition method, where the image recognition method is described by taking a target object as an example; the method comprises the following steps:

s201, detecting and preprocessing the image to be recognized to obtain a processed image to be recognized.

In some embodiments of the present invention, fig. 4 is a flowchart of an image recognition method provided in an embodiment of the present invention, and as shown in fig. 4, after an image to be recognized is acquired, detection preprocessing needs to be performed on the image to be recognized. The detection preprocessing is used for adjusting the size, the format and the like of the image to be recognized.

It can be understood that the image to be recognized is processed to meet the preset requirement through detection preprocessing, so that subsequent recognition of the image to be recognized is facilitated, and the recognition effect is ensured.

S202, down-sampling the processed image to be identified, and sampling the processed image to be identified from 256 multiplied by 256 to 8 multiplied by 8 to obtain a multilayer feature map (first feature maps corresponding to at least two resolutions respectively).

In some embodiments of the invention, the feature map is downsampled by a detection network of anchor-base (preset detection network). As shown in fig. 4, the processed image to be recognized is input into a detection network for trademark recognition, and when the detection network recognizes the trademark, the recognition effect is ensured by a data set (LogoDet) in the detection network.

In some embodiments of the invention, a data set is obtained by taking a single image under a line or on a line, an image sample to be recognized is obtained, a trademark in the image sample to be recognized is labeled and then classified according to the trademark in the image to be recognized, finally the image sample to be recognized is input into a detection network, the detection network is trained, and when the detection network is trained, the data set is generated in the detection network.

It can be understood that the training of the data set in the detection network is completed through the collected to-be-identified image sample containing the trademark, so that the pertinence of the data set to trademark identification can be ensured, and the identification accuracy of the detection network to the trademark is ensured.

And S203, adding the characteristics of the multilayer characteristic diagrams bit by bit to obtain a second characteristic diagram.

In some embodiments of the invention, the detection network performs feature fusion by adding features of the multi-layer feature map bit by bit.

It can be understood that the position information of the bottom layer and the semantic information of the high layer in the multi-layer feature map can be effectively combined through feature fusion.

And S204, detecting target objects with different sizes and setting anchors (candidate boxes) through a multi-scale method and sub-features with different resolutions in the second feature map.

In some embodiments of the invention, the detection network takes care of different sizes of target objects with different resolution sub-features, such as: detecting a larger target object by extracting the sub-features with the resolution of 8 multiplied by 8 in the second feature map; and detecting a smaller target object by extracting the sub-feature map with the resolution of 64 multiplied by 64 in the second feature map.

It can be understood that the size change caused by the distance between the camera and the trademark when the client acquires the image to be recognized in real life can be effectively simulated.

S205, obtaining the target anchor and the recognition result through NMS.

In some embodiments of the present invention, as shown in fig. 4, a target anchor (detection box) can be obtained by the NMS screening the anchor, and in actual use, a required recognition result, such as a type of a trademark (category information) and an area ratio (area ratio information), can be obtained by the target anchor.

It can be understood that, in this way, not only the type of the trademark can be recognized, but also the area information of the trademark can be obtained as needed, and the use scene of the commodity recognition in the activity can be enlarged.

And S206, outputting the target scheme according to the recognition result.

In some embodiments of the present invention, as shown in fig. 4, after the recognition result is obtained, a corresponding target scheme may be generated according to the recognition result and output.

It can be understood that the target activity is customized and generated based on the recognition result, and the purposes of improving the brand influence of the commodity, increasing the stay time of the user and promoting the commodity sales can be achieved.

Fig. 5 is a schematic structural diagram of an image recognition apparatus according to an embodiment of the present invention, and as shown in fig. 5, the embodiment of the present invention provides an image recognition apparatus 5, which includes a down-sampling unit 51, a fusing unit 52, a detecting unit 53, a screening unit 54, and a recognition unit 55; wherein the content of the first and second substances,

the down-sampling unit 51 is configured to perform multi-resolution down-sampling on an image to be identified, so as to obtain first feature maps corresponding to at least two resolutions respectively;

the fusion unit 52 is configured to perform feature fusion on the first feature maps corresponding to the at least two resolutions, so as to obtain a second feature map;

the detection unit 53 is configured to perform multi-scale target detection on the second feature map, so as to obtain a preset number of candidate frames;

the screening unit 54 is configured to screen the preset number of candidate frames to obtain target candidate frames; the target candidate box characterizes a region containing a target object in the second feature map;

the identifying unit 55 is configured to identify the target object corresponding to the target candidate frame, so as to obtain an identification result.

In some embodiments of the present invention, the down-sampling unit 51 is further configured to down-sample the image to be identified according to a preset resolution, so as to obtain the at least two layers of first feature maps.

In some embodiments of the present invention, the downsampling unit 51 is further configured to determine a downsampling scaling factor based on the side length of the image to be identified; the down-sampling proportional coefficient is a common divisor of the side length of the image to be identified; and performing downsampling of at least two resolutions on the image to be identified according to the downsampling proportion coefficient to obtain the first feature maps corresponding to the at least two resolutions respectively.

In some embodiments of the present invention, the detecting unit 53 is further configured to detect, through sliding windows, sub-features of each decomposition layer in the second feature map until a preset number of sliding windows detect target sub-features, so as to implement multi-scale target detection; and mapping the target sub-features to the area of the image to be identified to generate corresponding candidate frames, so as to obtain the preset number of candidate frames.

In some embodiments of the present invention, the identifying unit 55 is further configured to identify a target object corresponding to the target candidate frame, so as to obtain location information of the target candidate frame; and obtaining the area ratio information and the position information in the identification result according to the position information and the size information of the image to be identified.

In some embodiments of the present invention, the identifying unit 55 is further configured to identify a target object corresponding to the target candidate box, so as to obtain a predicted classification tag of the target object; and matching the predicted classification label with a preset classification label to obtain the quantity information and the category information in the identification result.

In some embodiments of the present invention, the apparatus further includes a display unit 56, configured to display the item activity information corresponding to one target object when the quantity information is the one target object.

In some embodiments of the invention, the apparatus further comprises a receiving unit 57, wherein,

the receiving unit 57 is configured to receive an activity information obtaining request when the quantity information is at least two target objects, where the activity information obtaining request carries any one of the at least two target objects;

the display unit 56 is configured to respond to the activity information obtaining request, and display corresponding item activity information according to the any one target object.

In some embodiments of the present invention, the receiving unit 57 is further configured to receive a product tracking request, where the product tracking request carries the identification result;

the display unit 56 is further configured to, in response to the product tracking request, obtain corresponding tracking information based on the identification result, and display prompt information according to the tracking information and the area ratio information in the identification result.

Fig. 6 is a schematic structural diagram of an image recognition apparatus according to an embodiment of the present invention, and as shown in fig. 6, an image recognition apparatus 6 according to an embodiment of the present invention includes: the image recognition device comprises a processor 61, a memory 62 and a communication bus 64, wherein the memory 62 is communicated with the processor 61 through the communication bus 64, the memory 62 stores one or more programs executable by the processor 61, when the one or more programs are executed, the processor 61 executes the image recognition method according to the embodiment of the invention, and specifically, the image recognition device 6 further comprises a communication component 63 for data transmission, wherein at least one processor 61 is provided.

In an embodiment of the present invention, the various components in image recognition device 6 are coupled together by bus 64. It will be appreciated that communications among these components are effected by connection via bus 64. The pass-through bus 64 includes a power bus, a control bus, and a status signal bus in addition to a data bus. But for clarity of illustration the various buses are labeled in figure 6 as passing through bus 64.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of a hardware embodiment, a software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention.

Claims

1. An image recognition method, comprising:

2. The method of claim 1, wherein the recognition result comprises category information, location information, area ratio, and quantity information.

3. The method according to claim 1, wherein the performing multi-resolution down-sampling on the image to be recognized to obtain the first feature maps corresponding to at least two resolutions respectively comprises:

and performing downsampling of at least two resolutions on the image to be identified according to the downsampling scale factor to obtain the first feature maps corresponding to the at least two resolutions respectively.

4. The method according to claim 1, wherein the performing multi-scale target detection on the second feature map to obtain a preset number of candidate frames comprises:

5. The method according to claim 2, wherein the identifying the area corresponding to the target candidate box to obtain the identification result comprises:

and obtaining the area ratio information and the position information in the identification result according to the position information and the size information of the image to be identified.

6. The method according to claim 2, wherein the identifying the area corresponding to the target candidate box to obtain the identification result comprises:

identifying a target object corresponding to the target candidate frame to obtain a prediction classification label of the target object;

7. The method of claim 2, 5 or 6, further comprising:

8. The method of claim 2, 5 or 6, further comprising:

9. The method of claim 2, 5 or 6, further comprising:

10. An image recognition device is characterized by comprising a down-sampling unit, a fusion unit, a detection unit, a screening unit and a recognition unit; wherein the content of the first and second substances,

11. An image recognition apparatus, comprising:

a memory for storing executable instructions;

a processor for implementing the method of any one of claims 1 to 9 when executing executable instructions stored in the memory.

12. A computer-readable storage medium having stored thereon executable instructions for causing a processor, when executing, to implement the method of any one of claims 1 to 9.