CN113344055B

CN113344055B - Image recognition method, device, electronic equipment and medium

Info

Publication number: CN113344055B
Application number: CN202110596961.3A
Authority: CN
Inventors: 安容巧
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-05-28
Filing date: 2021-05-28
Publication date: 2023-08-22
Anticipated expiration: 2041-05-28
Also published as: CN113344055A

Abstract

The disclosure discloses an image recognition method, an image recognition device, image recognition equipment, an image recognition medium and an image recognition product, relates to the technical field of artificial intelligence, and particularly relates to the technical field of computer vision and deep learning, and can be applied to an image recognition scene. The image recognition method comprises the following steps: identifying the image to be identified to obtain the position information of the target object in the image to be identified and a first label aiming at the target object; based on the position information, respectively comparing the similarity of an area image where a target object is located in an image to be identified with a plurality of reference images to obtain a comparison result, wherein each reference image comprises a reference object and a second label aiming at the reference object; determining a target image from the plurality of reference images based on the comparison result; a target label for the target object is determined based on the first label and the second label of the target image.

Description

Image recognition method, device, electronic equipment and medium

Technical Field

The disclosure relates to the technical field of artificial intelligence, in particular to the technical field of computer vision and deep learning, and can be applied to an image recognition scene.

Background

The related art generally recognizes an image to be recognized through an image recognition technique to recognize an object in the image to be recognized. However, in some scenes, when the image acquisition is performed on a plurality of objects placed in a stacked manner to obtain an image to be identified, the plurality of objects in the image to be identified are dense, so that it is difficult for the image identification technology of the related art to accurately identify the objects in the image to be identified.

Disclosure of Invention

The present disclosure provides an image recognition method, apparatus, electronic device, storage medium, and program product.

According to an aspect of the present disclosure, there is provided an image recognition method including: identifying an image to be identified to obtain the position information of a target object in the image to be identified and a first label aiming at the target object; based on the position information, respectively comparing the similarity of the region image of the target object in the image to be identified and a plurality of reference images to obtain a comparison result, wherein each reference image comprises a reference object and a second label aiming at the reference object; determining a target image from the plurality of reference images based on the comparison result; a target label for the target object is determined based on the first label and a second label of the target image.

According to another aspect of the present disclosure, there is provided an image recognition apparatus including: the device comprises an identification module, a comparison module, a first determination module and a second determination module. The identification module is used for identifying the image to be identified to obtain the position information of the target object in the image to be identified and a first label aiming at the target object; the comparison module is used for comparing the similarity between the region image of the target object in the image to be identified and a plurality of reference images based on the position information to obtain a comparison result, wherein each reference image comprises a reference object and a second label aiming at the reference object; a first determining module configured to determine a target image from the plurality of reference images based on the comparison result; and the second determining module is used for determining a target label aiming at the target object based on the first label and the second label of the target image.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor and a memory communicatively coupled to the at least one processor. Wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the image recognition method described above.

According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the above-described image recognition method.

According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the image recognition method described above.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 schematically illustrates a system architecture of an image recognition and apparatus according to an embodiment of the present disclosure;

FIG. 2 schematically illustrates a flow chart of an image recognition method according to an embodiment of the present disclosure;

FIG. 3 schematically illustrates a schematic diagram of an image recognition method according to an embodiment of the present disclosure;

FIG. 4 schematically illustrates a schematic diagram of area calculation according to an embodiment of the disclosure;

FIG. 5 schematically illustrates a schematic diagram of image recognition according to another embodiment of the present disclosure;

FIG. 6 schematically illustrates a schematic diagram of a quantity calculation according to an embodiment of the disclosure;

FIG. 7 schematically illustrates a block diagram of an image recognition device according to an embodiment of the present disclosure; and

fig. 8 is a block diagram of an electronic device for performing image recognition to implement an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The terms "comprises," "comprising," and/or the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.

All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It should be noted that the terms used herein should be construed to have meanings consistent with the context of the present specification and should not be construed in an idealized or overly formal manner.

Where expressions like at least one of "A, B and C, etc. are used, the expressions should generally be interpreted in accordance with the meaning as commonly understood by those skilled in the art (e.g.," a system having at least one of A, B and C "shall include, but not be limited to, a system having a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.).

The embodiment of the disclosure provides an image recognition method. The image recognition method comprises the following steps: and identifying the image to be identified to obtain the position information of the target object in the image to be identified and the first label aiming at the target object. Then, based on the position information, respectively comparing the similarity between the region image of the target object in the image to be identified and a plurality of reference images to obtain a comparison result, wherein each reference image comprises the reference object and a second label aiming at the reference object, determining the target image from the plurality of reference images based on the comparison result, and then determining the target label aiming at the target object based on the first label and the second label of the target image.

Fig. 1 schematically illustrates a system architecture of an image recognition and apparatus according to an embodiment of the present disclosure. It should be noted that fig. 1 is only an example of a system architecture to which embodiments of the present disclosure may be applied to assist those skilled in the art in understanding the technical content of the present disclosure, but does not mean that embodiments of the present disclosure may not be used in other devices, systems, environments, or scenarios.

As shown in fig. 1, a system architecture 100 according to this embodiment may include clients 101, 102, 103, a network 104, and a server 105. The network 104 is the medium used to provide communication links between the clients 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

A user may interact with the server 105 through the network 104 using clients 101, 102, 103 to receive or send messages, etc. Various communication client applications may be installed on clients 101, 102, 103, such as shopping class applications, web browser applications, search class applications, instant messaging tools, mailbox clients, social platform software, and the like (by way of example only).

The clients 101, 102, 103 may be a variety of electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablets, laptop and desktop computers, and the like. The clients 101, 102, 103 of the disclosed embodiments may, for example, run applications.

The server 105 may be a server providing various services, such as a background management server (by way of example only) that provides support for websites browsed by users using clients 101, 102, 103. The background management server may analyze and process the received data such as the user request, and feed back the processing result (e.g., the web page, information, or data obtained or generated according to the user request) to the client. In addition, the server 105 may also be a cloud server, i.e. the server 105 has cloud computing functionality.

It should be noted that the image recognition method provided by the embodiment of the present disclosure may be performed by the server 105. Accordingly, the image recognition apparatus provided by the embodiments of the present disclosure may be provided in the server 105. The image recognition method provided by the embodiments of the present disclosure may also be performed by a server or server cluster that is different from the server 105 and that is capable of communicating with the clients 101, 102, 103 and/or the server 105. Accordingly, the image recognition apparatus provided by the embodiments of the present disclosure may also be provided in a server or a server cluster that is different from the server 105 and is capable of communicating with the clients 101, 102, 103 and/or the server 105.

In one example, the server 105 may obtain an image to be identified from a user of the client 101, 102, 103 over the network 104 and identify the image to be identified.

It should be understood that the number of clients, networks, and servers in fig. 1 is merely illustrative. There may be any number of clients, networks, and servers, as desired for implementation.

The embodiment of the present disclosure provides an image recognition method, and an image recognition method according to an exemplary embodiment of the present disclosure is described below with reference to fig. 2 to 6 in conjunction with the system architecture of fig. 1. The image recognition method of the embodiment of the present disclosure may be performed by the server 105 shown in fig. 1, for example.

Fig. 2 schematically illustrates a flowchart of an image recognition method according to an embodiment of the present disclosure.

As shown in fig. 2, the image recognition method 200 of the embodiment of the present disclosure may include, for example, operations S210 to S240.

In operation S210, the image to be identified is identified, and location information of the target object in the image to be identified and a first tag for the target object are obtained.

In operation S220, based on the position information, similarity comparison is performed on the region image in which the target object is located in the image to be identified and the plurality of reference images, respectively, to obtain a comparison result.

In operation S230, a target image is determined from among the plurality of reference images based on the comparison result.

In operation S240, a target tag for the target object is determined based on the first tag and the second tag of the target image.

The image to be identified is illustratively identified, for example, by a target detection model, resulting in at least one target object in the image to be identified, and in the location information of each target object in the image to be identified and the first tag of the target object. The first tag includes, for example, an object name, an object attribute, and the like of the target object. In an example, the target object may include an item, the first tag includes an item name and an item attribute of the item, and the like.

For example, a plurality of reference images, each including a reference object, each further including a second tag for the reference object, are stored in advance in an image library, for example. The second tag includes, for example, an object name, an object attribute, etc. of the reference object, which may include the merchandise.

After the first label of the target object in the image to be identified is identified through the target detection model, the final label of the target object is determined through an image retrieval mode. For example, comparing the region image of the target object in the image to be identified with each reference image to obtain the similarity between the region image and each reference image, and then taking the reference image with higher similarity as the target image, wherein the target image is provided with the second label. Next, the first label of the target object and the second label of the target image are comprehensively considered, and a final target label for the target object is determined.

In the embodiment of the disclosure, the first label obtained by identifying the image to be identified through the target detection model is inaccurate, so that the embodiment of the disclosure combines the target detection model and the image retrieval mode to determine the final target label of the target object, thereby improving the accuracy of the label.

In addition, because a large number of training samples are required for training the target detection model, each training sample needs to be labeled manually, and therefore the labeling cost is high. When the training samples are newly added, the newly added training samples are labeled and the target detection model is retrained, so that the iteration cost of the model is high. Therefore, in order to improve the speed and accuracy of acquiring the object tag, the disclosed embodiments acquire the object tag by combining the target detection model and the image retrieval mode. The image retrieval mode needs to annotate the labels of the reference images in the image library in advance, and the quantity of the reference images in the image library is far smaller than that of the training samples, so that the annotation cost of the training samples is greatly reduced.

Fig. 3 schematically illustrates a schematic diagram of an image recognition method according to an embodiment of the present disclosure.

As shown in fig. 3, the image to be recognized 310 includes a plurality of objects, for example, a plurality of objects including objects a to F. Target detection is performed on the image 310 to be identified by using the target detection model to obtain a target object. The target detection model is used to identify the image 310 to be identified to obtain the position information of the target object and a first label, taking the target object as an object B as an example, where the first label includes, for example, the object name and the object attribute of the object B. Taking the beverage as an example of the object B, the object name is "a beverage", the object attribute is "6 bottles, 250 ml per bottle", for example.

The image library 320 stores, for example, a plurality of reference images, which are images for a plurality of objects. For example, for object a, images of object a are acquired at different angles, the acquired images are stored in image library 320, and second labels of the plurality of reference images acquired for object a are all, for example, the same. For example, the second labels for the plurality of reference images of the object a each include an object name of the object a, for example, "certain mineral water", and an object attribute of, for example, "12 bottles, 500 milliliters per bottle". Object B and object C are also acquired from different angles to obtain images.

Illustratively, after the target object (object B) in the image to be recognized 310 is recognized, the region image 311 in which the target object (object B) is located is determined based on the positional information of the target object (object B) in the image to be recognized 310. The similarity comparison is performed on each of the reference images in the region image 311 and the image library 320 to obtain a comparison result. Then, the reference images, of which the similarity with the region image is ranked first N, among the plurality of reference images are determined as N candidate images based on the comparison result, where N is an integer greater than 1. Illustratively, the N reference images may be images for N objects, or some of the N reference images may be images for the same object.

After determining the N candidate images, the target image may be further determined from the N candidate images based on the similarity between the region image 311 and each of the N candidate images. The determination of the target image from the N candidate images includes, for example, at least the following three ways.

In the first way, a candidate image having a similarity with the region image 311 greater than the first threshold value among the N candidate images is determined as the target image. For example, the first threshold value may be set higher, for example, 0.7 or 0.8, or the like. In an example, when the N candidate images have a similarity between the plurality of candidate images and the region image 311 greater than the first threshold value, one candidate image having the greatest similarity among the plurality of candidate images may be determined as the target image.

In a second manner, when the similarity between each of the N candidate images and the region image is smaller than a second threshold, a candidate image whose similarity with the region image is ranked N before is determined from the N candidate images, and the second threshold is smaller than or equal to the first threshold. Then, a target image is determined from the N candidate images based on the number of occurrences of the second label for each of the N candidate images, N being smaller than N, the number of occurrences representing the number of reference images having the second label among the plurality of reference images.

That is, in the case where the similarity between each of the N candidate images and the region image is low, the candidate image whose similarity is ranked in the top 2 is selected from the N candidate images, and each of the 2 candidate images has the second label. The 2 candidate images include, for example, a first candidate image and a second candidate image, for example, the first candidate image is selected as the target image, and the number of occurrences of the second label of the first candidate image in the image library 320 is greater than the number of occurrences of the second label of the second candidate image in the image library 320. That is, the number of reference images in the image library 320 having the same second label as the first candidate image is a first number, and the number of reference images in the image library 320 having the same second label as the second candidate image is a second number, the first number being greater than the second number.

In the third way, when the similarity between each of the N candidate images and the region image 311 is smaller than the third threshold, one candidate image having the largest similarity with the region image 311 is selected from the N candidate images as the target image, and the third threshold is smaller than the second threshold. That is, in the case where the similarity between each of the N candidate images and the region image is too low, the candidate image having the largest similarity is selected from the N candidate images as the target image.

After determining the target image, a target label for the target object may be determined based on the first label for the image to be identified and the second label for the target image. The method specifically comprises the following steps.

In an example, the first labelSignature p _multi-det With confidence levelSecond tag p _search Has similarity->When the first label p _multi-det And a second tag p _search When the first condition is satisfied, the first label p _multi-det And a second tag p _search One of which is selected as the target tag.

For example, if the target image is determined using the first or second method described above, and when the first label p _multi-det And a second tag p _search When the same, determine the first tag p _multi-det And a second tag p _search The first condition is satisfied. At this time, the first tag p is selected _multi-det Or a second tag p _search As a target tag. In addition, can also be based on confidenceAnd similarity->To update the confidence score for the target object. Specifically, as shown in formula (1), α=0.4.

For example, if the target image is determined using the third method described above, the first tag p is determined _multi-det And a second tag p _search The first condition is satisfied. When (when)At the time, the first label p _multi-det As a target tag. When->At the time, the second label p _search As a target tag for a target object.

In another example, if the target image is determined using the first or second method described above, and when the first label p _multi-det And a second tag p _search When there is a difference, the first tag p is determined _multi-det And a second tag p _search The second condition is satisfied. At this time, the first tag p is determined _multi-det Confidence of (1)And a second tag p _search Similarity of->Difference between them, then from the first tag p based on the difference _multi-det Second tag p _search And determining a target label from adjacent labels, wherein the adjacent labels are labels of other objects adjacent to the target object in the image to be identified.

For example, ifThen p can be determined _search As a target tag for a target object.

For example, ifThen p can be determined _multi-det As a target tag for a target object.

For example, ifIndicating the second tag p _search And the reliability is low, at the moment, the labels of other objects adjacent to the target object are counted from the image to be identified, and then the label with the largest occurrence number in the labels of the other objects is taken as the target label aiming at the target object.

It will be appreciated that embodiments of the present disclosure first determine a target image from candidate images based on the similarity of the images, and then ultimately decide a target label for a target object based on a first label for the image to be identified and a second label for the target image. Therefore, according to the embodiment of the disclosure, at least the condition that the first label obtained by identifying the image to be identified through the target detection model is inaccurate can be avoided, and therefore the label accuracy of the target object is improved.

By way of example, the embodiment of the disclosure can identify the image to be identified through the target detection model, so as to obtain the position information and the first label of the target object in the image to be identified. The target detection model includes, for example, but is not limited to, the two-stage Faster_RCNN detection model.

The object detection model includes a sub-network RPN (Region Proposal Network) that outputs an object detection box, also known as a proposals (suggestion box), by inputting image features of an image to be identified into the RPN, as described in detail below.

Typically, a number of initial test frames are preset, the initial test frames are associated with scale parameters of the model, the scale parameters are typically 8, and the anchor stride parameters of the model are typically [4,8, 16, 32, 64, 128 ]]Thus the dimensions of the anchors of the model are [32 ] ² ，64 ² ，128 ² ，256 ² ，512 ² ]The size of the anchors is related to the size of the initial detection frame. Since the image to be identified in the embodiment of the present disclosure includes a plurality of target objects, and the distance between adjacent target objects in the plurality of target objects is smaller, the distance is typically smaller than a preset distance, for example, that is, the image to be identified is an image for a dense scene, if the scale parameter is set to 8, the size of the anchors will be larger, and thus the size of the initial detection frame will be larger. If the size of the initial detection frame is larger, the object with smaller size is missed when the target detection is carried out on the image to be identified, and the target recall rate is lower.

In view of this, embodiments of the present disclosure may generate a plurality of initial detection frames based on a preset size, wherein,the preset size is associated with a preset distance. That is, the preset size is smaller, and the size of the initial detection frame generated based on the preset size is smaller, so that omission of the object in the recognition process is avoided. For example, scale parameter may be set to 4, in which case the anchors size of the model is, for example, [16 ] ² ，32 ² ，64 ² ，128 ² ，256 ² ]. The size of the anchors is associated with the size of the initial detection frame, so reducing the size of the initial detection frame by reducing the size of the anchors makes the initial detection frame suitable for dense scenes.

After determining the plurality of initial detection frames, the plurality of initial detection frames may be updated to obtain a plurality of target detection frames. Then, based on the positions of the plurality of target detection frames in the image to be recognized, the position information of the target object is determined, and for example, the position of the target detection frame in the image to be recognized may be taken as the position information of the target object.

Illustratively, updating the plurality of initial detection boxes to obtain a plurality of target detection boxes includes: based on the image features of the image to be identified, a plurality of initial detection frames are adjusted to obtain a plurality of adjusted detection frames, each adjusted detection frame having a corresponding label and a confidence level of the corresponding label, the corresponding label representing the label of the object located in the adjusted detection frame. Adjusting the initial detection frame includes changing a shape of the initial detection frame, changing a size of the initial detection frame, moving the initial detection frame, deleting the initial detection frame, and the like. Each mediated detection box includes, for example, a plurality of initial tags and a confidence level corresponding to each initial tag.

After obtaining the plurality of adjusted detection frames, the target detection frame may be obtained in two ways using non-maximum suppression techniques. The non-maximum suppression technique is, for example, multiClasSoftNMS.

One way is to determine any two adjusted detection frames from the plurality of adjusted detection frames, the initial labels with the greatest confidence in the two adjusted detection frames are inconsistent, and the intersection ratio iou of the two adjusted detection frames is greater than a threshold value. The intersection ratio is the ratio of the intersection and the union of the two detection frames. The confidence of one of the two adjusted detection frames is then set to 0. For example, the two adjusted detection frames include a detection frame a and a detection frame B, where the maximum confidence of the detection frame a is a, the maximum confidence of the detection frame B is B, and when B is smaller than a and the intersection ratio iou of the detection frame a and the detection frame B is greater than a threshold, the confidence of the detection frame B is set to 0, for example, B may be set to 0, or the confidence corresponding to each of the plurality of initial tags of the detection frame B may be set to 0. In this way, since the positions of the detection frames in the dense scene are generally close, if the confidence is set to 0 (corresponding to direct deletion), excessive detection frames will be deleted, and thus the target will be missed, so that the recall rate of the target is lower.

Alternatively, the confidence level may not be set to 0 directly (i.e., the detection box is not deleted directly). Instead, the confidence of at least one of the two adjacent detection frames is adjusted based on the overlap ratio (cross-over ratio) between the two adjacent detection frames for the two adjacent detection frames (detection frame a and detection frame B) having different labels among the plurality of adjusted detection frames.

For example, for the above-mentioned detection frame a and detection frame B, the intersection ratio of the detection frame a and the detection frame B is iou, and the confidence of the detection frame B is s, for example _i The confidence s of the detection frame B is adjusted by adopting a Gaussian weighting mode as shown in a formula (2) _i For example, the confidence of the detection frame B is reduced.

Wherein,,represents a gaussian exponential function, μ is the expected value, μ=iou, σ is the standard deviation.

After adjusting the confidence of the detection frame, a plurality of target detection frames may be selected from the plurality of adjusted detection frames obtained after adjusting the confidence based on a preset confidence condition. Selecting the plurality of target detection frames based on the preset confidence condition comprises selecting a plurality of target detection frames with the highest confidence from the plurality of adjusted detection frames, and then taking the position of the target detection frames as the position information of the target object.

In addition, when the image to be recognized is recognized by using the two-stage Faster_RCNN detection model, the model parameters of the two-stage Faster_RCNN model may be adjusted based on an exponential moving average (Exponential Moving Average) method. The exponential moving average is to obtain the current predicted value of the parameter by weighted average of the actual value and the predicted value of the previous parameter. That is, model parameters are iteratively optimized by averaging the parameters of the model to increase the robustness of the model, by an averaging method that places higher weight on recent parameters by exponential moving average.

In addition, when the image to be recognized is recognized by using the two-stage Faster_RCNN detection model, the feature pyramid network (Feature Pyramid Networks, FPN) of the image to be recognized is processed in order to reduce time consumption. For example, the feature pyramid includes 3 layers (channels) or 4 layers (channels), each layer having a feature size of 256×256, for example, reducing the feature size 256 to 64, while reducing the input resolution from 800 to 640. By reducing the feature size and input resolution, the computation time of the model can be reduced while ensuring the accuracy of the model.

In an embodiment of the present disclosure, after the target detection model recognizes the image to be recognized, a plurality of detection frames and a first label for each detection frame are output. The position of each detection frame in the image to be identified is the position information of the target object, and the first label of each detection frame is the first label of the target object. And then, obtaining a second label of each target object by utilizing an image retrieval mode, and obtaining the target label of the target object based on the first label and the second label.

For a plurality of target objects identified from an image to be identified, a target type object is determined from the plurality of target objects based on a target label of each target object. For example, when the plurality of target objects includes commodities of a plurality of merchants, the target type object is a target object (commodity) for a certain merchant. Alternatively, when the plurality of target objects includes a plurality of types of objects, including, for example, beverage, mineral water, milk, and the like, the target type object is, for example, an object of a type for a certain type, for example, an object of a type of beverage is taken as the target type object.

Then, based on the position information of the target type object in the image to be recognized and the size information of the target type object, the area of the target area indicated by the target type object is determined.

Fig. 4 schematically illustrates a schematic diagram of area calculation according to an embodiment of the present disclosure.

As shown in fig. 4, the image 410 to be recognized is acquired for a plurality of objects 400 put in actual stack, and the plurality of objects 400 includes 24, for example. The image 410 to be identified is identified by using the target detection model, so as to obtain a plurality of detection frames, wherein the detection frames are represented by dotted lines, and the positions of the plurality of detection frames are the position information of a plurality of target objects A-L corresponding to the plurality of detection frames one by one. The plurality of target objects a to L are, for example, each of target type objects. An object of an embodiment of the present disclosure is to estimate a floor area of a plurality of objects 400 placed in a stack based on the identified position information of a plurality of target type objects a to L and an actual size of each target type object.

Illustratively, a plurality of reference objects, such as target objects located at the bottom layer, are determined from the target type objects based on the position information of the target type objects in the image to be recognized. For example, the plurality of reference objects includes a target object E, F, G, H, K, L.

Then, a first reference object, a second reference object, and a third reference object are determined from the plurality of reference objects, the second reference object and the third reference object being on both sides of the first reference object. For example, a first reference object is located at the turning point, e.g., the first reference object is the target object H, a second reference object is e.g., the target object E, and a third reference object is e.g., the target object L.

The direction from the first reference object to the second reference object is determined as a first direction P, and the direction from the first reference object to the third reference object is determined as a second direction Q. Then, an included angle between the first direction P and the second direction Q is determined, and if the included angle belongs to a preset included angle, the area of the target area indicated by the target type object is determined based on the position information of the target type object in the image to be identified and the size information of the target type object.

For example, the preset angle is an interval [15 °,165 ° ], when the angle between the first direction P and the second direction Q is within the interval [15 °,165 ° ], it is indicated that the image 410 to be recognized is not a front-shot image, and at this time, the area of the target area indicated by the target type objects a to L may be determined based on the position information of the plurality of target type objects a to L and the size information of each target type object.

Specifically, the target type objects a to L correspond to a plurality of detection frames, the plurality of detection frames form a detection frame set D, the position information and the confidence of the detection frames in the set D are expressed as a formula (3), M is the number of the target type objects a to L, and score is the confidence.

D＝{xmin，ymin，xmax，ymax，sku _id ，score} _i i∈[1，M] (3)

For one detection box b in set D _base Its position is denoted b _base ＝{xmin _base ，ymin _base ，xmax _base ，ymax _base }. Can be based on the detection frame b _base Moving a certain distance in the horizontal direction or the vertical direction of the image 410 to be recognized to find other detection frames.

For example, to detect frame b _base Taking the moving distance w along the horizontal direction as a starting point to obtain another detection frame b _h Another detection frame b _h The position of (2) is represented by formula (4). When detecting frame b _base And another detection frame b _h The center distance is smaller than a certain threshold value, and the frame b is detected _base And another detection frame b _h If the cross-over ratio iou is greater than a certain threshold value, determining a detection frame b _base And another detection frame b _h Adjacent in the horizontal direction.

b _h ＝{xmin _base ±w，ymin _base ，xmax _base ±w，ymax _base } (4)

For example, to detect frame b _base Taking the distance h along the vertical direction as a starting point to obtain another detection frame b _v Another detection frame b _v The position of (2) is represented by formula (5). Detection frame b _base And another detection frame b _v The center distance is smaller than a certain threshold value, and the frame b is detected _base And another detection frame b _v If the cross-over ratio iou is greater than a certain threshold value, determining a detection frame b _base And another detection frame b _h Adjacent in the vertical direction.

b _v ＝{xmin _base ，ymin _base ±h，xmax _base ，ymax _base ±h} (5)

For all the detection frames in the set D, determining the detection frame corresponding to the minimum xmin from all the detection frames as a left detection frame box _left ，box _left For example, the detection frame corresponding to the target type object a or the detection frame corresponding to the target type object E. Then with a detection box _left The lower left corner detection frame b moving downwards to the lowest layer along the vertical direction as a starting point _leftbot Detection frame b _leftbot For example, a detection box corresponding to the target type object E.

To detect frame b _leftbot Moving to the right along the horizontal direction as a starting point to determine other detection frames, wherein the detection frame b _leftbot And other detection frames, e.g. forming set D _bot ＝{d ₀ ，d ₁ ...，d _k }，d ₀ ，d ₁ ...，d _k Each detection box in (a) belongs to the set D.

For set D _bot Each detection frame of the plurality of detection frames is taken as a starting point, moves downwards along the vertical direction, determines the detection frame of the bottommost layer, and adds the detection frame of the bottommost layer to the set D _bot The collection D 'is obtained' _bot 。

Due to the different sizes of the detection frames, small-sized detection frames are easy to leak out when the detection frames are found by horizontal sliding. The amplitude of the horizontal movement (movement step) can thus be reduced, based on D' _bot Recall the small-size detection frame to obtain the final productSet D' _bot 。

Next, set D "" _bot The lowest detection frame of (a) serves as a detection frame at the turning point, which corresponds to the first reference object (object H). Will gather D _bot The object of the target type (object E) corresponding to the leftmost detection frame of the middle bottom layer is taken as a second reference object. Will gather D _bot The object of the target type (object L) corresponding to the detection frame on the rightmost side of the middle floor layer is taken as a third reference object.

And then when the included angle between the first direction P and the second direction Q belongs to a preset included angle, determining the area of the target area indicated by the target type object based on the position information of the target type object in the image to be identified and the size information of the target type object.

For example, based on the size information of the target type object E, F, G, H, the actual distance L from the target type object H to the target type object E is calculated ₁ . Based on the size information of the target type object H, K, L, the actual distance L from the target type object H to the target type object L is calculated ₂ . The area of the target area indicated by the target type object is L ₁ *L ₂ Area L ₁ *L ₂ I.e., the footprint of the plurality of objects 400 stacked.

It can be appreciated that, after the target object in the image to be identified is identified, the embodiment of the disclosure can estimate the actual occupied area of the plurality of objects stacked and placed according to the position information of the target type object in the target object, so that the occupied area is not required to be measured manually, and the labor cost is reduced.

In an embodiment of the present disclosure, the image to be identified is, for example, an image obtained for a plurality of objects, the object type object comprising at least part of the plurality of objects. Due to the fact that a plurality of objects are stacked and placed, the objects cannot be completely displayed in the image to be recognized due to shielding among the objects. The object of the target type obtained by recognizing the image to be recognized is therefore typically a displayed object, which is a part of the objects.

For example, the classification model may be used to classify the image to be identified. If the category of the image to be recognized is the first category, determining the area of the target area indicated by the target type object based on the position information of the target type object in the image to be recognized and the size information of the target type object. If the category of the image to be recognized is the second category, the number of the plurality of objects may be determined based on the position information of the target type object in the image to be recognized and the image characteristics of the target type object. The specific procedure is described below.

Fig. 5 schematically illustrates a schematic diagram of image recognition according to another embodiment of the present disclosure.

As shown in fig. 5, an image 501 to be recognized is input into a target detection model 502, a top detection model 503, and a classification model 504, respectively, for parallel processing.

The target detection model 502 identifies the image 501 to be identified to obtain an identification result X ₁ Recognition result X ₁ Including location information of a plurality of target objects and a first tag. The top detection model 503 identifies the image 501 to be identified to obtain an identification result Y ₁ Recognition result Y ₁ Including location information of a plurality of target objects and a first tag.

Image search is performed in the image library 506 based on the region image by using the image search mode 505, and a second label for the target object is obtained. The regional image is the identification result X ₁ The indicated target object is in the region portion of the image 501 to be recognized. Then, a target tag of the target object is determined based on the first tag and the second tag. Taking the position information of the target object and the target label as a result X ₂ 。

Image search is performed in the image library 506 based on the region image by using the image search mode 505, and a second label for the target object is obtained. The regional image is the identification result Y ₂ The indicated target object is in the region portion of the image 501 to be recognized. Then, a target tag of the target object is determined based on the first tag and the second tag. Taking the position information of the target object and the target label as a result Y ₂ 。

Classifying the image 501 to be identified by using the classification model 504 to obtain the image to be identified The category of the other image 501. Then, the judging device 507 judges the category of the image 501 to be identified, if the category of the image 501 to be identified is the first category 508, the result X is obtained ₂ The input area calculation module 510 performs area calculation to obtain an area calculation result 511, and the area calculation process is described above and will not be described herein. The first category 508, for example, characterizes that the image 501 to be identified is an image acquired in a first scene, including, for example, a supermarket scene, a shopping mall scene, and the like.

Classifying the image 501 to be identified by using the classification model 504 to obtain the category of the image 501 to be identified. Then, the judging device 507 judges the category of the image 501 to be identified, if the category of the image 501 to be identified is the second category 509, the result Y is obtained ₂ The input number calculation module 512 performs a number calculation to obtain a number calculation result 513. The second category 509, for example, characterizes that the image 501 to be identified is an image captured under a second scene, including, for example, a personal store scene, a small convenience store scene, and the like.

The classification model may be trained, for example, by semi-supervised means. A small number of training images are first marked manually. Training the classification model with a small number of training images to converge the model and obtain a trained classification model M ₀ . Then, using the classification model M ₀ Classifying a plurality of unknown images to obtain labels of the images, taking the plurality of images with the labels as training images, and using the plurality of training images to classify the model M ₀ Training to obtain a final classification model M ₁ 。

It can be appreciated that the semi-supervised mode is adopted to train the classification model, and the method has the advantages of low manual labeling cost, strong model generalization and high robustness.

According to the embodiment of the disclosure, the categories of the images to be identified are identified, and area calculation or quantity calculation is performed on the images of different categories so as to meet different business scenes.

Fig. 6 schematically illustrates a schematic diagram of a number calculation according to an embodiment of the present disclosure.

As shown in fig. 6, an image 610 to be recognized is acquired for a plurality of objects 600 put in actual stack, the plurality of objects 600 including, for example, objects a to I (9). The image 610 to be identified is identified using the top detection model, resulting in a plurality of detection frames, which are represented by dashed lines. The positions of the plurality of detection frames are the position information of a plurality of target objects (A, D, F, H, I) corresponding to the plurality of detection frames one by one. The plurality of target objects (A, D, F, H, I) are for example each target type objects.

The top detection model can detect whether each target type object has a top attribute. For example, the target type object A, F, I indicated by the detection box has a top attribute, which refers to the top of the target type object displayed in the image to be recognized 610. In addition, the target type object D, H indicated by the detection box does not have a top attribute. Next, the number of the plurality of objects is determined based on the position information of the target type object in the image to be recognized and the image characteristics of the target type object, which characterize whether the target type object has a top attribute.

For example, object type A is at the first level, object type D is at the second level, object type F is at the third level, object type H is at the fourth level, and object type I is at the fifth level. The object type object I located at the highest layer has a top attribute, thereby determining that the fifth layer has 1 object. The object type object H at the fourth layer does not have a top attribute, so that it is determined that the number of objects at the fourth layer is identical to the number of objects at the fifth layer, i.e., the fourth layer has 1 object. The object type object F at the third layer has a top property, so that it is determined that the number of objects at the third layer is 1 more than the number of objects at the fourth layer, i.e., the third layer has 2 objects. The object type object D located at the second layer does not have a top attribute, so that it is determined that the number of objects at the second layer is identical to the number of objects at the third layer, i.e., the second layer has 2 objects. The object type object a located at the first layer has a top property, so that it is determined that the number of objects at the first layer is 1 more than the number of objects at the second layer, i.e., the first layer has 3 objects. From this, the number of objects 600 placed in a stack can be calculated to be 9.

It can be appreciated that, after the target object in the image to be identified is identified, the embodiment of the disclosure can estimate the number of the plurality of objects stacked and placed according to the position information of the target type object in the target object, without manually calculating the number of the objects, thereby reducing the labor cost.

Fig. 7 schematically illustrates a block diagram of an image recognition apparatus according to an embodiment of the present disclosure.

As shown in fig. 7, the image recognition apparatus 700 of the embodiment of the present disclosure includes, for example, a recognition module 710, a comparison module 720, a first determination module 730, and a second determination module 740.

The identifying module 710 may be configured to identify an image to be identified, to obtain location information of a target object in the image to be identified and a first tag for the target object. According to an embodiment of the present disclosure, the identification module 710 may perform, for example, operation S210 described above with reference to fig. 2, which is not described herein.

The comparing module 720 may be configured to compare, based on the location information, a similarity between an area image where the target object is located in the image to be identified and a plurality of reference images to obtain a comparison result, where each reference image includes a reference object and a second tag for the reference object. According to an embodiment of the present disclosure, the comparison module 720 may perform, for example, operation S220 described above with reference to fig. 2, which is not described herein.

The first determination module 730 may be configured to determine a target image from a plurality of reference images based on the comparison result. According to an embodiment of the present disclosure, the first determining module 730 may, for example, perform the operation S230 described above with reference to fig. 2, which is not described herein.

The second determination module 740 may be configured to determine a target label for the target object based on the first label and a second label of the target image. The second determining module 740 may, for example, perform operation S240 described above with reference to fig. 2 according to an embodiment of the present disclosure, which is not described herein.

According to an embodiment of the present disclosure, the first determining module 730 includes: a first determination sub-module and a second determination sub-module. And the first determining submodule is used for determining the reference images with the similarity arranged with the regional image in the first N reference images as N candidate images, wherein N is an integer greater than 1. And a second determination sub-module for determining a target image from the N candidate images based on a similarity between the region image and each of the N candidate images.

According to an embodiment of the present disclosure, the second determination submodule includes at least one of a first determination unit, a second determination unit, and a first selection unit. And a first determining unit configured to determine, as the target image, a candidate image having a similarity with the region image greater than a first threshold value among the N candidate images. And a second determining unit configured to determine, from the N candidate images, a candidate image whose similarity to the region image is ranked N before, in response to the similarity between each of the N candidate images and the region image being smaller than a second threshold, and determine a target image from the N candidate images based on a number of occurrences of the second label of each of the N candidate images, where N is smaller than N, the number of occurrences representing a number of reference images having the second label from among the plurality of reference images. And a first selection unit configured to select, as the target image, one candidate image having the greatest similarity with the region image from the N candidate images in response to the similarity between each of the N candidate images and the region image being smaller than a third threshold. Wherein the second threshold is less than or equal to the first threshold and the third threshold is less than the second threshold.

According to an embodiment of the present disclosure, the second determination module 740 includes at least one of a selection sub-module and a third determination sub-module. And the selecting sub-module is used for selecting one from the first label and the second label as a target label in response to the first label and the second label meeting the first condition. And the third determining submodule is used for responding to the first label and the second label to meet the second condition, determining a difference value between the confidence coefficient of the first label and the similarity of the second label, and determining a target label from the first label, the second label and an adjacent label based on the difference value, wherein the adjacent label is a label of other objects adjacent to the target object in the image to be identified.

According to an embodiment of the present disclosure, the image to be recognized includes a plurality of target objects, and a distance between adjacent target objects among the plurality of target objects is smaller than a preset distance. Wherein the identification module 710 includes: generating a sub-module, updating the sub-module and a fourth determining sub-module. And the generation sub-module is used for generating a plurality of initial detection frames based on a preset size, wherein the preset size is associated with a preset distance. And the updating sub-module is used for updating the plurality of initial detection frames to obtain a plurality of target detection frames. And the fourth determination submodule is used for determining the position information of the target object based on the positions of the target detection frames in the image to be identified.

According to an embodiment of the present disclosure, the update sub-module includes: the device comprises a first adjusting unit, a second adjusting unit and a second selecting unit. The first adjusting unit is used for adjusting a plurality of initial detection frames based on the image characteristics of the image to be identified to obtain a plurality of adjusted detection frames, wherein each adjusted detection frame has a corresponding label and a confidence level of the corresponding label. The second adjusting unit is used for adjusting the confidence of at least one of the two adjacent detection frames based on the coincidence degree between the two adjacent detection frames aiming at the two adjacent detection frames with different labels in the plurality of adjusted detection frames. And the second selection unit is used for selecting a plurality of target detection frames from the plurality of adjusted detection frames obtained after the confidence is adjusted based on a preset confidence condition.

According to an embodiment of the present disclosure, an image to be identified includes a plurality of target objects; the apparatus 700 may further include: the third determination module and the fourth determination module. And a third determining module, configured to determine a target type object from a plurality of target objects based on the target label of each target object. And a fourth determining module for determining the area of the target area indicated by the target type object based on the position information of the target type object in the image to be recognized and the size information of the target type object.

According to an embodiment of the present disclosure, the fourth determination module includes: the fifth, sixth, seventh, eighth, and ninth determination sub-modules. And a fifth determining sub-module for determining a plurality of reference objects from the target type objects based on the position information of the target type objects in the image to be recognized. A sixth determining sub-module for determining a first reference object, a second reference object and a third reference object from the plurality of reference objects, wherein the second reference object and the third reference object are located at both sides of the first reference object. A seventh determination submodule for determining a direction from the first reference object to the second reference object as the first direction. An eighth determination submodule determines a direction from the first reference object to the third reference object as a second direction. And a ninth determining sub-module, configured to determine, in response to the angle between the first direction and the second direction belonging to a preset angle, an area of the target area indicated by the target type object based on the position information of the target type object in the image to be identified and the size information of the target type object.

According to an embodiment of the present disclosure, the apparatus 700 may further include, for example: and the classification module is used for classifying the images to be identified by using the classification model. Wherein the fourth determination module is further configured to: and determining the area of the target area indicated by the target type object based on the position information of the target type object in the image to be identified and the size information of the target type object in response to the category of the image to be identified being the first category.

According to an embodiment of the present disclosure, the image to be identified is an image obtained for a plurality of objects, the object type object comprising at least part of the plurality of objects. The apparatus 700 may further include: and a fifth determining module, configured to determine, in response to the category of the image to be identified being the second category, the number of the plurality of objects based on the position information of the target type object in the image to be identified and the image characteristics of the target type object.

In the technical scheme of the disclosure, the acquisition, storage, application and the like of the related user personal information all conform to the regulations of related laws and regulations, and the public sequence is not violated.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

Fig. 8 illustrates a schematic block diagram of an example electronic device 800 that may be used to implement embodiments of the present disclosure. Electronic device 800 is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 8, the apparatus 800 includes a computing unit 801 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data required for the operation of the device 800 can also be stored. The computing unit 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to the bus 804.

Various components in device 800 are connected to I/O interface 805, including: an input unit 806 such as a keyboard, mouse, etc.; an output unit 807 such as various types of displays, speakers, and the like; a storage unit 808, such as a magnetic disk, optical disk, etc.; and a communication unit 809, such as a network card, modem, wireless communication transceiver, or the like. The communication unit 809 allows the device 800 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The computing unit 801 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 801 performs the respective methods and processes described above, for example, an image recognition method. For example, in some embodiments, the image recognition method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 808. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 800 via ROM 802 and/or communication unit 809. When a computer program is loaded into RAM 803 and executed by computing unit 801, one or more steps of the image recognition method described above may be performed. Alternatively, in other embodiments, the computing unit 801 may be configured to perform the image recognition method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel, sequentially, or in a different order, provided that the desired results of the disclosed aspects are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. An image recognition method, comprising:

identifying an image to be identified to obtain the position information of a target object in the image to be identified and a first label aiming at the target object;

based on the position information, respectively comparing the similarity of the region image of the target object in the image to be identified and a plurality of reference images to obtain a comparison result, wherein each reference image comprises a reference object and a second label aiming at the reference object;

Determining a target image from the plurality of reference images based on the comparison result; and

determining a target label for the target object based on the first label and a second label of the target image;

wherein the image to be identified comprises a plurality of target objects; the method further comprises the steps of:

determining a target type object from the plurality of target objects based on the target label of each target object; and

determining the area of a target area indicated by a target type object based on the position information of the target type object in the image to be identified and the size information of the target type object;

wherein the determining the area of the target area indicated by the target type object based on the position information of the target type object in the image to be identified and the size information of the target type object includes:

determining a plurality of reference objects from the target type objects based on the position information of the target type objects in the image to be identified;

determining a first reference object, a second reference object and a third reference object from the plurality of reference objects, wherein the second reference object and the third reference object are positioned on two sides of the first reference object;

Determining a direction pointing to the second reference object by the first reference object as a first direction;

determining a direction from the first reference object to the third reference object as a second direction; and

and determining the area of a target area indicated by the target type object based on the position information of the target type object in the image to be identified and the size information of the target type object in response to the included angle between the first direction and the second direction belonging to a preset included angle.

2. The method of claim 1, wherein the determining a target image from the plurality of reference images based on the comparison result comprises:

determining the reference images with the similarity arranged with the regional image in the first N reference images as N candidate images, wherein N is an integer greater than 1; and

a target image is determined from the N candidate images based on a similarity between the region image and each of the N candidate images.

3. The method of claim 2, wherein the determining a target image from the N candidate images based on a similarity between the region image and each of the N candidate images comprises at least one of:

Determining a candidate image, of the N candidate images, with the similarity with the region image being greater than a first threshold, as a target image;

determining a first N candidate images of which the similarity with the region image is ranked in the first N candidate images from the N candidate images in response to the similarity between each candidate image and the region image being smaller than a second threshold value, and determining a target image from the N candidate images based on the occurrence number of the second label of each candidate image in the N candidate images, wherein N is smaller than N, and the occurrence number represents the number of reference images with the second label in the plurality of reference images; and

in response to the similarity between each of the N candidate images and the region image being less than a third threshold, selecting one candidate image from the N candidate images having the greatest similarity with the region image as a target image,

wherein the second threshold is less than or equal to the first threshold and the third threshold is less than the second threshold.

4. The method of claim 3, wherein the determining a target label for the target object based on the first label and a second label of the target image comprises at least one of:

Selecting one of the first tag and the second tag as a target tag in response to the first tag and the second tag satisfying a first condition; and

and determining a difference value between the confidence coefficient of the first label and the similarity of the second label in response to the first label and the second label meeting a second condition, and determining a target label from the first label, the second label and an adjacent label based on the difference value, wherein the adjacent label is a label of other objects adjacent to the target object in the image to be identified.

5. The method according to any one of claims 1-4, wherein the image to be identified comprises a plurality of target objects, a distance between adjacent target objects of the plurality of target objects being less than a preset distance;

the identifying the image to be identified, and obtaining the position information of the target object in the image to be identified comprises the following steps:

generating a plurality of initial detection frames based on a preset size, wherein the preset size is associated with the preset distance; and

updating the plurality of initial detection frames to obtain a plurality of target detection frames; and

and determining the position information of the target object based on the positions of the target detection frames in the image to be identified.

6. The method of any of claims 5, wherein the updating the plurality of initial detection boxes to obtain a plurality of target detection boxes comprises:

adjusting the plurality of initial detection frames based on the image characteristics of the image to be identified to obtain a plurality of adjusted detection frames, wherein each adjusted detection frame has a corresponding label and a confidence level of the corresponding label;

adjusting the confidence of at least one of the two adjacent detection frames based on the overlap ratio between the two adjacent detection frames for two adjacent detection frames with different labels in the plurality of adjusted detection frames; and

based on a preset confidence condition, selecting a plurality of target detection frames from a plurality of adjusted detection frames obtained after confidence adjustment.

7. The method of claim 1, further comprising: classifying the image to be identified by using a classification model;

and determining the area of a target area indicated by the target type object based on the position information of the target type object in the image to be identified and the size information of the target type object in response to the category of the image to be identified being the first category.

8. The method of claim 7, wherein the image to be identified is an image obtained for a plurality of objects, the target type object comprising at least a portion of the plurality of objects;

the method further comprises the steps of: and determining the number of the plurality of objects based on the position information of the target type object in the image to be identified and the image characteristics of the target type object in response to the category of the image to be identified being the second category.

9. An image recognition apparatus comprising:

the identification module is used for identifying the image to be identified to obtain the position information of the target object in the image to be identified and a first label aiming at the target object;

the comparison module is used for comparing the similarity between the region image of the target object in the image to be identified and a plurality of reference images based on the position information to obtain a comparison result, wherein each reference image comprises a reference object and a second label aiming at the reference object;

a first determining module configured to determine a target image from the plurality of reference images based on the comparison result; and

a second determining module, configured to determine a target tag for the target object based on the first tag and a second tag of the target image;

Wherein the image to be identified comprises a plurality of target objects; the apparatus further comprises:

a third determining module, configured to determine a target type object from the plurality of target objects based on the target tag of each target object; and

a fourth determining module, configured to determine an area of a target area indicated by the target type object based on position information of the target type object in the image to be identified and size information of the target type object;

wherein the fourth determination module includes:

a fifth determining sub-module for determining a plurality of reference objects from the target type objects based on the position information of the target type objects in the image to be identified;

a sixth determining sub-module for determining a first reference object, a second reference object, and a third reference object from the plurality of reference objects, wherein the second reference object and the third reference object are located at both sides of the first reference object;

a seventh determining submodule for determining a direction pointed by the first reference object to the second reference object as a first direction;

an eighth determination submodule for determining a direction from the first reference object to the third reference object as a second direction; and

And a ninth determining submodule, configured to determine an area of a target area indicated by the target type object based on position information of the target type object in the image to be identified and size information of the target type object in response to an included angle between the first direction and the second direction belonging to a preset included angle.

10. The apparatus of claim 9, wherein the first determination module comprises:

a first determining sub-module, configured to determine, as N candidate images, reference images, of the plurality of reference images, having a similarity with the region image ranked in the top N, where N is an integer greater than 1; and

and a second determining sub-module, configured to determine a target image from the N candidate images based on a similarity between the region image and each of the N candidate images.

11. The apparatus of claim 10, wherein the second determination submodule comprises at least one of:

a first determining unit configured to determine, as a target image, a candidate image having a similarity with the region image greater than a first threshold value among the N candidate images;

a second determining unit configured to determine, from among the N candidate images, a candidate image whose degree of similarity with the region image is ranked N before, in response to a degree of similarity between each of the N candidate images and the region image being smaller than a second threshold, and determine a target image from among the N candidate images based on a number of occurrences of a second label of each of the N candidate images, where N is smaller than N, the number of occurrences representing a number of reference images having the second label among the plurality of reference images; and

A first selection unit configured to select, as a target image, one candidate image having a maximum similarity with the region image from the N candidate images in response to a similarity between each of the N candidate images and the region image being smaller than a third threshold,

12. The apparatus of claim 11, wherein the second determination module comprises at least one of:

a selecting sub-module, configured to select one of the first tag and the second tag as a target tag in response to the first tag and the second tag satisfying a first condition; and

and the third determining submodule is used for determining a difference value between the confidence coefficient of the first label and the similarity of the second label in response to the first label and the second label meeting a second condition, and determining a target label from the first label, the second label and an adjacent label based on the difference value, wherein the adjacent label is a label of other objects adjacent to the target object in the image to be identified.

13. The apparatus according to any one of claims 9-12, wherein the image to be identified comprises a plurality of target objects, a distance between adjacent ones of the plurality of target objects being less than a preset distance;

wherein, the identification module includes:

a generating sub-module for generating a plurality of initial detection frames based on a preset size, wherein the preset size is associated with the preset distance; and

the updating sub-module is used for updating the plurality of initial detection frames to obtain a plurality of target detection frames; and

and a fourth determining sub-module, configured to determine location information of the target object based on locations of the plurality of target detection frames in the image to be identified.

14. The apparatus of any of claims 13, wherein the update sub-module comprises:

the first adjusting unit is used for adjusting the plurality of initial detection frames based on the image characteristics of the image to be identified to obtain a plurality of adjusted detection frames, wherein each adjusted detection frame has a corresponding label and the confidence of the corresponding label;

a second adjusting unit, configured to adjust, for two adjacent detection frames having different labels among the plurality of adjusted detection frames, a confidence level of at least one detection frame among the two adjacent detection frames based on a coincidence level between the two adjacent detection frames; and

And the second selection unit is used for selecting a plurality of target detection frames from the plurality of adjusted detection frames obtained after the confidence is adjusted based on a preset confidence condition.

15. The apparatus of claim 9, further comprising: the classification module is used for classifying the images to be identified by using a classification model;

wherein the fourth determination module is further configured to:

16. The apparatus of claim 15, wherein the image to be identified is an image obtained for a plurality of objects, the target type object comprising at least a portion of the plurality of objects;

the apparatus further comprises: and a fifth determining module, configured to determine, in response to the category of the image to be identified being the second category, the number of the plurality of objects based on the location information of the target type object in the image to be identified and the image characteristics of the target type object.

17. An electronic device, comprising:

At least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-8.

18. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-8.