CN112434715A

CN112434715A - Target identification method and device based on artificial intelligence and storage medium

Info

Publication number: CN112434715A
Application number: CN202011435671.2A
Authority: CN
Inventors: 李星宇; 岳大威; 王宇
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-12-10
Filing date: 2020-12-10
Publication date: 2021-03-02
Anticipated expiration: 2040-12-10
Also published as: CN112434715B

Abstract

The invention discloses a target identification method, a target identification device and a storage medium based on artificial intelligence, wherein the method comprises the steps of obtaining an image to be identified; extracting the comprehensive local position information and the global context information of the image to be recognized to obtain a feature map group corresponding to the image to be recognized; extracting targets based on the feature graph group to obtain target extraction results, wherein each target in the target extraction results correspondingly comprises four corner points, and the target is framed and selected by a quadrilateral extraction frame determined by the four corner points; and outputting a target identification result according to the target extraction result. The method can accurately regress the four corners, and based on the four corners obtained by regression, a quadrilateral detection frame can be uniquely determined, and compared with a rectangular detection frame in the related art, the quadrilateral detection frame can be more attached to the external contour of the target in the image, so that the noise is reduced, and the accuracy of target identification is improved.

Description

Target identification method and device based on artificial intelligence and storage medium

Technical Field

The present disclosure relates to the field of artificial intelligence, and in particular, to a method and an apparatus for identifying a target based on artificial intelligence, and a storage medium.

Background

In the related art, target detection usually only can return two corner points of a target, the target is selected through a rectangular detection frame uniquely determined by the two corner points, and a target detection result is obtained. However, in actual image acquisition, the perspective of the imaging target is changed due to the imaging angle, and the rectangular detection frame does not fit the result of target detection well, and thus redundancy and noise are likely to be included, resulting in a limited accuracy of the target detection result.

Disclosure of Invention

In order to improve the accuracy of target detection, the embodiments of the present disclosure provide a target identification method and apparatus based on artificial intelligence, and a storage medium.

In one aspect, the present disclosure provides an artificial intelligence-based target identification method, including:

acquiring an image to be identified;

extracting the comprehensive local position information and the global context information of the image to be recognized to obtain a feature map group corresponding to the image to be recognized;

performing target extraction based on the feature map group to obtain target extraction results, wherein each target in the target extraction results correspondingly comprises four corner points, and the target is framed and selected by a quadrilateral extraction frame determined by the four corner points;

and outputting a target identification result according to the target extraction result.

In another aspect, the present disclosure provides an artificial intelligence-based target recognition apparatus, including:

the image to be recognized acquisition module is used for acquiring an image to be recognized;

the characteristic extraction module is used for extracting the comprehensive local position information and the global context information of the image to be identified to obtain a characteristic graph group corresponding to the image to be identified;

the target extraction module is used for extracting targets based on the feature map group to obtain target extraction results, each target in the target extraction results correspondingly comprises four corner points, and the target is framed and selected by a quadrilateral extraction frame determined by the four corner points;

and the target identification module is used for outputting a target identification result according to the target extraction result.

In another aspect, the present disclosure provides a computer-readable storage medium, where at least one instruction or at least one program is stored in the computer-readable storage medium, and the at least one instruction or the at least one program is loaded and executed by a processor to implement an artificial intelligence based object recognition method as described above.

In another aspect, the present disclosure provides an electronic device, comprising at least one processor, and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor, and the at least one processor implements an artificial intelligence based object recognition method as described above by executing the instructions stored by the memory.

The disclosure provides a target identification method, a target identification device and a storage medium based on artificial intelligence. This openly can accurate regression four angular points, and based on four angular points that the regression obtained, a quadrangle detection frame can be uniquely determined, and this quadrangle detection frame can more laminate the outside contour of target in the image than the rectangle detection frame among the correlation technique to reduce the noise, promote target identification's precision.

Drawings

In order to more clearly illustrate the technical solutions and advantages of the embodiments of the present disclosure or the related art, the drawings used in the description of the embodiments or the related art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and other drawings can be obtained by those skilled in the art without creative efforts.

Fig. 1 is a schematic diagram of the present disclosure providing a result of a shop sign recognition in the related art;

FIG. 2 is a schematic flow chart diagram of an artificial intelligence-based target identification method provided by the present disclosure;

FIG. 3 is a schematic diagram of an image in a shop identification scene in an embodiment of the disclosure;

FIG. 4 is a schematic diagram of the structure of the Hourglass network provided by the present disclosure;

FIG. 5 is a schematic diagram of a Hourglass network element provided by the present disclosure;

FIG. 6 is a flow chart of outputting a target recognition result according to the target extraction result provided by the present disclosure;

FIG. 7 is another flow chart provided by the present disclosure for outputting a target recognition result according to the above target extraction result;

FIG. 8 is a flowchart illustrating processing of the nested structure to obtain a processing result according to the present disclosure;

FIG. 9 is a schematic view of a nesting arrangement provided by the present disclosure;

FIG. 10 is another schematic illustration of a nesting arrangement provided by the present disclosure;

FIG. 11 is a flow chart of a method of training a neural network provided by the present disclosure;

FIG. 12 is a flow chart of obtaining a training sample set provided by the present disclosure;

FIG. 13 is a flow chart of a training target extraction network provided by the present disclosure and trained according to the above-described training sample set;

FIG. 14 is a block diagram of an artificial intelligence based object recognition apparatus provided by the present disclosure;

fig. 15 is a hardware structure diagram of an apparatus provided by the present disclosure for implementing the method provided by the embodiment of the present disclosure.

Detailed Description

The technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are only a part of the embodiments of the present disclosure, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or server that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

In order to make the purpose, technical solution and advantages of the embodiments of the present disclosure more clearly understood, the embodiments of the present disclosure are described in further detail below with reference to the accompanying drawings and the embodiments. It is to be understood that the specific embodiments described herein are merely illustrative of the embodiments of the disclosure and that no limitation to the embodiments of the disclosure is intended.

In the following, the terms "first", "second" are used for descriptive purposes only and are not to be understood as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the present embodiment, "a plurality" means two or more unless otherwise specified. In order to facilitate understanding of the above technical solutions and the technical effects thereof in the embodiments of the present disclosure, the embodiments of the present disclosure first explain related terms:

artificial Intelligence (AI): the method is a theory, method, technology and application system for simulating, extending and expanding human intelligence by using a digital computer or a machine controlled by the digital computer, sensing the environment, acquiring knowledge and obtaining the best result by using the knowledge. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Deep learning: the method is a new research direction in the field of Machine Learning (ML), and is an intrinsic rule and a presentation level of Learning sample data, and information obtained in the Learning process is very helpful for explaining data such as characters, images and sound. The final aim of the method is to enable the machine to have the analysis and learning capability like a human, and to recognize data such as characters, images and sounds.

Target detection: the method is also called target extraction, is image segmentation based on target geometry and statistical characteristics, combines the segmentation and identification of a target into a whole, and has the accuracy and the real-time performance which are important capabilities of the whole system.

CenterNet: is a deep learning model for target detection that returns other attributes of the target (such as size, orientation, pose, etc.) based on a center point.

Hourglass: a convolutional neural network that uses multi-scale features to obtain global context information. The network structure is shaped like an hourglass, and the structure from top to bottom to top is repeatedly used for extracting the comprehensive local position information and the global context information. The Hourglass is widely applied to human posture estimation, wherein the human posture estimation firstly determines the accurate pixel position of the important key point of the human body in the image and then completes the action recognition through the analysis of the human posture and the limb joint. When the Hourglass is applied to human body posture estimation, the Hourglass can assist the human body posture estimation model to regress to obtain 17 key points, and the 17 key points are positioned in a human body.

A shop sign: that is, shop (shop) signboards, are one kind of POI (Point of Interest), and most of POIs belong to the category of shop signboards.

The target identification is a key research direction in the field of artificial intelligence, and the target in the image is automatically extracted through target identification, so that the labor can be saved, the cost can be reduced, and the method has a wide application prospect and is greatly developed. However, the target identification in the related art has the problems of high noise and limited accuracy. Taking the artificial intelligence-based target recognition in the shop call scene as an example, please refer to fig. 1, which shows a schematic diagram of the recognition result of the shop call in the related art. As can be seen from fig. 1, the rectangular shop sign is reflected in the target of the oblique quadrangle in the image due to the influence of the shooting angle. The detection frame obtained by the related art is rectangular (please refer to the thick frame in fig. 1), and cannot fit the outer boundary of the target of the oblique quadrangle, and for the target of the oblique quadrangle, if the rectangular detection frame is used, the framing generates more obvious noise.

Therefore, noise is introduced into the rectangular detection frame determined by regressing the two corner points in the related technology, so that the accuracy of target identification is reduced.

An artificial intelligence-based object recognition method of the present disclosure is described below, and fig. 2 is a schematic flow chart illustrating an artificial intelligence-based object recognition method provided by an embodiment of the present disclosure, which provides the above-described method operation steps according to the embodiment or the flow chart, but may include more or less operation steps based on conventional or non-inventive labor. The order of steps recited in the embodiments is merely one manner of performing the steps in a multitude of orders and does not represent the only order of execution. In practice, the system or server product may be implemented in a sequential or parallel manner (e.g., parallel processor or multi-threaded environment) according to the embodiments or methods shown in the figures. The method may include:

s101, obtaining an image to be identified.

In some possible embodiments, the image to be identified may be acquired by an electronic device. Alternatively, the electronic device may acquire the image to be recognized from another device, for example, the electronic device may acquire the image to be recognized from an image capturing device, a monitoring device, or the like. In some implementations, the image to be recognized may be an image frame in a video.

The image to be recognized in the embodiment of the present disclosure may be a two-dimensional image, and specifically, the image to be recognized may be an RGB three-channel color image (R: Red, G: Green, B: Blue), a grayscale image, or an RGBD four-channel color image including depth information. The color system of the image to be recognized is not limited in the embodiment of the disclosure.

And S102, extracting the comprehensive local position information and the global context information of the image to be recognized to obtain a feature map group corresponding to the image to be recognized.

In order to obtain a more accurate target extraction result and achieve the purpose that four corners can be obtained by regression based on the center point of the target, the comprehensive local position information and the global context information of the image are extracted, so that the more accurate target extraction result can be obtained under the guidance of the global context information.

The comprehensive local position information can be used for extracting the object and determining the outline of the object in the image; the global context information may be used to express environmental information of the object. For example, an object may be extracted from the image according to the integrated local position information, and if the object is located on a shop, the object has a high probability of being a shop signboard; if the object is located on the vehicle, the object has a high probability of being a license plate. Stores and cars may be considered as a kind of global context information from which license plates may be excluded if the object recognition aims at extracting store calls. According to the embodiment of the disclosure, the comprehensive local position information and the global context information are extracted, so that noise can be conveniently filtered in a target extraction link, and the accuracy of target extraction is improved. As shown in fig. 3, which shows a schematic view of an image in a shop identification scene. As can be seen from fig. 3, various disturbances such as advertisements, banners, security marks, etc. appear in fig. 3, and are likely to be erroneously recognized as a shop sign, and by combining such global context information filtering, noises such as "advertisements", "banners", and security marks "can be filtered, thereby improving the recognition accuracy of the shop sign.

In one embodiment, the integrated local position information and the global context information of the image to be recognized can be extracted based on a Hourglass network. In the related technology, the Hourglass network is mostly used for feature extraction in human posture estimation, the number of feature points in human posture estimation is large, usually 17, and all the feature points are located inside a target, while the feature points (four corner points) identified by the target in the present disclosure are fewer and located at the edge of the target, which is obviously different from the feature points in human posture estimation, but the Hourglass has the advantage of being capable of extracting the comprehensive local position information and the global context information of the image.

Referring to fig. 4, a schematic diagram of a Hourglass network is shown, which is an expandable structure, each Hourglass network unit is shaped like two Hourglass in opposite vertex, and a feature extraction network with a plurality of Hourglass network units can be formed by connecting the Hourglass network units end to end. The integrated local location information and global context information is extracted by reusing "top-down" to "bottom-up" Hourglass network elements. Referring to fig. 5, a schematic diagram of a Hourglass network unit with recursive residual modules as backbone modules is shown.

And S103, extracting targets based on the feature map group to obtain target extraction results, wherein each target in the target extraction results correspondingly comprises four corner points, and selecting the target according to a quadrilateral extraction frame determined by the four corner points.

The target extraction result corresponding to the image to be recognized in the present disclosure may include 0 target, one target or a plurality of targets. For any object, the object comprises four corner points, and is uniquely framed by a quadrilateral extraction frame determined based on the four corner points.

The method and the device can predict the center point of the target at first, and then return to and fro the four corner points based on the center point of the target, so as to obtain a target extraction result.

And S104, outputting a target identification result according to the target extraction result.

In a possible embodiment, the target recognition result may include a quadrilateral extraction frame, and may also include a rectangular extraction frame determined according to two farthest corner points of the quadrilateral extraction frame.

In one embodiment, the target extraction result may be filtered, and redundancy in the target recognition result is reduced by reducing the probability of repeatedly selecting the same target, please refer to fig. 6, which shows a flowchart of outputting the target recognition result according to the target extraction result, including:

s1041, calculating the overlapping degree of each quadrilateral extraction frame in the target extraction result.

The overlapping degree of any two quadrilateral extraction frames in the disclosure can be represented by the ratio of the intersection of any two quadrilateral extraction frames to the union of any two quadrilateral extraction frames.

S1043, filtering the target extraction result according to the overlapping degree to obtain a filtering result.

S1045, outputting the target recognition result according to the filtering result.

Specifically, in the present disclosure, an objective extraction frame may be determined in each quadrilateral extraction frame in the target extraction result, where a confidence of the objective extraction frame is greater than that of other quadrilateral extraction frames; respectively calculating the overlapping degree between each quadrilateral extraction frame except the target extraction frame and the target extraction frame; if the overlapping degree of the first extraction frame and the target extraction frame is larger than a preset overlapping degree threshold value, deleting the first candidate frame; if the overlapping degree of the first extraction frame and the target extraction frame is less than or equal to the overlapping degree threshold value, the first extraction frame is reserved; wherein the first extraction frame is any one quadrilateral extraction frame except the target extraction frame; and repeating the operation on the rest quadrilateral candidate frames to obtain a filtering result. The specific value of the overlap threshold is not limited in the present disclosure, and can be determined according to actual needs.

In one embodiment, a nested structure in the target extraction result may be further processed, where the nested structure includes a first extraction box at an outer layer and at least one second extraction box located inside the first extraction box; the first extraction frame and the second extraction frame are both quadrilateral extraction frames, and the nested structure can be simplified by processing the nested structure, so that the redundancy in the target recognition result is further reduced.

Referring to fig. 7, another flowchart of outputting a target recognition result according to the target extraction result is shown, which includes:

s1042, if the target extraction result comprises at least one nested structure, processing the nested structure to obtain a processing result; the nested structure comprises a first extracting frame on the outer layer and at least one second extracting frame positioned inside the first extracting frame; the first extraction frame and the second extraction frame are both quadrilateral extraction frames.

And S1044, outputting the target recognition result according to the processing result.

Specifically, as shown in fig. 8, it shows a flowchart for processing the above-mentioned nested structure to obtain a processing result, including:

s10421, extracting the outer layer of the nested structure to obtain a first extraction frame;

s10422, extracting the inner layer of the nested structure to obtain at least one second extraction frame;

s10423, if the number of the second extraction frames is equal to 1, obtaining an intersection of the second extraction frame and the first extraction frame, and if a ratio of the intersection to the second extraction frame is greater than a preset first threshold, deleting a target extraction frame, where the target extraction frame is an extraction frame with a lower confidence level in the second extraction frame and the first extraction frame.

Referring to fig. 9, which shows a schematic diagram of a nested structure, in fig. 9, an outer layer box is a first extraction box, an inner layer box is a second extraction box, an intersection of the second extraction box and the first extraction box is a smaller extraction box (second extraction box) than an upper area, and if a result obtained is greater than a preset first threshold, the extraction box with the smaller confidence is deleted. In the target extraction result in the present disclosure, each target has its corresponding confidence, and the confidence represents the confidence of the classification of the target. For example, if the present disclosure is used to identify a store, the confidence characterizes the confidence that the target is a store. In fig. 9, if the confidence of the second extraction frame is smaller, the second extraction frame is deleted, leaving the first extraction frame, and if the confidence of the first extraction frame is smaller, the first extraction frame is deleted, leaving the second extraction frame.

S10424, if the overlapping degree of each adjacent second extraction frame is smaller than a preset second threshold value, and the ratio of the union set of each second extraction frame to the first extraction frame is larger than a preset third threshold value, deleting the first extraction frame to obtain the processing result.

Referring to fig. 10, which shows another schematic diagram of the nested structure, the outer layer box in fig. 10 is a first fetch box, and the inner layer box is three second fetch boxes. And if the ratio of the coverage area of the three second extraction frames to the coverage area of the first extraction frame is larger than a preset third threshold value, deleting the first extraction frame.

The disclosure does not limit the specific values of the first threshold, the second threshold and the third threshold, and can be set according to actual needs.

According to the method and the device, the target extraction result can be filtered firstly, then the nested structure processing can be carried out on the filtered target extraction result, the nested structure processing can also be carried out on the target extraction result, then the target extraction result after the nested structure processing is carried out is filtered, and finally the target identification result after the filtering and the nested structure processing is obtained.

The artificial intelligence-based target recognition method can obtain a feature graph group containing comprehensive local position information and global context information by transferring the idea of feature extraction commonly used in human body posture estimation to feature extraction for target recognition, and can regress four corner points of a target based on the extracted feature graph group, so that a quadrilateral extraction frame in an output target recognition result is more fit with the target, the noise of the quadrilateral extraction frame is reduced, the recognition result is obtained more accurately, and the recall rate and the accuracy of the target recognition are also improved.

Referring to table 1, a table showing performance comparison of the embodiment of the present disclosure and the target recognition algorithm in the related art applied to the shop identification is shown.

TABLE 1

Model (model)	Rate of accuracy	Recall rate	F1 value
				Faster-RCNN	0.68	0.79	0.73
YOLOv3	0.7	0.73	0.71
				CornerNet	0.79	0.84	0.81
The disclosed embodiments	0.84	0.87	0.85

The method comprises the steps that fast-RCNN, YOLOv3 and CornerNet are algorithms which can be used for target recognition in the related technology, comprehensive local position information and global context information of an image to be recognized can be extracted by using a feature extraction algorithm of a human body posture recognition technology for reference, a regression idea of CenterNet is fused, four corner points of a target are regressed, and the target recognition accuracy and the recall rate in a shop calling scene are remarkably improved. The present disclosure may also be applied to the recognition of other targets, and the present disclosure has no limitation on the usage scenario thereof.

As described in the foregoing embodiment, the target identification method based on artificial intelligence provided in the embodiment of the present disclosure may be implemented by using a neural network, where the neural network is a target extraction network and includes a feature extraction network and a target prediction network that are sequentially connected, where the feature extraction network is configured to extract comprehensive local location information and global context information of the image to be identified, so as to obtain a feature map group corresponding to the image to be identified; the target prediction network is used for extracting targets based on the feature map group to obtain target extraction results, each target in the target extraction results correspondingly comprises four corner points, and the target is selected by a quadrilateral extraction frame determined by the four corner points.

The following describes a process of training a neural network.

Referring to fig. 11, a method of training a neural network is shown, the method comprising:

s10, a training sample set is obtained, and for each sample image in the training sample set, each marking target in the sample image comprises four marking angular points.

In one embodiment, in order to improve the training effect and enhance the generalization capability of the neural network, the number of sample images in the training sample set can be increased by performing image enhancement on the sample images. As shown in fig. 12, the acquiring of the training sample set includes:

s11, a first sample set is obtained, and for each first sample image in the first sample set, each labeling target in the first sample image comprises four labeling angular points.

And S12, carrying out image enhancement on at least one first sample image in the first sample set to obtain at least one second sample image corresponding to the first sample image.

Image enhancement in this disclosure includes, but is not limited to, any one of random scaling, random cropping, color perturbation, and the like, and combinations thereof. In a possible embodiment, when the scaling is performed randomly, a data may be randomly selected from a preset scaling sequence as a scaling scale, and the first sample image is scaled to obtain a corresponding second sample image. Illustratively, the scaling sequence may be [0.6,0.7,0.8,0.8,0.9,1.0,1.1,1.2,1.3,1.4], and the disclosure does not limit specific data of the scaling sequence.

And S13, generating the training sample set according to the first sample image set and each second sample image corresponding to each first sample image.

S20, training a target extraction network according to the training sample set until the loss generated by the target extraction network is smaller than a loss threshold value; the target extraction network comprises a feature extraction network and a target prediction network which are connected in sequence.

In one possible embodiment, a training sample set containing 16000 sample images can be obtained through image enhancement, training of the target extraction network is performed based on the training sample set, the respective rate of the target extraction network is set to 512 × 512, the initial learning rate is set to 0.00005, and the learning rate is reduced by 10 times at [90, 120 ]. The specific training parameters may be determined according to actual training conditions, and are not limited by this disclosure. The trained target extraction network may also be verified and tested, for example, 400 images may be used.

Calculating the loss generated by the target extraction network based on step S20, where the loss is less than the loss threshold, it may indicate that the training of the target extraction network meets the requirement, and may be applied, and where the loss is greater than or equal to the loss threshold, the parameters of the feature extraction network and/or the target prediction network in the target extraction network, such as convolution parameters, may be feedback-adjusted until the obtained loss is less than the loss threshold. The loss threshold may be a value set according to the requirement, such as 0.1, but is not a specific limitation of the present disclosure.

In one embodiment, as shown in fig. 13, the training of the target extraction network according to the training sample set includes:

and S21, inputting the sample image into the feature extraction network to obtain a feature map group corresponding to the sample image.

And S22, inputting the characteristic graph group into the target prediction network to obtain a first thermodynamic graph group, wherein the first thermodynamic graph group comprises a central point thermodynamic diagram of each target and thermodynamic diagrams of four corner points corresponding to each target.

And S23, obtaining a second thermal map group according to the sample image, wherein the second thermal map group comprises a central point thermodynamic map of each marking target and thermodynamic maps of four corner points of each marking target.

And S24, calculating a loss value according to the first thermal map group and the second thermal map group.

In one embodiment, said calculating a loss value based on said first set of thermal maps and said second set of thermal maps comprises:

s241, respectively calculating a first loss, a second loss, a third loss, a fourth loss, a fifth loss and a sixth loss according to the first thermal diagram group and the second thermal diagram group;

s242, determining the loss value according to the first loss, the second loss, the third loss, the fourth loss, the fifth loss and the sixth loss;

the first loss is used for describing a thermodynamic loss of a center point, the second loss is used for describing a thermodynamic loss of corner points, the third loss is used for describing a shape loss of a quadrilateral extraction frame, the fourth loss is used for describing a center point offset loss generated due to image enhancement, the fifth loss is used for describing a corner point offset loss generated due to image enhancement, and the sixth loss is used for describing a corner point pairing loss.

Specifically, the first loss is determined by a central point thermodynamic diagram in the first thermodynamic diagram group and a corresponding central point thermodynamic diagram in the second thermodynamic diagram group;

the second loss is determined by thermodynamic diagrams of four corner points in the first thermodynamic diagram group and thermodynamic diagrams of four corresponding corner points in the second thermodynamic diagram group;

the third loss is determined by thermodynamic diagrams of four corner points in the first thermodynamic diagram group and thermodynamic diagrams of four corner points corresponding to the second thermodynamic diagram group;

the fourth penalty is used to describe the penalty due to center point shift due to image enhancement. Illustratively, if the target center point in the initial sample image before image enhancement is (5,5), the center point of the sample image after image enhancement should be (2.5 ), but since the pixels are all integers, the center point is shifted to (2,2), thereby generating the center point offset. By the same token, the fifth loss describes the loss due to corner point offset due to image enhancement. If the sample image is not obtained by image enhancement, the fourth loss and the fifth loss may be 0, and the present disclosure introduces the fourth loss and the fifth loss, aiming to correct errors introduced in the recognition result due to image enhancement.

A sixth loss is determined by the corner thermodynamic diagrams in the first set of thermodynamic diagrams and the second set of thermodynamic diagrams, describing the losses resulting from the pairing of the corners. If the pairing fails due to the missing corner points in the first thermal map set, corner point pairing loss is generated correspondingly.

The loss value of the disclosure can be obtained by weighted summation of the first loss, the second loss, the third loss, the fourth loss, the fifth loss and the sixth loss, the weight value can be determined according to actual needs, and the disclosure is not particularly limited.

And S25, if the loss value is larger than or equal to the loss threshold value, feeding back and adjusting parameters of the target extraction network.

The embodiment of the present disclosure further discloses an artificial intelligence-based target identification apparatus, as shown in fig. 14, the apparatus includes:

the image to be recognized acquisition module 10 is used for acquiring an image to be recognized;

a feature extraction module 20, configured to extract the comprehensive local location information and the global context information of the image to be recognized, so as to obtain a feature map group corresponding to the image to be recognized;

a target extraction module 30, configured to perform target extraction based on the feature map group to obtain a target extraction result, where each target in the target extraction result correspondingly includes four corner points, and the target is selected by using a quadrilateral extraction frame determined by the four corner points;

and the target identification module 40 is used for outputting a target identification result according to the target extraction result.

Specifically, the embodiment of the present disclosure discloses an artificial intelligence based target identification device and the corresponding method embodiments are all based on the same inventive concept. For details, please refer to the method embodiment, which is not described herein.

Embodiments of the present disclosure also provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions to cause the computer device to execute the artificial intelligence based object recognition method.

The disclosed embodiments also provide a computer-readable storage medium, which may store a plurality of instructions. The above-described instructions may be adapted to be loaded by a processor and to perform an artificial intelligence based object recognition method as described above in embodiments of the present disclosure.

Further, fig. 15 shows a hardware structure diagram of an apparatus for implementing the method provided by the embodiment of the present disclosure, and the apparatus may participate in constituting or containing the device or system provided by the embodiment of the present disclosure. As shown in fig. 15, the device 10 may include one or more (shown as 102a, 102b, … …, 102 n) processors 102 (the processors 102 may include, but are not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA, etc.), a memory 104 for storing data, and a transmission device 106 for communication functions. Besides, the method can also comprise the following steps: a display, an input/output interface (I/O interface), a Universal Serial Bus (USB) port (which may be included as one of the ports of the I/O interface), a network interface, a power source, and/or a camera. It will be understood by those skilled in the art that the structure shown in fig. 15 is only an illustration and is not intended to limit the structure of the electronic device. For example, device 10 may also include more or fewer components than shown in FIG. 15, or have a different configuration than shown in FIG. 15.

It should be noted that the one or more processors 102 and/or other data processing circuitry described above may be referred to generally herein as "data processing circuitry". The data processing circuitry may be embodied in whole or in part in software, hardware, firmware, or any combination thereof. Further, the data processing circuitry may be a single, stand-alone processing module, or incorporated in whole or in part into any of the other elements in the device 10 (or mobile device). As referred to in the disclosed embodiments, the data processing circuit acts as a processor control (e.g., selection of a variable resistance termination path connected to the interface).

The memory 104 may be used to store software programs and modules of application software, such as program instructions/data storage devices corresponding to the methods described above in the embodiments of the present disclosure, and the processor 102 executes various functional applications and data processing by running the software programs and modules stored in the memory 104, so as to implement the above-described artificial intelligence-based object recognition method. The memory 104 may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, memory 104 may further include memory located remotely from processor 102, which may be connected to device 10 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission device 106 is used for receiving or transmitting data via a network. Specific examples of such networks may include wireless networks provided by the communication provider of the device 10. In one example, the transmission device 106 includes a network adapter (NIC) that can be connected to other network devices through a base station so as to communicate with the internet. In one example, the transmission device 106 can be a Radio Frequency (RF) module, which is used for communicating with the internet in a wireless manner.

The display may be, for example, a touch screen type Liquid Crystal Display (LCD) that may enable a user to interact with a user interface of the device 10 (or mobile device).

It should be noted that: the precedence order of the embodiments of the present disclosure is merely for description, and does not represent the merits of the embodiments. And specific embodiments of the disclosure have been described above. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

The embodiments in the disclosure are described in a progressive manner, and the same and similar parts among the embodiments can be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, as for the device and server embodiments, since they are substantially similar to the method embodiments, the description is simple, and the relevant points can be referred to the partial description of the method embodiments.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, and the program may be stored in a computer-readable storage medium, and the storage medium may be a read-only memory, a magnetic disk or an optical disk.

The above description is only for the purpose of illustrating the preferred embodiments of the present disclosure and is not to be construed as limiting the present disclosure, but rather as the following claims are intended to cover all modifications, equivalents, and improvements falling within the spirit and scope of the present disclosure.

Claims

1. An artificial intelligence based target identification method, characterized in that the method comprises:

acquiring an image to be identified;

2. The method of claim 1, wherein outputting a target recognition result according to the target extraction result comprises:

calculating the overlapping degree of each quadrilateral extraction frame in the target extraction result;

filtering the target extraction result according to the overlapping degree to obtain a filtering result;

and outputting the target recognition result according to the filtering result.

3. The method according to claim 1 or 2, wherein outputting a target recognition result according to the target extraction result comprises:

if the target extraction result comprises at least one nested structure, processing the nested structure to obtain a processing result; the nested structure comprises a first extraction box at the outer layer and at least one second extraction box positioned inside the first extraction box; the first extraction frame and the second extraction frame are both quadrilateral extraction frames;

and outputting the target recognition result according to the processing result.

4. The method of claim 3, wherein the processing the nested structure to obtain a processing result comprises:

extracting the outer layer of the nested structure to obtain a first extraction frame;

extracting the inner layer of the nested structure to obtain at least one second extraction frame;

if the number of the second extraction frames is equal to 1, obtaining an intersection of the second extraction frames and the first extraction frames, and if the ratio of the intersection to the second extraction frames is greater than a preset first threshold, deleting a target extraction frame, wherein the target extraction frame is an extraction frame with a low confidence level in the second extraction frame and the first extraction frame;

if the number of the second extraction frames is greater than 1, deleting the first extraction frame if the overlapping degree of each adjacent second extraction frame is smaller than a preset second threshold and the ratio of the union of each second extraction frame to the first extraction frame is greater than a preset third threshold, and obtaining the processing result.

5. The method of claim 1, further comprising:

acquiring a training sample set, wherein each labeling target in the sample image comprises four labeling angular points for each sample image in the training sample set;

training a target extraction network according to the training sample set until the loss generated by the target extraction network is less than a loss threshold value; the target extraction network comprises a feature extraction network and a target prediction network which are sequentially connected;

the feature extraction network is used for extracting comprehensive local position information and global context information of the image to be identified to obtain a feature map group corresponding to the image to be identified; the target prediction network is used for extracting targets based on the feature map group to obtain target extraction results, each target in the target extraction results correspondingly comprises four corner points, and the target is selected by a quadrilateral extraction frame determined by the four corner points.

6. The method of claim 5, wherein the obtaining a training sample set comprises:

acquiring a first sample set, wherein each labeling target in the first sample image comprises four labeling angular points for each first sample image in the first sample set;

performing image enhancement on at least one first sample image in the first sample set to obtain at least one second sample image corresponding to the first sample image;

and generating the training sample set according to the first sample image set and the second sample images corresponding to the first sample images.

7. The method of claim 6, wherein training an object extraction network from the set of training samples comprises:

inputting a sample image into the feature extraction network to obtain a feature map group corresponding to the sample image;

inputting the characteristic map group into the target prediction network to obtain a first thermodynamic map group, wherein the first thermodynamic map group comprises a central point thermodynamic diagram of each target and thermodynamic diagrams of four corner points corresponding to each target;

obtaining a second thermal map set according to the sample image, wherein the second thermal map set comprises a central point thermodynamic map of each labeling target and thermodynamic maps of four corner points of each labeling target;

calculating a loss value according to the first thermal map set and the second thermal map set;

and if the loss value is greater than or equal to the loss threshold value, feeding back and adjusting parameters of the target extraction network.

8. The method of claim 7, wherein said calculating a loss value from said first set of thermal maps and said second set of thermal maps comprises;

calculating a first loss, a second loss, a third loss, a fourth loss, a fifth loss and a sixth loss according to the first thermal map set and the second thermal map set respectively;

determining the loss value according to the first loss, the second loss, the third loss, the fourth loss, the fifth loss and the sixth loss;

wherein the first loss is used for describing a thermodynamic loss of a center point, the second loss is used for describing a thermodynamic loss of a corner point, the third loss is used for describing a shape loss of a quadrilateral extraction frame, the fourth loss is used for describing a center point offset loss generated due to image enhancement, the fifth loss is used for describing a corner point offset loss generated due to image enhancement, and the sixth loss is used for describing a corner point pairing loss.

9. An artificial intelligence based object recognition apparatus, the apparatus comprising:

10. A computer-readable storage medium, wherein at least one instruction or at least one program is stored in the computer-readable storage medium, and the at least one instruction or the at least one program is loaded by a processor and executed to implement an artificial intelligence based object recognition method according to any one of claims 1 to 8.