CN112991451B

CN112991451B - Image recognition method, related device and computer program product

Info

Publication number: CN112991451B
Application number: CN202110322600.XA
Authority: CN
Inventors: 邹智康; 叶晓青; 孙昊
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-03-25
Filing date: 2021-03-25
Publication date: 2023-08-04
Anticipated expiration: 2041-03-25
Also published as: CN112991451A

Abstract

The present disclosure provides image recognition methods, apparatuses, electronic devices, computer-readable storage media, and computer program products, relating to the technical field of artificial intelligence such as the technical field of computer vision, automatic driving, and deep learning. One embodiment of the method comprises the following steps: the method comprises the steps of obtaining initial key points of a target object in a two-dimensional image, generating a feature vector set formed by feature vectors of non-initial key points pointing to each initial key point, filtering the feature vectors in the feature vector set of each initial key point by using a random sampling consistency algorithm, determining the target key point according to the feature vectors in the filtered feature vector set, generating a target three-dimensional external frame according to the target key point, and identifying parameter information of the target object by using the target external three-dimensional frame. The implementation mode provides a technical scheme for optimizing the key points based on the non-key points, so that a three-dimensional external frame of a high-quality target object is provided under various scenes.

Description

Image recognition method, related device and computer program product

Technical Field

The present disclosure relates to the field of image processing technology, and in particular, to the field of artificial intelligence technologies such as the fields of computer vision, automatic driving and deep learning technologies, and more particularly, to an image recognition method, an apparatus, an electronic device, a computer readable storage medium, and a computer program product.

Background

High-quality reconstruction of a three-dimensional scene is one of main fronts of computer vision and computer image study for many years, and in order to better realize three-dimensional scene reconstruction of an object, multi-degree-of-freedom information such as three-dimensional position information related to the object, length, width and height of the object, orientation angle of the object and the like needs to be acquired.

In the prior art, the multi-degree-of-freedom information is generally obtained based on a monocular three-dimensional detection technology, monocular three-dimensional detection mainly depends on priori information of a three-dimensional bounding box, a data set is traversed in advance to generate a three-dimensional candidate box, then an input picture is processed based on a neural network to obtain a three-dimensional offset, and the three-dimensional candidate box is combined to obtain a real three-dimensional bounding box of an object, so that three-dimensional detection work is completed.

Disclosure of Invention

Embodiments of the present disclosure provide an image recognition method, apparatus, electronic device, computer readable storage medium, and computer program product.

In a first aspect, an embodiment of the present disclosure provides an image recognition method, including: acquiring a plurality of initial key points of a target object in a two-dimensional image, and generating a feature vector set formed by feature vectors, which are respectively pointed to each initial key point, of non-initial key points, wherein the non-initial key points are points, which are different from the initial key points, in the two-dimensional image; filtering the feature vectors in the feature vector set of each initial key point by utilizing a random sampling consistency algorithm, and determining a target key point according to the feature vectors in the feature vector set after filtering; and generating a target three-dimensional external frame according to the target key points, and identifying parameter information of the target object by utilizing the target external three-dimensional frame.

In a second aspect, an embodiment of the present disclosure proposes an image recognition apparatus including: the device comprises a feature vector set generating unit, a feature vector set generating unit and a feature vector extraction unit, wherein the feature vector set generating unit is configured to acquire a plurality of initial key points of a target object in a two-dimensional image and generate a feature vector set formed by feature vectors, which point to each initial key point, of non-initial key points, and the non-initial key points are points, which are different from the initial key points, in the two-dimensional image; a key point determining unit configured to filter the feature vectors in the feature vector set of each of the initial key points by using a random sampling consistency algorithm, and determine a target key point according to the feature vectors in the feature vector set after the filtering; and the parameter information identification unit is configured to generate a target three-dimensional external frame according to the target key point and identify the parameter information of the target object by utilizing the target external three-dimensional frame.

In a third aspect, an embodiment of the present disclosure provides an electronic device, including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to implement the image recognition method as described in any one of the implementations of the first aspect when executed.

In a fourth aspect, embodiments of the present disclosure provide a non-transitory computer-readable storage medium storing computer instructions for enabling a computer to implement an image recognition method as described in any one of the implementations of the first aspect when executed.

In a fifth aspect, embodiments of the present disclosure provide a computer program product comprising a computer program which, when executed by a processor, is capable of implementing an image recognition method as described in any one of the implementations of the first aspect.

The image recognition method, the device, the electronic equipment, the computer readable storage medium and the computer program product provided by the embodiment of the disclosure are used for respectively filtering the feature vectors in the feature vector set of each initial key point by utilizing a random sampling consistency algorithm after acquiring a plurality of initial key points of a target object in a two-dimensional image and generating a feature vector set formed by the feature vectors, which are pointed to each initial key point, of each non-initial key point, determining the target key point according to the feature vectors in the filtered feature vector set, generating a target three-dimensional external frame according to the target key point, and recognizing the parameter information of the target object by utilizing the target external three-dimensional frame.

The method and the device are based on the key point estimation technology, vectors, which point to the initial key points, of the non-initial key points in the two-dimensional image are further utilized to optimize the initial key points, the target key points which are more accurate relative to the initial key points are obtained, the problems that the key points are invisible and the key point estimation is inaccurate due to the reasons such as object shielding are solved, the quality of the three-dimensional external frame of the determined target object is improved, and the three-dimensional detection precision is improved.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

Other features, objects and advantages of the present disclosure will become more apparent upon reading of the detailed description of non-limiting embodiments, made with reference to the following drawings:

FIG. 1 is an exemplary system architecture in which the present disclosure may be applied;

fig. 2 is a flowchart of an image recognition method according to an embodiment of the present disclosure;

FIG. 3 is a flowchart of another image recognition method according to an embodiment of the present disclosure;

FIGS. 4-1, 4-2 and 4-3 are schematic diagrams illustrating the effects of the image recognition method under an application scenario according to the embodiments of the present disclosure;

Fig. 5 is a block diagram of an image recognition apparatus according to an embodiment of the present disclosure;

fig. 6 is a schematic structural diagram of an electronic device adapted to perform an image recognition method according to an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness. It should be noted that, without conflict, the embodiments of the present disclosure and features of the embodiments may be combined with each other.

In addition, in the technical scheme disclosed by the disclosure, if the target object is a human body, the acquisition (for example, including a human face and a two-dimensional image of human body information), storage, application and the like of the related personal information of the user all conform to the regulations of related laws and regulations and do not violate the popular regulations.

FIG. 1 illustrates an exemplary system architecture 100 to which embodiments of the image recognition methods, apparatus, electronic devices, and computer-readable storage media of the present disclosure may be applied.

As shown in fig. 1, a system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

The user may interact with the server 105 via the network 104 using the terminal devices 101, 102, 103 to receive or send messages or the like. Various applications for implementing information communication between the terminal devices 101, 102, 103 and the server 105, such as an image recognition application, a three-dimensional reconstruction application, an instant messaging application, and the like, may be installed on the terminal devices.

The terminal devices 101, 102, 103 and the server 105 may be hardware or software. When the terminal devices 101, 102, 103 are hardware, they may be various electronic devices with display screens, including but not limited to smartphones, tablets, laptop and desktop computers, etc.; when the terminal devices 101, 102, 103 are software, they may be installed in the above-listed electronic devices, which may be implemented as a plurality of software or software modules, or may be implemented as a single software or software module, which is not particularly limited herein. When the server 105 is hardware, it may be implemented as a distributed server cluster formed by a plurality of servers, or may be implemented as a single server; when the server is software, the server may be implemented as a plurality of software or software modules, or may be implemented as a single software or software module, which is not particularly limited herein.

The server 105 can provide various services through various built-in applications, and for example, an image recognition class application that can provide three-dimensional parameter information of a target object based on two-dimensional image extraction can be provided, and the server 105 can achieve the following effects when running the image recognition class application: firstly, a plurality of initial key points of a target object in a two-dimensional image are acquired from terminal equipment 101, 102 and 103 through a network 104, a feature vector set formed by feature vectors, which are respectively pointed to each initial key point, of non-initial key points, namely points, which are different from the initial key points, in the two-dimensional image, are generated, then a server 105 respectively filters the feature vectors in the feature vector set of each initial key point by utilizing a random sampling consistency algorithm, the target key points are determined according to the feature vectors in the filtered feature vector set, finally, the server 105 generates a target three-dimensional external frame according to the target key points, and the parameter information of the target object is identified by utilizing the target external three-dimensional frame.

It should be noted that the two-dimensional image and the initial key point may be acquired from the terminal apparatuses 101, 102, 103 via the network 104, or may be stored in advance in the server 105 in various ways. Thus, when the server 105 detects that such data has been stored locally (e.g., a two-dimensional image, an initial key point that was left until processing began), it may choose to retrieve such data directly from the local, in which case the exemplary system architecture 100 may not include the terminal devices 101, 102, 103 and network 104.

Since the image recognition method needs to occupy more operation resources and stronger operation capability, the image recognition method provided in the following embodiments of the present disclosure is generally executed by the server 105 having stronger operation capability and more operation resources, and accordingly, the image recognition device is also generally disposed in the server 105. However, it should be noted that, when the terminal devices 101, 102, 103 also have the required computing capability and computing resources, the terminal devices 101, 102, 103 may also complete each operation performed by the server 105 through the image recognition application installed thereon, and further output the same result as the server 105. Particularly, in the case where a plurality of terminal devices having different computing capabilities exist at the same time, when the image recognition application determines that the terminal device has a higher computing capability and more computing resources remain, the terminal device may perform the above-mentioned computation, so that the computing pressure of the server 105 is appropriately reduced, and accordingly, the image recognition device may be provided in the terminal devices 101, 102, 103. In this case, the exemplary system architecture 100 may also not include the server 105 and the network 104.

It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

Referring to fig. 2, fig. 2 is a flowchart of an image recognition method according to an embodiment of the disclosure, wherein a flowchart 200 includes the following steps:

step 201, acquiring a plurality of initial key points of a target object in a two-dimensional image, and generating a feature vector set composed of feature vectors of non-initial key points respectively pointing to each initial key point.

In this embodiment, a main execution body of the image recognition method (for example, the server 105 shown in fig. 1) acquires a plurality of initial keypoints of a target object in a two-dimensional image, where a determining process of the initial keypoints may first determine pixels related to the target object in the two-dimensional image, after obtaining the pixels related to the target object, take a point for generating an initial three-dimensional bounding box of the target object as an initial keypoint, where the initial keypoint may be an exemplary vertex of the three-dimensional bounding box of the target object, a three-dimensional coordinate point corresponding to an outline of the target object in a three-dimensional space, a central point of the target object, and the like, obtain the initial keypoint, generate a feature vector in the two-dimensional image, where a non-initial keypoint points to the initial keypoint, where the non-initial keypoints are points different from the initial keypoints in the two-dimensional image, and finally generate a feature vector set formed by the feature vectors in which the non-initial keypoints point to each of the initial keypoints respectively.

The initial key points can be determined based on a key point feature extraction network constructed based on algorithms such as Scale-invariant feature transform (Scale-invariant features transform, sift for short), acceleration robust features (Speeded Up Robust Features, surf for short) and difference of Gaussian functions (Difference of Gaussian, dog for short), the initial key points can be determined from a two-dimensional image based on an existing recognition algorithm, and in practice, the key point feature extraction network, the recognition algorithm and the like can be trained in advance based on the two-dimensional image with the initial key points manually marked so as to achieve the purpose of recognizing the initial key points.

Exemplary, the initial keypoint is located at a coordinate p in a two-dimensional image, and the non-initial keypoint is located at a coordinate X in the two-dimensional image _i The value range of i is related to the number of pixels contained in the two-dimensional image (for example, the number of pixels contained in the two-dimensional image is 255, the value range of i is [ 1, 255 ]), then the feature vector V can be defined as

In some embodiments of the present disclosure, when generating a feature vector in which a non-initial key point points to an initial key point, an appropriate non-initial key point selection range may be determined in advance in a two-dimensional image based on the initial key point, and a feature vector corresponding to the initial key point may be generated based only on the non-initial key point included in the non-initial key point selection range, so as to improve the operation efficiency of obtaining the feature vector.

It should be understood that, depending on the manner of generating the initial three-dimensional bounding box of the target object, the form of the initial key point may be different, and, for example, when the initial three-dimensional bounding box cuboid box of the target object is taken as an initial key point, if the target object can be completely read from the two-dimensional image, the initial key point may be directly determined as 8 vertices of the cuboid box, and if the target object is blocked by an obstacle in the two-dimensional image, and if the target object cannot be completely read from the two-dimensional image, the partial vertices of the cuboid box may be determined according to the readable part of the target object, so that the central point coordinates of the target object are needed, so that the partial vertices of the cuboid box corresponding to the blocked part may be solved according to the known partial vertices and the central point coordinates of the target object.

It should be noted that the two-dimensional image and the initial key point may be obtained directly from a local storage device by the execution body, or may be obtained from a non-local storage device (for example, the terminal devices 101, 102, 103 shown in fig. 1). The local storage device may be a data storage module, such as a server hard disk, disposed in the execution body, in which case the two-dimensional image and the initial key point may be quickly read locally; the non-local storage device may also be any other electronic device arranged to store data, such as some user terminals or the like, in which case the executing entity may acquire the desired two-dimensional image, the initial key point, by sending an acquisition command to the electronic device.

Furthermore, in order to improve the recognition efficiency of the initial key points, the two-dimensional image can be cut by taking the target object included in the two-dimensional image as a center according to the preset size parameter, so as to reduce the size of the two-dimensional image, thereby achieving the purposes of reducing the non-initial key points and improving the efficiency of generating the feature vectors subsequently.

Step 202, filtering the feature vectors in the feature vector set of each initial key point by using a random sampling consistency algorithm, and determining a target key point according to the feature vectors in the filtered feature vector set.

In this embodiment, a random sampling consistency algorithm is adopted to filter a plurality of feature vectors aiming at the same initial key point, so as to remove abnormal feature vectors in the plurality of feature vectors obtained aiming at the initial key point, the feature vectors aiming at the same initial key point in the filtered feature vectors are collected, and the same initial key point pointed by the feature vectors is determined as a target key point according to the collected result.

The basic assumption of the random sampling consistency algorithm (Random Sample Consensus, ransac) is that the samples contain correct data (data that can be described by a model) and also contain abnormal data (data that deviate far from a normal range and cannot adapt to a mathematical model), that is, the data set contains noise, and the abnormal data may be generated due to erroneous measurement, erroneous assumption, erroneous calculation, and the like, while the ransac algorithm also assumes that, given a set of correct data, there is a method that can calculate model parameters that conform to the data, and the abnormal data contained in the data can be removed by the ransac algorithm.

In practice, feature vectors pointing to within a preset area may also be determined as feature vectors for the same initial keypoints.

And 203, generating a target three-dimensional external frame according to the target key points, and identifying parameter information of the target object by using the target external three-dimensional frame.

In this embodiment, a target three-dimensional external recognition frame is generated according to the target key point determined in the above step 202, and parameters of the target object are extracted by using the target three-dimensional external recognition frame, where the parameters may generally be a center point coordinate of the target object, a three-dimensional coordinate of a contour point of the target object, a shooting rotation angle of the target object in a two-dimensional image, and the like, and under an application scenario such as three-dimensional scene reconstruction, after the target three-dimensional external recognition frame is determined, the three-dimensional coordinate of the contour point of the target object is generally resolved according to the spatial coordinate of the target three-dimensional recognition frame, so as to complete the reconstruction work of the target object in the three-dimensional scenario.

According to the image recognition method provided by the embodiment of the disclosure, based on a key point estimation technology, vectors of non-initial key points in a two-dimensional image pointing to initial key points are further utilized to optimize the initial key points, so that target key points which are more accurate relative to the initial key points are obtained, the problems that the key points are invisible and the key point estimation is inaccurate due to reasons such as object shielding are solved, the quality of a three-dimensional external frame of a determined target object is improved, and the three-dimensional detection precision is improved.

Referring to fig. 3, fig. 3 is a flowchart of another image recognition method according to an embodiment of the disclosure, wherein the flowchart 300 includes the following steps:

step 301, determining a minimum rectangular frame which can completely surround the target object in the two-dimensional image, and magnifying the minimum rectangular frame by a preset multiple to obtain an identification surrounding frame.

In this embodiment, a minimum rectangular frame capable of completely surrounding a target object in a two-dimensional image is determined, the length and the width of the minimum rectangular frame are obtained, the length and the width of the minimum rectangular frame are amplified by a preset multiple, and then the center position of the minimum rectangular frame is still used as the center, so that a surrounding frame is determined and identified.

Further, after a larger value of the length and the width of the minimum rectangular frame is taken, the larger value is amplified by a preset multiple, and the recognition surrounding frame is determined according to the amplified result.

Step 302, centering on the target object, extracting a target image containing the target object from the two-dimensional image based on the size of the recognition bounding box.

In this embodiment, based on the recognition bounding box determined in step 301, the target image having the same image size as the recognition bounding box is determined from the two-dimensional image with the center of the target object as the center of the recognition bounding box.

And step 303, processing the target image by using the feature extraction network, and determining a plurality of initial key points of the target object in the two-dimensional image.

In this embodiment, the target image determined in step 302 is processed to generate the initial keypoints of the target object in the two-dimensional image by using a feature extraction network constructed by an algorithm such as Scale-invariant feature transform (Scale-invariant features transform, abbreviated as Sift), acceleration robust feature (Speeded Up Robust Features, abbreviated as Surf), and difference of gaussian function (Difference of Gaussian, abbreviated as Dog).

The feature extraction network can be trained in advance according to the two-dimensional image manually marked with the initial key points as a training sample, so that the aim of processing the target image by using the trained feature extraction network to generate the initial key points of the target object in the two-dimensional image is fulfilled.

Step 304, a plurality of initial key points of the target object in the two-dimensional image are obtained, and a feature vector set formed by feature vectors of non-initial key points pointing to each initial key point respectively is generated.

And 305, filtering the feature vectors in the feature vector set of each initial key point by using a random sampling consistency algorithm, and determining a target key point according to the feature vectors in the filtered feature vector set.

And 306, generating a target three-dimensional external frame according to the target key points, and identifying parameter information of the target object by using the target external three-dimensional frame.

The above steps 304-306 are identical to the steps 201-203 shown in fig. 2, and the same parts are referred to the corresponding parts of the previous embodiment, and will not be described herein again.

On the basis of the embodiment shown in fig. 2, the present embodiment further extracts a portion of the two-dimensional image that includes the target object, so as to determine the initial key point according to the content of the extracted portion, and compared with the embodiment shown in fig. 2, the present embodiment can reduce the operation workload when determining the initial key point, so as to improve the response speed of the image recognition method.

In some optional implementations of the present embodiment, before the processing the target image with the feature extraction network, the method further includes: determining the three-dimensional shape of an initial three-dimensional circumscribed frame of the target object according to the type information of the target object; network configuration parameters are extracted for the feature according to the number of key points required to generate the stereoscopic shape.

Specifically, on the premise of further guaranteeing the quality of the initial key points, in order to improve the efficiency of the initial key points, according to different types of target objects, an initial three-dimensional external frame which is more attached to the target objects can be determined, according to the number of vertexes of the initial three-dimensional external frame which is more attached, the number of required initial key points and/or target key points is redetermined, and parameters of the feature extraction network are correspondingly configured, so that the initial key points and/or target key points corresponding to the number of required initial key points and/or target key points can be extracted by using the feature extraction network after configuration, the target three-dimensional external frame of the target objects can be determined more accurately and more adaptively, and the computing resources of the feature extraction network can be utilized more reasonably.

For example, when the spatial shape of the target object is similar to a cone, the three-dimensional shape of the three-dimensional circumscribed frame of the target object may be correspondingly determined to be a triangular pyramid, and only 5 vertices are required as initial key points and/or target key points at this time, instead of 8 vertices required by the cuboid three-dimensional circumscribed frame, so that the computing resources of the feature extraction network are saved in a manner of reducing the initial key points and/or the target key points.

In some optional implementations of this embodiment, further comprising: comparing the difference between the target three-dimensional external frame generated based on each target key point and the initial three-dimensional external frame generated based on each initial key point to generate external three-dimensional frame difference information; and carrying out parameter optimization on the feature extraction network according to the external three-dimensional frame difference information.

Specifically, after the initial key point of the target object in the two-dimensional image is obtained, a corresponding target three-dimensional external frame of the target object can be constructed according to the initial key point, after the target key point is determined, difference information between the target three-dimensional external frame of the target object constructed based on the target key point and the initial three-dimensional external frame of the initial target object constructed based on the initial key point is obtained, parameter optimization is performed on a feature extraction network for determining the initial key point according to the feature difference information, and quality of the initial key point extracted by the feature extraction network is improved.

On the basis of any of the above embodiments, in order to implement three-dimensional scene reconstruction, three-dimensional detection needs to acquire coordinates of the center of the target object under the camera coordinate system, the actual length, width and height of the object, and the orientation angle of the target object, so the parameter information includes at least one of the following: the coordinate information of the target object, the real size information of the target object and the orientation angle information of the target object are convenient for directly extracting the parameter information with higher utilization value for the follow-up three-dimensional scene reconstruction work, and the efficiency of the three-dimensional scene reconstruction work is improved.

For deepening understanding, the disclosure further provides a specific implementation scheme in combination with a specific application scenario, and as shown in fig. 4-1, taking a two-dimensional image including an automobile partially blocked by a roadblock as a target object as an example, the specific process is as follows:

first, an initial key point of a target object in a two-dimensional image is obtained, wherein the initial key point comprises a central point of the automobile and 8 vertexes of a cuboid three-dimensional external frame of the automobile, and feature vectors of non-initial key points, except the central point of the automobile and the 8 vertexes of the cuboid three-dimensional external frame of the automobile, in the two-dimensional image, pointing to the initial key point are generated, as shown in fig. 4-2, wherein an initial key point A blocked by a roadblock and part of feature vectors pointing to the initial key point A are shown (represented by arrow symbols in fig. 4-2).

And secondly, filtering feature vectors pointing to the same initial key point by using a random sampling consistency algorithm, and determining a target key point according to the filtered feature vectors, wherein the target key point comprises a central point I of the automobile and 8 vertexes A, B, C, D, E, F, G and H of a cuboid three-dimensional circumscribed frame of the automobile as shown in fig. 4-3.

And thirdly, identifying parameter information of the target object by utilizing the three-dimensional external connection frame generated according to the target key points.

With further reference to fig. 5, as an implementation of the method shown in the foregoing figures, the present disclosure provides an embodiment of an image recognition apparatus, which corresponds to the method embodiment shown in fig. 2, and which is particularly applicable to various electronic devices.

As shown in fig. 5, the image recognition apparatus 500 of the present embodiment may include: a feature vector set generating unit 501, a key point determining unit 502, and a parameter information identifying unit 503. The feature vector generating unit 501 is configured to acquire a plurality of initial key points of a target object in a two-dimensional image, and generate a feature vector set formed by feature vectors, which respectively point to each of the initial key points, of non-initial key points, where the non-initial key points are points in the two-dimensional image that are different from the initial key points; a keypoint determining unit 502 configured to filter the feature vectors in the feature vector set of each of the initial keypoints by using a random sampling consistency algorithm, and determine a target keypoint according to the feature vectors in the filtered feature vector set; the parameter information identifying unit 503 is configured to generate a target three-dimensional circumscribed frame according to the target key point, and identify parameter information of the target object by using the target three-dimensional circumscribed frame.

In the present embodiment, in the image recognition apparatus 500: the specific processing of the feature vector set generating unit 501, the key point determining unit 502, and the parameter information identifying unit 503 and the technical effects thereof may refer to the relevant descriptions of steps 201 to 203 in the corresponding embodiment of fig. 2, and are not repeated herein.

In some optional implementations of the present embodiment, the image recognition apparatus 500 further includes: the recognition bounding box determining unit is configured to determine a minimum rectangular box which can completely enclose a target object in the two-dimensional image, and enlarge the minimum rectangular box by a preset multiple to obtain a recognition bounding box; a target image extraction unit configured to extract a target image containing the target object from the two-dimensional image based on the size of the recognition bounding box centering on the target object; and the initial key point generating unit is configured to process the target image by utilizing the characteristic extraction network and determine a plurality of initial key points of a target object in the two-dimensional image.

In some optional implementations of the present embodiment, the image recognition apparatus 500 further includes: a three-dimensional frame determining unit configured to determine a stereoscopic shape of an initial three-dimensional circumscribed frame of the target object according to the kind information of the target object; and a parameter configuration unit configured to extract network configuration parameters for the feature according to the number of key points required to generate the stereoscopic shape.

In some optional implementations of the present embodiment, the image recognition apparatus 500 further includes: a difference information acquisition unit configured to compare differences between the target three-dimensional circumscribed frame generated based on each of the target key points and the initial three-dimensional circumscribed frame generated based on each of the initial key points, and generate circumscribed three-dimensional frame difference information; and the neural network optimization unit is configured to perform parameter optimization on the feature extraction network according to the external three-dimensional frame difference information.

In some optional implementations of the present embodiment, the parameter information in the image recognition apparatus 500 includes at least one of the following: coordinate information of the target object, real size information of the target object, and orientation angle information of the target object.

The image recognition device provided by the embodiment further optimizes the initial key point by using a vector of the non-initial key point pointing to the initial key point in the two-dimensional image based on the key point estimation technology to obtain a target key point which is more accurate relative to the initial key point, solves the problems that the key point is invisible and the key point estimation is inaccurate due to reasons such as object shielding, improves the quality of a three-dimensional external frame of a determined target object, and improves the three-dimensional detection precision.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

Fig. 6 illustrates a schematic block diagram of an example electronic device 600 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 6, the apparatus 600 includes a computing unit 601 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 602 or a computer program loaded from a storage unit 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data required for the operation of the device 600 may also be stored. The computing unit 601, ROM 602, and RAM 603 are connected to each other by a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

Various components in the device 600 are connected to the I/O interface 605, including: an input unit 606 such as a keyboard, mouse, etc.; an output unit 607 such as various types of displays, speakers, and the like; a storage unit 608, such as a magnetic disk, optical disk, or the like; and a communication unit 609 such as a network card, modem, wireless communication transceiver, etc. The communication unit 609 allows the device 600 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

The computing unit 601 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 601 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 601 performs the respective methods and processes described above, such as an image recognition method. For example, in some embodiments, the image recognition method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 608. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 600 via the ROM 602 and/or the communication unit 609. When the computer program is loaded into the RAM 603 and executed by the computing unit 601, one or more steps of the image recognition method described above may be performed. Alternatively, in other embodiments, the computing unit 601 may be configured to perform the image recognition method by any other suitable means (e.g. by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of large management difficulty and weak service expansibility in the traditional physical host and virtual private server (VPS, virtual Private Server) service. Servers may also be divided into servers of a distributed system or servers that incorporate blockchains.

According to the technical scheme of the embodiment of the disclosure, based on the key point estimation technology, the vector of the non-initial key point pointing to the initial key point in the two-dimensional image is further utilized to optimize the initial key point, so that a target key point which is more accurate relative to the initial key point is obtained, the problems that the key point is invisible and the key point estimation is inaccurate due to the reasons such as object shielding are solved, the quality of the three-dimensional circumscribed frame of the determined target object is improved, and the three-dimensional detection precision is improved.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel, sequentially, or in a different order, provided that the desired results of the disclosed aspects are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. An image recognition method, comprising:

acquiring a plurality of initial key points of a target object in a two-dimensional image, and generating a feature vector set formed by feature vectors, which point to each initial key point respectively, of non-initial key points, wherein the non-initial key points are points which are different from the initial key points in the two-dimensional image, and the feature vectors are expressed as the ratio of the difference value between coordinates of the non-initial key points and coordinates of the initial key points to the two norms of the difference value;

respectively filtering the feature vectors in the feature vector set of each initial key point by utilizing a random sampling consistency algorithm, collecting the feature vectors aiming at the same initial key point in the filtered feature vectors, and determining the same initial key point pointed by the feature vectors as a target key point according to the collected result;

and generating a target three-dimensional external frame according to the target key points, and identifying parameter information of the target object by utilizing the target three-dimensional external frame.

2. The method of claim 1, comprising, prior to said acquiring a plurality of initial keypoints of a target object in a two-dimensional image:

determining a minimum rectangular frame which can completely surround a target object in the two-dimensional image, and amplifying the minimum rectangular frame by a preset multiple to obtain an identification surrounding frame;

Extracting a target image containing the target object from the two-dimensional image based on the size of the recognition bounding box by taking the target object as a center;

and processing the target image by utilizing a characteristic extraction network, and determining a plurality of initial key points of a target object in the two-dimensional image.

3. The method of claim 2, further comprising, prior to the processing of the target image with the feature extraction network:

determining the three-dimensional shape of an initial three-dimensional external frame of the target object according to the type information of the target object;

and extracting network configuration parameters for the features according to the number of key points required for generating the stereoscopic shape.

4. The method of claim 2, further comprising:

comparing differences between the target three-dimensional external frames generated based on the target key points and the initial three-dimensional external frames generated based on the initial key points to generate external three-dimensional frame difference information;

and carrying out parameter optimization on the feature extraction network according to the external three-dimensional frame difference information.

5. The method of any of claims 1-4, wherein the parameter information comprises at least one of:

Coordinate information of the target object, real size information of the target object, and orientation angle information of the target object.

6. A picture recognition apparatus comprising:

a feature vector set generating unit configured to acquire a plurality of initial key points of a target object in a two-dimensional image, and generate a feature vector set composed of feature vectors, each of which points to each of the initial key points, respectively, the non-initial key points being points in the two-dimensional image different from the initial key points, the feature vectors being represented as a ratio of a difference value between coordinates of the non-initial key points and coordinates of the initial key points to a two-norm of the difference value;

the key point determining unit is configured to filter the feature vectors in the feature vector set of each initial key point by utilizing a random sampling consistency algorithm, collect the feature vectors aiming at the same initial key point in the filtered feature vectors, and determine the same initial key point pointed by the feature vectors as a target key point according to the collected result;

and the parameter information identification unit is configured to generate a target three-dimensional external frame according to the target key points and identify the parameter information of the target object by utilizing the target three-dimensional external frame.

7. The apparatus of claim 6, further comprising:

the recognition bounding box determining unit is configured to determine a minimum rectangular box which can completely enclose a target object in the two-dimensional image, and enlarge the minimum rectangular box by a preset multiple to obtain a recognition bounding box;

a target image extraction unit configured to extract a target image containing the target object from the two-dimensional image based on a size of the recognition bounding box centering on the target object;

and the initial key point generating unit is configured to process the target image by utilizing a characteristic extraction network and determine a plurality of initial key points of a target object in the two-dimensional image.

8. The apparatus of claim 7, further comprising:

a three-dimensional frame determination unit configured to determine a stereoscopic shape of an initial three-dimensional circumscribed frame of the target object according to the kind information of the target object;

and a parameter configuration unit configured to extract network configuration parameters for the features according to the number of key points required to generate the stereoscopic shape.

9. The apparatus of claim 7, further comprising:

a difference information acquisition unit configured to compare differences between a target three-dimensional circumscribed frame generated based on each of the target key points and an initial three-dimensional circumscribed frame generated based on each of the initial key points, and generate circumscribed three-dimensional frame difference information;

And the neural network optimization unit is configured to perform parameter optimization on the feature extraction network according to the external three-dimensional frame difference information.

10. The apparatus according to any of claims 6-9, wherein the parameter information comprises at least one of:

11. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein, the liquid crystal display device comprises a liquid crystal display device,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the image recognition method of any one of claims 1-5.

12. A non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the image recognition method of any one of claims 1-5.