CN114627169A

CN114627169A - Image processing method and device, electronic equipment and storage medium

Info

Publication number: CN114627169A
Application number: CN202210226669.7A
Authority: CN
Inventors: 吴文龙
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-03-09
Filing date: 2022-03-09
Publication date: 2022-06-14
Anticipated expiration: 2042-03-09
Also published as: CN114627169B

Abstract

The application provides an image processing method, device, equipment and storage medium, and relates to the field of artificial intelligence computer vision. The method comprises the following steps: performing feature detection on the first image to obtain a first feature point and a first position of the first feature point in the first image; connecting the first image with the second image to obtain a third image, and extracting feature points of the third image to obtain a first feature map; inputting the first position and the first feature map into a neural network model so as to obtain a second position corresponding to the first position in the second image according to the relative position relation between the feature points of the first feature map; and determining a second feature point matched with the first feature point in the second image according to the second position. According to the embodiment of the application, the relative position relation between the feature points is fused, so that the accuracy of feature point matching can be improved.

Description

Image processing method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of artificial intelligence computer vision, and more particularly, to methods, apparatuses, devices, and storage media for image processing.

Background

The image registration algorithm has a huge application scene in industrial Artificial Intelligence (AI) quality inspection. In current industry AI quality testing platform, the industry manufacturing components and parts, especially 3C class components and parts, are less usually, and the structure is accurate, therefore the camera all is to the frequent position of defect, designs into the multi-angle and shoots. Specifically, when a certain camera is used, a fixed region of interest (ROI) region can be used for clearly photographing, and the rest regions are relatively fuzzy and are reserved for other cameras for photographing. The image registration method is to align the ROI of the captured image with the standard image, so that the subsequent defect contrast learning module can effectively locate and identify the defect in the ROI.

Image registration is the process of matching two images acquired under different conditions (different times, different imaging devices, different angles, illumination, etc.). The image registration process can be roughly divided into 4 parts: feature point detection, special description, feature point matching, outlier filtering and pose estimation. The method comprises the steps that feature point detection and feature description acquire the position and descriptor of each feature point in an image; the characteristic point matching is to obtain the matching relation of the characteristic points in the two images according to the positions and the descriptors of the characteristic points in the two images. Feature point matching is a core part of the image registration algorithm.

The current commonly used feature point matching method mainly determines the feature point matching relationship according to the distance information of the feature point descriptor. Since this method only considers descriptors, errors can occur when the descriptors are not accurate enough. For example, a descriptor may be less accurate when the image is distorted or the descriptor dimensions are low. Therefore, how to improve the accuracy of feature point matching is an urgent problem to be solved.

Disclosure of Invention

The embodiment of the application provides an image processing method, device, equipment and storage medium, which can be helpful for improving the accuracy of feature point matching.

In a first aspect, a method for image processing is provided, including:

performing feature detection on a first image to obtain a first feature point and a first position of the first feature point in the first image;

connecting the first image and the second image to obtain a third image;

extracting feature points of the third image to obtain a first feature map;

inputting the first position and the first feature map into a neural network model to obtain a second position corresponding to the first position in the second image according to the relative position relation between feature points of the first feature map;

and determining a second feature point matched with the first feature point in the second image according to the second position.

In a second aspect, a method of training a model is provided, comprising:

performing feature detection on a first picture sample to obtain a third feature point and a third position of the third feature point in the first image;

performing homography transformation on the first picture sample to obtain a second picture sample and a fourth position corresponding to the third position in the second picture sample;

connecting the first picture sample with the second picture sample to obtain a third picture sample;

performing feature extraction on the third picture sample to obtain a second feature map;

inputting the third position and the second feature map into a neural network model to obtain position information corresponding to the third position in the second picture sample according to the relative position relation between feature points of the second feature map;

and training the neural network model according to the position information corresponding to the third position and the fourth position to obtain the trained neural network model.

In a third aspect, an apparatus for image processing is provided, including:

the detection unit is used for carrying out feature detection on a first image to obtain a first feature point and a first position of the first feature point in the first image;

the connecting unit is used for connecting the first image and the second image to obtain a third image;

the feature extraction unit is used for extracting feature points of the third image to obtain a first feature map;

the neural network model is used for inputting the first position and the first feature map so as to obtain a second position corresponding to the first position in the second image according to the relative position relation between feature points of the first feature map;

and the determining unit is used for determining a second feature point matched with the first feature point in the second image according to the second position.

In a fourth aspect, an apparatus for training a model is provided, comprising:

the detection unit is used for carrying out feature detection on the first picture sample to obtain a third feature point and a third position of the third feature point in the first image;

the transformation unit is used for carrying out homography transformation on the first picture sample to obtain a second picture sample and a fourth position corresponding to the third position in the second picture sample;

the connection unit is used for connecting the first picture sample and the second picture sample to obtain a third picture sample;

the feature extraction unit is used for extracting features of the third picture sample to obtain a second feature map;

a training unit, configured to input the third position and the second feature map into a neural network model, so as to obtain, according to a relative position relationship between feature points of the second feature map, position information corresponding to the third position in the second picture sample; and

In a fifth aspect, the present application provides an electronic device, comprising:

a processor adapted to implement computer instructions; and the number of the first and second groups,

a memory storing computer instructions adapted to be loaded by the processor and to perform the method of the first aspect described above, or the method of the second aspect.

In a sixth aspect, embodiments of the present application provide a computer-readable storage medium storing computer instructions, which, when read and executed by a processor of a computer device, cause the computer device to perform the method of the first aspect or the method of the second aspect.

In a seventh aspect, the present application provides a computer program product or a computer program, where the computer program product or the computer program includes computer instructions, and the computer instructions are stored in a computer-readable storage medium. The computer instructions are read by a processor of the computer device from a computer readable storage medium, and the computer instructions are executed by the processor to cause the computer device to perform the method of the first aspect, or the method of the second aspect, as described above.

According to the embodiment of the application, the relative position relationship between the feature points in the feature map of the third image can be fused, so that the obtained feature point descriptor can represent the relative position relationship between the feature points in the third image, for example, different position points are pulled farther apart, and the same position points are pulled closer, so that the limitation that the traditional algorithm only depends on the feature point descriptor for matching is avoided, and the accuracy of feature point matching can be improved.

Drawings

Fig. 1 is a schematic view of an application scenario according to an embodiment of the present application;

fig. 2 is a schematic diagram of a specific application scenario related to an embodiment of the present application;

FIG. 3 is a schematic flow chart diagram of a method of image processing according to an embodiment of the present application;

fig. 4 is a schematic diagram of a network architecture according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a neural network model provided in an embodiment of the present application;

FIG. 6 is a schematic diagram of a self-attention mechanism;

FIG. 7 is a schematic flow chart diagram of a method of training a model in accordance with an embodiment of the present application;

fig. 8 is a schematic block diagram of an apparatus for image processing according to an embodiment of the present application;

FIG. 9 is a schematic block diagram of an apparatus for training a model according to an embodiment of the present disclosure;

fig. 10 is a schematic block diagram of an electronic device provided in an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It should be understood that in the embodiment of the present application, "B corresponding to a" means that B is associated with a. In one implementation, B may be determined from a. It should also be understood that determining B from a does not mean determining B from a alone, but may be determined from a and/or other information.

In the description of the present application, "at least one" means one or more, "a plurality" means two or more than two, unless otherwise specified. In addition, "and/or" describes an association relationship of associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone, wherein A and B can be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. "at least one of the following" or similar expressions refer to any combination of these items, including any combination of the singular or plural items. For example, at least one (one) of a, b, or c, may represent: a, b, c, a-b, a-c, b-c, or a-b-c, wherein a, b, c may be single or multiple.

It should be further understood that the descriptions of the first, second, etc. appearing in the embodiments of the present application are only for illustrating and differentiating the objects, and do not represent a particular limitation to the number of devices in the embodiments of the present application, and do not constitute any limitation to the embodiments of the present application.

It should also be appreciated that a particular feature, structure, or characteristic described in connection with an embodiment is included in at least one embodiment of the application. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.

Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or server that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The scheme provided by the application can relate to artificial intelligence technology.

Among them, Artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

It should be understood that the artificial intelligence technology is a comprehensive subject, and relates to a wide range of fields, namely a hardware technology and a software technology. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

With the research and progress of artificial intelligence technology, the artificial intelligence technology is developed and applied in a plurality of fields, such as common smart homes, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned driving, automatic driving, unmanned aerial vehicles, robots, smart medical care, smart customer service, and the like.

The embodiment of the application can relate to a Computer Vision (CV) technology in an artificial intelligence technology, wherein the Computer Vision is a science for researching how to enable a machine to see, and further means that a camera and a Computer are used for replacing human eyes to perform machine Vision such as identification and measurement on a target, and further image processing is performed, so that the Computer is processed into an image which is more suitable for human eyes to observe or is transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. The computer vision technology generally includes technologies such as image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technology, virtual reality, augmented reality, synchronous positioning, map construction and the like, and also includes common biometric technologies such as face recognition, fingerprint recognition and the like.

The embodiment of the application also can relate to Machine Learning (ML) in the artificial intelligence technology, wherein the ML is a multi-field cross subject and relates to a plurality of subjects such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.

In addition, the scheme provided by the embodiment of the application can also relate to an image registration technology.

Fig. 1 is a schematic view of an application scenario related to an embodiment of the present application.

As shown in fig. 1, the system comprises an acquisition device 101, a computing device 102 and a display device 103. Wherein the capturing device 101 is used for capturing an image, such as an image to be registered. The computing device 102 is used for processing the images acquired by the acquisition device 101, for example for image registration. The display device 103 is used to display the registered image processed by the computing device 102.

Illustratively, the computing device 101 may be a user device, such as a mobile phone, a tablet computer, a notebook computer, a palm computer, a Mobile Internet Device (MID) or other terminal device with a browser-installed function.

Illustratively, the computing device 102 may be a server. The server may be one or more. When the number of the servers is multiple, at least two servers exist for providing different services, and/or at least two servers exist for providing the same service, for example, the same service is provided in a load balancing manner, which is not limited in the embodiment of the present application. The server can be provided with a neural network model, and the server provides support for the training and application process of the neural network model. The server can also be provided with an image processing device for registering images, and the server provides support for the application process of the image processing device.

The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a CDN (Content Delivery Network), a big data and artificial intelligence platform, and the like. Servers may also become nodes of a blockchain.

Illustratively, when the computing device 102 has display functionality, the display device 103 may be a display in the computing device 102.

Illustratively, the display device 103 is a different device from the computing device 102, and the display device 103 is connected to the computing device 102 via a network. The network may be a wireless or wired network such as an Intranet (Intranet), the Internet (Internet), a Global System of Mobile communication (GSM), Wideband Code Division Multiple Access (WCDMA), a 4G network, a 5G network, Bluetooth (Bluetooth), Wi-Fi, or a communication network.

By way of example, application scenarios of the present application include, but are not limited to, industrial AI quality inspection. Fig. 2 shows a schematic diagram of a system architecture diagram of an industrial AI quality inspection platform 200, which includes an image registration module 210 and a defect-versus-this learning module 220. For example, the image registration module 210 may apply an image registration algorithm to perform image registration on raw images of components taken by multiple cameras (for example, the camera 1, the camera 2 …, and the camera 20) from different angles, so as to obtain a registration map; the defect comparison learning module 220 may perform defect comparison learning on the registration map and the etalon of the component to obtain a result map of defect comparison, so as to determine whether the component has a defect.

Feature point matching is a core part of the image registration algorithm. Common feature point matching algorithms include BruteForce matching algorithm and FLANN matching algorithm, wherein BruteForce belongs to brute force search algorithm, and FLANN belongs to nearest neighbor algorithm. The BruteForce matching algorithm is better than the FLANN matching algorithm.

The brute force matching algorithm, namely a brute force search algorithm, has the working principle that: based on a feature point detection algorithm, feature point positions and descriptors of two images can be obtained, one feature point is selected from a first image, then descriptor distance testing is carried out on the feature point positions and each feature point of a second image in sequence, and finally the feature point with the nearest distance is returned. A descriptor represents a vector of a point, and generally describes information such as a position and a neighborhood of the point. Descriptors of different points should have sufficient differences, while descriptors of the same points are usually closer.

In general, the BruteForce algorithm will cross-validate that the best match (i, j) will be returned only if the i-th feature point in a is closest to the descriptor of the j-th feature point in B, and the j-th feature point in B is also closest to the i-th feature point in a (no other points in a are closest to j). I.e. the two feature points are to be matched to each other.

In the characteristic point matching method based on the BruteForce matching algorithm, the matching precision is mainly limited by the accuracy of the characteristic point descriptor. Once the image distortion is severe, the descriptor generated by the local image is rather inaccurate, which will seriously affect the final feature point matching effect.

In view of this, embodiments of the present application provide an image processing method, an image processing apparatus, an image processing device, and a storage medium, which can improve a feature point matching effect.

Specifically, in this embodiment of the present application, first, feature detection may be performed on a first image to obtain a first feature point and a first position of the first feature point in the first image, then, the first image and a second image are connected to obtain a third image, feature point extraction is performed on the third image to obtain a feature map, then, the first position and the feature map are input into a neural network model to obtain a second position corresponding to the first position in the second image according to a relative position relationship between feature points of the first feature map, and finally, a second feature point corresponding to the first feature point in the second image may be determined according to the second position.

The technical solutions provided by the embodiments of the present application are described below with reference to the accompanying drawings.

Fig. 3 shows a schematic flow diagram of a method 300 of image processing according to an embodiment of the present application, where the method 300 of image processing may be performed by any electronic device with data processing capability, for example, the electronic device may be implemented as a server, and for example, the electronic device may be implemented as the computing device 102 in fig. 1 or the image registration module 220 in fig. 2, which is not limited in this application. Illustratively, the method 300 may be used for image registration.

In some embodiments, a machine learning model may be included (e.g., deployed) in the computing device, which may be used to perform the method 300 of image processing. By way of example, the machine learning model may be, without limitation, a deep learning model, a neural network model, or other model. Fig. 4 is a schematic diagram of a network architecture provided in an embodiment of the present application, which includes a feature point detecting unit 410, a connecting unit 420, a feature extracting unit 430, and a neural network model 450. The steps in method 300 will be described below in conjunction with the network architecture in fig. 4.

It should be understood that fig. 4 illustrates an example of a machine learning model for image processing, which is merely to assist those skilled in the art to understand and implement the embodiments of the present application, and does not limit the scope of the embodiments of the present application. Equivalent alterations and modifications may be made by those skilled in the art based on the examples given herein, and such alterations and modifications are intended to be within the scope of the embodiments of the present application.

As shown in fig. 3, the method 300 of image processing may include steps 310 to 240.

And 310, performing feature detection on the first image to obtain a first feature point and a first position of the first feature point in the first image.

Illustratively, as shown in fig. 4, the first image may be a picture of the component taken by the camera, for example, may include a first ROI of the picture of the component. Illustratively, the size of the first image may be represented as (3, h, w), where 3 represents the red, green, blue, RGB three channels of the first image, and h, w represent the height and width of the first image, respectively.

For example, the feature point detection unit 410 may perform feature detection on the first image to obtain a first feature point and a first position of the first feature point in the first image. As a specific example, the first feature point may include a corner point or an edge point of the component, which is not limited. As a specific example, the first location may be a coordinate (x)₁，y₁) Wherein 0 is less than or equal to x₁≤h，0≤y₁≤w。

For example, the feature point detecting unit 410 may be a conventional or deep feature point detecting network, which is not limited. As an example, the first feature point may include n feature points (n, 2), where n is a positive integer.

And 320, connecting the first image and the second image to obtain a third image.

For example, referring to fig. 4, the first image and the second image may be connected (connected) by the connection unit 420, so as to obtain a third image. Here, connection may also be replaced by splicing, merging, and the like, all of which mean the same or similar. Illustratively, the first image and the second image are the same size, e.g., both are (3, h, w), and the size of the third image obtained by combining the first image and the second image in columns may be (3, h, 2 w).

For example, the second image may be an image that needs to be registered with the first image, for example, a camera may take a picture of the component, for example, a second ROI that may include the picture of the component. The component in the first image is the same as the component in the first image, except that the first image and the second image are two images acquired under different conditions (e.g., different times, different imaging devices, different angles, and/or different lighting conditions). For example, in the application scenario shown in fig. 2, the first image may be a picture of a component taken by the camera 1, and the second image may be a picture of the component taken by the camera 2.

And 330, extracting the feature points of the third image to obtain a first feature map.

For example, referring to fig. 4, feature point extraction may be performed on the third image by the feature extraction unit 430 to obtain the first feature map (feature map). The first feature map may include a two-dimensional feature of the feature point, the two-dimensional feature characterizing position information of the feature point in the third image. By way of example, the feature extraction unit 430 may be a Convolutional Neural Network (CNN), which may be a conventional feature extraction backbone, such as resnet50, without limitation. Optionally, after feature point extraction is performed on the third image, upsampling may be performed to obtain the first feature map with the same size as the input third image, for example, (d, h, 2w), where d is the number of channels (chanel) of the first feature map.

And 340, inputting the first position and the first feature map into a neural network model, so as to obtain a second position corresponding to the first position in the second image according to the relative position relationship between the feature points of the first feature map. As a specific example, the second location may be a coordinate (x)₂，y₂) Wherein 0 is less than or equal to x₂≤h，0≤y₂≤w。

Illustratively, referring to fig. 4, the first location and the first feature map may be input to a neural network model 440, and the second location may be output by the neural network model 440. Illustratively, as shown in fig. 4, the neural network model 440 may include an encoder 441 and a decoder 442.

In some alternative embodiments, the first feature map may be input to the encoder 441 to encode the first feature map to obtain full map information of the first feature map; and inputting the global map information and the first location into a decoder 442, so as to decode the first location according to the global map information, thereby obtaining the second location. Wherein, the whole map information includes the relative position relationship between the feature points in the first feature map (i.e. the position correlation of each feature point in the third image). Since the feature points in the full map information can be fused with the position correlation of the third image (e.g., the whole map), the feature point descriptor obtained in this way can characterize the relative position relationship between the feature points in the third image, for example, pulling different position points (i.e., non-corresponding points in the first image and the second image) farther apart in the third image and pulling the same position points (i.e., corresponding points in the first image and the second image) closer together.

As a possible implementation, referring to fig. 5, the encoder 441 may include a first position encoding unit 502 and a first Multi-head attention (Multi-head attention) mechanism 501. The first multi-headed attention mechanism 501 is used to perform an attention operation. Referring to fig. 5, the first feature map may be input to the first position encoding unit 502 to obtain position information of feature points of the first feature map. Optionally, after obtaining the position information of the feature point of the first feature map, vector representation may be performed on the position information and the feature point of the first feature map to obtain a first representation vector. As an implementation of inputting the position information and the feature points of the first feature map into the first multi-head attention mechanism 501, the first representation vector may be input into the first multi-head attention mechanism 501 to obtain the full map information. Illustratively, the full graph information may be represented as an encoding matrix output by the first multi-headed attention mechanism 501.

It should be noted that, when performing an attention operation on the first feature map, it is necessary to sequence the first feature map, for example, stretching a two-dimensional image into a one-dimensional sequence, which may result in the loss of position information of each point in the first feature map. As a specific example, the position coordinate (x) of each feature point in the two-dimensional image_i，y_i) After stretching into a one-dimensional sequence, can be represented as (y)_i×h+x_i) Thereby losing the position information (x)_i，y_i). To avoid loss of position information for points in the first profile, the first profile may be input into the first multi-headed attention mechanismAnd carrying out position coding on the first feature map through a first coding unit to obtain the relative distance between the feature points.

For example, the first encoding unit may be a sine encoder, which is not limited in this application. Specifically, the sine encoder can perform position encoding according to the following formula:

pos represents a serial number of a token obtained after splitting an input image block (token), and for example, when there are 250 tokens after token switching, pos is 250 for the 250 th token; i corresponds to an embedding size (embedding size), and if the embedding size of the token embedding layer is 4, i represents the ith element in the embedding vector, for example, a token is embedded by token embedding to [0.1,0.15,0.12,0.03 ] of the token embedding]If i is 0, i is 10.15, and i is 0.1. Since 2i and 2i +1 are used for dividing odd and even columns of the embedding matrix, i is maximally equal to embedding _ size/2, e.g. embedding size is 4, i is maximally 2; d is a radical of_modelRepresents the embedding size.

As one possible implementation, with continued reference to fig. 5, the decoder 442 may include a second position encoding unit 504 and a second multi-headed attention mechanism 503. This second multi-headed attention mechanism 503 is used to perform the attention operation. Referring to fig. 5, the first position may be input to the second position encoding unit 504 to obtain position information of the first position. Optionally, after obtaining the position information of the first position, vector representation may be performed on the position information to obtain a second representation vector. As an implementation manner of inputting the position information of the first position and the full map information into the second multi-head attention mechanism 503, the second representation vector and the full map information output by the first multi-head attention mechanism may be input into the second multi-head attention mechanism 503 to decode the position information of the first position based on the full map information, and since the full map information can draw the same point closer, the second position corresponding to the first position in the second image can be obtained. That is, the first and second positions may correspond to the same point in the third image.

Illustratively, the multi-head attention mechanism may consist of a plurality of self-attention mechanisms (self-attentions). Fig. 6 shows a schematic diagram of the self-attention mechanism, where the matrix Q, K, V is the query value, key value and value, respectively, Softmax denotes the normalized exponential function, and MatMul denotes the matrix multiplication. The output of the self-attention mechanism represents the similarity of each element between Q and K, and may be expressed, for example, as follows:

wherein d is_kRepresenting Q, K the number of columns of the matrix, i.e., the vector dimension, T represents the matrix transpose,

representing the matrix multiplication of the Softmax () output with V.

For example, in the first multi-head attention mechanism, the input Q, K, and V are all the same, and may be, for example, position information of a feature point of the first feature map and a vector representation of the feature point of the first feature map. The first multi-point attention mechanism can cut the first feature map into a plurality of parts (for example, 4 or 8 parts in a channel dimension) to perform self-attention operation, so that the relevance of each point in the first feature map to other points can be calculated, and further, the richer feature information of the first feature map can be captured.

For example, after the self-attribute operation, the feature points in the first feature map may be fused with the position correlation of the whole third image, so that the obtained feature point descriptor can characterize the relative position relationship between the feature points, for example, the same point (i.e., the corresponding point in the first image and the second image) is pulled closer, and the different point (the point in the first image and the second image which does not correspond) is pulled further away.

In the second multi-head attention mechanism, the input Q is, for example, a vector representation of the first position, and the input K, V is, for example, encoded information output by the first multi-head attention mechanism, i.e., the above-mentioned full map information. The content of the corresponding query can be obtained by decoding the input Q through the second multi-headed attention mechanism. For example, when a first position in a first image is input, the full-scale image information can draw the same point closer, so that a second position (i.e., the same point) in a second image corresponding to the first position can be obtained.

In some alternative embodiments, the neural network model may include a Transformer structure, which is not limited in this application.

And 350, determining a second feature point matched with the first feature point in the second image according to the second position.

In some alternative embodiments, before step 340, the model may be trained to obtain the neural network model. The training data of the neural network model comprises characteristic points and characteristic point position information of the first picture sample. For example, the first picture sample may be subjected to homography transformation to obtain a second picture sample; and then training the neural network model according to the first picture sample and the second picture sample to obtain the trained neural network model.

Illustratively, the homography transformation may be a mapping of one plane to another. The homography matrix may be a 3 × 3 matrix with 8 degrees of freedom, which may be determined by 4 point pairs, which is not limited in this application.

For example, referring to fig. 4, the first picture sample and the second picture sample may be transformed by the homography transformation unit 450, so as to obtain the feature point and the feature point position information of the second picture sample. The feature points in the first picture sample are matched with the feature points in the second picture sample, and the feature point position information in the first picture sample is also corresponding to the feature point position information in the second picture sample. That is to say, in the embodiment of the present application, the neural network model may be trained according to the feature points and the position information of the feature points of the first picture sample and the position information of the matching feature points in the second picture sample, so as to obtain a trained neural network model.

Fig. 7 shows a schematic flow chart of a method 400 for training a model according to an embodiment of the present application, where the method 400 for training a model can be performed by any electronic device with data processing capability, for example, the electronic device can be implemented as a server, and for example, the electronic device can be implemented as the computing device 102 in fig. 1, which is not limited in this application. As shown in fig. 7, method 400 includes steps 410 through 450.

In some embodiments, a machine learning model may be included (e.g., deployed) in the computing device and may be used to perform the method 400 of training the model. By way of example, the machine learning model may be, without limitation, a deep learning model, a neural network model, or other model. Illustratively, the machine learning model may be the network architecture of fig. 4, and when the network architecture is used to train the model, the homography transformation unit 450 and the loss unit 460 are also included. The steps in method 400 will be described below in conjunction with the network architecture in fig. 4.

And 410, performing feature detection on the first picture sample to obtain a third feature point and a third position of the third feature point in the first image.

And 420, performing homography transformation on the first picture sample to obtain a second picture sample and a fourth position corresponding to the third position in the second picture sample.

For example, the first picture sample may be transformed by the homography transformation unit 450 in fig. 4 to obtain the second picture sample and the fourth position. For example, the homography can be referred to the description above, and the description is omitted.

430, connecting the first picture sample and the second picture sample to obtain a third picture sample.

And 440, performing feature extraction on the third image sample to obtain a second feature map.

Specifically, the operations on the first picture sample and the second picture sample in steps 410 to 440 may refer to the operations on the first image and the second image in steps 310 to 330 in fig. 3, and are not described herein again.

And 450, inputting the third position and the second feature map into a neural network model to obtain position information corresponding to the third position in the second picture sample according to the relative position relationship between the feature points of the second feature map.

For example, the neural network model may be the neural network model in fig. 4, and the process of obtaining the location information corresponding to the third location in step 450 is similar to the process of obtaining the second location in fig. 3, and reference may be made to the description in step 340 in fig. 3, which is not repeated here.

And 460, training the neural network model according to the position information corresponding to the third position and the fourth position to obtain the trained neural network model.

Here, the fourth position may be input to the neural network model as a true value corresponding to the third position, so as to implement training of the model parameters.

For example, referring to fig. 4, the loss of the position information and the fourth position of the fourth feature point matched with the third feature point in the second picture sample output by the decoder 442 may be calculated by the loss calculating unit 460, and then the parameters of the neural network model may be updated according to the loss, for example, a back propagation gradient algorithm.

In some alternative embodiments, when an encoder and decoder are included in the neural network model, the dimension of the query result output by the decoder is related to the input dimension of the encoder and decoder, which may be (n, d), for example. At this time, the fourth position, i.e. the dimension of the input true value is (n, 2), so the output dimension of the decoder can be shifted from (n, d) to (n, 2) through a Multi-Layer perceptron (MLP) network, then the loss 460 unit can calculate the L2 loss (loss), and train the whole network according to the L2 loss. The MLP comprises an input layer, an output layer and a plurality of intermediate hidden layers, and a full connection structure can be formed between the layers.

The present invention is not limited to the details of the above embodiments, and various simple modifications can be made to the technical solution of the present invention within the technical concept of the present invention, and the technical solution of the present invention is protected by the present invention. For example, the various features described in the foregoing detailed description may be combined in any suitable manner without contradiction, and various combinations that may be possible are not described in this application in order to avoid unnecessary repetition. For example, various embodiments of the present application may be arbitrarily combined with each other, and the same should be considered as the disclosure of the present application as long as the concept of the present application is not violated.

It should also be understood that, in the various method embodiments of the present application, the sequence numbers of the above-mentioned processes do not imply any order of execution, and the order of execution of the processes should be determined by their functions and inherent logic, and should not constitute any limitation to the implementation processes of the embodiments of the present application. It is to be understood that the numerical designations are interchangeable under appropriate circumstances such that the embodiments of the application described are capable of operation in sequences other than those illustrated or described herein.

Method embodiments of the present application are described above in detail, and apparatus embodiments of the present application are described below in conjunction with fig. 8-10.

Fig. 8 is a schematic block diagram of an apparatus 600 for image processing according to an embodiment of the present application. As shown in fig. 8, the apparatus 600 for image processing may include a detection unit 610, a connection unit 620, a feature extraction unit 630, a neural network model 640, and a determination unit 650.

A detecting unit 610, configured to perform feature detection on a first image to obtain a first feature point and a first position of the first feature point in the first image;

a connecting unit 620, configured to connect the first image and the second image to obtain a third image;

a feature extraction unit 630, configured to perform feature point extraction on the third image to obtain a first feature map;

a neural network model 640, configured to input the first position and the first feature map, so as to obtain a second position in the second image, where the second position corresponds to the first position, according to a relative position relationship between feature points of the first feature map;

a determining unit 650, configured to determine, according to the second position, a second feature point in the second image, where the second feature point matches the first feature point.

Optionally, the neural network model 640 includes the encoder and the decoder, and the neural network model 640 is specifically configured to:

inputting the first feature map into the encoder to encode the first feature map to obtain full map information of the first feature map, wherein the full map information includes a relative position relationship between feature points of the first feature map;

and inputting the full map information and the first position into the decoder so as to decode the first position according to the full map information to obtain the second position.

Optionally, the encoder includes a first position encoding unit and a first multi-headed attention mechanism.

Optionally, the first encoder unit is specifically configured to: inputting the first feature map and outputting position information of feature points of the first feature map;

the first multi-head attention mechanism is specifically configured to: and inputting the position information of the feature points and the feature points of the first feature map, and outputting the full map information.

Optionally, the decoder includes a second position encoding unit and a second multi-headed attention mechanism.

Optionally, the second position encoding unit is specifically configured to: inputting the first position and outputting position information of the first position;

the second multi-headed attention mechanism is specifically configured to: inputting the position information of the first position and outputting the second position.

Optionally, the neural network model includes a Transformer structure.

Optionally, the training data of the neural network model includes feature points and feature point position information of the first picture sample.

Optionally, the apparatus 600 further comprises a training unit, configured to:

performing homography transformation on the first picture sample to obtain a second picture sample;

and training the neural network model according to the first picture sample and the second picture sample to obtain the trained neural network model.

Optionally, the first image includes a first region of interest ROI of the picture of the component, and the second image includes a second ROI of the picture of the component.

It is to be understood that apparatus embodiments and method embodiments may correspond to one another and that similar descriptions may refer to method embodiments. To avoid repetition, further description is omitted here. Specifically, the apparatus 600 for image processing in this embodiment may correspond to a corresponding main body that executes the method 300 in this embodiment, and the foregoing and other operations and/or functions of each module in the apparatus 600 are respectively for implementing each method in the foregoing or a corresponding flow in each method, and are not described herein again for brevity.

Fig. 9 is a schematic block diagram illustrating an apparatus 700 for training a model according to an embodiment of the present application. As shown in fig. 9, the apparatus 700 includes a detection unit 710, a transformation unit 720, a connection unit 730, a feature extraction unit 740, and a training unit 750.

A detecting unit 710, configured to perform feature detection on a first picture sample to obtain a third feature point and a third position of the third feature point in the first image;

a transforming unit 720, configured to perform homography transformation on the first picture sample to obtain a second picture sample and a fourth position corresponding to the third position in the second picture sample;

a connecting unit 730, configured to connect the first picture sample and the second picture sample to obtain a third picture sample;

a feature extraction unit 740, configured to perform feature extraction on the third image sample to obtain a second feature map;

a training unit 750, configured to input the third position and the second feature map into a neural network model, so as to obtain position information corresponding to the third position in the second picture sample according to a relative position relationship between feature points of the second feature map; and

It is to be understood that apparatus embodiments and method embodiments may correspond to one another and that similar descriptions may refer to method embodiments. To avoid repetition, further description is omitted here. Specifically, the apparatus 700 for training the model in this embodiment may correspond to a corresponding main body for executing the method 400 in this embodiment, and the foregoing and other operations and/or functions of each module in the apparatus 700 are respectively for implementing each method in the foregoing or a corresponding flow in each method, and are not described herein again for brevity.

The apparatus and system of embodiments of the present application are described above in connection with the drawings from the perspective of functional modules. It should be understood that the functional modules may be implemented by hardware, by instructions in software, or by a combination of hardware and software modules. Specifically, the steps of the method embodiments in the present application may be implemented by integrated logic circuits of hardware in a processor and/or instructions in the form of software, and the steps of the method disclosed in conjunction with the embodiments in the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. Alternatively, the software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, electrically erasable programmable memory, registers, or other storage medium known in the art. The storage medium is located in a memory, and a processor reads information in the memory and completes the steps in the above method embodiments in combination with hardware thereof.

Fig. 10 is a schematic block diagram of an electronic device 800 provided in an embodiment of the present application.

As shown in fig. 10, the electronic device 800 may include:

a memory 810 and a processor 820, the memory 810 being configured to store a computer program and to transfer the program code to the processor 820. In other words, the processor 820 may call and execute a computer program from the memory 810 to implement the method in the embodiment of the present application.

For example, the processor 820 may be configured to perform the steps of the

method

300 or 400 according to instructions in the computer program.

In some embodiments of the present application, the processor 820 may include, but is not limited to:

general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, and the like.

In some embodiments of the present application, the memory 810 includes, but is not limited to:

volatile memory and/or non-volatile memory. The non-volatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable PROM (EEPROM), or a flash Memory. Volatile Memory can be Random Access Memory (RAM), which acts as external cache Memory. By way of example, but not limitation, many forms of RAM are available, such as Static random access memory (Static RAM, SRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic random access memory (Synchronous DRAM, SDRAM), Double Data Rate Synchronous Dynamic random access memory (DDR SDRAM), Enhanced Synchronous SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), and Direct Rambus RAM (DR RAM).

In some embodiments of the present application, the computer program may be partitioned into one or more modules, which are stored in the memory 810 and executed by the processor 820 to perform the encoding methods provided herein. The one or more modules may be a series of computer program instruction segments capable of performing certain functions, the instruction segments describing the execution of the computer program in the electronic device 800.

Optionally, the electronic device 800 may further include:

a transceiver 830, the transceiver 830 being connectable to the processor 820 or the memory 810.

The processor 820 may control the transceiver 830 to communicate with other devices, and specifically, may transmit information or data to the other devices or receive information or data transmitted by the other devices. The transceiver 830 may include a transmitter and a receiver. The transceiver 830 may further include one or more antennas.

It should be understood that the various components in the electronic device 800 are connected by a bus system that includes a power bus, a control bus, and a status signal bus in addition to a data bus.

According to an aspect of the present application, there is provided a communication device comprising a processor and a memory, the memory being configured to store a computer program, the processor being configured to call and execute the computer program stored in the memory, so that the encoder performs the method of the above-described method embodiment.

According to an aspect of the present application, there is provided a computer storage medium having a computer program stored thereon, which, when executed by a computer, enables the computer to perform the method of the above-described method embodiments. In other words, the present application also provides a computer program product containing instructions, which when executed by a computer, cause the computer to execute the method of the above method embodiments.

According to another aspect of the application, a computer program product or computer program is provided, comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the method of the above-described method embodiment.

In other words, when implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions described in accordance with the embodiments of the present application occur, in whole or in part, when the computer program instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored on a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website, computer, server, or data center to another website, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that includes one or more of the available media. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a Digital Video Disc (DVD)), or a semiconductor medium (e.g., a Solid State Disk (SSD)), among others.

Those of ordinary skill in the art will appreciate that the various illustrative modules and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus, device and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the module is merely a logical division, and other divisions may be realized in practice, for example, a plurality of modules or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or modules, and may be in an electrical, mechanical or other form.

Modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical modules, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. For example, functional modules in the embodiments of the present application may be integrated into one processing module, or each of the modules may exist alone physically, or two or more modules are integrated into one module.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of image processing, comprising:

connecting the first image and the second image to obtain a third image;

extracting feature points of the third image to obtain a first feature map;

inputting the first position and the first feature map into a neural network model so as to obtain a second position corresponding to the first position in the second image according to the relative position relation between feature points of the first feature map;

2. The method according to claim 1, wherein the inputting the first position and the first feature map into a neural network model to obtain a second position corresponding to the first position in the second image according to a relative position relationship between feature points of the first feature map comprises:

inputting the first feature map into an encoder of the neural network model to encode the first feature map to obtain full map information of the first feature map, wherein the full map information comprises relative position relations between feature points of the first feature map;

and inputting the full-image information and the first position into a decoder of the neural network model so as to decode the first position according to the full-image information to obtain the second position.

3. The method of claim 2, wherein the encoder comprises a first position encoding unit and a first multi-headed attention mechanism.

4. The method of claim 3, wherein inputting the first feature map into an encoder of the neural network model to encode the first feature map to obtain full map information of the first feature map comprises:

inputting the first feature map into the first position encoding unit to obtain position information of feature points of the first feature map;

and inputting the position information of the characteristic points and the characteristic points of the first characteristic diagram into the first multi-head attention mechanism to obtain the full diagram information.

5. The method of any of claims 2-4, wherein the decoder comprises a second position encoding unit and a second multi-headed attention mechanism.

6. The method of claim 5, wherein inputting the full map information and the first location into a decoder of the neural network model to decode the first location according to the full map information to obtain the second location comprises:

inputting the first position into the second position coding unit to obtain position information of the first position;

and inputting the position information of the first position into the second multi-head attention mechanism to obtain the second position.

7. The method of any one of claims 1-4, wherein the neural network model comprises a Transformer structure.

8. The method according to any one of claims 1-4, wherein the training data of the neural network model comprises feature points and feature point position information of the first picture sample.

9. The method of claim 8, wherein prior to inputting the first location and the first feature map into a neural network model, further comprising:

10. The method according to any one of claims 1 to 4, wherein the first image comprises a first region of interest, ROI, of a picture of a component and the second image comprises a second ROI of the picture of the component.

11. A method of training a neural network model, comprising:

12. An apparatus for image processing, comprising:

13. An apparatus for training a neural network model, comprising:

14. An electronic device comprising a processor and a memory, the memory having stored therein instructions that, when executed by the processor, cause the processor to perform the method of any of claims 1-11.

15. A computer storage medium for storing a computer program comprising instructions for performing the method of any one of claims 1-11.

16. A computer program product, comprising computer program code which, when run by an electronic device, causes the electronic device to perform the method of any of claims 1-11.