CN115170893B

CN115170893B - Training method of common-view gear classification network, image sorting method and related equipment

Info

Publication number: CN115170893B
Application number: CN202211038622.4A
Authority: CN
Inventors: 曹鹏蕊
Original assignee: Honor Device Co Ltd
Current assignee: Honor Device Co Ltd
Priority date: 2022-08-29
Filing date: 2022-08-29
Publication date: 2023-01-31
Anticipated expiration: 2042-08-29
Also published as: CN115170893A

Abstract

The application provides a training method, an image sorting method and related equipment of a common-view gear classification network, and relates to the field of image processing; the training method of the common-view gear classification network comprises the following steps: acquiring a query training image and an image database; extracting local features of the query training image and the comparison image; determining effective image blocks and effective common-view image blocks in the query training image and the comparison image according to the local features; determining a true value gear between the query training image and the comparison image according to the effective image block and the effective common-view image block; determining a predicted gear between a training image and a comparison image according to the query by using a common-view gear classification network; and training the common-view gear classification network based on the true gear and the predicted gear to obtain the trained common-view gear classification network. Based on the technical scheme of the application, the accuracy of the retrieved image result can be improved.

Description

Training method of common-view gear classification network, image sorting method and related equipment

Technical Field

The application relates to the field of image processing, in particular to a training method of a common-view gear classification network, an image sorting method and related equipment.

Background

With the rapid development of hardware technology and visual algorithms, augmented Reality (AR) technology is also increasingly applied to human life. As an important part of AR technology, the visual repositioning technology (VPS) has become a research hotspot in this field in recent years.

The VPS is a technology for completing a positioning task through machine vision, for example, in a live-action navigation process, an image acquired in real time is matched with a map stored in a cloud to obtain accurate positioning. In the matching process, the accuracy requirement of the VPS on the image is very high, but the retrieved image result is not processed by the related technology, or the retrieved image result is inaccurate or even wrong due to simple algorithm logic and long time consumption provided by the VPS, so that the positioning accuracy is influenced. Therefore, how to sort the image results retrieved during matching and improve the accuracy of the retrieval result becomes a problem that needs to be solved urgently.

Disclosure of Invention

The application provides a training method, an image sorting method and related equipment of a common-view gear classification network, which improve the accuracy of a retrieval result by combining the overall characteristics and the local characteristics of a comparison image in a query image and an image database.

In a first aspect, a method for training a common-view gear classification network is provided, including:

acquiring a query training image and an image database, wherein the image database comprises a multi-frame contrast image;

extracting local features of the query training image and the comparison image, wherein the local features comprise local feature points and local feature point descriptors;

determining effective image blocks and effective common-view image blocks in the query training image and the comparison image according to the local features, wherein the effective image blocks are image blocks comprising the local feature points, and the effective common-view image blocks are image blocks comprising local feature points in local feature point pairs matched between the query training image and the comparison image;

determining a true value gear between the query training image and the comparison image according to the effective image block and the effective common-view image block;

determining a predicted gear between the training image according to the query and the comparison image by using the common visual gear classification network;

and training the common-view gear classification network based on the true gear and the predicted gear to obtain the trained common-view gear classification network.

The application provides a training method of a common-view gear classification network, which can firstly utilize a local feature extraction network to extract local features of a query training image and a comparison image in an image database, and can divide the query training image and the comparison image into a plurality of image blocks because partial areas of the image comprise local feature points in the local features, and determine whether each image block is an effective image block by determining whether each image block contains the local feature points. Then, the local feature points of the query training image and the comparison image are matched to determine that the image block containing the local feature points in the matched local feature point pair is an effective common-view image block. Then, the determined effective image blocks and effective common-view image blocks of the query training image and the comparison image can be used for determining a common-view score between the query training image and the comparison image, and determining a true value gear corresponding to the common-view score. And finally, simultaneously inputting the query training image and the comparison image into a common-view gear classification network, determining the predicted gears of the query training image and the comparison image by using the common-view gear classification network, and training the common-view gear classification network by combining the true gear and the predicted gear. Therefore, the common-view gear classification network trained by the training method provided by the embodiment of the application can be directly used for determining the gear between two frames of images when being applied subsequently, so as to accurately predict the similarity of the two frames of images through local features.

In a possible implementation manner of the first aspect, the extracting local features of the query training image and the comparison image includes:

extracting local features of the query training image and the comparison image by using a local feature extraction network;

the local feature extraction network comprises an encoding network, a key point detection network and a descriptor decoding network, wherein the encoding network is used for extracting basic features, the key point detection network is used for extracting the local feature points based on the basic features, and the descriptor decoding network is used for extracting the local feature point descriptors based on the basic features.

In this implementation, the local feature is a feature of a local region of the image. The extraction of the local feature points included in the local features can make a large amount of image information sparse, so that information compression can be better performed, and the extraction of the local feature point descriptor describes information around the local feature points, so as to generate feature vectors for distinguishing from other regions.

In a possible implementation manner of the first aspect, determining valid image blocks in the query training image and the comparison image according to the local features includes:

dividing the query training image and the comparison image into a plurality of image blocks with consistent sizes;

and determining the image block comprising the local feature point as the effective image block.

In this implementation, useless information in the image can be filtered out by determining valid image blocks, and only image blocks including local feature points capable of representing local information are included.

In a possible implementation manner of the first aspect, determining an effective common-view block in the query training image and the comparison image according to the local feature includes:

matching the local characteristic points in the query training image and the comparison image, and determining local characteristic point pairs on the matching;

and determining the image block comprising the local feature point in the matched local feature point pair as the effective common-view image block.

The effective common-view image block refers to an effective image block comprising local feature points in the matched local feature point pair.

In this implementation, although some effective image blocks in the query training image include local feature points, the local feature points do not have a matching relationship with the comparison image, which indicates that the contents of the effective image blocks are different from those of the comparison image. Therefore, it is necessary to screen out the valid image blocks including the local feature points having no matching relationship, and only the valid image blocks including the local feature points having matching relationship are retained. Thus, it can also be said that the valid common-view image blocks indicate image blocks including the same content.

In a possible implementation manner of the first aspect, determining a true value shift between the query training image and the comparison image according to the effective image block and the effective common-view image block includes:

determining a first common view ratio according to the effective common view image blocks and the effective image blocks in the query training image;

determining a second common-view ratio according to the effective common-view image block and the effective image block in the comparison image;

determining a smaller value of the first common aspect ratio and the second common aspect ratio as a common aspect score between the query training image and the comparison image;

and determining a corresponding gear between the query training image and the comparison image according to the common-view score and by using a mapping relation between a preset common-view gear threshold and the gear, and using the gear as the true gear.

In this implementation, determining the first common aspect ratio is equivalent to determining the aspect ratio of the same content of the query training image and the comparison image in the query training image, and determining the second common aspect ratio is equivalent to determining the aspect ratio of the same content of the query training image and the comparison image in the comparison image, where the smaller value of the two is the more accurate same content.

In a second aspect, an image sorting method is provided, which is applied to an electronic device, and includes:

displaying a first interface, the first interface comprising a first control;

detecting a first operation on the first control;

responding to the first operation, and acquiring a query image;

determining global feature similarity between the query image and each frame of contrast image in an image database, wherein the image database comprises multi-frame contrast images;

acquiring an M-frame contrast image with high global feature similarity with the query image from the image database as an M-frame candidate image;

extracting local features of each frame of candidate images in the query image and the M frames of candidate images;

determining gears according to the local features of the query image and the M frames of candidate images by using a trained common-view gear classification network obtained by the training method of the common-view gear classification network in the first aspect or the first aspect;

determining the previous N frames of candidate images as N frames of target images from M frames of candidate images according to the sequence of the gears from large to small, wherein N is more than or equal to 1 and is less than M, and both N and M are integers;

and displaying N frames of the target image.

The application provides an image sorting method, which comprises the steps of firstly sorting based on the global similarity degree with a query image, screening out the previous M frames of images which are similar in global situation to serve as candidate images, then determining the gear sizes of the M frames of candidate images and the query image according to local features by utilizing a trained common-view gear classification network, namely determining the similarity degree by combining the local features, and finally screening out the images of N before gear sorting to serve as target images. The N frames of target images are selected by combining the image local features after the image retrieval global feature results, so that the results which only depend on the global feature retrieval can be optimized, and the determined target images have higher similarity with the query images and are more accurate. When the method is subsequently applied to an image retrieval module in VPS positioning, the VPS positioning effect can be improved.

In a possible implementation manner of the second aspect, acquiring an M-frame contrast image with high global feature similarity to the query image from the image database as an M-frame candidate image includes:

sequencing the multiple frames of the contrast images according to the sequence of the global feature similarity from large to small, and determining the previous M frames of the contrast images as the M frames of candidate images;

or, taking the M frames of contrast images with the global feature similarity larger than a preset global feature similarity threshold as the M frames of candidate images.

In this implementation manner, the greater the global feature similarity, the more similar the overall characteristics of the contrast image and the query image in the image database, so that the contrast image of M frames before the global feature similarity ranking can be screened out from the image database as a candidate image, so that in the following, an image similar to the query image can be further screened out from the candidate image according to other conditions.

In a possible implementation manner of the second aspect, extracting local features of the query image and M frames of the candidate images includes:

and extracting local features of the query image and the M frames of candidate images by using a local feature extraction network.

In this implementation, the local feature extraction network is the same as that in the training method.

In a possible implementation manner of the second aspect, the trained common view gear classification network includes: a decoding network, an attention mechanism network and a classification network;

determining gears by using the trained common-view gear classification network according to the local features of the query image and the M frames of candidate images, wherein the step comprises the following steps:

inputting the local features of the query image into a decoding network to obtain a first feature map;

inputting the first feature map into the attention mechanism network to obtain an attention feature vector;

performing point multiplication on the attention feature vector and local features of the M frames of candidate images to obtain a second feature map;

and inputting the first characteristic diagram and the second characteristic diagram into the classification network to obtain the gear.

In the implementation mode, the trained common-view gear classification network is used for determining the gears, and the image similarity condition can be rapidly classified according to local features.

In a possible implementation manner of the second aspect, the first interface is a photographing interface, and the first control is a control for instructing to photograph.

In a third aspect, a training apparatus for a common view gear classification network is provided, which includes means for performing the method of the first aspect or any one of the first aspect.

In a fourth aspect, there is provided an electronic device comprising means for performing any of the methods of the second or fourth aspects.

In a fifth aspect, an electronic device, one or more processors and memory;

the memory is coupled to the one or more processors for storing computer program code comprising computer instructions which are invoked by the one or more processors to cause the electronic device to perform the method of any of the first aspect or the first aspect, and/or the method of any of the second aspect or the second aspect.

In a sixth aspect, a chip system is provided, which is applied to an electronic device, and includes one or more processors for invoking computer instructions to cause the electronic device to perform the method of any one of the first aspect or the second aspect.

In a seventh aspect, a computer-readable storage medium is provided, which stores computer program code, which, when executed by an electronic device, causes the electronic device to perform the method of any one of the first or second aspects.

In an eighth aspect, there is provided a computer program product comprising: computer program code which, when run by an electronic device, causes the electronic device to perform any of the methods of the first or second aspects.

Drawings

FIG. 1 is a schematic diagram of a scenario in which an embodiment of the present application is applicable;

FIG. 2 is a schematic diagram of another scenario in which an embodiment of the present application is applicable;

fig. 3 is a schematic structural diagram of a local feature extraction network according to an embodiment of the present application;

FIG. 4 is a training diagram of a common-view gear classification network according to an embodiment of the present disclosure;

FIG. 5 is a flowchart illustrating a method for training a common-view gear classification network according to an embodiment of the present application;

FIG. 6 is a flowchart illustrating a process of determining valid image blocks according to an embodiment of the present application;

FIG. 7 is a schematic diagram of the determined valid image block of FIG. 6;

FIG. 8 is a schematic diagram of determining local feature points on a match according to an embodiment of the present application;

FIG. 9 is a schematic diagram of FIG. 8 after the local feature point pairs with matching errors are culled;

FIG. 10 is a schematic illustration of another determined valid image block;

FIG. 11 is a schematic flow chart for determining a co-view score according to an embodiment of the present disclosure;

FIG. 12 is a flowchart illustrating an image sorting method according to an embodiment of the present application;

FIG. 13 is a schematic application diagram of a common-view gear classification network according to an embodiment of the present application;

fig. 14 is a schematic structural diagram of a hardware system of an electronic device according to an embodiment of the present disclosure;

fig. 15 is a schematic structural diagram of a software system of an electronic device according to an embodiment of the present application;

FIG. 16 is a schematic structural diagram of an exercise device according to an embodiment of the present application;

fig. 17 is a schematic structural diagram of an image sorting apparatus according to an embodiment of the present application;

fig. 18 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The technical solution in the present application will be described below with reference to the accompanying drawings.

In the description of the embodiments of the present application, "/" indicates an alternative meaning, for example, a/B may indicate a or B; "and/or" herein is merely an association relationship describing an associated object, and means that there may be three relationships, for example, a and/or B, and may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, in the description of the embodiments of the present application, "a plurality" means two or more than two.

In the following, the terms "first", "second" are used for descriptive purposes only and are not to be understood as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature.

First, some terms in the embodiments of the present application are explained so as to be easily understood by those skilled in the art.

1. Augmented reality refers to a virtual environment in which a real world and a virtual world are combined by a computer to create human-computer interaction, and is a generic term for various technologies such as Augmented Reality (AR), virtual Reality (VR), and Mixed Reality (MR).

2. A visual repositioning system (VPS) is a service for performing positioning using image information, and the service matches a 3D map with image information acquired by an electronic device based on the 3D map, thereby achieving the purpose of identifying orientation information of the electronic device with high precision.

3. Global features, features extracted from an image (image) as a whole, are generally used to express the overall characteristics of the image. Global features include any one or a combination of more of the following: pre-training deep learning features, GIST features, local Binary Pattern (LBP) features, or color distribution features, etc.

The pre-training deep learning features are image features obtained by calculation by using a pre-training convolutional neural network model, and the pre-training model is a model obtained by image training in a data set. A data set, such as the ImageNet project, is a large visual database for visual object recognition software research, containing 1000 categories of tens of millions of images. After the image is input into the convolutional neural network model trained in advance, the features in any convolutional layer of the model can be used as the global features of the image. GIST features are an abstract scene representation that can naturally motivate the concept of different categorical scenes, such as cities, mountains, etc. The color distribution characteristics refer to image characteristics represented by color statistical distribution vectors, such as color histograms, color moments, color aggregation vectors and the like; the color histogram refers to distribution characteristics of an image color space in different color intervals, such as a Red Green Blue (RGB) color distribution space; the color moments include first order moments, second order moments, third order moments, and the like.

4. The local feature is a feature extracted by using a partial region of an image as an input, and is generally used to express a local characteristic of an image. Local features include any one or a combination of more of the following: scale Invariant Feature (SIFT) features, speeded Up Robust Features (SURFs), affine (affine) SIFT features (also called ASIFT features), principal component analysis-based scale invariant features (PCA-SIFT), histogram of Oriented Gradients (HOG), FAST and rotated (oriented) descriptor features (ORB) for oriented t detection and rotation, wherein speeded up segmentation detection features (features), binary robust invariant features (FAST and oriented descriptors), binary robust features (robust features, etc. Local features are generally extracted centered on a pixel in the image, which is also called a point of interest, a keypoint, or a feature point. The descriptors extracted in the local range of the feature point generally have features with unchanged local scale, that is, if the content in the local range of the feature point is subjected to various types of processing, such as scaling, rotation, noise addition, shielding and illumination change, the descriptors extracted in the local range after processing are the same as or similar to the descriptors extracted in the local range before processing. Therefore, if there are a large number of local feature matches, i.e. there are a large number of feature point matches, it can be generally considered that the similarity between the two images is high. A plurality of local features can be generally extracted for each local feature in one image.

5. Similarity, which refers to the degree of similarity between two hash values, features, or vectors. The similarity may be a distance between two hash values, features or vectors, or a value obtained by normalizing the distance between two hash values, features or vectors. The distance here may be a euclidean distance, a cosine distance, an edit (editor) distance, a hamming distance, etc. The smaller the distance between two hash values, features or vectors, the greater the similarity.

6. Back propagation algorithm

The neural network can adopt a Back Propagation (BP) algorithm to correct the size of parameters in the initial neural network model in the training process, so that the reconstruction error loss of the neural network model is smaller and smaller. Specifically, the error loss is generated by transmitting the input signal forward until the output, and the parameters in the initial neural network model are updated by reversely propagating the error loss information, so that the error loss is converged. The back propagation algorithm is a back propagation motion taking error loss as a leading factor and is used for obtaining parameters of an optimal neural network model; such as a weight matrix.

The foregoing is a brief introduction to the nouns referred to in the embodiments of the present application, and will not be described in detail below.

With the rapid development of hardware technology and visual algorithm, augmented reality technology is also increasingly applied to people's lives. The visual repositioning technique, as an important part of the AR technique, has also become a research hotspot in this field in recent years.

The VPS is a technology for completing a positioning task through machine vision, for example, in a process of performing live-action navigation, an image acquired in real time is matched with a map stored in a cloud to obtain accurate positioning. In the matching process, the accuracy requirement of the VPS on the image is very high, but the retrieved image result is not processed by the related technology, or although the retrieved image result is processed, the retrieved image result is inaccurate and even wrong due to simple logic and long time consumption of a processing algorithm provided by the VPS, so that the accuracy of positioning is influenced.

Therefore, how to sort the image results retrieved during matching and improve the accuracy of the retrieval result becomes a problem that needs to be solved urgently.

In view of this, the embodiment of the present application provides an image ranking method, comparing global features and local features of a query image and a comparison image in an image database, where the global features may reflect the overall similarity of images, and the local features may reverse the local similarity of images, and the global features and the local features are combined for comparison, so as to comprehensively improve the accuracy of a retrieved image result.

Two application scenarios provided in the embodiment of the present application are illustrated in conjunction with fig. 1 and fig. 2.

The application scene one: navigation positioning

Fig. 1 is a schematic view of a scenario to which the embodiment of the present application is applied. As shown in fig. 1, when a user enters an unfamiliar scene area, the electronic device may provide live-action navigation to the user through AR technology, so that the user can accurately perform route identification. In the process of live-action navigation of the electronic equipment, a user can instruct the electronic equipment to acquire a current shot image, then the electronic equipment searches the live-action gallery according to the acquired current shot image, and at the moment, the electronic equipment can identify a current scene corresponding to the current shot image so as to realize positioning of the electronic equipment according to the current scene.

Application scenario two: searching picture by picture

Fig. 2 is a schematic view of another scenario to which the embodiment of the present application is applicable. As shown in fig. 2 (a), a plurality of applications are displayed on a display interface of the electronic device. When the user performs a click operation on the "search for a picture" application, in response to the click operation, the electronic device opens the application and displays a search interface as shown in (b) of fig. 2. The search interface includes a "camera" icon 11, indicating that the user can take a picture of the query image and find similar images of the query image. After the user performs one click operation on the "camera" icon 11 on the search interface, the electronic device displays a photographing interface as shown in (c) of fig. 2 in response to the one click operation by the user.

As shown in (c) of fig. 2, the electronic device displays a photographing interface including a preview image and further including a photographing key 12. After the user aims at the target to be photographed (for example, the router shown in the figure), in response to the click operation of the user on the photographing key 12, the electronic device may photograph the router, perform a search in the stored image database, and display the searched image containing the image similar to the target to be photographed on the interface shown in (d) in fig. 2 for the user to view.

By combining the two scenes, when the electronic device identifies the corresponding current scene according to the current shot image to perform self-positioning or search similar images, if the retrieval results are not sorted or the retrieval results are simply sorted by using a related algorithm, the obtained positioning or retrieved image results are very inaccurate, and the user experience is influenced. In contrast, if the electronic device performs processing by using the image sorting method provided by the embodiment of the present application, the global feature and the local feature of the query image and the comparison image in the image database can be combined to perform comparison, so that the accuracy of the retrieved image result is improved.

The above two examples of the application scenarios are not intended to limit the application scenarios of the present application.

Because the local features of the query image and the comparison images in the image database need to be analyzed and compared, the local features of the images need to be extracted through a trained local feature extraction network, and a trained common-view-level classification network is also needed to analyze and compare the local features of the images.

Fig. 3 is a schematic structural diagram of a local feature extraction network provided in the embodiment of the present application, and fig. 4 is a schematic training diagram of a common-view gear classification network provided in the embodiment of the present application. Fig. 5 is a flowchart illustrating a method for training a common-view gear classification network according to an embodiment of the present disclosure.

The following first describes in detail a training process of the common-view gear classification network provided in the embodiment of the present application with reference to fig. 3 to 5.

Fig. 5 is a flowchart illustrating a training method of a common-view gear classification network according to an embodiment of the present application.

As shown in fig. 5, the method 100 may include the following S110 to S180.

And S110, acquiring a query training image and an image database.

Wherein the query training image is used to indicate the query image used for training. The image database includes a plurality of frame-contrast images. It should be understood that the query training image and the comparison image included in the image database are both pre-captured images. Of course, the images acquired in other manners may also be available, and the embodiment of the present application does not limit this.

Optionally, the query training image may be preprocessed.

In the embodiment of the present application, the preprocessing may include at least one of scaling, flipping, clipping, and the like. That is, at least one of scaling, flipping, clipping, and the like can be performed on the query training image. At this time, the length of the longest edge of the query training image and the length of the longest edge of the comparison image can be kept consistent by taking the comparison training image as a reference, and the local features of the query training image and the comparison image can be conveniently extracted subsequently after the lengths of the longest edge of the query training image and the longest edge of the comparison image are kept consistent.

It should be understood that scaling refers to enlarging or reducing the size of the query training image, and flipping refers to flipping the image with one edge of the query training image as a symmetry axis; cropping refers to removing a partial region of the query training image. Of course, the pretreatment may also include other treatment steps, which may be added or modified as needed, and the embodiment of the present application does not limit this.

And S120, extracting local features of each frame of contrast image in the query training image and the image database.

The local feature comprises a local feature point and a local feature point descriptor.

It should be understood that a local feature is a feature of a local region of an image. The extraction of the local feature points included in the local features can make a large amount of image information sparse, so that information compression can be better performed, and the extraction of the local feature point descriptor describes information around the local feature points, so as to generate feature vectors for distinguishing from other regions.

Optionally, the local feature extraction network may be used to extract local features from the query training image and each frame of the contrast image.

Illustratively, as shown in fig. 3, the local feature extraction network may include an encoding network, a keypoint detection network, and a descriptor decoding network. The coding network is used for extracting the basic features of the image; the key point detection network is used for extracting key points of the image or local feature points based on the extracted image basic features; the descriptor decoding network is used for extracting a key point descriptor of the image or a descriptor called a local feature point descriptor.

In the embodiment of the application, the local feature extraction network is trained in advance, and after the training, the key point detection network in the local feature extraction network has better local feature point detection performance. Based on the key point detection network with better performance, the local feature points capable of expressing the local features of the image can be detected more accurately.

In the embodiment of the present application, the local feature extraction network may be any one of networks including a SIFT and nearest neighbor matching (nn) network structure, an Sp (super) network structure and Sg (super) network structure, an Alsfeat (sparse local features of access shape and localization) network structure, and an nn network structure; or, the local feature extraction network may also be formed based on other network structures, and may be specifically set and modified according to needs, which is not limited in this embodiment of the present application.

Of course, in the above process, other methods may also be used to extract the local features of each frame of the contrast image in the query training image and the image database, which is not limited in this embodiment of the present application.

And S130, determining effective image blocks respectively included in the query training image and each frame of contrast image according to the local features.

Wherein an active image block refers to an image block comprising local feature points. The size and the dimension of the image block may be determined as needed, which is not limited in any way by the embodiment of the present application.

Alternatively, as shown in fig. 6, the above S130 may include the following S131 and S132.

And S141, dividing the query training image and each frame of contrast image into a plurality of image blocks with consistent sizes.

The shape of the image block is not limited, and may be a square, a rectangle, a circle, or other shapes, and the size of the image block is also not limited, and may be set and modified as needed.

And S142, determining whether each image block contains local feature points, if so, determining that the image block is an effective image block, and if not, determining that the image block is not an effective image block.

Fig. 7 shows an exemplary schematic diagram of determining valid image blocks.

As shown in FIG. 7, the query training image and the comparison image may be divided into a plurality of image blocks of the same size, e.g., each image block having a size of n.

As shown in fig. 7 (a), for the query training image, after the query training image is divided into a plurality of image blocks, since the line 1, column 3 image block includes a local feature point a1, the line 1, column 4 image block includes a local feature point a2, the line 2, column 2 image block includes a local feature point a3, the line 2, column 3 image block includes a local feature point a4, the line 3, column 3 image block includes a local feature point a5, the line 4, column 3 image block includes a local feature point a6, and the line 5, column 3 image block includes a local feature point a7. Therefore, it can be determined that the row 1, column 3 image block, the row 1, column 4 image block, the row 2, column 2 image block, the row 2, column 3 image block, the row 3, column 4 image block, the row 4, column 3 image block, and the row 5, column 3 image block are all valid image blocks in the query training image.

As shown in (b) in fig. 7, similarly, for the comparison image, after the comparison image is divided into a plurality of image blocks, since the image block in the 3 rd row and the 4 th column includes the local feature point b1, the image block in the 3 rd row and the 5 th column includes the local feature point b2, the image block in the 4 th row and the 3 rd column includes the local feature point b3, the image block in the 3 rd row and the 4 th column includes the local feature point b4, the image block in the 5 th row and the 2 nd column includes the local feature point b6, and the image block in the 5 th row and the 4 th column includes the local feature point b5. Therefore, it can be determined that the 3 rd row and 4 th column image block, the 3 rd row and 5 th column image block, the 4 th row and 3 rd column image block, the 3 rd row and 4 th column image block, the 5 th row and 2 nd column image block, and the 5 th row and 4 th column image block are all valid image blocks in the comparison image.

S140, matching local characteristic points in the local characteristics of the query training image and each frame of contrast image, and determining matched local characteristic point pairs.

Firstly, extracting local feature points and local feature point descriptors of a query training image by using a local feature extraction network aiming at the query training image; aiming at the comparison image, the local feature points and the local feature point descriptors of the comparison image can be extracted by using a local feature extraction network; and then, matching according to the local characteristic points of the query training image and the local characteristic points of the comparison image, and determining the matched local characteristic point pairs. The matched local feature point pairs comprise local feature points of a query training image and local feature points of a comparison image, and the two local feature points are used for indicating the same target.

Illustratively, fig. 8 shows a schematic diagram of local feature point pairs on a match.

As shown in fig. 8 (a), the left image is a query training image, and a1 to a7 are local feature points determined from the query training image. The right image is a frame of contrast image arbitrarily selected from an image database, and b1 to b6 are local feature points determined from the contrast image.

As shown in (b) in fig. 8, after matching, a1 matches b1, and a1 and b1 are local feature point pairs on a pair of matches; a2 is matched with b2, and a2 and b2 are a pair of matched local characteristic point pairs; a3 and b3 are matched, and a3 and b3 are local characteristic point pairs on a pair of matches; a4 and b4 are matched, and a4 and b4 are a pair of matched local characteristic point pairs; a5 and b5 are local characteristic point pairs on a pair of matches; a7 and b6 are matched, and a7 and b6 are local characteristic point pairs on a pair of matches. Otherwise, a6 has no matching object. The dotted line in fig. 8 is only used to indicate a matching relationship, and does not occur in actual processing, and has no other meaning.

It should be understood that, since the local features of the comparison image in the image database may need to be called many times in the training process, the local features of the comparison image may be extracted for the first time and then stored in correspondence with the comparison image, that is, the local features are updated to the image database. When the local features of a certain contrast image need to be acquired later, the local features of the contrast image stored in the image database can be directly called for use, so that the workload is reduced, and the processing efficiency is improved. And for other query training images, only local feature extraction needs to be carried out on the other query training images.

Optionally, in order to improve the accuracy of the subsequent result, after the matching is performed, error correction may be performed first to remove local feature point pairs with matching errors.

In the embodiment of the present application, local feature points with matching errors may be removed by using an epipolar geometry (epipolar geometry) method. Of course, other methods may be used for elimination, and the embodiment of the present application does not limit this.

It should be understood that the epipolar geometry refers to estimating the pose change condition of the camera when two continuous frames of images are shot according to multiple groups of two-dimensional pixel coordinates between the two continuous frames of images on the premise that the two-dimensional pixel coordinates are known. After the pose change condition of the camera is estimated, if some matched local feature points do not accord with the pose change condition, the matching is wrong, and then the local feature points which are matched wrongly can be removed.

Fig. 9 is a schematic diagram of fig. 8 after the local feature point pairs with matching errors are removed.

As shown in fig. 8 and 9, in the matching process, since a7 and b6 are matched incorrectly, the local feature points a7 and b6 may be eliminated, and only other local feature point pairs which are matched correctly are retained.

It should also be understood that the local feature point pairs with matching errors are removed, the matching relation of the local feature points is simply canceled, and the local feature points are not removed from the image.

S150, determining effective common-view image blocks respectively included in the query training image and each frame of contrast image based on the matched local feature point pairs in the query training image and each frame of contrast image.

The effective common-view image block refers to an effective image block comprising local feature points in the matched local feature point pair. Although some effective image blocks in the query training image include local feature points, the local feature points have no matching relationship with the comparison image, which indicates that the contents of the effective image blocks are different from those of the comparison image. Therefore, it is necessary to screen out the valid image blocks including the local feature points having no matching relationship, and only the valid image blocks including the local feature points having matching relationship are retained. Thus, it can also be said that the valid common-view image blocks indicate image blocks including the same content.

Example 1, as shown in fig. 9 (a), for a query training image, after the query training image is divided into a plurality of image blocks, although the image block in row 4, column 3 includes a local feature point a6, and the image block in row 5, column 3 includes a local feature point a7, the local feature point a6 does not have a corresponding matched local feature point in a comparison image, and the local feature point a7 also has no matching relationship, which indicates that the comparison image does not include the same content as the two local feature points. Therefore, it can be determined that the 4 th row and 3 rd column image block and the 5 th row and 3 rd column image block do not belong to the valid common-view image block. In addition, the other row 1, column 3, row 1, column 4, row 2, column 2, row 2, column 3 and row 3, column 3 are all valid common-view blocks in the query training image.

Referring to fig. 9 (b), after the comparison image is divided into a plurality of image blocks, although the image block in row 5 and column 2 includes the local feature point b6, there is no corresponding matched local feature point in the query training image, which means that the query training image does not include the same content as the local feature point. Therefore, it can be determined that the 5 th row and 2 nd column block including the local feature point b6 does not belong to the valid common-view block. In addition, the 3 rd row and 4 th column image block, the 3 rd row and 5 th column image block, the 4 th row and 3 rd column image block, the 4 th row and 4 th column image block and the 5 th row and 4 th column image block all belong to effective common-view image blocks in the comparison image.

Similarly, with reference to the above, it is possible to determine valid co-view image blocks in each frame of contrast image, which include the same content as the query training image.

Example 2, as shown in fig. 10, for the left query training image and the right comparison image, the points marked in the figure are determined local feature points, and a circle containing the local feature points is a valid image block.

Then, matching the local characteristic points of the query training image and the comparison image, and determining the matched local characteristic point pairs. Therefore, the effective image blocks including the local feature points with the matching relation can be screened out from the effective image blocks to be the effective common-view image blocks.

And S160, determining the common-view score of the query training image and each frame of contrast image according to the effective image blocks and the effective common-view image blocks between the query training image and each frame of contrast image.

The common view score is used for indicating the content similarity degree between the two frames of images determined by combining the local features, and the higher the common view score is, the more similar the common view score is.

As shown in fig. 11, S160 may include the following S161 to S164.

S161, determining a first common view ratio according to the effective common view image block and the effective image block in the query training image.

And S162, determining a second common view ratio according to the effective common view image block and the effective image block in the contrast image.

And S163, determining the smaller value of the first common visual proportion and the second common visual proportion as the common visual score between the query training image and the comparison image.

And S164, repeatedly executing the steps S161 to S163, and determining the corresponding common-view score between the query training image and each frame of contrast image.

It should be understood that determining the first common aspect ratio is equivalent to determining the aspect ratio of the same content in the query training image as in the comparison image, and determining the second common aspect ratio is equivalent to determining the aspect ratio of the same content in the query training image as in the comparison image, the smaller of the two being the more accurate same content.

Exemplarily, as shown in fig. 7 and fig. 9, the first common-view ratio is determined to be five-seventh according to the effective common-view image block and the effective image block in the query training image; according to the effective common-view image block and the effective image block in the contrast image, the second common-view ratio can be determined to be five-sixths. The first co-apparent ratio is less than the second co-apparent ratio, whereby a co-apparent score of five-seventeen between the query training image and the comparison image may be determined.

S170, according to the common vision score, determining a corresponding gear between the query training image and each frame of contrast image by using a mapping relation between a preset common vision gear threshold and the gear, and using the gear as a true gear.

In the embodiment of the present application, the number and the size of the common gear threshold may be set as needed, and the embodiment of the present application does not limit this. For example, three common view gear thresholds, score _ th1, score _ th2, and score _ th3, respectively, may be set.

Accordingly, the divided gears are:

gear 4,0 < score _ th1;

the gear 3, score_th1 is less than or equal to score < score _ th2;

the gear 2, score _th2 is less than score < score _ th3;

the gear 1, score _th3is less than or equal to score.

It should be understood that the more common view gear thresholds that are set, the finer the division, and the correspondingly more gears.

For example, if the set gear threshold value score _ th1 is 0.2, score _ th2 is 0.6, and score _ th3 is 0.8, the common-view gear threshold value ranges corresponding to the above gears are:

the gear 4,0 < score < 0.2;

the gear is more than or equal to 3,0.2 and less than 0.6 of score;

the gear 2,0.6 is more than or equal to score and less than 0.8;

the gear is 1,0.8 and is not more than score.

And the score is the determined common view score, and a common view gear threshold range to which the common view score belongs is determined according to the common view score, so that the corresponding gear can be found. It should be understood that the 1 th position represents the highest degree of similarity, and the 4 th position represents the lowest degree of similarity.

In combination with the above-mentioned common view score in S160, since the common view score between the query training image and the comparison image is five-sevenths, about 0.71, and is within the common view level threshold range of 0.6 to 0.8, the true value level corresponding to the query training image and the comparison image can be determined to be level 2.

Optionally, before determining the gear, the comparison images may be further sorted according to the common view scores corresponding to all the comparison images from large to small.

And S180, training the common visual gear classification network based on the query training image and each frame of contrast image and the corresponding true value gear between the query training image and each frame of contrast image to obtain the trained common visual gear classification network.

For example, the query training image and any frame of contrast image are input into a common view gear classification network, and a gear between the query training image and the frame of contrast image is determined by using the common view gear classification network, and the gear is used as a prediction gear. And then, comparing the preset gears with the true gears between the inquiry training image and the comparison image, and iterating the parameters of the common-view gear classification network through a back propagation algorithm until the predicted gear convergence of the common-view gear classification model is close to the true gears, so as to obtain the trained common-view gear classification network, namely the trained common-view gear classification network.

The above description is given by taking a frame of contrast image as an example, and in the actual training process, the query training image and the multi-frame contrast image may be input into the common-view gear classification network for training.

In the embodiment of the present application, local features of a query training image and a comparison image in an image database may be extracted by using a local feature extraction network, and since an image partial region includes local feature points in the local features, the query training image and the comparison image may be divided into a plurality of image blocks, and whether each image block is a valid image block is determined by determining whether each image block includes a local feature point. Then, the local feature points of the query training image and the comparison image are matched to determine that the image block containing the local feature points in the matched local feature point pair is an effective common-view image block. Then, the determined effective image blocks and effective common-view image blocks of the query training image and the comparison image can be used for determining a common-view score between the query training image and the comparison image, and determining a true value gear corresponding to the common-view score. And finally, simultaneously inputting the query training image and the comparison image into a common-view gear classification network, determining the predicted gears of the query training image and the comparison image by using the common-view gear classification network, and training the common-view gear classification network by combining the true gear and the predicted gear. Therefore, the common-view gear classification network trained by the training method provided by the embodiment of the application can be directly used for determining the gear between two frames of images when being applied subsequently, so as to accurately predict the similarity of the two frames of images through local features.

The common view gear classification network trained by fig. 3 to 11 may be applied to the image ranking method provided by the present application, and then, the image ranking method provided by the embodiment of the present application is described in detail with reference to fig. 12.

Fig. 12 is a schematic flowchart of an image sorting method according to an embodiment of the present application. As shown in fig. 12, the method 200 may include the following S210 to S270.

And S210, acquiring a query image.

It should be understood that the query image may be an image captured by a user using an electronic device, or an image stored in the electronic device, or may also be an image acquired or received by the electronic device from a cloud, or of course, may also be in other manners, which is not limited in this embodiment of the present application. The query image is an image input by the user for acquiring a similar image, and the comparison image is an image from which the user wants to find out the image similar to the query image.

S220, determining the global feature similarity between the query image and each frame of contrast image in the image database.

Wherein the image database comprises a plurality of frames of contrast images.

The multi-frame contrast images included in the image database may include pre-captured images as well as previously entered query images by the user. The image database may be stored on the electronic device or in the cloud. When the image database is stored in the cloud, the image database can be updated in real time according to the updating operation of the cloud server. At this time, the electronic device may obtain the image database in the cloud in response to the query instruction of the user.

When determining the global feature similarity between the query image and each frame of the contrast images in the image database, for example, a pre-trained convolutional neural network model, also called a global feature extraction network, may be used to extract the global features between the query image and each frame of the contrast images in the image database, and then determine the degree of similarity between the two global features.

The global features of each frame of contrast image in the image database can be predetermined and stored, and when the image sorting method provided by the embodiment of the application is used for processing, the global features of each frame of contrast image are only required to be called to calculate the similarity degree. Therefore, for different query images, the global features of each frame of contrast image in the image database can be multiplexed, so that the calculation amount can be reduced, and the processing efficiency can be improved.

And S230, determining M candidate images from the multi-frame contrast images according to the global feature similarity.

Optionally, as an embodiment, the top M frames of comparison images may be picked out from the image database as M frames of candidate images in an order from large to small of the global feature similarity. M is an integer greater than 1, and M is less than the total number of compared images in the image database.

Optionally, as another embodiment, a global feature similarity threshold may be preset, and the M frames of contrast images with the corresponding global feature similarity greater than the global feature similarity threshold are determined as M frames of candidate images.

It should be understood that the above description is only two examples, and the M frame candidate images may also be determined according to other manners, which is not limited in any way by the embodiments of the present application.

It should be understood that the greater the global feature similarity, the more similar the overall characteristics of the contrast image and the query image in the image database are, therefore, the M frames of contrast images before the global feature similarity ranking can be screened out from the image database as M frames of candidate images, so that in the following, images similar to the query image can be further screened out from the candidate images according to other conditions. The process of screening the candidate images according to the global feature similarity is equivalent to a process of primarily screening similar images, and through the primary screening process, a large number of images with low or dissimilar similarity to the query images can be screened out, so that the matching speed and accuracy of the query images are improved.

For example, M may be 100, and of course, M may also be in other numbers, which is not limited in this embodiment of the application. When the M previous frames of contrast images are selected from the image database as candidate images in the order of decreasing the global feature similarity, the size of M may be set as needed. When the contrast image corresponding to the global feature similarity greater than the global feature similarity threshold is determined as a candidate image, the size of M may be obtained by statistics according to the number of the determined candidate images.

For example, when M is preset to be 100, all the frame images and the query image in the image database may be input into the global feature extraction network to calculate and sort the global feature similarity between each frame of the contrast image and the query image, and then the contrast image of 100 frames before the sorting is output as the candidate image.

Illustratively, a global feature similarity threshold is predetermined, a contrast image corresponding to a global feature similarity greater than the global feature similarity threshold is determined as a candidate image, and if the counted candidate image is 80 frames, the size of M at this time is 80.

And S240, extracting the local features of the query image by using the local feature extraction network.

The local features of the query image comprise local feature points and local feature point descriptors.

And S250, determining the gear corresponding to the query image and the candidate image by utilizing the trained common-view gear classification network according to the local features between the query image and the M frames of candidate images.

The M frames of candidate images are screened from the comparison images in the image database, local features of each frame of comparison image are determined and stored in the training process, and at the moment, related local features can be directly called for processing according to the corresponding relation in application. Alternatively, when not stored in advance, the local feature extraction network may be used to extract the local feature of the candidate image, and the local feature of the candidate image may also include the local feature point and the local feature point descriptor.

In the embodiment of the present application, fig. 13 shows an application diagram of a common-view gear classification network.

As shown in fig. 13, the common view gear classification network may include: a decoding network, an attention mechanism network, and a classification network.

The decoding network is used for fusing local feature points of the query image and local feature point descriptors to generate a first feature map; the attention mechanism network is used for generating an attention feature vector according to the first feature map; then, the attention feature vector is point-multiplied with the local feature points and the local feature point descriptors in the candidate image to generate a second feature map. The classification network is used for classifying according to the first characteristic diagram and the second characteristic diagram and determining the corresponding gear.

S260, sorting according to the sequence of gears from large to small, and determining the first N frames of candidate images as target images, wherein N is more than or equal to 1 and less than M, and both N and M are integers.

For example, after the global similarity between the query image and each frame of contrast image in the image database is determined, the comparison images are sorted from large to small according to the global similarity, and the first 100 frames of contrast images can be screened out as candidate images, which indicates that the overall situations of the 100 frames of candidate images and the query image are relatively similar.

Then, extracting local features of the query image by using a local feature extraction network, determining corresponding gears between the query image and the 100 frames of candidate images by using a trained common-view gear classification network, then, sequencing the candidate images according to the sequence of gears from large to small, and determining the first 30 frames of candidate images from the 100 frames of candidate images as 30 frames of target images, wherein the gears from large to small show that the local conditions are similar under the condition that the 30 frames of candidate images are similar to the query image in the whole.

The size of N may be set and modified as needed, and this is not limited in any way by the embodiments of the present application.

And S270, displaying the N frames of target images.

The N frames of target images may be displayed according to the sorting order of the gear positions, and certainly, may also be displayed in other manners, or may also display a part of the target images in the N frames of target images, which is not limited in this embodiment of the present application.

The image sorting method provided by the embodiment of the application is characterized in that sorting is performed on the basis of the global similarity degree with the query image, the previous M frames of images which are globally similar are screened out to serve as candidate images, then the trained common-view gear classification network is used for determining the gear sizes of the M frames of candidate images and the query image according to local features, namely the similarity degree is determined by combining the local features, and finally, the images of N before gear sorting are screened out to serve as target images. The N frames of target images are selected by combining the image local features after the image retrieval global feature results, so that the results which only depend on the global feature retrieval can be optimized, and the determined target images have higher similarity with the query images and are more accurate. When the method is subsequently applied to an image retrieval module in VPS positioning, the VPS positioning effect can be improved.

The image sorting method and the related device in the embodiment of the present application will be described below with reference to the accompanying drawings.

Fig. 14 shows a hardware system of an electronic device suitable for use in the present application.

The electronic device 100 may be a mobile phone, a smart screen, a tablet computer, a wearable electronic device, an in-vehicle electronic device, an Augmented Reality (AR) device, a Virtual Reality (VR) device, a notebook computer, an ultra-mobile personal computer (UMPC), a netbook, a Personal Digital Assistant (PDA), a projector, and the like, and the embodiment of the present application does not limit the specific type of the electronic device 100.

The electronic device 100 may include a processor 110, an external memory interface 120, an internal memory 121, a Universal Serial Bus (USB) interface 130, a charging management module 140, a power management module 141, a battery 142, an antenna 1, an antenna 2, a mobile communication module 150, a wireless communication module 160, an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, a sensor module 180, a key 190, a motor 191, an indicator 192, a camera 193, a display screen 194, a Subscriber Identity Module (SIM) card interface 195, and the like. The sensor module 180 may include a pressure sensor 180A, a gyroscope sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity light sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, a touch sensor 180K, an ambient light sensor 180L, a bone conduction sensor 180M, and the like.

The configuration shown in fig. 14 is not intended to specifically limit the electronic device 100. In other embodiments of the present application, electronic device 100 may include more or fewer components than those shown in FIG. 14, or electronic device 100 may include a combination of some of the components shown in FIG. 14, or electronic device 100 may include sub-components of some of the components shown in FIG. 14. The components shown in fig. 14 may be implemented in hardware, software, or a combination of software and hardware.

Processor 110 may include one or more processing units. For example, the processor 110 may include at least one of the following processing units: an Application Processor (AP), a modem processor, a Graphics Processing Unit (GPU), an Image Signal Processor (ISP), a controller, a video codec, a Digital Signal Processor (DSP), a baseband processor, and a neural Network Processor (NPU). The different processing units may be independent devices or integrated devices. The controller can generate an operation control signal according to the instruction operation code and the timing signal to complete the control of instruction fetching and instruction execution.

A memory may also be provided in processor 110 for storing instructions and data. In some embodiments, the memory in the processor 110 is a cache memory. The memory may hold instructions or data that have just been used or recycled by the processor 110. If the processor 110 needs to reuse the instruction or data, it can be called directly from the memory. Avoiding repeated accesses reduces the latency of the processor 110, thereby increasing the efficiency of the system.

The connection relationship between the modules shown in fig. 14 is only for illustrative purposes and does not limit the connection relationship between the modules of the electronic apparatus 100. Alternatively, the modules of the electronic device 100 may also adopt a combination of the connection manners in the above embodiments.

The wireless communication function of the electronic device 100 may be implemented by the antenna 1, the antenna 2, the mobile communication module 150, the wireless communication module 160, the modem processor, the baseband processor, and the like.

The antennas 1 and 2 are used for transmitting and receiving electromagnetic wave signals. Each antenna in the electronic device 100 may be used to cover a single or multiple communication bands. Different antennas can also be multiplexed to improve the utilization of the antennas. For example: the antenna 1 may be multiplexed as a diversity antenna of a wireless local area network. In other embodiments, the antenna may be used in conjunction with a tuning switch.

The electronic device 100 may implement display functionality through the GPU, the display screen 194, and the application processor. The GPU is a microprocessor for image processing, and is connected to the display screen 194 and an application processor. The GPU is used to perform mathematical and geometric calculations for graphics rendering. The processor 110 may include one or more GPUs that execute program instructions to generate or alter display information.

The display screen 194 may be used to display images or video.

The electronic device 100 may implement a photographing function through the ISP, the camera 193, the video codec, the GPU, the display screen 194, the application processor, and the like.

The ISP is used to process the data fed back by the camera 193. For example, when a photo is taken, the shutter is opened, light is transmitted to the camera photosensitive element through the lens, the optical signal is converted into an electrical signal, and the camera photosensitive element transmits the electrical signal to the ISP for processing and converting into an image visible to naked eyes. The ISP can optimize the algorithm of the noise, brightness and color of the image, and can also optimize the parameters of exposure, color temperature and the like of the shooting scene. In some embodiments, the ISP may be provided in camera 193.

The camera 193 is used to capture still images or video. The object generates an optical image through the lens and projects the optical image to the photosensitive element. The photosensitive element may be a Charge Coupled Device (CCD) or a complementary metal-oxide-semiconductor (CMOS) phototransistor. The light sensing element converts the optical signal into an electrical signal, which is then passed to the ISP where it is converted into a digital image signal. And the ISP outputs the digital image signal to the DSP for processing. The DSP converts the digital image signal into a standard Red Green Blue (RGB), YUV, or the like format image signal. In some embodiments, electronic device 100 may include 1 or N cameras 193, N being a positive integer greater than 1.

The digital signal processor is used for processing digital signals, and can process digital image signals and other digital signals. For example, when the electronic device 100 selects a frequency bin, the digital signal processor is used to perform fourier transform or the like on the frequency bin energy.

Video codecs are used to compress or decompress digital video. The electronic device 100 may support one or more video codecs. In this way, the electronic device 100 may play or record video in a variety of encoding formats, such as: moving Picture Experts Group (MPEG) 1, MPEG2, MPEG3, and MPEG4.

The gyro sensor 180B may be used to determine the motion attitude of the electronic device 100. In some embodiments, the angular velocity of electronic device 100 about three axes (i.e., the x-axis, y-axis, and z-axis) may be determined by gyroscope sensor 180B. The gyro sensor 180B may be used for photographing anti-shake. For example, when the shutter is pressed, the gyro sensor 180B detects a shake angle of the electronic device 100, calculates a distance to be compensated for by the lens module according to the shake angle, and allows the lens to counteract the shake of the electronic device 100 by a reverse movement, thereby achieving anti-shake. The gyro sensor 180B can also be used in scenes such as navigation and motion sensing games.

For example, in the embodiment of the present application, the gyroscope sensor 180B may be used for acquiring shake information, and the shake information may be used for representing a pose change of the electronic device during shooting.

Acceleration sensor 180E may detect the magnitude of acceleration of electronic device 100 in various directions, typically the x-axis, y-axis, and z-axis. The magnitude and direction of gravity can be detected when the electronic device 100 is stationary. The acceleration sensor 180E may also be used to recognize the attitude of the electronic device 100 as an input parameter for applications such as horizontal and vertical screen switching and pedometers.

The distance sensor 180F is used to measure a distance. The electronic device 100 may measure the distance by infrared or laser. In some embodiments, for example in a shooting scene, the electronic device 100 may utilize the range sensor 180F to range for fast focus.

The ambient light sensor 180L is used to sense the ambient light level. Electronic device 100 may adaptively adjust the brightness of display screen 194 based on the perceived ambient light level. The ambient light sensor 180L may also be used to automatically adjust the white balance when taking a picture. The ambient light sensor 180L may also cooperate with the proximity light sensor 180G to detect whether the electronic device 100 is in a pocket to prevent accidental touches.

The fingerprint sensor 180H is used to collect a fingerprint. The electronic device 100 can utilize the collected fingerprint characteristics to implement functions such as unlocking, accessing an application lock, taking a picture, and answering an incoming call.

The touch sensor 180K is also referred to as a touch device. The touch sensor 180K may be disposed on the display screen 194, and the touch sensor 180K and the display screen 194 form a touch screen, which is also referred to as a touch screen. The touch sensor 180K is used to detect a touch operation applied thereto or in the vicinity thereof. The touch sensor 180K may pass the detected touch operation to the application processor to determine the touch event type. Visual output associated with the touch operation may be provided via the display screen 194. In other embodiments, the touch sensor 180K may be disposed on a surface of the electronic device 100 at a different location than the display screen 194.

The hardware system of the electronic device 100 is described above in detail, and the software system of the electronic device 100 is described below. Fig. 15 is a schematic diagram of a software system of an electronic device according to an embodiment of the present application.

As shown in fig. 15, the system architecture may include an application layer 210, an application framework layer 220, a hardware abstraction layer 230, a driver layer 240, and a hardware layer 250.

The application layer 210 may include camera applications, gallery, calendar, telephony, maps, navigation, WLAN, bluetooth, music, video, short messaging, etc. applications.

The application framework layer 220 provides an Application Programming Interface (API) and a programming framework for an application program of the application layer; the application framework layer may include some predefined functions.

For example, the application framework layer 220 may include a camera access interface; camera management and camera devices may be included in the camera access interface. Wherein camera management may be used to provide an access interface to manage the camera; the camera device may be used to provide an interface for accessing the camera.

The hardware abstraction layer 230 is used to abstract the hardware. For example, the hardware abstraction layer may include a camera abstraction layer and other hardware device abstraction layers; the camera hardware abstraction layer may invoke algorithms in a camera algorithm library.

For example, software algorithms for image processing may be included in the camera algorithm library.

The driver layer 240 is used to provide drivers for different hardware devices. For example, the driver layer may include camera device drivers; a digital signal processor driver, a graphics processor driver, or a central processor driver.

The hardware layer 250 may include camera devices as well as other hardware devices.

For example, the hardware layer 250 includes a camera device, a digital signal processor, a graphics processor, or a central processor; illustratively, an image signal processor may be included in the camera device, and the image signal processor may be used for image processing.

For example, the training method of the common-view gear classification network provided by the embodiment of the present application may be executed in a digital signal processor, a graphics processor, or a central processing unit; or in other computing hardware of the electronic device.

Fig. 16 is a schematic diagram of a training device 300 for implementing the training method according to an embodiment of the present application.

As shown in fig. 16, the training apparatus 300 includes an acquisition unit 310 and a processing unit 320.

The obtaining unit 310 is configured to obtain a query training image and an image database, where the image database includes multiple frames of contrast images.

The processing unit 320 is configured to extract local features of the query training image and the comparison image, where the local features include local feature points and local feature point descriptors;

determining a predicted gear between the training image and the comparison image according to the query by using the common-view gear classification network;

Optionally, as an embodiment, the processing unit 420 is further configured to:

and determining the image block comprising the local feature points as the effective image block.

Optionally, as an embodiment, the processing unit 420 is further configured to:

and determining the image block comprising the local feature point in the matched local feature point pair as the effective common-view image block. Optionally, as an embodiment, the processing unit 420 is further configured to:

It should be noted that the training device 300 is embodied in the form of a functional unit. The term "unit" herein may be implemented by software and/or hardware, and is not particularly limited thereto.

Fig. 17 is a schematic diagram of an image sorting apparatus 400 for implementing the above-described image sorting method according to an embodiment of the present application.

As shown in fig. 17, the image sorting apparatus 400 includes a display unit 410 and a processing unit 420.

The display unit 410 is configured to display a first interface, where the first interface includes a first control.

Processing unit 420 is configured to detect a first operation on the first control; in response to the first operation, a query image is obtained.

The processing unit 420 is further configured to determine a global feature similarity between the query image and each frame of contrast images in an image database, where the image database includes multiple frames of contrast images;

determining gears by using the trained common-view gear classification network obtained by the training method of the common-view gear classification network according to the local characteristics of the query image and the candidate images of the M frames;

and displaying N frames of the target image.

Optionally, as an embodiment, the processing unit 420 is further configured to:

sequencing the multiple frames of contrast images according to the sequence of the global feature similarity from large to small, and determining the previous M frames of contrast images as the M frames of candidate images;

Optionally, as an embodiment, the processing unit 420 is further configured to:

Optionally, as an embodiment, the trained common-view gear classification network includes: a decoding network, an attention mechanism network, and a classification network.

The processing unit 420 is further configured to:

Optionally, as an embodiment, the first interface is a photographing interface, and the first control is a control for instructing to take a photograph.

The image sorting apparatus 400 is embodied as a functional unit. The term "unit" herein may be implemented by software and/or hardware, and is not particularly limited thereto.

For example, a "unit" may be a software program, a hardware circuit, or a combination of both that implement the above-described functions. The hardware circuitry may include an Application Specific Integrated Circuit (ASIC), an electronic circuit, a processor (e.g., a shared processor, a dedicated processor, or a group of processors) and memory that execute one or more software or firmware programs, a combinational logic circuit, and/or other suitable components that support the described functionality.

Accordingly, the units of the respective examples described in the embodiments of the present application can be realized in electronic hardware, or a combination of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The embodiment of the application also provides a computer readable storage medium, wherein the computer readable storage medium stores computer instructions; the computer readable storage medium, when run on the image sorting apparatus 400, causes the image sorting apparatus 400 to perform the image sorting method shown previously.

The computer instructions may be stored on a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website, computer, server, or data center to another website, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or can comprise one or more data storage devices, such as a server, a data center, etc., that can be integrated with the medium. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium, or a semiconductor medium (e.g., solid State Disk (SSD)), among others.

Embodiments of the present application also provide a computer program product containing computer instructions, which when run on the image sorting apparatus 400, enables the image sorting apparatus 400 to execute the aforementioned image sorting method.

Fig. 18 is a schematic structural diagram of a chip according to an embodiment of the present application. The chip shown in fig. 18 may be a general-purpose processor or may be a dedicated processor. The chip includes a processor 501. The processor 501 is configured to support the image sorting apparatus 400 to execute the foregoing technical solutions.

Optionally, the chip further includes a transceiver 502, where the transceiver 502 is configured to receive control of the processor 501, and is configured to support the image sorting apparatus 400 to execute the foregoing technical solution.

Optionally, the chip shown in fig. 18 may further include: a storage medium 503.

It should be noted that the chip shown in fig. 18 can be implemented by using the following circuits or devices: one or more Field Programmable Gate Arrays (FPGAs), programmable Logic Devices (PLDs), controllers, state machines, gate logic, discrete hardware components, any other suitable circuitry, or any combination of circuitry capable of performing the various functions described throughout this application.

The electronic device, the image sorting apparatus 400, the computer storage medium, the computer program product, and the chip provided in the embodiments of the present application are all configured to execute the method provided above, and therefore, reference may be made to the beneficial effects corresponding to the method provided above for the beneficial effects that can be achieved by the electronic device, the image sorting apparatus 400, the computer storage medium, the computer program product, and the chip provided in the embodiments of the present application, and details are not repeated here.

It should be understood that the above description is only for the purpose of helping those skilled in the art better understand the embodiments of the present application, and is not intended to limit the scope of the embodiments of the present application. It will be apparent to those skilled in the art that various equivalent modifications or variations are possible in light of the above examples, for example, some steps may not be necessary or some steps may be newly added in various embodiments of the above detection method. Or a combination of any two or more of the above embodiments. Such modifications, variations, or combinations are also within the scope of the embodiments of the present application.

It should also be understood that the foregoing descriptions of the embodiments of the present application focus on differences between the various embodiments, and that the same or similar parts that are not mentioned may be referred to one another, and thus, for brevity, will not be described again.

It should also be understood that the sequence numbers of the above-mentioned processes do not mean the execution sequence, and the execution sequence of each process should be determined by the function and the inherent logic thereof, and should not constitute any limitation to the implementation process of the embodiments of the present application.

It should also be understood that, in this embodiment of the present application, "preset" and "predefined" may be implemented by saving a corresponding code, table, or other means that can be used to indicate relevant information in advance in a device (for example, including an electronic device), and this application is not limited to the specific implementation manner thereof.

It should also be understood that the manner, the case, the category, and the division of the embodiments in the present application are only for convenience of description, and should not constitute a particular limitation, and features in various manners, categories, cases, and embodiments may be combined without contradiction.

It is also to be understood that the terminology and/or the description of the various embodiments herein is consistent and mutually inconsistent if no specific statement or logic conflicts exists, and that the technical features of the various embodiments may be combined to form new embodiments based on their inherent logical relationships.

Finally, it should be noted that: the above description is only an embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions within the technical scope of the present disclosure should be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A training method of a common visual gear classification network is characterized by comprising the following steps:

determining a first common view ratio according to the effective common view image block and the effective image block in the query training image;

determining the smaller of the first common aspect ratio and the second common aspect ratio as a common aspect score between the query training image and the comparison image;

according to the common-view score, determining a corresponding gear between the query training image and the comparison image by utilizing a mapping relation between a preset common-view gear threshold and the gear, and taking the gear as a true gear;

determining a predicted gear between the query training image and the comparison image using a common view gear classification network;

2. The training method of claim 1, wherein extracting local features of the query training image and the comparison image comprises:

3. Training method according to claim 1 or 2, wherein determining valid image blocks in the query training image and the comparison image according to the local features comprises:

4. The training method of claim 3, wherein determining valid co-view patches in the query training image and the comparison image based on the local features comprises:

and determining an image block comprising the local feature points in the matched local feature point pair as the effective common-view image block.

5. An image sorting method applied to an electronic device includes:

displaying a first interface, the first interface comprising a first control;

detecting a first operation on the first control;

responding to the first operation, and acquiring a query image;

determining gears according to the local features of the query image and the M frames of candidate images by using a trained common-view gear classification network obtained by the training method of the common-view gear classification network according to any one of claims 1 to 4;

determining the previous N frames of candidate images as N frames of target images from M frames of candidate images according to the sequence of gears from large to small, wherein N is more than or equal to 1 and less than M, and both N and M are integers;

and displaying N frames of the target image.

6. The image sorting method according to claim 5, wherein acquiring, from the image database, an M-frame contrast image having a high global feature similarity to the query image as an M-frame candidate image comprises:

7. The image ranking method according to claim 6, wherein extracting local features of the query image and the M frames of the candidate images comprises:

8. The image ranking method of claim 7, wherein the trained common view gear classification network comprises: a decoding network, an attention mechanism network and a classification network;

9. The image sorting method according to claim 5, wherein the first interface is a photographing interface, and the first control is a control for instructing photographing.

10. An electronic device, comprising:

one or more processors and memory;

the memory coupled with the one or more processors for storing computer program code comprising computer instructions which are invoked by the one or more processors to cause the electronic device to perform the method of any of claims 1 to 4, and/or the method of any of claims 5 to 9.

11. A chip system, wherein the chip system is applied to an electronic device, and the chip system comprises one or more processors for invoking computer instructions to cause the electronic device to execute the method of any one of claims 1 to 4 and/or the method of any one of claims 5 to 9.

12. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a processor, causes the processor to perform the method of any of claims 1 to 4 and/or the method of any of claims 5 to 9.