CN114863138B

CN114863138B - Image processing method, device, storage medium and equipment

Info

Publication number: CN114863138B
Application number: CN202210798669.4A
Authority: CN
Inventors: 郭卉
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-07-08
Filing date: 2022-07-08
Publication date: 2022-09-06
Anticipated expiration: 2042-07-08
Also published as: CN114863138A

Abstract

The embodiment of the application provides an image processing method, an image processing device, a storage medium and image processing equipment, which can be applied to various fields such as cloud technology, artificial intelligence, block chains, Internet of vehicles, intelligent transportation and intelligent home, and the method comprises the following steps: acquiring an image to be processed, and performing feature extraction processing on the image to be processed by using a target global feature extraction network to obtain global features of the image to be processed; acquiring a saliency image of the image to be processed, and performing feature extraction processing on the saliency image of the image to be processed by using a target saliency feature extraction network to obtain saliency features of the image to be processed; and determining the fusion characteristics of the image to be processed based on the global characteristics and the salient characteristics of the image to be processed. By adopting the method, the accuracy of the extracted image features can be effectively improved.

Description

Image processing method, image processing apparatus, storage medium, and device

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to an image processing method, an image processing apparatus, a storage medium, and a device.

Background

At present, various systems, software and websites can provide massive images for users to select, and an image retrieval function is provided for facilitating the users to search for the images. In the existing image retrieval method, image features are obtained by analyzing an image, and similar images of the image are searched according to the image features and a retrieval algorithm, however, the accuracy of the image features extracted by the existing image feature extraction method is low, and when the retrieval task is executed by adopting the image features with low accuracy, the retrieval result is inaccurate.

Disclosure of Invention

The embodiment of the application provides an image processing method, an image processing device, a storage medium and an image processing device, which can effectively improve the accuracy of extracted image features.

In one aspect, an embodiment of the present application provides an image processing method, where the method includes:

acquiring an image to be processed, and performing feature extraction processing on the image to be processed by using a target global feature extraction network to obtain global features of the image to be processed;

acquiring a saliency image of the image to be processed, and performing feature extraction processing on the saliency image of the image to be processed by using a target saliency feature extraction network to obtain saliency features of the image to be processed;

determining fusion characteristics of the image to be processed based on the global characteristics and the salient characteristics of the image to be processed;

the target significant feature extraction network is obtained by training an initial significant feature extraction network by combining the target global feature extraction network and a triple training sample, wherein the triple training sample comprises an anchor sample image, and a positive sample image and a negative sample image corresponding to the anchor sample image; when the initial salient feature extraction network is trained, the target global feature extraction network is used for extracting global features of the anchor sample image, the positive sample image and the negative sample image respectively, and the initial salient feature extraction network is used for extracting salient features of the anchor sample image, the positive sample image and the negative sample image respectively; the target salient feature extraction network is obtained by adjusting network parameters of the initial salient feature extraction network based on a first loss parameter and a second loss parameter, wherein the first loss parameter is determined based on respective salient features of the anchor sample image, the positive sample image and the negative sample image, and the second loss parameter is determined based on respective global features and salient features of the anchor sample image, the positive sample image and the negative sample image.

In one aspect, an embodiment of the present application provides an image processing apparatus, where the apparatus includes:

the acquisition unit is used for acquiring an image to be processed;

the processing unit is used for performing feature extraction processing on the image to be processed by using a target global feature extraction network to obtain global features of the image to be processed;

the acquisition unit is further used for acquiring a saliency image of the image to be processed;

the processing unit is further configured to perform feature extraction processing on the saliency image of the to-be-processed image by using a target saliency feature extraction network to obtain saliency features of the to-be-processed image;

the processing unit is further used for determining fusion characteristics of the image to be processed based on the global characteristics and the salient characteristics of the image to be processed;

the target significant feature extraction network is obtained by training an initial significant feature extraction network by combining the target global feature extraction network and a triple training sample, wherein the triple training sample comprises an anchor sample image, and a positive sample image and a negative sample image corresponding to the anchor sample image; when the initial salient feature extraction network is trained, the target global feature extraction network is used for extracting global features of the anchor sample image, the positive sample image and the negative sample image respectively, and the initial salient feature extraction network is used for extracting salient features of the anchor sample image, the positive sample image and the negative sample image respectively; the target significant feature extraction network is obtained by adjusting network parameters of the initial significant feature extraction network based on a first loss parameter and a second loss parameter, wherein the first loss parameter is determined based on the significant features of the anchor sample image, the positive sample image and the negative sample image, and the second loss parameter is determined based on the global features and the significant features of the anchor sample image, the positive sample image and the negative sample image.

In one aspect, the present application provides a computer device, where the computer device includes a processor, a communication interface, and a memory, where the processor, the communication interface, and the memory are connected to each other, where the memory stores a computer program, and the processor is configured to invoke the computer program to execute the image processing method according to any one of the foregoing possible implementation manners.

In one aspect, the present application provides a computer-readable storage medium, where a computer program is stored, and when executed by a processor, the computer program implements the image processing method of any one of the possible implementations.

In one aspect, the present application further provides a computer program product, where the computer program product includes a computer program or a computer instruction, and the computer program or the computer instruction is executed by a processor to implement the steps of the image processing method provided in the present application.

In an aspect, the present application further provides a computer program, where the computer program includes computer instructions, the computer instructions are stored in a computer-readable storage medium, a processor of a computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the image processing method provided in the present application.

In the embodiment of the application, the initial saliency feature extraction network can be used for extracting the saliency features of the anchor sample image, the positive sample image and the negative sample image, the first loss parameter can be determined based on the saliency features of the anchor sample image, the positive sample image and the negative sample image, the second loss parameter can be determined based on the global feature and the saliency feature of the anchor sample image, the positive sample image and the negative sample image, and finally the network parameter adjustment can be performed on the initial saliency feature extraction network based on the first loss parameter and the second loss parameter to obtain the target saliency feature extraction network. The first loss parameter can enable the initial salient feature extraction network to learn the representation of the salient region in the image by measuring the similarity among the salient features of the anchor sample image, the positive sample image and the negative sample image, so that the accuracy of the salient features extracted by the target salient feature extraction network is improved; the second loss parameter can enable the initial saliency feature extraction network to perform compatibility learning on the basis of saliency learning by measuring the similarity among the fusion features (which can be determined based on the global feature and the saliency feature) of the anchor sample image, the positive sample image and the negative sample image, so that the learning of the upgrade features (namely the fusion features) is realized, the saliency features extracted by the target saliency feature extraction network can be effectively integrated with the global feature, and the accuracy of the saliency features extracted by the target saliency feature extraction network is further improved. Meanwhile, the saliency features of the image can be introduced into the obtained fusion features through effective integration, and even if the saliency region becomes small or image attack and the like occur, the position to be represented (namely the saliency region) can be accurately positioned through the fusion features, so that the whole image can be more effectively represented.

The target global feature extraction network is used for carrying out feature extraction on the image to be processed to obtain the global features of the image to be processed, the target saliency feature extraction network is used for carrying out feature extraction on the saliency image of the image to be processed to obtain the saliency features of the image to be processed, the different-level features of the image to be processed can be mined to the greatest extent, effective integration can be realized based on the global features and the saliency features of the image to be processed, the fusion features of the image to be processed can be obtained, the defects of the global features can be compensated, and the accuracy of the extracted image features can be effectively improved.

Drawings

In order to more clearly illustrate the technical method of the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings can be obtained by those skilled in the art without inventive labor.

Fig. 1 is a schematic system architecture diagram of an image processing system according to an embodiment of the present disclosure;

fig. 2 is a first flowchart illustrating an image processing method according to an embodiment of the present disclosure;

fig. 3 is a schematic diagram of a saliency region mask image provided in an embodiment of the present application;

fig. 4 is a first flowchart illustrating an image re-arrangement retrieving method according to an embodiment of the present disclosure;

fig. 5 is a schematic flowchart illustrating a second image rearrangement retrieval method according to an embodiment of the present application;

fig. 6 is a third schematic flowchart of an image rearrangement retrieval method according to an embodiment of the present application;

fig. 7 is a fourth schematic flowchart of an image rearrangement retrieval method according to the embodiment of the present application;

fig. 8 is a schematic flowchart of a method for retrieving an image rearrangement record according to an embodiment of the present application;

fig. 9 is a schematic flowchart of a target salient feature extraction network acquisition method according to an embodiment of the present application;

FIG. 10 is a schematic diagram of a similar image provided by an embodiment of the present application;

fig. 11 is a flowchart illustrating a second image processing method according to an embodiment of the present application;

fig. 12 is a schematic structural diagram of an image processing apparatus according to an embodiment of the present application;

fig. 13 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

The technical method in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is to be understood that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application without making any creative effort belong to the protection scope of the present application.

The embodiment of the application provides an image processing method, which can effectively improve the accuracy of extracted image features and can be applied to various fields or scenes such as cloud technology, artificial intelligence, block chains, Internet of vehicles, intelligent transportation, intelligent home and the like. In an embodiment, the image processing algorithm specifically relates to a Machine Learning technique in an artificial intelligence technique, and Machine Learning (ML) is a multi-domain cross subject, and relates to multiple subjects such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.

Referring to fig. 1, fig. 1 is a schematic system structure diagram of an image processing system according to an embodiment of the present disclosure, as shown in fig. 1, the system includes a computer device 10 and a database 11, and the computer device 10 and the database 11 may be connected in a wired or wireless manner.

The computer device 10 may include one or more of a terminal and a server. That is, the data processing method proposed in the embodiment of the present application may be executed by a terminal, may be executed by a server, or may be executed by both a terminal and a server capable of communicating with each other. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as cloud service, a cloud database, cloud computing, a cloud function, cloud storage, Network service, cloud communication, middleware service, domain name service, security service, Content Delivery Network (CDN), big data and an artificial intelligence platform. The terminal device may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, an intelligent voice interaction device, an intelligent appliance, a vehicle-mounted terminal, and the like.

The database 11 may be a local database of the computer device 10, a cloud database accessible by the computer device 10, or a local database of another computer device. The database 11 may be used to store triplet training samples and images to be processed.

The computer device 10 may carry a target global feature extraction network, and may perform optimization updating on the initial salient feature extraction network through the target global feature extraction network to obtain a target salient feature extraction network. The interaction process with the database 11 is as follows:

the computer device 10 obtains the triple training samples from the database 11, and trains the initial saliency feature extraction network by combining the target global feature extraction network and the triple training samples to obtain the target saliency feature extraction network. When the initial saliency feature extraction network is trained, the target global feature extraction network is used for extracting global features of an anchor sample image, a positive sample image and a negative sample image in a triple training sample, the initial saliency feature extraction network is used for extracting saliency features of the anchor sample image, the positive sample image and the negative sample image, so that the computer device 10 can determine a first loss parameter based on the saliency features of the anchor sample image, the positive sample image and the negative sample image, determine a second loss parameter based on the global features and the saliency features of the anchor sample image, the positive sample image and the negative sample image, and finally perform network parameter adjustment on the initial saliency feature extraction network based on the first loss parameter and the second loss parameter to obtain the target saliency feature extraction network.

The computer device 10 may obtain the image to be processed from the database 11, the computer device 10 may perform feature extraction processing on the image to be processed by using the target global feature extraction network to obtain global features of the image to be processed, perform feature extraction processing on the saliency image of the image to be processed by using the target saliency feature extraction network to obtain saliency features of the image to be processed, and then perform effective integration based on the global features and the saliency features of the image to be processed to obtain fusion features of the image to be processed.

In an embodiment, the request terminal may send the image to be processed to the computer device 10, and after the computer device 10 processes the image to be processed by using the target global feature extraction network and the target salient feature extraction network to obtain the salient features of the image to be processed, the computer device may execute an image re-arrangement retrieval task by using the fusion features of the image to be processed, recall the similar image of the image to be processed, and return the recalled similar image to the request terminal.

When distinguishing whether two images are repeated or similar, the human eyes pay more attention to a salient region on the images, wherein the salient region refers to a region which is interested by a user in one image. When the area of the salient region is small, the extracted global features are more representations of the whole image, and the salient region is not focused, so that the representations of the salient region can be added into the obtained fusion features by effectively integrating the global features and the salient features, and even if the salient region is small or image attacks (such as color, chrominance, luminance transformation and cutting processing) exist in the image, the salient region needing to be represented can be accurately positioned, so that the whole image can be more accurately represented, the accuracy of the extracted image features can be effectively improved, and the problem that the image rearrangement retrieval task is missed to be recalled due to poor image representation under the condition that the image area occupied by the salient region is not large is solved.

It is to be understood that the system architecture diagram described in the embodiment of the present application is for more clearly illustrating the technical solution of the embodiment of the present application, and does not constitute a limitation to the technical solution provided in the embodiment of the present application, and as a person having ordinary skill in the art knows that along with the evolution of the system architecture and the appearance of a new service scenario, the technical solution provided in the embodiment of the present application is also applicable to similar technical problems.

Referring to fig. 2, fig. 2 is a first flowchart illustrating an image processing method according to an embodiment of the present disclosure. The method may be applied to the computer device 10 in the image processing system described above, and includes:

s201, obtaining an image to be processed, and performing feature extraction processing on the image to be processed by using a target global feature extraction network to obtain global features of the image to be processed.

In the embodiment of the present application, the image to be processed may be any one of the images. The target global feature extraction network is used to extract a representation of the image global (full picture). Therefore, the method and the device utilize the target global feature extraction network to perform feature extraction processing on the image to be processed, so that the global feature of the image to be processed can be obtained, and the global feature is the global representation of the image to be processed.

In an embodiment, the target global feature extraction network is obtained by training based on a triplet training sample and an initial global feature extraction network, wherein the triplet training sample includes an anchor sample image, and a positive sample image and a negative sample image corresponding to the anchor sample image. The anchor sample image and the positive sample image corresponding to the anchor sample image satisfy a similarity relation, and the anchor sample image and the negative sample image corresponding to the anchor sample image satisfy a dissimilarity relation. The initial global feature extraction network is used for extracting global features of an anchor sample image, a positive sample image and a negative sample image in a triple training sample, determining a reference measurement loss parameter and a first average quantization loss parameter based on the global features of the anchor sample image, the positive sample image and the negative sample image, performing network parameter adjustment on network parameters of the initial global feature extraction network by using the reference measurement loss parameter and the first average quantization loss parameter to obtain an adjusted global feature extraction network, and determining a target global feature extraction network based on the adjusted global feature extraction network.

S202, obtaining a saliency image of the image to be processed, and performing feature extraction processing on the saliency image of the image to be processed by using a target saliency feature extraction network to obtain saliency features of the image to be processed.

In the embodiment of the present application, the saliency image of the to-be-processed image includes a saliency region in the to-be-processed image, and the saliency region refers to a region in one image, which is usually more interesting to human eyes, such as a human, an animal, and an object in the image. For example, referring to the content indicated by 31 in fig. 3, it can be seen that the human eye focuses primarily on the salient areas of the animal, sofa, etc. in the figure, rather than the surrounding green environment or white background. The target salient feature extraction network is used for extracting the global representation of the salient image of the image to be processed, and the effective information in the salient image of the image to be processed is the salient region in the image to be processed, so that the global representation of the salient image of the image to be processed can be determined as the representation of the salient region in the image to be processed, and the salient region in the image to be processed is subjected to detailed feature description. The method and the device can utilize the target saliency feature extraction network to perform feature extraction processing on the saliency image of the image to be processed to obtain the saliency feature of the image to be processed, wherein the saliency feature is the representation of a saliency region in the image to be processed.

In one embodiment, the process of acquiring a saliency image in an image to be processed includes: and normalizing the pixel value of each pixel point in the image to be processed, wherein the pixel value of each pixel point in the image to be processed after normalization processing is between 0 and 1. Further, the pixel values of the first type of pixel points and the second type of pixel points in the normalized image to be processed can be respectively adjusted to be the first pixel value and the second pixel value, so that the salient region mask image of the image to be processed is obtained. The first type of pixel points refer to pixel points with pixel values smaller than a set pixel value (which can be set artificially, for example, 0.5) after normalization processing, and the second type of pixel points refer to pixel points with pixel values larger than or equal to the set pixel value after normalization processing. Specifically, the pixel value of the first type of pixel point in the normalized image to be processed may be adjusted to a first pixel value (the first pixel value is 0), and the pixel value of the second type of pixel point in the normalized image to be processed may be adjusted to a second pixel value (the second pixel value is 1), so as to obtain the salient region mask image of the image to be processed. For example, referring to fig. 3, the content indicated by 31 in fig. 3 is each original image, and the content indicated by 32 in fig. 3 is a saliency region mask image corresponding to each original image (where the pixel value of the white region is 1 and the pixel value of the black region is 0). The saliency region mask image of the image to be processed and the image to be processed can be further multiplied to obtain the saliency image of the image to be processed. Because the pixel value of the background region in the saliency region mask image of the image to be processed is 0 and the pixel value of the saliency region is 1, after the saliency region mask image of the image to be processed is multiplied by the image to be processed, the pixel values of the background region in the image to be processed are all set to be 0 (black), the saliency region in the image to be processed remains unchanged, and the effective information in the saliency image of the image to be processed is the saliency region in the image to be processed. In the task of retrieving the images, it is also very critical whether the salient regions are consistent, for example, for the animals such as giraffes in fig. 3, if the animals in the images cannot be effectively characterized, the animals are easily and mainly characterized to the background environment by using the full-image characterization, which easily results in the missed recalling of the images related to the animals.

In a possible embodiment, an image Saliency recognition Network, for example, a Pyramid Feature Attention Network for Saliency detection, may be obtained, the image to be processed is input into the image Saliency recognition Network for processing, an evaluation score of Saliency of the image to be processed is obtained, and if the evaluation score is greater than a preset score threshold, the step of obtaining the Saliency image in the image to be processed is performed. By the embodiment, whether the image to be processed has the saliency region can be verified, so that the effective information in the saliency image of the image to be processed is ensured to be the saliency region in the image to be processed.

The target significant feature extraction network is obtained by training an initial significant feature extraction network in combination with a target global feature extraction network and a triple training sample, wherein the triple training sample comprises an anchor sample image, and a positive sample image and a negative sample image corresponding to the anchor sample image; when the initial saliency feature extraction network is trained, the target global feature extraction network is used for extracting global features of the anchor sample image, the positive sample image and the negative sample image, and the initial saliency feature extraction network is used for extracting saliency features of the anchor sample image, the positive sample image and the negative sample image; the target significant feature extraction network is obtained by adjusting network parameters of the initial significant feature extraction network based on a first loss parameter and a second loss parameter, wherein the first loss parameter is determined based on the significant features of the anchor sample image, the positive sample image and the negative sample image, and the second loss parameter is determined based on the global features and the significant features of the anchor sample image, the positive sample image and the negative sample image. The detailed flow of the target salient feature extraction network acquisition is explained by the following steps S901 to S906.

And S203, determining the fusion characteristics of the image to be processed based on the global characteristics and the salient characteristics of the image to be processed.

In an embodiment, the global feature and the salient feature of the image to be processed may be fused to obtain a fused feature of the image to be processed. The fusion can be specifically achieved by stitching the global features and the salient features of the image to be processed. By adopting the embodiment, the saliency characteristics of the image to be processed can be introduced into the fusion characteristics, and even if the saliency area becomes smaller or image attack and the like occurs, the position to be represented (namely the saliency area) can be accurately positioned through the fusion characteristics, so that the image to be processed is more effectively represented, and the accuracy of the extracted image characteristics is improved.

The extracted fusion features can be used in a common image rearrangement retrieval task, and referring to fig. 4, fig. 4 is a schematic flow chart of an image rearrangement retrieval method provided by the embodiment of the present application, where the method includes: inputting the warehousing image (any image required to be stored in the image library) into a target global feature extraction network and a target salient feature extraction network for feature extraction processing to obtain salient features and global features of the warehousing image, then fusing the salient features and the global features of the warehousing image to obtain feature data (namely fusion features) of the warehousing image, and finally storing the feature data of the warehousing image into a feature database. And meanwhile, the warehousing image is provided with a characteristic index, the characteristic index comprises an index key and an index value, the index key is generated by an image identifier (used for uniquely identifying the warehousing image) corresponding to the warehousing image, and the index value is generated by characteristic data of the warehousing image. In an implementation manner, as shown in fig. 5, the image to be processed may be input into a target global feature extraction network and a target salient feature extraction network to perform feature extraction processing, so as to obtain salient features and global features of the image to be processed, then the salient features and the global features of the image to be processed are fused to obtain fused data of the image to be processed, and a feature database may be queried based on the fused features of the image to be processed, where the feature database includes feature data of each image in an image database. Specifically, the similarity between the fusion feature of the image to be processed and the feature data of each image in the image library can be calculated, and the feature data with the corresponding similarity larger than the similarity threshold value is used as the matching feature data matched with the fusion feature of the image to be processed; or sorting according to the similarity from large to small, and selecting the characteristic data with the preset number in the top in the sorting as the matching characteristic data matched with the fusion characteristic of the image to be processed. And if the matched feature data matched with the fusion features of the image to be processed exists in the feature database, determining the image corresponding to the matched feature data in the image database as a similar image of the image to be processed. Specifically, a feature index using the matching feature data as an index value may be determined, the similar image of the image to be processed may be recalled according to an index key in the feature index, and finally the similar image of the image to be processed may be used as a recall result of the image re-retrieval task. According to the method, the accuracy of the recall result of the image rearrangement retrieval task can be improved through the fusion features with higher accuracy, meanwhile, the image rearrangement retrieval system can only retrieve 1 feature representation (namely fusion features) generated by the image to be processed through the feature fusion method, and compared with the situation that the similarity among multiple features needs to be calculated in the multi-feature retrieval process, the complexity of the image rearrangement retrieval system can be reduced.

As shown in fig. 6, the original duplicate removal retrieval system adopts a target global feature extraction network to extract global features of the put-in images, and stores the global features into an old feature database, and additionally, an old feature index is established for the put-in images, the old feature index includes an index key and an index value, the index key is generated by an image identifier corresponding to the put-in images, and the index value is generated by the global features of the put-in images. Therefore, the original rearrangement retrieval system utilizes the target global feature extraction network to extract the global features of the images to be processed, calculates the similarity between the global features of the images to be processed and the global features of the images, and takes the images corresponding to the global features with the corresponding similarity larger than the similarity threshold or the images corresponding to the global features with the preset number in the similarity sequence as the similar images of the images to be processed. When the number of images in the image library is extremely large, the rearrangement retrieval service has to be interrupted for updating all global features in the old feature database into the fusion features, and therefore the method for upgrading the feature compatibility without perception of the user is provided. As shown in fig. 7, the method includes: 1) the old feature database with the extracted global features is not refreshed, and only the retrieval is maintained; 2) for newly-added warehousing images every day, extracting fusion features of the newly-added warehousing images by using a target global feature extraction network and a target significance feature extraction network, adding the fusion features into a new feature database, and establishing a new feature index for the fusion features, wherein the new feature index comprises an index key and an index value, the index key is generated by an image identifier corresponding to the warehousing images, the index value is generated by the fusion features of the warehousing images, in addition, global features split from the fusion features are added into an old feature index, the old feature index comprises the index key and an index value, the index key is generated by the image identifier corresponding to the warehousing images, and the index value is generated by the global features of the warehousing images; 3) and for the image re-arrangement retrieval task, new and old feature indexes are retrieved, so that the recall effect is improved. For example, as shown in fig. 8, for an image to be processed, a fusion feature of the image to be processed is determined by using a target global feature extraction network and a target saliency feature extraction network, a similarity between the fusion feature of the image to be processed and a fusion feature in each new feature index is calculated, a new feature index with a higher corresponding similarity is recalled, a global feature of the image to be processed is split from the fusion feature of the image to be processed, a similarity between the global feature of the image to be processed and a global feature in each old feature index is calculated, an old feature index with a higher corresponding similarity is recalled, and finally, an image corresponding to an index key in the recalled new feature index and an image corresponding to an index key in the recalled old feature index can be used as similar images of the image to be processed to obtain a recall result; 4) when the old feature database reaches the eliminated time (can be set manually), the old feature index is not used, the global feature is not required to be split again for the image to be stored, and the new feature index is directly adopted for retrieval. By adopting the characteristic compatibility upgrading scheme provided by the application, the compatibility of historical stock (namely an old characteristic database) can be considered, so that the historical index (namely the old characteristic index) and the historical stock can quickly support a new characteristic application mode without refreshing; in addition, because the new features (namely the fusion features) are better in representation, the two indexes (the new feature index and the old feature index) are adopted for retrieval, so that the new feature index can have recall supplementing capability aiming at similar images which cannot be recalled by the old features (namely the global features), the retrieval effect can be improved in a transition stage, and the retrieval requirements can be met by purely using the new features after the historical inventory is eliminated; meanwhile, the method can continuously provide retrieval service and smoothly upgrade, and can also realize the update without feeling of the user.

By adopting the embodiment of the application, the computer equipment can utilize the target global feature extraction network to perform feature extraction processing on the image to be processed to obtain the global feature of the image to be processed, utilize the target saliency feature extraction network to perform feature extraction processing on the saliency image of the image to be processed to obtain the saliency feature of the image to be processed, and then effectively integrate the global feature and the saliency feature of the image to be processed to obtain the fusion feature of the image to be processed, so that the fusion feature can represent the global image of the image to be processed and can also represent the saliency region in the image to be processed, thereby supplementing the deficiency of the global feature and effectively improving the accuracy of the extracted image feature.

Referring to fig. 9, fig. 9 is a flowchart illustrating an obtaining method of a target salient feature extraction network according to an embodiment of the present application, where the obtaining method of the target salient feature extraction network can be executed by the computer device 10 in the image processing system. The method comprises the following steps:

s901, obtaining a triple training sample, wherein the triple training sample comprises an anchor sample image, and a positive sample image and a negative sample image corresponding to the anchor sample image.

Because it is difficult to search for a proper triplet training sample from massive image data, in a feasible implementation manner, two images can be arbitrarily extracted from the massive image data, and if the two images are similar enough, the two images serve as a positive sample pair, so that the positive sample pair includes a pair of similar images. As shown in fig. 10, any two images of the 4 images indicated by 101, 102, 103, and 104 in fig. 10 may be taken as a positive sample pair, wherein some images are generated through image attack, the image attack may include color change, chromaticity change, cropping processing, rotation processing, brightness change, adding filter, and the like, and for example, the images indicated by 102 and 103 are obtained by performing rotation processing and cropping processing on the image indicated by 101, respectively.

If M positive sample pairs are included, for any one of the M positive sample pairs, triple sample mining may be performed on the any one positive sample pair to obtain a triple training sample corresponding to the any one positive sample pair. The specific process of obtaining the triple training samples corresponding to the target sample pair includes: and respectively taking two images in the target sample pair as an anchor sample image and a positive sample image, calculating the similarity between the anchor sample image and any image in other positive sample pairs, sequencing the images from small to large according to the similarity, taking the first Q (for example, 20) images as negative sample images, and respectively forming a triple training sample corresponding to the target sample pair with the anchor sample image and the positive sample image in the target sample pair. Wherein Q is a positive integer and M is a positive integer. According to the description of the triple sample mining processing performed on the target sample pair, it can be known that if the triple sample mining processing is performed on each of the M positive sample pairs in a similar manner, Q × M triple training samples can be finally obtained. It should be noted that a relatively large value, such as 1024, needs to be set as the value of M.

S902, extracting the global features of the anchor sample image, the positive sample image and the negative sample image by using a target global feature extraction network.

In one embodiment, the target global feature extraction network may be obtained by training the initial global feature extraction network with triplet training samples. Specifically, an anchor sample image, a positive sample image and a negative sample image included in the triple training sample are respectively input into the initial global feature extraction network for feature extraction processing, so that the global features of the anchor sample image, the global features of the positive sample image and the global features of the negative sample image are obtained. Then, the global features of the anchor sample image, the global features of the positive sample image and the global features of the negative sample image are respectively substituted into X in the expression of the triple metric loss function shown in the following formula (1) _a 、X _p And X _n And calculating to obtain a reference measurement loss parameter. The reference metric loss parameter is used for enabling an initial global feature extraction network to carry out feature metric learning aiming at global features, and the feature metric learningThe learning task is required to ensure that the global features between the anchor sample image and the positive sample image are close enough (i.e., relatively high similarity) and the global features between the anchor sample image and the negative sample image are far enough (i.e., relatively low similarity).

L _tri =max(||X _a -X _p ||-||X _a -X _n ||+α，0)（1）

Wherein L is _tri Representing a triplet metric loss parameter, X _a Representing a characterization of the image of the anchored sample, X _p Representing a characterization of a positive sample image, X _n Representing a representation of a negative sample image, | | X _a -X _p | | represents the distance between the representations of the anchor sample image and the positive sample image, | | X _a -X _n | | represents the distance between the features of the anchor sample image and the negative sample image, where the larger the distance, the smaller the similarity, the smaller the distance, the greater the similarity, and α represents a similarity difference threshold (which may be set artificially, e.g., 50). The triplet metric loss function may be used to ensure that the difference between the similarity between the representations of the anchor sample image and the positive sample image and the similarity between the representations of the anchor sample image and the negative sample image is greater than a similarity difference threshold.

Further, a quantization code value corresponding to the global feature of the anchor sample image, a quantization code value corresponding to the global feature of the positive sample image, and a quantization code value corresponding to the global feature of the negative sample image are determined. Namely, each feature value in the global features output by the initial global feature extraction network is mapped to-1 or 1, which can be specifically realized by the following formula (2).

（2）

Wherein Q _i Denotes the ith characteristic value, B _i And representing the corresponding quantized coding value after the ith characteristic value is mapped.

Further, the quantized coded value corresponding to the global feature of the anchor sample image and the quantized coded value pair in the global feature of the anchor sample image may be comparedThe corresponding characteristic values are substituted into B in the expression of the quantization loss function represented by the following formula (3) _i And Q _i And calculating to obtain a first quantization loss parameter. The quantization code value corresponding to the global feature of the positive sample image and the feature value corresponding to the quantization code value in the global feature of the positive sample image may be respectively substituted into B in the expression of the quantization loss function shown in the following expression (3) _i And Q _i And calculating to obtain a second quantization loss parameter. The quantization code value corresponding to the global feature of the negative sample image and the feature value corresponding to the quantization code value in the global feature of the negative sample image may be substituted into B in the expression of the quantization loss function shown in the following expression (3) _i And Q _i And calculating to obtain a third quantization loss parameter. The first average quantization loss parameter can be obtained according to the average value of the first quantization loss parameter, the second quantization loss parameter and the third quantization loss parameter. The quantization loss function is used to bring the eigenvalues closer to their corresponding quantized coded values.

（3）

Wherein L is _coding Representing a quantization loss parameter, N _bits Representing a feature dimension.

Finally, the weight of the triple metric loss function and the weight of the quantization loss function, the reference metric loss parameter and the first average quantization loss parameter are respectively substituted into w in the expression of the target loss function shown in the following formula (4) ₁ 、w ₂ 、L _tri 、L _coding Calculating to obtain a target loss parameter L _total 。

L _total =w ₁ L _tri +w ₂ L _coding （4）

Wherein, w ₁ Weight, w, representing a triplet metric loss function ₂ Representing the weight of the quantization loss function.

Finally, network parameters in the initial global feature extraction network can be reversely adjusted (can be realized by means of a random gradient algorithm) based on the target loss parameters to obtain a trained global feature extraction network, so that in the subsequent training process of the initial global feature extraction network, the target global feature extraction network for extracting the global features can be determined based on the trained global feature extraction network. Specifically, when the target loss function convergence or the adjustment times reach the pre-training times, the trained global feature extraction network may be determined as the target global feature extraction network.

Note that the weight of the triplet metric loss function is greater than the weight of the quantization loss function, e.g., w ₁ Is 1, w ₂ Is 0.01. After the triple metric loss function determines the coding sign (the eigenvalue is greater than 0 or less than 0, which can determine whether the quantized coding value is a positive number or a negative number), the quantized coding value is determined, so that the quantized loss function is easy to converge, and the quantization capability is less important than the characteristic metric capability, so that the triple metric loss function is finally required to be ensured to be dominant in the target loss function, and the characteristic metric learning effect is avoided being influenced.

In one embodiment, the initial global feature extraction network may include a first feature extraction module and a first feature extraction module, and the first feature extraction model may be a ResNet-101 network, please refer to table 1 below, where table 1 records a network structure of the ResNet-101 network:

TABLE 1

As shown in table 1 above, the ResNet-101 network includes 5 convolutional layers, using 64 convolutional kernels with step size (stride) of 2 and size of 7 × 7 on the first convolutional layer; maximum pooling (max pool) of size 3 x 3, step size 2, and 3 residual blocks (blocks) are used in the second convolution layer, each residual block mainly using convolution kernels of size 1 x 1 and 3 x 3; 4 residual blocks are used in the third convolutional layer; 23 residual blocks are used in the fourth convolution layer; the fifth convolutional layer uses 3 residual blocks, and ResNet-101 uses 33 residual blocks in common.

The network structure of the first feature extraction module may be as shown in table 2 below:

TABLE 2

In one embodiment, the ResNet101 network shown in table 1 can be a ResNet101 network that is pre-trained with ImageNet data set (a large universal object recognition source data set). The hash quantization layer 1 and the pooling layer in the first feature extraction module may be initialized with a gaussian distribution with a variance of 0.01 and a mean of 0.

The embodiment of the application only exemplarily illustrates the network structure of the first feature extraction module included in the initial global feature extraction network, and does not limit the specific structure of the first feature extraction module; for example, the first feature extraction module mentioned above may also be a deep residual network, ResNet-18, ResNet-50, etc.

In an embodiment, after the target global feature extraction network is obtained, the anchor sample image, the positive sample image, and the negative sample image included in the triplet training sample may be input into the target global feature extraction network for feature extraction processing, so as to obtain a first global feature of the anchor sample image, a second global feature of the positive sample image, and a third global feature of the negative sample image. Due to the effect of the quantization loss function training, the first global feature, the second global feature and the third global feature output by the hash quantization layer 1 can be made to be vectors tending to-1 or 1.

And S903, extracting the respective saliency characteristics of the anchor sample image, the positive sample image and the negative sample image by using the initial saliency characteristic extraction network.

In an embodiment, the initial salient feature extraction network may include a second feature extraction module and a second feature extraction module, and the second feature region model may also adopt the ResNet-101 network shown in table 1. The network structure of the second feature extraction module may be as shown in the following table 3:

TABLE 3

In one embodiment, the hash quantization layer 2 and the pooling layer in the second feature extraction module may be initialized with a gaussian distribution with a variance of 0.01 and a mean of 0.

In a possible embodiment, the saliency images of the anchor sample image, the positive sample image and the negative sample image may be obtained, and the obtaining process may refer to the above process of obtaining the saliency image of the image to be processed. And further inputting the respective saliency images of the anchor sample image, the positive sample image and the negative sample image into an initial saliency feature extraction network for feature extraction processing to obtain the respective saliency features of the anchor sample image, the positive sample image and the negative sample image.

The embodiment of the present application only exemplarily illustrates a network structure of the second feature extraction module included in the initial significant feature extraction network, and does not limit a specific structure of the second feature extraction module; for example, the second feature extraction module mentioned above may also be a deep residual network, ResNet-18, ResNet-50, etc.

It should be noted that, the output of the hash quantization layer 1 and the output of the hash quantization layer 2 are both vectors tending to-1 or 1, so that the fused feature is also a vector tending to-1 or 1, and the image re-arrangement retrieval efficiency can be improved while the feature storage space is compressed.

And S904, determining a first loss parameter based on the respective significance characteristics of the anchor sample image, the positive sample image and the negative sample image.

In one embodiment, a first similarity between the salient features of the anchor sample image and the positive sample image may be determined, that is, the salient features of the anchor sample image and the salient features of the positive sample image are respectively substituted by: i X _a -X _p X in | | _a 、X _p And calculating to obtain the first similarity. Anchor samples can be determinedA second similarity between the salient features of the image and the negative sample image, namely substituting the salient features of the anchor sample image and the salient features of the negative sample image by the formula: i X _a -X _n X in | |) _a 、X _n And calculating to obtain a second similarity. And determining a triple metric loss parameter based on the difference between the first similarity and the second similarity. Specifically, the first similarity and the second similarity are substituted into | | | X in the expression of the triple metric loss function shown in the formula (1) _a -X _p ||、||X _a -X _n And (6) calculating to obtain a triple measurement loss parameter. Wherein the first similarity-the second similarity is a difference between the first similarity and the second similarity. The triple metric loss parameter is used for the initial significant feature extraction network to perform feature metric learning on the significant features. The feature metric learning means that the learning task is to ensure that the salient features between the anchor sample image and the positive sample image are close enough (i.e. the similarity is high) and the salient features between the anchor sample image and the negative sample image are far enough (i.e. the similarity is low).

Further, target encoding values corresponding to the respective salient features of the anchor sample image, the positive sample image and the negative sample image may be determined, that is, each feature value in the salient features output by the initial salient feature extraction network is mapped to-1 or 1, which may be specifically implemented by equation (2) above.

And further determining a quantization loss parameter based on the respective target coding values and the respective significance characteristics of the anchor sample image, the positive sample image and the negative sample image. Specifically, the target coding value corresponding to the saliency feature of the anchor sample image and the feature value corresponding to the quantization coding value in the saliency feature of the anchor sample image may be respectively substituted into B in the expression of the quantization loss function shown in the above expression (3) _i And Q _i And calculating to obtain a fourth quantization loss parameter. The target coding value corresponding to the significant feature of the positive sample image and the feature value corresponding to the quantization coding value in the significant feature of the positive sample image can be respectively substitutedB in the expression of the quantization loss function shown in the above equation (3) _i And Q _i And calculating to obtain a fifth quantization loss parameter. The target coding value corresponding to the saliency feature of the negative sample image and the feature value corresponding to the quantization coding value in the saliency feature of the negative sample image may be respectively substituted into B in the expression of the quantization loss function shown in the above expression (3) _i And Q _i And calculating to obtain a sixth quantization loss parameter. And obtaining a second average quantization loss parameter according to the average value of the fourth quantization loss parameter, the fifth quantization loss parameter and the sixth quantization loss parameter. The second average quantization loss parameter is further used as a quantization loss parameter. Finally, the weight of the triple metric loss function and the weight of the quantization loss function, and the triple metric loss parameter and the quantization loss parameter are respectively substituted into w in the expression of the target loss function shown in the above formula (4) ₁ 、w ₂ 、L _tri 、L _coding And calculating to obtain a first loss parameter. The quantization loss parameter is used for the initial significant feature extraction network to perform quantization learning, so that the significant features output by the obtained target significant feature extraction network are vectors tending to-1 or 1.

And S905, determining a second loss parameter based on the global features and the significance features of the anchor sample image, the positive sample image and the negative sample image.

In an embodiment, a first fused feature may be determined based on the first global feature and the salient features of the anchor sample image, a second fused feature may be determined based on the second global feature and the salient features of the positive sample image, and a third fused feature may be determined based on the third global feature and the salient features of the negative sample image. Specifically, the first global feature and the salient feature of the anchor sample image are spliced to obtain a first fusion feature, the second global feature and the salient feature of the positive sample image are spliced to obtain a second fusion feature, and the third global feature and the salient feature of the negative sample image are spliced to obtain a third fusion feature.

In another embodiment, the saliency features of the anchor sample image, the positive sample image, and the negative sample image may be input into an initial feature compression network for feature compression processing to obtain the saliency features of the anchor sample image, the positive sample image, and the negative sample image after compression processing, then the first fusion feature may be determined based on the first global feature and the saliency feature of the anchor sample image after compression processing, the second fusion feature may be determined based on the second global feature and the saliency feature of the positive sample image after compression processing, and the third fusion feature may be determined based on the third global feature and the saliency feature of the negative sample image after compression processing. The method comprises the steps of splicing a first global feature and a significance feature of an anchor sample image after compression processing to obtain a first fusion feature, splicing a second global feature and the significance feature of a positive sample image after compression processing to obtain a second fusion feature, and splicing a third global feature and the significance feature of a negative sample image after compression processing to obtain a third fusion feature. The network structure of the initial feature compression network may be as shown in table 4 below:

TABLE 4

In one embodiment, the initial feature compression network may be initialized with a gaussian distribution with a variance of 0.01 and a mean of 0. In addition, the initial feature compression network can also be a multi-layer perceptron and the like.

Further, a third similarity between the first fused feature and the second fused feature may be determined. Namely, the first fusion characteristic and the second fusion characteristic are respectively substituted into the formula: i X _a -X _p X in | | _a 、X _p And calculating to obtain a third similarity. A fourth similarity between the first fused feature and the third fused feature may be determined, i.e., the first fused feature and the third fused feature are respectively substituted by: i X _a -X _n X in | | _a 、X _n And calculating to obtain a fourth similarity. Then, the third similarity and the fourth similarity can be obtained according to the third similarity-the fourth similarityFinally, determining a second loss parameter based on the difference between the third similarity and the fourth similarity and a similarity difference threshold value, namely respectively substituting the third similarity, the fourth similarity and the similarity difference threshold value into the | X in the expression of the triple measurement loss function shown in the formula (1) _a -X _p ||、||X _a -X _n And calculating the | and the alpha to obtain a second loss parameter. The second loss parameter is to let the initial salient feature extraction network perform feature metric learning for the fused features. The feature metric learning means that the learning task is to ensure that the fused features between the anchor sample image and the positive sample image are close enough (i.e., the similarity is high) and the fused features between the anchor sample image and the negative sample image are far enough (i.e., the similarity is low).

The conventional similarity difference threshold α is a fixed value, that is, the same similarity difference threshold is used for all the triplet training samples, and when the triplet metric loss parameter in the first loss parameter is calculated by the method, the similarity difference threshold α is also a fixed value, for example, set to 50. And when the second loss parameter is calculated, the similarity difference threshold value alpha corresponding to each triple training sample is determined in an automatic adjustment mode. Specifically, the similarity difference threshold is determined based on the similarity between the first global feature and the second global feature and the similarity between the first global feature and the third global feature. Namely substituting the first global feature and the second global feature into the formula: i X _a -X _p X in | | _a 、X _p Calculating to obtain the similarity between the first global feature and the second global feature, and substituting the first global feature and the third global feature into the formula: i X _a -X _n X in | | _a 、X _n And calculating to obtain the similarity between the first global feature and the third global feature, obtaining a similarity difference value according to the similarity between the first global feature and the second global feature, namely the similarity between the first global feature and the third global feature, and obtaining a similarity difference threshold value alpha corresponding to the triple training sample according to the dimension expansion coefficient and the similarity difference value indicated by the splicing process. If splicing togetherThe process is to splice the features output by the hash quantization layer 1 and the features output by the hash quantization layer 2, and then the dimension expansion coefficient indicated by the splicing process = the feature dimension output by the hash quantization layer 1 + the feature dimension output by the hash quantization layer 2/the feature dimension output by the hash quantization layer 1, that is, the dimension expansion coefficient is: (64 + 64)/64 = 2. If the splicing process is to splice the features output by the hash quantization layer 1 and the features output by the feature compression layer, the dimension expansion coefficient indicated by the splicing process = the feature dimension output by the hash quantization layer 1 + the feature dimension output by the feature compression layer/the feature dimension output by the hash quantization layer 1, that is, the dimension expansion coefficient is: (64 + 32)/64 = 1.5. In this way, the similarity difference threshold and the feature dimension are in a monotonically increasing relationship (such as a proportional relationship), that is, when the feature dimension becomes larger, the similarity difference threshold becomes correspondingly larger, and then the discriminative information obtained from the saliency region by the entire network (including the target saliency feature extraction network and the target global feature extraction network) is also expanded to 1.5 times, so that the effect of feature metric learning can be increased.

And S906, adjusting network parameters of the initial saliency feature extraction network based on the first loss parameter and the second loss parameter to obtain a target saliency feature extraction network.

In an embodiment, the sum of the first loss parameter and the second loss parameter may be calculated to obtain a fusion loss parameter, and the network parameter of the network is extracted by using the fusion loss parameter to reversely adjust (which may be implemented by a random gradient algorithm) the initial saliency feature; the network parameters of the initial salient feature extraction network can be reversely adjusted by the first loss parameter, and then the network parameters of the initial salient feature extraction network are reversely adjusted again by the second loss parameter on the basis of the adjustment of the first loss parameter. After the trained salient feature extraction network is obtained, a target salient feature extraction network can be determined based on the trained salient feature extraction network. Optionally, when the fusion loss parameter is smaller than the loss threshold or the training frequency reaches a preset value, the trained significant feature extraction network may be determined as the target significant feature extraction network.

The triplet measurement loss parameter in the first loss parameter is used for enabling an initial saliency feature extraction network to learn the representation of a saliency region in an image by measuring whether the similarity between the saliency features of similar sample pairs (including an anchor sample image and a positive sample image) is high enough and whether the similarity between the saliency features of dissimilar sample pairs (including an anchor sample image and a negative sample image) is low enough, so that the accuracy of the saliency features extracted by a target saliency feature extraction network is improved, and in addition, the quantization loss parameter in the first loss parameter is used for enabling the saliency features extracted by the target saliency feature extraction network to be vectors tending to-1 or 1 (namely, the effect of quantization learning); the second loss parameter enables the initial significant feature extraction network to perform compatibility learning on the basis of significant learning by measuring whether the similarity between the fusion features of the similar sample pairs is high enough and whether the similarity between the fusion features of the dissimilar sample pairs is low enough, so that the learning of the upgrade features (namely the fusion features) is realized, the significant features extracted by the target significant feature extraction network can be effectively integrated with the global features, and the accuracy of the significant features output by the target significant feature extraction network is further improved. Meanwhile, the saliency features of the image can be introduced into the obtained fusion features through effective integration, and even if the saliency region becomes small or image attack and the like occur, the position to be represented (namely the saliency region) can be accurately positioned through the fusion features, so that the whole image can be more effectively represented.

In an embodiment, the initial feature compression network may be further adjusted in network parameters based on one or more of the first loss parameter and the second loss parameter to obtain a trained feature compression network, and the target feature compression network may be determined based on the trained feature compression network. The network parameter of the initial characteristic compression network can be reversely adjusted through the first loss parameter, the network parameter of the initial characteristic compression network can be reversely adjusted through the second loss parameter, and the network parameter of the initial characteristic compression network can be reversely adjusted through the sum of the first loss parameter and the second loss parameter.

In a possible embodiment, the determining the fusion feature of the image to be processed based on the global feature and the salient feature of the image to be processed includes: and performing feature compression processing on the salient features of the image to be processed by using the target feature compression network to obtain the compressed salient features of the image to be processed, and performing fusion processing on the global features of the image to be processed and the compressed salient features to obtain the fusion features of the image to be processed. The salient features can be refined through the target feature compression network.

In a feasible embodiment, Q × M triple training samples may be divided into a plurality of training batch sets, each training batch set includes at least one triple training sample, for each training batch set, a first loss parameter and a second loss parameter corresponding to each triple training sample may be obtained through each triple training sample in the current training batch set, the initial significant feature extraction network and the target global feature extraction network, a network parameter of the initial significant feature extraction network is reversely adjusted according to a mean value of a sum of the first loss parameter and the second loss parameter corresponding to all triple training samples in the current training batch set, so as to obtain a trained significant feature extraction network, after one training is completed, a next training batch set may be taken to continuously adjust a network parameter in the trained significant feature extraction network, and the training stopping condition is met until the training stopping condition is met, if the training stopping condition is met for a specified number of training times, or the training stopping condition is met when the average value of the sum of the first loss parameter and the second loss parameter is less than a preset loss threshold value. At this time, the trained significant feature extraction network may be used as the target significant feature extraction network. Alternatively, the learning rate of 0.0005 may be used for the first 10 training batch sets, and then the learning rate of each 10 training batch sets is adjusted to 0.1 times the original learning rate.

In summary, referring to fig. 11, fig. 11 is a schematic flowchart illustrating an image processing method according to an embodiment of the present application, where the method includes: network parameters of the initial global feature extraction network can be adjusted by using the target loss parameters to obtain a trained global feature extraction network, and the target global feature extraction network is determined based on the trained global feature extraction network. The target global feature extraction network can extract global features of the anchor sample image, the positive sample image and the negative sample image in the triple training sample, the initial saliency feature extraction network can extract saliency features of the anchor sample image, the positive sample image and the negative sample image in the triple training sample, and the saliency features are input into the initial feature compression network to obtain each compressed saliency feature. The global features of the anchor sample image, the positive sample image and the negative sample image can be spliced with the corresponding compressed salient features to obtain the fusion features of the anchor sample image, the positive sample image and the negative sample image. The triple measurement loss and the quantization loss can be determined by utilizing the respective saliency characteristics of the anchor sample image, the positive sample image and the negative sample image, so that the initial saliency feature extraction network can perform feature measurement learning and quantization learning on the saliency characteristics, and the fusion characteristics of the initial saliency feature extraction network and the initial saliency feature compression network can perform feature measurement learning on the fusion characteristics through the respective fusion characteristics of the anchor sample image, the positive sample image and the negative sample image, so that the initial saliency feature extraction network can perform compatibility learning on the basis of the saliency learning, and the accuracy of the saliency characteristics extracted by the target saliency feature extraction network can be improved.

By adopting the embodiment of the application, the initial saliency characteristic extraction network can extract the saliency characteristic, the target global characteristic extraction network can extract the global characteristic, and the fusion characteristic can be obtained by splicing the saliency characteristic and the global characteristic, so that the initial saliency characteristic extraction network can perform compatibility training by using the fusion characteristic on the basis of saliency learning, the obtained target saliency characteristic extraction network can extract the saliency characteristic with higher accuracy, and the accuracy of the fusion characteristic can be improved.

It is understood that in the specific implementation of the present application, related data such as images to be processed are involved, when the above embodiments of the present application are applied to specific products or technologies, user permission or consent needs to be obtained, and the collection, use and processing of the related data need to comply with relevant laws and regulations and standards of relevant countries and regions.

The method of the embodiments of the present application is described in detail above, and in order to better implement the method of the embodiments of the present application, the following provides a device of the embodiments of the present application. Referring to fig. 12, fig. 12 is a schematic structural diagram of an image processing apparatus according to an embodiment of the present disclosure, where the image processing apparatus 120 may include:

an obtaining unit 1201, configured to obtain an image to be processed;

a processing unit 1202, configured to perform feature extraction processing on the image to be processed by using a target global feature extraction network, so as to obtain a global feature of the image to be processed;

the obtaining unit 1201 is further configured to obtain a saliency image of the image to be processed;

the processing unit 1202 is further configured to perform feature extraction processing on the saliency image of the to-be-processed image by using a target saliency feature extraction network to obtain saliency features of the to-be-processed image;

the processing unit 1202 is further configured to determine a fusion feature of the image to be processed based on the global feature and the salient feature of the image to be processed;

In an embodiment, the obtaining unit 1201 is specifically configured to: acquiring the triple training samples;

the processing unit 1202 is specifically configured to: inputting the anchor sample image, the positive sample image and the negative sample image included in the triple training sample into the target global feature extraction network for feature extraction processing to obtain a first global feature of the anchor sample image, a second global feature of the positive sample image and a third global feature of the negative sample image;

the processing unit 1202 is specifically configured to: obtaining a saliency image of each of the anchor sample image, the positive sample image, and the negative sample image;

the processing unit 1202 is specifically configured to: inputting the respective saliency images of the anchor sample image, the positive sample image and the negative sample image into the initial saliency feature extraction network for feature extraction processing to obtain respective saliency features of the anchor sample image, the positive sample image and the negative sample image; determining a first fused feature based on the first global feature and the salient feature of the anchor sample image, determining a second fused feature based on the second global feature and the salient feature of the positive sample image, and determining a third fused feature based on the third global feature and the salient feature of the negative sample image; determining the first loss parameter based on the respective saliency features of the anchor sample image, the positive sample image, and the negative sample image, determining the second loss parameter based on the first fusion feature, the second fusion feature, and the third fusion feature; adjusting the network parameters of the initial significant feature extraction network based on the first loss parameters and the second loss parameters to obtain a trained significant feature extraction network; wherein the target salient feature extraction network is determined based on the trained salient feature extraction network.

In an embodiment, the processing unit 1202 is specifically configured to: inputting the respective salient features of the anchor sample image, the positive sample image and the negative sample image into an initial feature compression network for feature compression processing to obtain the respective compressed salient features of the anchor sample image, the positive sample image and the negative sample image; determining a first fused feature based on the first global feature and the compressed processed saliency feature of the anchor sample image, determining a second fused feature based on the second global feature and the compressed processed saliency feature of the positive sample image, and determining a third fused feature based on the third global feature and the compressed processed saliency feature of the negative sample image.

In an embodiment, the processing unit 1202 is specifically configured to: performing feature compression processing on the saliency features of the image to be processed by using a target feature compression network to obtain the compressed saliency features of the image to be processed; the target feature compression network is obtained by adjusting network parameters of the initial feature compression network based on one or more of the first loss parameter and the second loss parameter; and performing fusion processing on the global features of the image to be processed and the compressed significance features to obtain fusion features of the image to be processed.

In an embodiment, the processing unit 1202 is specifically configured to: determining a first similarity between salient features of the anchor sample image and the positive sample image, determining a second similarity between salient features of the anchor sample image and the negative sample image; determining a triplet metric loss parameter based on a difference between the first similarity and the second similarity; determining target coding values corresponding to respective saliency features of the anchor sample image, the positive sample image and the negative sample image, and determining a quantization loss parameter based on the respective target coding values and the respective saliency features of the anchor sample image, the positive sample image and the negative sample image; determining the first loss parameter based on the triplet metric loss parameter and the quantization loss parameter.

In an embodiment, the processing unit 1202 is specifically configured to: determining a third similarity between the first and second fused features, determining a fourth similarity between the first and third fused features; determining a similarity difference threshold based on a similarity between the first global feature and the second global feature and a similarity between the first global feature and the third global feature; determining the second loss parameter based on a difference between the third similarity and the fourth similarity and the similarity difference threshold.

In an embodiment, the processing unit 1202 is specifically configured to: normalizing the pixel value of each pixel point in the image to be processed; respectively adjusting the pixel values of first-class pixel points and second-class pixel points in the normalized image to be processed into a first pixel value and a second pixel value to obtain a significant region mask image of the image to be processed; the first type of pixel points refer to pixel points of which the pixel values are smaller than a set pixel value after normalization processing, and the second type of pixel points refer to pixel points of which the pixel values are larger than or equal to the set pixel value after normalization processing; and multiplying the saliency region mask image and the image to be processed to obtain a saliency image of the image to be processed.

In an embodiment, the processing unit 1202 is specifically configured to: querying a feature database based on the fusion features of the image to be processed, wherein the feature database comprises feature data of each image in an image database; if the matched feature data matched with the fusion feature of the image to be processed exists in the feature database, determining the image corresponding to the matched feature data in the image database as a similar image of the image to be processed.

It can be understood that the functions of the functional units of the image processing apparatus described in the embodiment of the present application may be specifically implemented according to the method in the foregoing method embodiment, and the specific implementation process may refer to the relevant description of the foregoing method embodiment, which is not described herein again.

By adopting the embodiment of the application, the computer equipment can utilize the target global feature extraction network to perform feature extraction processing on the image to be processed to obtain the global feature of the image to be processed, utilize the target saliency feature extraction network to perform feature extraction processing on the saliency image of the image to be processed to obtain the saliency feature of the image to be processed, and then effectively integrate the global feature and the saliency feature of the image to be processed to obtain the fusion feature of the image to be processed, so that the fusion feature can represent the global image of the image to be processed and can also represent the saliency region in the image to be processed, thereby effectively improving the accuracy of the extracted image feature.

As shown in fig. 13, fig. 13 is a schematic structural diagram of a computer device according to an embodiment of the present application, and an internal structure of the computer device 130 is shown in fig. 13, and includes: one or more processors 1301, memory 1302, communication interface 1303. The processor 1301, the memory 1302, and the communication interface 1303 may be connected through a bus 1304 or in other manners, and in the embodiment of the present application, the processor, the memory 1302, and the communication interface 1303 are connected through the bus 1304 as an example.

The processor 1301 (or CPU) is a computing core and a control core of the computer device 130, and can analyze various instructions in the computer device 130 and process various data of the computer device 130, for example: the CPU may be configured to analyze a power on/off instruction sent by the user to the computer device 130, and control the computer device 130 to perform power on/off operation; the following steps are repeated: the CPU may transfer various types of interactive data between the internal structures of the computer device 130, and so on. The communication interface 1303 may optionally include a standard wired interface, a wireless interface (e.g., Wi-Fi, mobile communication interface, etc.), and may be controlled by the processor 1301 to transmit and receive data. Memory 1302 (Memory) is a Memory device in computer device 130 for storing computer programs and data. It is understood that the memory 1302 may include both the built-in memory of the computer device 130 and, of course, the expansion memory supported by the computer device 130. Memory 1302 provides storage space that stores an operating system for computer device 130, which may include, but is not limited to: windows system, Linux system, Android system, iOS system, etc., which are not limited in this application. The processor 1301 performs the following operations by executing the computer program stored in the memory 1302:

In an embodiment, the processor 1301 is specifically configured to: acquiring the triple training samples, inputting the anchor sample images, the positive sample images and the negative sample images included in the triple training samples into the target global feature extraction network for feature extraction processing, and obtaining first global features of the anchor sample images, second global features of the positive sample images and third global features of the negative sample images; obtaining respective saliency images of the anchor sample image, the positive sample image and the negative sample image, inputting the respective saliency images of the anchor sample image, the positive sample image and the negative sample image into the initial saliency feature extraction network for feature extraction processing, and obtaining respective saliency features of the anchor sample image, the positive sample image and the negative sample image; determining a first fused feature based on the first global feature and the salient feature of the anchor sample image, determining a second fused feature based on the second global feature and the salient feature of the positive sample image, and determining a third fused feature based on the third global feature and the salient feature of the negative sample image; determining the first loss parameter based on the respective saliency features of the anchor sample image, the positive sample image, and the negative sample image, determining the second loss parameter based on the first fusion feature, the second fusion feature, and the third fusion feature; adjusting the network parameters of the initial significant feature extraction network based on the first loss parameters and the second loss parameters to obtain a trained significant feature extraction network; wherein the target salient feature extraction network is determined based on the trained salient feature extraction network.

In an embodiment, the processor 1301 is specifically configured to: inputting the respective significance characteristics of the anchor sample image, the positive sample image and the negative sample image into an initial characteristic compression network for characteristic compression processing to obtain the respective significance characteristics of the anchor sample image, the positive sample image and the negative sample image after the compression processing; determining a first fused feature based on the first global feature and the compressed processed saliency feature of the anchor sample image, determining a second fused feature based on the second global feature and the compressed processed saliency feature of the positive sample image, and determining a third fused feature based on the third global feature and the compressed processed saliency feature of the negative sample image.

In an embodiment, the processor 1301 is specifically configured to: performing feature compression processing on the saliency features of the image to be processed by using a target feature compression network to obtain the compressed saliency features of the image to be processed; the target feature compression network is obtained by adjusting network parameters of the initial feature compression network based on one or more of the first loss parameter and the second loss parameter; and performing fusion processing on the global features of the image to be processed and the compressed significance features to obtain fusion features of the image to be processed.

In an embodiment, the processor 1301 is specifically configured to: determining a first similarity between salient features of the anchor sample image and the positive sample image, determining a second similarity between salient features of the anchor sample image and the negative sample image; determining a triplet metric loss parameter based on a difference between the first similarity and the second similarity; determining target coding values corresponding to respective saliency features of the anchor sample image, the positive sample image and the negative sample image, and determining a quantization loss parameter based on the respective target coding values and the respective saliency features of the anchor sample image, the positive sample image and the negative sample image; determining the first loss parameter based on the triplet measure loss parameter and the quantization loss parameter.

In an embodiment, the processor 1301 is specifically configured to: determining a third similarity between the first fused feature and the second fused feature, determining a fourth similarity between the first fused feature and the third fused feature; determining a similarity difference threshold based on a similarity between the first global feature and the second global feature and a similarity between the first global feature and the third global feature; determining the second loss parameter based on a difference between the third similarity and the fourth similarity and the similarity difference threshold.

In an embodiment, the processor 1301 is specifically configured to: normalizing the pixel value of each pixel point in the image to be processed; respectively adjusting the pixel values of first-class pixel points and second-class pixel points in the normalized image to be processed into a first pixel value and a second pixel value to obtain a significant region mask image of the image to be processed; the first type of pixel points refer to pixel points of which the pixel values are smaller than a set pixel value after normalization processing, and the second type of pixel points refer to pixel points of which the pixel values are larger than or equal to the set pixel value after normalization processing; and multiplying the saliency region mask image and the image to be processed to obtain a saliency image of the image to be processed.

In an embodiment, the processor 1301 is specifically configured to: querying a feature database based on the fusion features of the image to be processed, wherein the feature database comprises feature data of each image in an image database; if the matched feature data matched with the fusion feature of the image to be processed exists in the feature database, determining the image corresponding to the matched feature data in the image database as a similar image of the image to be processed.

In a specific implementation, the processor 1301, the memory 1302, and the communication interface 1303 described in this embodiment of the present application may execute an implementation manner described in an image processing method provided in this embodiment of the present application, and may also execute an implementation manner described in an image processing apparatus provided in this embodiment of the present application, which is not described herein again.

By adopting the embodiment of the application, the computer equipment can utilize the target global feature extraction network to perform feature extraction processing on the image to be processed to obtain the global feature of the image to be processed, utilize the target saliency feature extraction network to perform feature extraction processing on the saliency image of the image to be processed to obtain the saliency feature of the image to be processed, and then effectively integrate the global feature and the saliency feature of the image to be processed to obtain the fusion feature of the image to be processed, so that the fusion feature can not only represent the global image of the image to be processed, but also represent the saliency region in the image to be processed, and the accuracy of the extracted image feature can be effectively improved.

An embodiment of the present application further provides a computer-readable storage medium, in which a computer program is stored, and when the computer program runs on a computer device, the computer device is caused to execute the image processing method according to any one of the foregoing possible implementation manners. For specific implementation, reference may be made to the foregoing description, which is not repeated herein.

The embodiments of the present application further provide a computer program product, where the computer program product includes a computer program or computer instructions, and when the computer program or the computer instructions are executed by a processor, the steps of the image processing method provided by the embodiments of the present application are implemented. For specific implementation, reference may be made to the foregoing description, which is not repeated herein.

The embodiment of the present application further provides a computer program, where the computer program includes computer instructions, the computer instructions are stored in a computer-readable storage medium, a processor of a computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the image processing method provided in the embodiment of the present application. For a specific implementation, reference may be made to the foregoing description, which is not repeated herein.

It should be noted that, for simplicity of description, the above-mentioned embodiments of the method are described as a series of acts or combinations, but those skilled in the art should understand that the present application is not limited by the order of acts described, as some steps may be performed in other orders or simultaneously according to the present application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by associated hardware instructed by a program, which may be stored in a computer-readable storage medium, and the storage medium may include: flash disks, Read-Only memories (ROMs), Random Access Memories (RAMs), magnetic or optical disks, and the like.

The above disclosure is only a few examples of the present application, and certainly should not be taken as limiting the scope of the present application, which is therefore intended to cover all modifications that are within the scope of the present application and which are equivalent to the claims.

Claims

1. An image processing method, characterized in that the method comprises:

2. The method of claim 1, further comprising:

acquiring the triple training samples, inputting the anchor sample images, the positive sample images and the negative sample images included in the triple training samples into the target global feature extraction network for feature extraction processing, and obtaining first global features of the anchor sample images, second global features of the positive sample images and third global features of the negative sample images;

obtaining respective saliency images of the anchor sample image, the positive sample image and the negative sample image, inputting the respective saliency images of the anchor sample image, the positive sample image and the negative sample image into the initial saliency feature extraction network for feature extraction processing, and obtaining respective saliency features of the anchor sample image, the positive sample image and the negative sample image;

determining a first fused feature based on the first global feature and the salient feature of the anchor sample image, determining a second fused feature based on the second global feature and the salient feature of the positive sample image, and determining a third fused feature based on the third global feature and the salient feature of the negative sample image;

determining the first loss parameter based on the respective saliency features of the anchor sample image, the positive sample image, and the negative sample image, determining the second loss parameter based on the first fusion feature, the second fusion feature, and the third fusion feature;

adjusting the network parameters of the initial significant feature extraction network based on the first loss parameters and the second loss parameters to obtain a trained significant feature extraction network; wherein the target salient feature extraction network is determined based on the trained salient feature extraction network.

3. The method of claim 2, wherein determining a first fused feature based on the first global feature and a salient feature of the anchor sample image, determining a second fused feature based on the second global feature and a salient feature of the positive sample image, and determining a third fused feature based on the third global feature and a salient feature of the negative sample image comprises:

inputting the respective salient features of the anchor sample image, the positive sample image and the negative sample image into an initial feature compression network for feature compression processing to obtain the respective compressed salient features of the anchor sample image, the positive sample image and the negative sample image;

determining a first fused feature based on the first global feature and the compressed processed saliency feature of the anchor sample image, determining a second fused feature based on the second global feature and the compressed processed saliency feature of the positive sample image, and determining a third fused feature based on the third global feature and the compressed processed saliency feature of the negative sample image.

4. The method according to claim 3, wherein the determining the fusion feature of the image to be processed based on the global feature and the salient feature of the image to be processed comprises:

performing feature compression processing on the saliency features of the image to be processed by using a target feature compression network to obtain the compressed saliency features of the image to be processed; the target feature compression network is obtained by adjusting network parameters of the initial feature compression network based on one or more of the first loss parameters and the second loss parameters;

and performing fusion processing on the global features of the image to be processed and the compressed significance features to obtain fusion features of the image to be processed.

5. The method according to any one of claims 2-4, wherein determining the first loss parameter based on the respective saliency features of the anchor sample image, the positive sample image, and the negative sample image comprises:

determining a first similarity between salient features of the anchor sample image and the positive sample image, determining a second similarity between salient features of the anchor sample image and the negative sample image;

determining a triplet metric loss parameter based on a difference between the first similarity and the second similarity;

determining target coding values corresponding to respective saliency features of the anchor sample image, the positive sample image and the negative sample image, and determining a quantization loss parameter based on the respective target coding values and the respective saliency features of the anchor sample image, the positive sample image and the negative sample image;

determining the first loss parameter based on the triplet metric loss parameter and the quantization loss parameter.

6. The method according to any of claims 2-4, wherein said determining the second loss parameter based on the first fused feature, the second fused feature, and the third fused feature comprises:

determining a third similarity between the first fused feature and the second fused feature, determining a fourth similarity between the first fused feature and the third fused feature;

determining a similarity difference threshold based on a similarity between the first global feature and the second global feature and a similarity between the first global feature and the third global feature;

determining the second loss parameter based on a difference between the third similarity and the fourth similarity and the similarity difference threshold.

7. The method according to any one of claims 1-4, wherein the acquiring a saliency image of the image to be processed comprises:

normalizing the pixel value of each pixel point in the image to be processed;

respectively adjusting the pixel values of first-class pixel points and second-class pixel points in the normalized image to be processed into a first pixel value and a second pixel value to obtain a significant region mask image of the image to be processed; the first type of pixel points refer to pixel points of which the pixel values are smaller than a set pixel value after normalization processing, and the second type of pixel points refer to pixel points of which the pixel values are larger than or equal to the set pixel value after normalization processing;

and multiplying the saliency region mask image and the image to be processed to obtain a saliency image of the image to be processed.

8. The method of claim 1, further comprising:

querying a feature database based on the fused features of the image to be processed, wherein the feature database comprises feature data of each image in an image database;

if the matched feature data matched with the fusion feature of the image to be processed exists in the feature database, determining the image corresponding to the matched feature data in the image database as a similar image of the image to be processed.

9. An image processing apparatus, characterized in that the apparatus comprises:

the acquisition unit is used for acquiring an image to be processed;

the processing unit is used for performing feature extraction processing on the image to be processed by utilizing a target global feature extraction network to obtain global features of the image to be processed;

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out an image processing method according to any one of claims 1 to 8.

11. A computer device, comprising a memory, a communication interface, and a processor, the memory, the communication interface, and the processor being interconnected; the memory stores a computer program that the processor calls for implementing the image processing method according to any one of claims 1 to 8.