CN113486914B

CN113486914B - Method, device and storage medium for training neural network for image feature extraction

Info

Publication number: CN113486914B
Application number: CN202011127280.4A
Authority: CN
Inventors: 朱安杰; 胡风
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-10-20
Filing date: 2020-10-20
Publication date: 2024-03-05
Anticipated expiration: 2040-10-20
Also published as: CN113486914A

Abstract

A method for training a neural network, comprising: acquiring a training image set, wherein the training image set comprises a first training image set and a second training image set; calculating a first image embedding vector for each training image in the first training image set and a second image embedding vector for each training image in the second training image set using the neural network; establishing a plurality of triplets for the training image set based on a distance relationship between the first image embedding vector and the second image embedding vector, each of the plurality of triplets including an anchor point image, a positive example image, and a negative example image; and training the neural network with a predetermined ternary loss function based on each of the plurality of triplets to determine a plurality of parameters of the neural network. The method effectively selects triplets for training the neural network to reduce training time and improve accuracy of the embeddings generated by the trained neural network.

Description

Method, device and storage medium for training neural network for image feature extraction

Technical Field

The invention relates to the technical field of image feature extraction, in particular to a convolutional neural network training method, a convolutional neural network training device and a convolutional neural network storage medium for image feature extraction.

Background

In recent years, image classification and feature extraction methods based on deep neural networks have achieved a lot of results. The data sets of ImageNet, open Image Dataset, etc. have made great progress in facilitating image classification based on deep neural networks. In reality, common animals (such as people, cats, dogs, elephants, etc.), articles (airplanes, cups, tables, chairs, etc.), scenes (stages, indoor, outdoor, etc.) can be correctly classified, and feature extraction of images is processed as a task downstream of the classification task. The neural network is trained through the classification task, and a specific layer is selected from the trained neural network to output and process the specific layer as the vectorized characteristic representation of the image. Currently, an image classification annotation data set in a cartoon scene is lacking. In a real scene, the data set for classification is marked by manpower, the number of images reaches more than ten millions, and the marking cost is huge. In addition, the expression of the same object in different cartoons has great difference in expression style, and the subjective influence of people is great.

Disclosure of Invention

The present disclosure provides a convolutional neural network training method, apparatus, and storage medium for image feature extraction that may alleviate, mitigate, or even eliminate one or more of the above-mentioned problems.

According to one aspect of the present invention, there is provided a method for training a neural network, comprising: acquiring a training image set, wherein the training image set comprises a first training image set and a second training image set; calculating a first image embedding vector for each training image in the first training image set and a second image embedding vector for each training image in the second training image set using the neural network; establishing a plurality of triplets for the training image set based on a distance relationship between the first image embedding vector and the second image embedding vector, each of the plurality of triplets including an anchor point image, a positive example image, and a negative example image; and training the neural network with a predetermined ternary loss function based on each of the plurality of triplets to determine a plurality of parameters of the neural network.

In some embodiments, establishing the plurality of triples for the training video based on a distance relationship between the first image embedded vector and the second image embedded vector includes: calculating a distance between each two first image embedding vector pairs in the first image embedding vectors, and selecting a minimum predetermined number of first image embedding vector pairs (vn 1, vn 2) in the distance; selecting, for one of a predetermined number of pairs of first image embedding vectors (vn 1, vn 2), a second image embedding vector nn1 of the second set of image embedding vectors that is closest to the first image embedding vector vn1 and a second image embedding vector nn2 that is closest to the first image embedding vector vn 2; a plurality of triplets for the training image set is established based on the first image-embedding vector pair (vn 1, vn 2), the second image-embedding vector nn1 associated with the first image-embedding vector pair (vn 1, vn 2), and the second image-embedding vector nn 2.

In some embodiments, creating the plurality of triplets for the training image set based on the first image-embedding vector pair (vn 1, vn 2) and the second image-embedding vector nn1 and the second image-embedding vector nn2 associated with the first image-embedding vector pair (vn 1, vn 2) includes: in response to the distance d1 of the first image embedding vector vn1 from the second image embedding vector nn1 being greater than the distance d2 of the first image embedding vector vn2 from the second image embedding vector nn2, building a triplet (vn 2, vn1, nn 2), wherein vn2 is the anchor image of the triplet, vn1 is the positive image of the triplet, nn2 is the negative image of the unit group; and in response to the distance d1 of the first image embedding vector vn1 from the second image embedding vector nn1 being smaller than the distance d2 of the first image embedding vector vn2 from the second image embedding vector nn2, creating a triplet (vn 1, vn2, nn 1), wherein vn1 is the anchor image of the triplet, vn2 is the positive image of the triplet, nn1 is the negative image of the cell set.

In some embodiments, acquiring a training image set, the training image set comprising a first training image set and a second training image set comprises: acquiring training videos, wherein the training videos comprise a first training video and at least one second training video; and extracting key frames aiming at the first training video and at least one second training video to obtain an acquired training image set, wherein the training image set comprises the first training image set and the second training image set.

In some embodiments, calculating the distance between each two of the first image-embedding vector pairs comprises: the Euclidean distance or cosine distance between every two first image embedded vector pairs in the first image embedded vectors is calculated.

In some embodiments, calculating a second image embedding vector nn1 of the second set of image embedding vectors that is closest to the first image embedding vector vn1 and a second image embedding vector nn2 that is closest to the first image embedding vector vn2 comprises: the second image embedding vector nn1 closest to the first image embedding vector vn1 in the second image embedding vector set and the second image embedding vector nn2 closest to the first image embedding vector vn2 in the euclidean distance are calculated, or the second image embedding vector nn1 closest to the first image embedding vector vn1 in the second image embedding vector set and the second image embedding vector nn2 closest to the first image embedding vector vn2 in the cosine distance are calculated.

According to another aspect of the present invention, there is provided an image feature extraction method including: acquiring at least one image; respectively extracting features of each image in at least one image by using a trained neural network to obtain an image embedding vector of each image; and outputting an image embedding vector; wherein the neural network is trained based on the steps of: acquiring a training image set, wherein the training image set comprises a first training image set and a second training image set; calculating a first image embedding vector for each training image in the first training image set and a second image embedding vector for each training image in the second training image set using the neural network; establishing a plurality of triplets for the training image set based on a distance relationship between the first image embedding vector and the second image embedding vector, each of the plurality of triplets including an anchor point image, a positive example image, and a negative example image; and training the neural network with a predetermined ternary loss function based on each of the plurality of triplets to determine a plurality of parameters of the neural network.

In some embodiments, establishing the plurality of triples for the training video based on a distance relationship between the first image embedded vector and the second image embedded vector includes: calculating a distance between each two first image embedding vector pairs in the first image embedding vectors, and selecting a minimum predetermined number of first image embedding vector pairs (vn 1, vn 2) in the distance; selecting, for one of a predetermined number of pairs of first image embedding vectors (vn 1, vn 2), a second image embedding vector nn1 of the second set of image embedding vectors that is closest to the first image embedding vector vn1 and a second image embedding vector nn2 that is closest to the first image embedding vector vn 2; a plurality of triplets for the training image set is established based on the first image-embedding vector pair (vn 1, vn 2) and the second image-embedding vector nn1 and the second image-embedding vector nn2 associated with the first image-embedding vector pair (vn 1, vn 2).

In some embodiments, performing feature extraction on each image in the at least one image by using the neural network, and obtaining an image embedding vector of each image includes: extracting the output of the last layer or middle layer of the neural network for each image; pooling the features of the last layer or middle layer of the neural network to obtain the vectorized embedded vector of each image.

According to another aspect of the present invention, there is provided an apparatus for training a neural network, comprising: an acquisition module configured to acquire a training image set including a first training image set and a second training image set; an image vectorization module configured to calculate a first image embedded vector for each training image in the first training image set and a second image embedded vector for each training image in the second training image set using a neural network; a training set creation module configured to create a plurality of triples for the training image set based on a distance relationship between the first image embedding vector and the second image embedding vector, each of the plurality of triples including an anchor image, a positive example image, and a negative example image; and a training module configured to train the neural network with a predetermined ternary loss function based on each of the plurality of triplets to determine a plurality of parameters of the neural network.

According to still another aspect of the present invention, there is provided an image feature extraction apparatus comprising: an acquisition module configured to acquire at least one image; the feature extraction module is configured to extract features of each image in at least one image by using a neural network respectively to obtain an image embedding vector of each image; and an output module configured to output the image embedding vector; wherein the neural network is trained based on the steps of: acquiring a training image set, wherein the training image set comprises a first training image set and a second training image set; calculating a first image embedding vector for each training image in the first training image set and a second image embedding vector for each training image in the second training image set using the neural network; establishing a plurality of triplets for the training image set based on a distance relationship between the first image embedding vector and the second image embedding vector, each of the plurality of triplets including an anchor point image, a positive example image, and a negative example image; and training the neural network with a predetermined ternary loss function based on each of the plurality of triplets to determine a plurality of parameters of the neural network.

According to another aspect of the present invention, there is provided a computing device comprising: a processor; and a memory having instructions stored thereon that, when executed on the processor, cause the processor to perform any of the above methods.

According to another aspect of the present invention, there is provided a computer readable storage medium having stored thereon computer readable instructions which, when executed, implement any one of the above methods.

By the embodiment of the invention, a neural network training method, a device and a storage medium for image feature extraction are provided. In particular, the method can be used for solving the problems of classification and feature extraction of images with fewer textures, such as cartoon images. The method can be used for training classification, segmentation and monitoring models of the cartoon images, and solves the problem of poor feature extraction quality in the cartoon image field. The embedded vectors of objects of a particular object type (e.g., images, particularly animation images) can be efficiently generated such that the distance between the embedded vectors of all images of a particular object type (e.g., animation images drawn by the same composer) is small, while the distance between the embedded vectors of images of different object types is large. Thus, the (anchor point, positive example, negative example) triplet data set of the cartoon class image for ternary loss training is quickly and efficiently acquired. By means of an ImageNet pre-training model, the convergence speed is high, and the training effect of the neural network model is good. The neural network is trained by the ternary loss method to acquire the capability of the neural network for extracting the cartoon image characteristics, and the training process is efficient and rapid. The vectorization representation quality of the acquired cartoon image is better than that of the traditional SIFT, RANSAC and other methods. Triplets for training the neural network may be effectively selected to reduce training time and improve the accuracy of the embeddings generated by the trained neural network.

Drawings

Further details, features and advantages of the invention are disclosed in the following description of exemplary embodiments with reference to the drawings. The drawings are only for purposes of illustrating embodiments and are not to be construed as limiting the disclosure. And the same reference numbers will be used throughout the drawings to refer to the same or like elements. In the drawings:

FIG. 1 depicts a schematic diagram of the principle of operation of ternary losses;

FIGS. 2a-c depict schematic diagrams of feature extraction in the related art;

FIG. 3a schematically depicts different representations of the same character in a cartoon;

FIG. 4 depicts a block diagram of an example scenario, according to one embodiment of the invention;

FIG. 5 depicts an example workflow diagram according to one embodiment of the invention;

FIG. 6 schematically illustrates a flow chart of a method of training a neural network, according to one embodiment of the invention;

FIG. 7 schematically illustrates a flow chart of a method of image feature extraction according to one embodiment of the invention;

FIG. 8 schematically illustrates a block diagram of an apparatus for training a neural network, in accordance with one embodiment of the present invention; and

FIG. 9 illustrates a schematic block diagram of a computing system capable of implementing a method for training a neural network, according to some embodiments of the invention.

Detailed Description

Several embodiments of the present invention will be described in greater detail below with reference to the accompanying drawings so as to enable those skilled in the art to understand and implement the present invention. This invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art. These examples are intended to illustrate, but not to limit, the present invention.

It will be understood that, although the terms first, second, third, etc. may be used herein to describe various elements, steps and/or sections, these elements, steps and/or sections should not be limited by these terms. These terms are only used to distinguish one element, step or section from another element, step or section. Thus, a first element, step or section discussed below could be termed a second element, step or section without departing from the teachings of the present invention.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and/or the present specification and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

It should be noted that the features of the embodiments may be used in any combination without conflict.

Before describing embodiments of the present invention in detail, some related concepts will be explained first:

1. ternary loss (Triplet): the ternary loss function is calculated by a triplet of three parameters (whereas the softmax loss function that is typically used is calculated by predicting the tag, the tag two parameters). The triplet refers to: anchor point, positive example, negative example three parts, each part is an embedded vector. An anchor point refers to a reference image, a positive example refers to an image similar to the anchor point, and a negative example refers to an image dissimilar to the anchor point. The goal of the ternary loss is to make the distance between the anchor point and the positive instance smaller than the distance between the anchor point and the negative instance. Fig. 1 depicts a schematic diagram 100 of the principle of operation of ternary losses. Circle 101 identifies the anchor point, circle 102 identifies the negative instance, and circle 103 identifies the positive instance. The training process marked by arrow 104 realizes the distance between the zoomed-in anchor point 101 and the positive example 103, and the distance between the zoomed-out anchor point 101 and the negative example 103.

Taking an image with less texture (specifically, a cartoon image) as an example, an anchor point image, a positive example image, a negative example image are schematically depicted. An image of the same style as an image is identified as an positive image. An image that is different from the image composer and style identifies a negative image. Thus, a triplet in the animation image is constructed.

Fig. 2a-c depict schematic diagrams of feature extraction in the related art. The current image net training based deep neural network extrapolates primarily based on the texture of the image, as shown in fig. 2a-c, which would recognize the image shown in fig. 2c as an elephant if the cat shape was drawn with the lines of fig. 2b, but filled with the skin material of the elephant shown in fig. 2 a. Specifically, the recognition result of the neural network is: texture image in fig. 2 a: 81.4% probability is elephant; the 10.3% probability is large lemons; the probability of 8.2% is black swan. Content image in fig. 2 b: 71.1% probability of tiger spot cat; 17.3% probability of gray fox; the 3.3% probability is for a Siamese cat. Texture-content conflict image in fig. 2 c: the 63.9% probability is the image; 26.4% probability of large lemons; 9.6% probability is black swan. It follows that current neural networks rely primarily on texture for judgment, which is not applicable to images with less texture (e.g., cartoon images).

Currently, an image classification annotation data set in a cartoon scene is lacking. In a real scene, the data set for classification is marked by manpower, the number of images reaches more than ten millions, and the marking cost is huge. And the performances of the same object in different cartoons have great differences in expression styles, and the subjective influence of people is great. Fig. 3a schematically depicts different manifestations of the same character in a cartoon. Fig. 3a illustrates the representation in different animas, taking dogs as an example. These different manifestations add significant difficulty to the construction of the animation scene classification dataset. The cartoon image has the following different points compared with the image in reality: the edge is clear; texture, material and texture are simple; the color is smooth, and the shadow is less; and the object is abstract and simple. The currently developed deep neural classification network facing the real scene is not suitable for the cartoon scene, the quality of the obtained characteristics is poor, and the characteristics are not ideal in retrieval, video and image understanding effects. In addition, because the classification data set in the cartoon field, the deep neural network classification network and the feature extraction method in the cartoon field are still blank at present, the network capable of identifying and classifying the scenes, people, animals and articles of the cartoon image is needed to extract the features and vectorize so as to understand and search the cartoon content.

Although the cartoon style has the commonalities of clear edges, simple materials, smooth color, few shadows and the like, the styles of the cartoon pictures of different composers still have larger differences. The classification by style includes writing style, white drawing style, beautiful style, lovely style, 3D style, etc. so that cartoon works of different styles need to be obtained. Performing fine adjustment (finetune) on the ImageNet pre-training neural network to obtain feature extraction capabilities of different cartoon styles, so that the cartoon feature extraction network can extract high-quality features which are cohesive and spaced in class and can be used for classification and retrieval for cartoon images of different styles.

The cost for obtaining training data of a large number of marked triples from the cartoon video is extremely high, but the feature embedding model can be trained without manually marking the samples by the training method of automatically mining the triples through the ImageNet pre-training model and adding the triples. The invention provides an unsupervised training method based on a triplet loss for embedding a cartoon image, and the extracted features can be used for tasks such as understanding, classifying and searching the cartoon image.

FIG. 4 depicts a block diagram of an example computing system 500, according to one embodiment of the invention. The system 500 includes a user computing device 502, a server computing system 530, and a training computing system 550 communicatively coupled by a network 580.

The user computing device 502 may be any type of computing device, such as, for example, a personal computing device (e.g., laptop or desktop), a mobile computing device (e.g., smart phone or tablet), a game console or controller, a wearable computing device, an embedded computing device, or any other type of computing device.

The user computing device 502 includes one or more processors 512 and memory 514. The one or more processors 512 may be any suitable processing device (e.g., processor core, microprocessor, ASIC, FPGA, controller, microcontroller, etc.), and may be one processor or multiple processors operatively connected. Memory 514 may include one or more non-transitory computer-readable storage media such as RAM, ROM, EEPROM, EPROM, flash memory devices, magnetic disks, and the like, as well as combinations thereof. Memory 514 may store data 516 and instructions 518 executed by processor 512 to cause user computing device 502 to perform operations.

The user computing device 502 may store or include one or more image feature extraction models 520. For example, the image feature extraction model 520 may be or may otherwise include various machine learning models, such as a neural network (e.g., a deep neural network) or other multi-layer nonlinear model. The neural network may include a recurrent neural network (e.g., a long and short term memory recurrent neural network), a feed-forward neural network, a convolutional neural network, or other form of neural network. Alternatively or additionally, the image feature extraction model 520 may include other forms of machine learning models.

In some implementations, one or more image feature extraction models 520 may be received from the server computing system 530 over the network 580, stored in the user computing device memory 514, and then used or otherwise implemented by the one or more processors 512. In some implementations, the user computing device 502 can implement multiple parallel instances of the image feature extraction model 520 (e.g., to perform multiple parallel instances of image feature extraction).

Additionally or alternatively, one or more image feature extraction models 540 may be included in the server computing system 530 in communication with the user computing device 502 according to a client-server relationship, or otherwise stored and implemented by the server computing system 530. For example, the image feature extraction model 540 may be implemented by the server computing system 530 as part of a web service (e.g., an image feature search service). Accordingly, one or more image feature extraction models 520 may be stored and implemented at the user computing device 502 and/or one or more image feature extraction models 540 may be stored and implemented at the server computing system 530.

The user computing device 502 may also include one or more user input components 522 that receive user input. For example, the user input component 522 may be a touch-sensitive component (e.g., a touch-sensitive display screen or touchpad) that is sensitive to the touch of a user input object (e.g., a finger or stylus). The touch sensitive component may be used to implement a virtual keyboard. Other example user input components include a microphone, a conventional keyboard, a conventional mouse, a camera, a microphone, or other components that a user may provide user input.

The server computing system 530 includes one or more processors 532 and memory 534. The one or more processors 532 may be any suitable processing device (e.g., processor core, microprocessor, ASIC, FPGA, controller, microcontroller, etc.), and may be one processor or multiple processors operatively connected. The memory 534 may include one or more non-transitory computer-readable storage media, such as RAM, ROM, EEPROM, EPROM, flash memory devices, disks, and the like, as well as combinations thereof. The memory 534 may store data 536 and instructions 538 for execution by the processor 532 to cause the server computing system 530 to perform operations.

In some implementations, the server computing system 530 includes or is otherwise implemented by one or more server computing devices. Where the server computing system 530 includes multiple server computing devices, such server computing devices can operate in accordance with a sequential computing architecture, a parallel computing architecture, or some combination thereof.

As described above, the server computing system 530 may store or otherwise include one or more machine-learned image feature extraction models 540. For example, the image feature extraction model 540 may be or otherwise include various machine learning models, such as a neural network (e.g., a deep neural network) or other multi-layer nonlinear model. The neural network may include a recurrent neural network (e.g., a long and short term memory recurrent neural network), a feed-forward neural network, a convolutional neural network, or other form of neural network. Alternatively or additionally, the image feature extraction model 540 may include other forms of machine learning models.

The server computing system 530 may train the image feature extraction model 540 via interaction with a training computing system 550 communicatively coupled via a network 580. The training computing system 550 may be separate from the server computing system 530 or may be part of the server computing system 530.

The training computing system 550 includes one or more processors 552 and memory 554. The one or more processors 552 may be any suitable processing device (e.g., a processor core, microprocessor, ASIC, FPGA, controller, microcontroller, etc.), and may be one processor or multiple processors operatively connected. The memory 554 may include one or more non-transitory computer-readable storage media, such as RAM, ROM, EEPROM, EPROM, flash memory devices, magnetic disks, and the like, as well as combinations thereof. Memory 554 may store data 556 and instructions 558 for execution by processor 552 to cause training computing system 550 to perform operations. In some implementations, the training computing system 550 includes or is otherwise implemented by one or more server computing devices.

Training computing system 550 may include a model trainer 560 that trains machine-learned models 520/540 using various training or learning techniques (such as, for example, back propagation of errors). Model trainer 560 may perform a variety of generalization techniques (e.g., weight decay, loss, etc.) to enhance the generalization ability of the trained model.

Specifically, model trainer 560 may train image feature extraction models 520/540 based on a set of training data 562.

Model trainer 560 includes computer logic for providing the desired functionality. Model trainer 560 may be implemented in hardware, firmware, and/or software that controls a general purpose processor. For example, in some implementations, model trainer 560 includes program files stored on a storage device, loaded into memory, and executed by one or more processors. In other implementations, model trainer 560 includes one or more sets of computer-executable instructions stored in a tangible computer-readable storage medium, such as a RAM hard disk or an optical or magnetic medium.

Network 580 may be any type of communication network, such as a local area network (e.g., an intranet), a wide area network (e.g., the internet), or some combination thereof, and may include any number of wired or wireless links. In general, communications over network 580 may occur via any type of wired and/or wireless connection using a variety of communication protocols (e.g., TCP/IP, HTTP, SMTP, FTP), coding or formats (e.g., HTML, XML), and/or protection schemes (e.g., VPN, secure HTTP, SSL).

FIG. 4 illustrates one example computing system that may be used to implement the present disclosure. Other computing systems may also be used. For example, in some implementations, the user computing device 502 may include a model trainer 560 and a training data set 562. In such an embodiment, the image feature extraction model 520 may be trained and used locally at the user computing device 502. In some of such implementations, the user computing device 502 may implement a model trainer 560 to personalize the image feature extraction model 520 based on user-specific data.

FIG. 5 depicts an example workflow diagram according to one embodiment of the invention. In particular, fig. 5 shows an image feature extraction model 602 configured to provide an image embedding vector 606. Specifically, the image feature extraction model 602 may receive the input image 604 and, in response, provide an image embedding vector 606, the image embedding vector 606 encoding information describing an image depicted in the input image 604.

In some implementations, the image embedding vector 606 may be obtained from the last layer of the image feature extraction model 602 or otherwise provided at the last layer of the image feature extraction model 602. In other implementations, the image embedding vector 606 may be obtained from or otherwise provided at an intermediate layer of the image feature extraction model 602. For example, the middle layer may be the near-last layer but not the last layer of the image feature extraction model 602.

In some implementations, the image feature extraction model 602 may be or may otherwise include various machine learning models, such as a neural network (e.g., deep neural network) or other multi-layer nonlinear model. The neural network may include a recurrent neural network (e.g., a long and short term memory recurrent neural network), a feed-forward neural network, a convolutional neural network, or other form of neural network. Alternatively or additionally, the image feature extraction model 602 may include other forms of machine learning models.

In accordance with an aspect of the disclosure, in some implementations, the image embedding vector 606 may be or include an embedding vector containing a plurality of values for a plurality of embedding dimensions (e.g., 16 embedding dimensions), respectively.

Image embedding vector 606 may be used for a number of different purposes including, for example, determining a similarity measure between two cartoon images. Specifically taking a cartoon image as an example, the similarity between the embedding of each cartoon image can indicate the similarity between each cartoon image. Also, dissimilarity between embeddings may indicate dissimilarity between cartoon images. In one example, the similarity or association between the cartoon images may be determined by calculating the Euclidean distance (e.g., in a dimensional manner) between their respective embedded vectors provided by image feature extraction model 602. This may be useful for performing a cartoon image similarity search in other applications. In addition, the image embedding vector 606 can also be used for classifying cartoon images and identifying scenes; the lens detection of the cartoon video can be realized by calculating vectorized characteristic distance between cartoon images, and the cartoon video is segmented; and searching the images in a picture of the cartoon image through the vectorized characteristic of the cartoon image, so as to search the video in a picture.

Fig. 6 schematically illustrates a flow chart of a method of training a neural network, according to one embodiment of the invention. The method comprises step S701 of acquiring a training image set comprising a first training image set and a second training image set. The input of the convolution neural network in the same batch consists of two parts: a keyframe image batch1 in the same video and a keyframe image batch2 randomly selected from all video keyframes. For example, there is a video library including 100 videos, from which 1 video is first selected, and the remaining 99 videos are regarded as "all videos" at this time. Then, determining, for example, a predetermined number of key frames from the first video based on the inter-frame distances, respectively; the same predetermined number of key frames are also selected from the remaining 99 videos. In one embodiment, acquiring the training image set further comprises: acquiring training videos, wherein the training videos comprise a first training video and at least one second training video; and extracting key frames aiming at the first training video and at least one second training video to obtain an acquired training image set, wherein the training image set comprises the first training image set and the second training image set.

In step S702, a first image embedding vector for each training image in the first training image set and a second image embedding vector for each training image in the second training image set are calculated using a neural network. The convolutional neural network calculates a first image embedding vector and a second image embedding vector for the batch. In some implementations, the image embedding vector may be obtained from or otherwise provided at the last layer of the image feature extraction model. In other embodiments, the image embedding vector may be obtained from or otherwise provided at an intermediate layer of the image feature extraction model. For example, the intermediate layer may be the near-last layer but not the last layer of the image feature extraction model. In some implementations, the image feature extraction model may be or may otherwise include various machine learning models, such as a neural network (e.g., deep neural network) or other multi-layer nonlinear model. The neural network may include a recurrent neural network (e.g., a long and short term memory recurrent neural network), a feed-forward neural network, a convolutional neural network, or other form of neural network. Alternatively or additionally, the image feature extraction model may include other forms of machine learning models.

In step S703, a plurality of triples for the training image set are established based on a distance relationship between the first image embedding vector and the second image embedding vector, each of the plurality of triples including an anchor image, a positive example image, and a negative example image. In one embodiment, establishing a plurality of triples for the training video based on a distance relationship between the first image embedded vector and the second image embedded vector includes: calculating a distance between each two first image-embedding vector pairs of the first image-embedding vectors, and selecting a minimum predetermined number (e.g., the first N, N being a positive integer) of the first image-embedding vector pairs (vn 1, vn 2) among the distances; selecting, for one of a predetermined number of pairs of first image embedding vectors (vn 1, vn 2), a second image embedding vector nn1 of the second set of image embedding vectors that is closest to the first image embedding vector vn1 and a second image embedding vector nn2 that is closest to the first image embedding vector vn 2; a plurality of triplets for the training image set is established based on the first image-embedding vector pair (vn 1, vn 2) and the second image-embedding vector nn1 and the second image-embedding vector nn2 associated with the first image-embedding vector pair (vn 1, vn 2). In another embodiment, creating a plurality of triplets for the training image set based on the first image-embedded vector pair (vn 1, vn 2) and the second image-embedded vector nn1 and the second image-embedded vector nn2 associated with the first image-embedded vector pair (vn 1, vn 2) includes: in response to the distance d1 of the first image embedding vector vn1 from the second image embedding vector nn1 being greater than the distance d2 of the first image embedding vector vn2 from the second image embedding vector nn2, building a triplet (vn 2, vn1, nn 2), wherein vn2 is the anchor image of the triplet, vn1 is the positive image of the triplet, nn2 is the negative image of the unit group; and in response to the distance d1 of the first image embedding vector vn1 from the second image embedding vector nn1 being smaller than the distance d2 of the first image embedding vector vn2 from the second image embedding vector nn2, creating a triplet (vn 1, vn2, nn 1), wherein vn1 is the anchor image of the triplet, vn2 is the positive image of the triplet, nn1 is the negative image of the cell set.

In one embodiment, calculating the distance between each two of the first image-embedding vector pairs comprises: the Euclidean distance or cosine distance between every two first image embedded vector pairs in the first image embedded vectors is calculated.

In one embodiment, calculating a second image embedding vector nn1 of the second set of image embedding vectors that is closest to the first image embedding vector vn1 and a second image embedding vector nn2 that is closest to the first image embedding vector vn2 comprises: the second image embedding vector nn1 closest to the first image embedding vector vn1 in the second image embedding vector set and the second image embedding vector nn2 closest to the first image embedding vector vn2 in the euclidean distance are calculated, or the second image embedding vector nn1 closest to the first image embedding vector vn1 in the second image embedding vector set and the second image embedding vector nn2 closest to the first image embedding vector vn2 in the cosine distance are calculated.

In step S704, the neural network is trained with a predetermined ternary loss function based on each of the plurality of triplets to determine a plurality of parameters of the neural network. By the embodiment of the invention, a neural network training method for image feature extraction is provided. In particular, the method can be used for solving the problems of classification and feature extraction of images with fewer textures, such as cartoon images. The method can be used for training classification, segmentation and monitoring models of the cartoon images, and solves the problem of poor feature extraction quality in the cartoon image field. The embedded vectors of objects of a particular object type (e.g., images, particularly animation images) can be efficiently generated such that the distance between the embedded vectors of all images of a particular object type (e.g., animation images drawn by the same composer) is small, while the distance between the embedded vectors of images of different object types is large. Thus, the (anchor point, positive example, negative example) triplet data set of the cartoon class image for ternary loss training is quickly and efficiently acquired. By means of an ImageNet pre-training model, the convergence speed is high, and the training effect of the neural network model is good. The neural network is trained by the ternary loss method to acquire the capability of the neural network for extracting the cartoon image characteristics, and the training process is efficient and rapid. The vectorization representation quality of the acquired cartoon image is better than that of the traditional SIFT, RANSAC and other methods. Triplets for training the neural network may be effectively selected to reduce training time and improve the accuracy of the embeddings generated by the trained neural network.

Fig. 7 schematically shows a flow chart of an image feature extraction method according to one embodiment of the invention. In step S801, at least one image is acquired. In particular, a less textured image, such as a cartoon image, may be acquired. The image may be selected by selecting a key frame in a video. In step S802, feature extraction is performed on each image in at least one image by using the trained neural network, so as to obtain an image embedding vector of each image. The neural network is trained based on the following steps: first, a training image set is acquired, the training image set comprising a first training image set and a second training image set. The input of the convolution neural network in the same batch consists of two parts: a keyframe image batch1 in the same video and a keyframe image batch2 randomly selected from all video keyframes. For example, there is a video library including 100 videos, from which 1 video is first selected, and the remaining 99 videos are regarded as "all videos" at this time. Then, determining, for example, a predetermined number of key frames from the first video based on the inter-frame distances, respectively; the same predetermined number of key frames are also selected from the remaining 99 videos. In one embodiment, acquiring the training image set further comprises: acquiring training videos, wherein the training videos comprise a first training video and at least one second training video; and extracting key frames aiming at the first training video and at least one second training video to obtain an acquired training image set, wherein the training image set comprises the first training image set and the second training image set.

Then, a first image embedded vector for each training image in the first training image set and a second image embedded vector for each training image in the second training image set are calculated using the neural network. The convolutional neural network calculates a first image embedding vector and a second image embedding vector for the batch. In some implementations, the image embedding vector may be obtained from or otherwise provided at the last layer of the image feature extraction model. In other embodiments, the image embedding vector may be obtained from or otherwise provided at an intermediate layer of the image feature extraction model. For example, the intermediate layer may be the near-last layer but not the last layer of the image feature extraction model. In some implementations, the image feature extraction model may be or may otherwise include various machine learning models, such as a neural network (e.g., deep neural network) or other multi-layer nonlinear model. The neural network may include a recurrent neural network (e.g., a long and short term memory recurrent neural network), a feed-forward neural network, a convolutional neural network, or other form of neural network. Alternatively or additionally, the image feature extraction model may include other forms of machine learning models.

Next, a plurality of triplets for the training image set are established based on a distance relationship between the first image embedding vector and the second image embedding vector, each of the plurality of triplets including an anchor image, a positive example image, and a negative example image. In one embodiment, establishing a plurality of triples for the training video based on a distance relationship between the first image embedded vector and the second image embedded vector includes: calculating a distance between each two first image-embedding vector pairs of the first image-embedding vectors, and selecting a minimum predetermined number (e.g., the first N, N being a positive integer) of the first image-embedding vector pairs (vn 1, vn 2) among the distances; selecting, for one of a predetermined number of pairs of first image embedding vectors (vn 1, vn 2), a second image embedding vector nn1 of the second set of image embedding vectors that is closest to the first image embedding vector vn1 and a second image embedding vector nn2 that is closest to the first image embedding vector vn 2; a plurality of triplets for the training image set is established based on the first image-embedding vector pair (vn 1, vn 2) and the second image-embedding vector nn1 and the second image-embedding vector nn2 associated with the first image-embedding vector pair (vn 1, vn 2). In another embodiment, creating a plurality of triplets for the training image set based on the first image-embedded vector pair (vn 1, vn 2) and the second image-embedded vector nn1 and the second image-embedded vector nn2 associated with the first image-embedded vector pair (vn 1, vn 2) includes: in response to the distance d1 of the first image embedding vector vn1 from the second image embedding vector nn1 being greater than the distance d2 of the first image embedding vector vn2 from the second image embedding vector nn2, building a triplet (vn 2, vn1, nn 2), wherein vn2 is the anchor image of the triplet, vn1 is the positive image of the triplet, nn2 is the negative image of the unit group; and in response to the distance d1 of the first image embedding vector vn1 from the second image embedding vector nn1 being smaller than the distance d2 of the first image embedding vector vn2 from the second image embedding vector nn2, creating a triplet (vn 1, vn2, nn 1), wherein vn1 is the anchor image of the triplet, vn2 is the positive image of the triplet, nn1 is the negative image of the cell set.

Finally, the neural network is trained with a predetermined ternary loss function based on each of the plurality of triplets to determine a plurality of parameters of the neural network.

In step S803, an image embedding vector is output. After training the cartoon feature to extract the neural network, the last output can be obtained in the mode of obtaining the feature from the neural network, and the middle layer output can also be obtained. The image embedding vector is obtained by pooling based on the intermediate layer output. The pooling method may include a variety of pooling methods such as maximum pooling, average pooling, RMAC, geM, etc. In one embodiment, PCA may also be used to reduce feature dimensions, and an automatic encoder structure may also be used to output final features.

Image embedding vectors may be used for a variety of different purposes including, for example, determining a similarity measure between two cartoon images made from different cartoon images in respective videos. Specifically taking a cartoon image as an example, the similarity between the embedding of each cartoon image can indicate the similarity between each cartoon image. Also, dissimilarity between embeddings may indicate dissimilarity between cartoon images. In one example, the similarity or association between the cartoon images may be determined by calculating the Euclidean distance (e.g., in a dimensional manner) between their respective embedded vectors provided by the image feature extraction model. This may be useful for performing a cartoon image similarity search in other applications. In addition, the image embedded vector can also be used for classifying cartoon images and identifying scenes; the lens detection of the cartoon video can be realized by calculating vectorized characteristic distance between cartoon images, and the cartoon video is segmented; and searching the images in a picture of the cartoon image through the vectorized characteristic of the cartoon image, so as to search the video in a picture.

Fig. 8 schematically illustrates a block diagram of a training apparatus 900 for a neural network, according to one embodiment of the invention. The apparatus 900 for training a neural network includes: an acquisition module 901, an image vectorization module 902, a training set establishment module 903, and a training module 904. The acquisition module 901 is configured to acquire a training image set comprising a first training image set and a second training image set. The image vectorization module 902 is configured to calculate a first image embedded vector for each training image in the first set of training images and a second image embedded vector for each training image in the second set of training images using a neural network. The training set establishment module 903 is configured to establish a plurality of triples for the training image set based on a distance relationship between the first image embedded vector and the second image embedded vector, each of the plurality of triples including an anchor image, a positive example image, and a negative example image. The training module 904 is configured to train the neural network with a predetermined ternary loss function based on each of the plurality of triplets to determine a plurality of parameters of the neural network. By the embodiment of the invention, a neural network training device for image feature extraction is provided. Specifically, the method can be used for training classification, segmentation and monitoring models of the cartoon images, and solves the problem of poor feature extraction quality in the cartoon image field. The embedded vectors of objects of a particular object type (e.g., images, particularly animation images) can be efficiently generated such that the distance between the embedded vectors of all images of a particular object type (e.g., animation images drawn by the same composer) is small, while the distance between the embedded vectors of images of different object types is large. Thus, the (anchor point, positive example, negative example) triplet data set of the cartoon class image for ternary loss training is quickly and efficiently acquired. By means of an ImageNet pre-training model, the convergence speed is high, and the training effect of the neural network model is good. The neural network is trained by the ternary loss method to acquire the capability of the neural network for extracting the cartoon image characteristics, and the training process is efficient and rapid. The vectorization representation quality of the acquired cartoon image is better than that of the traditional SIFT, RANSAC and other methods. Triplets for training the neural network may be effectively selected to reduce training time and improve the accuracy of the embeddings generated by the trained neural network.

Fig. 9 illustrates a schematic block diagram of a computing system 1000 capable of implementing a method for training a neural network, according to some embodiments of the invention. In some embodiments, the computing system 1000 is representative of the user computing device 502, the server computing system 530, and the training computing system 550 in the application scenario of FIG. 4.

Computing system 1000 may include a variety of different types of devices, such as computing device computers, client devices, systems-on-a-chip, and/or any other suitable computing system or computing system.

The computing system 1000 may include at least one processor 1002, memory 1004, communication interface(s) 1006, display device 1008, other input/output (I/O) devices 1010, and one or more mass storage 1012, capable of communicating with each other, such as through a system bus 1014 or other suitable means of connection.

The processor 1002 may be a single processing unit or multiple processing units, all of which may include a single or multiple computing units or multiple cores. The processor 1002 may be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuitries, and/or any devices that manipulate signals based on operational instructions. The processor 1002 may be configured to, among other capabilities, obtain and execute computer-readable instructions stored in the memory 1004, mass storage 1012, or other computer-readable medium, such as program code of the operating system 1016, program code of the application programs 1018, program code of other programs 1020, etc., to implement the methods provided by embodiments of the present invention.

Memory 1004 and mass storage device 1012 are examples of computer storage media for storing instructions that are executed by processor 1002 to implement the various functions as previously described. For example, the memory 1004 may generally include both volatile memory and nonvolatile memory (e.g., RAM, ROM, etc.). In addition, mass storage device 1012 may generally include hard drives, solid state drives, removable media, including external and removable drives, memory cards, flash memory, floppy disks, optical disks (e.g., CD, DVD), storage arrays, network attached storage, storage area networks, and the like. Memory 1004 and mass storage device 1012 may both be referred to herein collectively as memory or a computer storage medium, and may be non-transitory media capable of storing computer-readable, processor-executable program instructions as computer program code that may be executed by processor 1002 as a particular machine configured to implement the operations and functions described in the examples herein.

A number of program modules may be stored on the mass storage device 1012. These programs include an operating system 1016, one or more application programs 1018, other programs 1020, and program data 1022, which can be loaded into the memory 1004 for execution. Examples of such application programs or program modules may include, for example, computer program logic (e.g., computer program code or instructions) for implementing the methods provided herein. Moreover, the program modules may be distributed in different physical locations to perform the corresponding functions. For example, the methods described as being performed by the user computing device 502, the server computing system 530, and the training computing system 550 in the application scenario of fig. 4 may be distributed across multiple computing devices.

The present invention also provides a computer readable storage medium having stored thereon computer readable instructions which, when executed, implement the above-described method.

Although illustrated in fig. 9 as being stored in the memory 1004 of the computing system 1000, the modules 1014, 1018, 1020, and 1022, or portions thereof, may be implemented using any form of computer readable media accessible by the computing system 1000. As used herein, "computer-readable medium" includes at least two types of computer-readable media, namely computer storage media and communication media.

Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital Versatile Disks (DVD), or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium which can be used to store information for access by a computing system.

In contrast, communication media may embody computer readable instructions, data structures, program modules, or other data in a modulated data signal such as a carrier wave or other transport mechanism. Computer storage media as defined herein do not include communication media.

The computing system 1000 may also include one or more communication interfaces 1006 for exchanging data with other devices, such as via a network, direct connection, or the like. The communication interface 1006 may facilitate communications within a variety of network and protocol types, including wired networks (e.g., LAN, cable, etc.) and wireless networks (e.g., WLAN, cellular, satellite, etc.), the Internet, and so forth. The communication interface 1006 may also provide communication with external storage devices (not shown) such as in a storage array, network attached storage, storage area network, or the like.

In some examples, a display device 1008 may be included for displaying information and images. Other I/O devices 1010 may be devices that receive various inputs from a user and provide various outputs to the user, and may include touch input devices, gesture input devices, cameras, keyboards, remote controls, mice, printers, audio input/output devices, and so on.

In the description of the present specification, the terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc. describe mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and additional implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order from that shown or discussed, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the embodiments of the present invention.

In addition, each functional unit in the embodiments of the present invention may be integrated in one processing module, or each unit may exist alone physically, or two or more units may be integrated in one module. The integrated modules may be implemented in hardware or in software functional modules. The integrated modules may also be stored in a computer readable storage medium if implemented as software functional modules and sold or used as a stand-alone product.

The invention mainly solves the problem that the traditional feature extraction methods such as SIFT and RANSAC can not well represent the cartoon picture; the feature extraction capability of the current deep neural network based on ImageNet training is not suitable for cartoon pictures; and the technical problems of difficult collection and high cost of the (anchor point, positive example and negative example) triplet data set of the cartoon picture are lacking.

Variations to the disclosed embodiments can be understood and effected by those skilled in the art in practicing the claimed subject matter, from a study of the drawings, the disclosure, and the appended claims. In the claims, the word "comprising" does not exclude other elements or steps, and the indefinite article "a" or "an" does not exclude a plurality. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage.

Claims

1. A method for training a neural network, comprising:

acquiring a training image set, wherein the training image set comprises a first training image set and a second training image set;

calculating a first image embedding vector for each training image in the first training image set and a second image embedding vector for each training image in the second training image set using the neural network;

establishing a plurality of triples for the training image set based on a distance relationship between the first image embedding vector and the second image embedding vector, each of the plurality of triples including an anchor image, a positive example image, and a negative example image; and

training the neural network with a predetermined ternary loss function based on each of the plurality of triplets to determine a plurality of parameters of the neural network;

wherein establishing a plurality of triples for the training image set based on a distance relationship between the first image embedding vector and the second image embedding vector comprises:

calculating a distance between each two of the first image-embedded vector pairs, selecting a minimum predetermined number of first image-embedded vector pairs (vn 1, vn 2) of the distances;

Selecting, for one of the predetermined number of first image-embedding vector pairs (vn 1, vn 2), a second image-embedding vector nn1 of the second set of image-embedding vectors that is closest to the first image-embedding vector vn1 and a second image-embedding vector nn2 that is closest to the first image-embedding vector vn 2;

creating a plurality of triples for the training image set based on the first image-embedding vector pair (vn 1, vn 2), a second image-embedding vector nn1 associated with the first image-embedding vector pair (vn 1, vn 2), and a second image-embedding vector nn2, and

wherein the creating a plurality of triples for the training image set based on the first image-embedding vector pair (vn 1, vn 2), a second image-embedding vector nn1 associated with the first image-embedding vector pair (vn 1, vn 2), and a second image-embedding vector nn2 comprises:

establishing a triplet (vn 2, vn1, nn 2) in response to the distance d1 of the first image embedding vector vn1 from the second image embedding vector nn1 being greater than the distance d2 of the first image embedding vector vn2 from the second image embedding vector nn2, wherein vn2 is the anchor image of the triplet, vn1 is the positive image of the triplet, nn2 is the negative image of the triplet; and

And establishing a triplet (vn 1, vn2, nn 1) in response to the distance d1 between the first image embedding vector vn1 and the second image embedding vector nn1 being smaller than the distance d2 between the first image embedding vector vn2 and the second image embedding vector nn2, wherein vn1 is used as an anchor point image of the triplet, vn2 is used as a positive example image of the triplet, and nn1 is used as a negative example image of the triplet.

2. The method of claim 1, wherein the acquiring a training image set comprising a first training image set and a second training image set comprises:

acquiring training videos, wherein the training videos comprise a first training video and at least one second training video;

and extracting key frames aiming at the first training video and the at least one second training video to obtain an acquired training image set, wherein the training image set comprises a first training image set and a second training image set.

3. The method of claim 1, wherein the calculating a distance between each two of the first image-embedding vector pairs comprises:

and calculating the Euclidean distance or cosine distance between every two first image embedded vector pairs in the first image embedded vectors.

4. The method of claim 1, wherein the selecting the second image-embedding vector nn1 of the second set of image-embedding vectors that is closest to the first image-embedding vector vn1 and the second image-embedding vector nn2 that is closest to the first image-embedding vector vn2 comprises:

and calculating a second image embedded vector nn1 which is closest to the first image embedded vector vn1 in a Euclidean manner and a second image embedded vector nn2 which is closest to the first image embedded vector vn2 in the second image embedded vector set, or calculating a second image embedded vector nn1 which is closest to the first image embedded vector vn1 in the second image embedded vector set and a second image embedded vector nn2 which is closest to the first image embedded vector vn2 in a cosine manner.

5. An image feature extraction method, comprising:

acquiring at least one image;

respectively extracting the characteristics of each image in the at least one image by using a trained neural network to obtain an image embedded vector of each image; and

outputting the image embedded vector;

wherein the neural network is trained based on the steps of:

training the neural network with a predetermined ternary loss function based on each of the plurality of triplets to determine a plurality of parameters of the neural network; wherein establishing a plurality of triples for the training image set based on a distance relationship between the first image embedding vector and the second image embedding vector comprises:

6. The method of claim 5, wherein the performing feature extraction on each of the at least one image using the trained neural network, respectively, to obtain an image embedding vector for each image comprises:

extracting, for each image, an output of a last or middle layer of the neural network;

pooling the characteristics of the last layer or middle layer of the neural network to obtain the vectorized embedded vector of each image.

7. An apparatus for training a neural network, comprising:

an acquisition module configured to acquire a training image set including a first training image set and a second training image set;

an image vectorization module configured to calculate a first image embedding vector for each training image in the first training image set and a second image embedding vector for each training image in the second training image set using the neural network;

a training set creation module configured to create a plurality of triples for the training image set based on a distance relationship between the first image embedding vector and the second image embedding vector, each of the plurality of triples including an anchor image, a positive example image, and a negative example image; and

A training module configured to train the neural network with a predetermined ternary loss function based on each of the plurality of triplets to determine a plurality of parameters of the neural network;

8. An image feature extraction apparatus comprising:

an acquisition module configured to acquire at least one image;

The feature extraction module is configured to extract features of each image in the at least one image by using a neural network respectively to obtain an image embedding vector of each image; and

an output module configured to output the image embedding vector;

wherein the neural network is trained based on the steps of:

acquiring a training image set, wherein the training image set comprises a first training image set and a second training image set; calculating a first image embedding vector for each training image in the first training image set and a second image embedding vector for each training image in the second training image set using the neural network;

9. A computing device comprising a memory and a processor, the memory configured to store thereon computer-executable instructions that, when executed on the processor, perform the method of any of claims 1-6.

10. A computer readable storage medium having stored thereon computer executable instructions which when executed on a processor perform the method of any of claims 1-6.