CN110866140B

CN110866140B - Image feature extraction model training method, image searching method and computer equipment

Info

Publication number: CN110866140B
Application number: CN201911172129.XA
Authority: CN
Inventors: 陈震鸿; 颜强
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-11-26
Filing date: 2019-11-26
Publication date: 2024-02-02
Anticipated expiration: 2039-11-26
Also published as: CN110866140A

Abstract

The application relates to an image feature extraction model training method, an image searching method and computer equipment. The method comprises the following steps: acquiring a plurality of picture groups for training, wherein the picture groups at least comprise reference samples and similar samples of the reference samples; inputting each sample of the picture group into a corresponding sub-neural network in the neural network model, extracting semantic feature vectors of each sample through a deep neural network of each sub-neural network, and extracting visual feature vectors of each sample through a shallow neural network of each sub-neural network; according to the semantic feature vector and the visual feature vector, each sub-neural network of the neural network model outputs an image feature vector of a corresponding sample; and training a neural network model by taking the distance between the image feature vectors of the minimized reference sample and the similar sample as a target to obtain an image feature extraction model. The method considers semantic and visual similarity, and can improve image searching accuracy when applied to image searching.

Description

Image feature extraction model training method, image searching method and computer equipment

Technical Field

The present invention relates to the field of artificial intelligence, and in particular, to an image feature extraction model training method, an image searching method, and a computer device.

Background

With the rapid development of internet technology, image searching is applied to more and more scenes, such as commodity searching of a shopping platform, and similar commodities can be searched by using pictures, so that the searching efficiency of target commodities is greatly improved.

With the rapid development of artificial intelligence technology, in order to improve the efficiency of image searching, people apply neural networks to image searching, and rapidly extract image features to perform similarity calculation. However, in the traditional method for searching images by using the neural network model, the image features extracted by the neural network model are single, so that the image searching accuracy is low.

Disclosure of Invention

Based on this, it is necessary to provide an image search method, apparatus, computer device and storage medium, and an image feature extraction model training method, apparatus, computer device and storage medium, for the problem of low image search accuracy.

An image feature extraction model training method, comprising:

acquiring a plurality of picture groups for training, wherein the picture groups at least comprise reference samples and similar samples of the reference samples;

Inputting each sample of the picture group into a corresponding sub-neural network in a neural network model, extracting semantic feature vectors of each sample through a deep neural network of each sub-neural network, and extracting visual feature vectors of each sample through a shallow neural network of each sub-neural network; outputting image feature vectors of corresponding samples by each sub-neural network of the neural network model according to the semantic feature vectors and the visual feature vectors;

and training the neural network model by taking the distance between the image feature vectors of the reference sample and the similar sample as a target to obtain an image feature extraction model.

An image search method, the method comprising:

acquiring an image to be searched;

inputting the image to be searched into a pre-trained image feature extraction model, obtaining a semantic feature vector of the image to be searched through a deep neural network of the image feature extraction model, obtaining a visual feature vector of the image to be searched through a shallow neural network of the image feature extraction model, and outputting the image feature vector of the image to be searched according to the semantic feature vector and the visual feature vector;

Determining the distance between the image feature vector of the image to be searched and the feature vector of each image in a database;

and determining images similar to the images to be searched according to the distance to obtain image searching results.

An image feature extraction model training apparatus comprising:

the system comprises a picture group acquisition module, a picture group acquisition module and a picture analysis module, wherein the picture group acquisition module is used for acquiring a plurality of picture groups for training, and the picture groups at least comprise reference samples and similar samples of the reference samples;

the feature extraction module is used for inputting each sample of the picture group into a corresponding sub-neural network in the neural network model, extracting semantic feature vectors of each sample through a deep neural network of each sub-neural network, and extracting visual feature vectors of each sample through a shallow neural network of each sub-neural network; the feature fusion module is used for outputting image feature vectors of corresponding samples by each sub-neural network of the neural network model according to the semantic feature vectors and the visual feature vectors;

and the training module is used for training the neural network model by taking the distance between the image feature vectors of the reference sample and the similar sample as a target to obtain an image feature extraction model.

An image search apparatus, the apparatus comprising:

the image acquisition module is used for acquiring an image to be searched;

the feature vector extraction module is used for inputting the image to be searched into a pre-trained image feature extraction model, obtaining a semantic feature vector of the image to be searched through a deep neural network of the image feature extraction model, obtaining a visual feature vector of the image to be searched through a shallow neural network of the image feature extraction model, and outputting the image feature vector of the image to be searched according to the semantic feature vector and the visual feature vector;

the distance determining module is used for determining the distance between the image feature vector of the image to be searched and the feature vector of each image in the database;

and the retrieval module is used for determining the distance between the image characteristic vector of the image to be searched and the characteristic vector of each image in the database.

A computer-readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of the method of the above embodiments.

A computer device comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to perform the steps of the method of the embodiments described above.

According to the image feature extraction model training method, two similar images in the image group for training are utilized for training, the two similar images are respectively input into corresponding sub-neural networks in the neural network model during training, the deep neural network of each sub-neural network extracts semantic feature vectors of corresponding samples, the shallow neural network of each sub-neural network extracts visual feature vectors of the corresponding samples, each sub-neural network outputs image feature vectors of the corresponding samples according to the semantic feature vectors and the visual feature vectors, the distance between the reference samples and the image feature vectors between the similar samples is minimized as a target, and the image feature extraction model is obtained through training. In the model training process, each sub-neural network of the neural network model is used for respectively carrying out feature extraction on each sample of the picture group, the layer number of the deep neural network is deeper, the semantic similarity of the images can be learned, the semantic feature vector of the images is extracted, the visual feature vector of the images is extracted by introducing the shallow neural network, the semantic feature vector and the visual feature vector are integrated by the image feature vector, and the semantic features and the visual features of the images are comprehensively considered when the image feature extraction model is trained. When training is performed with the aim of minimizing the distance between the image feature vectors of the two similar images, the image feature extraction model can be enabled to consider the similarity of the two similar images in terms of semantics and vision, and when the image feature extraction model is applied to image search, the image search precision can be improved.

According to the image searching method, when an image is searched, the image to be searched is input into the pre-trained image feature extraction model, the semantic feature vector of the image to be searched is obtained through the deep neural network of the image feature extraction model, the visual feature vector of the image to be searched is obtained through the shallow neural network of the image feature extraction model, the image feature vector of the image to be searched is obtained according to the semantic feature vector and the visual feature vector, and the similar image is determined through the distance between the image feature vector and the feature vector of each image in the database, so that the searching result is obtained. The searching result is determined according to the image feature vector, and the image feature vector integrates the semantic feature vector and the visual feature vector, so that the semantic feature and the visual feature of the picture to be searched are comprehensively considered when the picture is searched, the obtained searching result is similar to the picture to be searched in semantic sense and visual sense, and the image searching precision is further improved.

Drawings

FIG. 1 is an application scenario diagram of an image search method in one embodiment;

FIG. 2 is a flow chart of a training method of an image feature extraction model in one embodiment;

FIG. 3 is a schematic diagram of an image feature extraction model in one embodiment;

FIG. 4 is a schematic diagram of a process for constructing a picture triplet in one embodiment;

FIG. 5 is a schematic diagram of an image feature extraction model according to another embodiment;

FIG. 6 is a schematic diagram of a feature vector extraction module according to an embodiment;

FIG. 7 is a schematic diagram of a conventional network layer of a deep neural network in one embodiment;

FIG. 8 is a schematic diagram of an improved network layer of a deep neural network in one embodiment;

FIG. 9 is a flow diagram of a method of image search in one embodiment;

FIG. 10 is a flowchart of an image search method in another embodiment;

FIG. 11 is a schematic diagram of a process for operating a display image search page in one embodiment;

FIG. 12 is a schematic diagram of an image search operation process in one embodiment;

FIG. 13 is a schematic diagram of an image search process in one embodiment;

FIG. 14 is a block diagram of an image feature extraction model training apparatus in one embodiment;

FIG. 15 is a block diagram showing the structure of an image search apparatus in one embodiment;

fig. 16 is a block diagram showing the structure of an image search apparatus in another embodiment;

FIG. 17 is a block diagram of a computer device in one embodiment.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

Artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

Machine Learning (ML) is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, etc. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.

The scheme provided by the embodiment of the application relates to the technology of artificial intelligence such as machine learning, and is specifically described by the following embodiments:

FIG. 1 is a diagram of an application environment for an image search method in one embodiment. Referring to fig. 1, the image search method is applied to an image search system. The image search system includes a search terminal 110 and a server 130, the search terminal 110 and the server 130 being connected through a network. The server 130 is used for training the image feature extraction model, receiving the image to be searched sent by the search terminal 110 and searching the image, and the search terminal 110 is used for acquiring the image to be searched input by the user and displaying the image search result.

As shown in FIG. 2, in one embodiment, an image feature extraction model training method is provided. The present embodiment is mainly exemplified by the method applied to the server 130 in fig. 1. Referring to fig. 2, the image feature extraction model training method specifically includes the following steps:

s202, a plurality of picture groups for training are obtained, wherein the picture groups at least comprise reference samples and similar samples of the reference samples.

In particular, the plurality of groups of pictures constitute training samples for training the image feature extraction model. Each group of pictures includes at least a reference sample, and similar samples to the reference sample. The reference sample is used as an image to be searched in the image feature extraction model process, and the similar sample is a similar image of the image to be searched. It may be considered that a group of pictures includes at least one image to be searched and similar images of the image to be searched. According to the image feature extraction model training method, the image feature extraction model is trained by utilizing a large number of known similarity relations of two samples in the image group.

S204, inputting each sample of the picture group into a corresponding sub-neural network in the neural network model, extracting semantic feature vectors of each sample through a deep neural network of each sub-neural network, and extracting visual feature vectors of each sample through a shallow neural network of each sub-neural network.

Specifically, the neural network model includes a plurality of sub-neural networks corresponding to the number of each sample in the image group, and calculates the image feature vector of each sample in the image group. The neural network model of one embodiment is structured as shown in fig. 3, and includes a first sub-neural network 301 for image feature vector extraction of a reference sample, and a second sub-neural network 302 for image feature vector extraction of a similar sample.

Specifically, the neural network model adopts twin neural networks, the network structures of all the sub-neural networks are the same, the sub-neural networks share references, and the processing processes are the same. Specifically, the sub-neural network includes a deep neural network and a shallow neural network. Among these, the main role of deep networks is to learn semantic similarity. Semantic similarity refers to similarity of semantic features of images. The semantic features of the image are used to characterize semantic information of the image. An image is made up of many pixels, and the semantic information of an image is the semantic meaning expressed by each pixel of the image. When semantic similarity is applied to image searching, the returned image search results are as identical as possible to the subject described by the image to be searched. For example, when the user searches for the expression of the cat, the expression of the cat should be returned.

The deep artificial neural network can be adopted, has a multi-layer structure, has strong characteristic invariance and can learn good semantic similarity. In one embodiment, the deep neural network may employ a CNN (Convolutional Neural Network ). CNN is a feed-forward neural network that incorporates convolution calculations. The convolutional neural network has the advantages of strong characteristic representation capability, less shared weight parameters, strong invariance to deformation conditions such as translation, scaling, rotation and the like, and is one of the mainstream models in the current computer vision.

In another embodiment, the deep neural network may also employ a res net (Residual NeuralNetwork ), which is a deep artificial neural network based on a residual learning structure. The residual error learning structure performs jump type propagation among a plurality of neural network layers in a short circuit connection mode, and can effectively solve the problem of convergence efficiency reduction caused by the increase of the layer number of the deep neural network, thereby better playing the fitting learning capacity of the deep neural network.

In other embodiments, deep neural networks may also employ DenseNet, VGGNet, googleNet, etc.

The depth neural network has deeper layer number, has very good image characteristics, can learn semantic similarity well, captures the characteristics of the image, and finds the image of the same subject. However, because of strong feature invariance of the depth network, visual differences are easily ignored, learning of visual similarity is affected, similarity of two pairs of images with semantic similarity and visual differences is not obviously distinguished, and further calculation accuracy of the image similarity is not high enough. Based on a depth network, the vision invariance is learned by combining a shallow neural network, and a depth similarity network model is provided.

The main function of the shallow network is to learn visual similarity. Visual similarity refers to the similarity of visual features of images. The visual features of an image refer to the edges, texture, chromaticity, etc. of the image. When visual similarity is applied to image searching, the returned image search results are to be visually as similar as possible to the image to be searched. For example, when the user searches for the expression of a orange cat, the expression of the orange cat is returned as much as possible, instead of the expression of other colored cats. The shallow neural network comprises a small number of network layers, which can be a full connection layer, a convolution layer and a pooling layer.

S206, outputting the image feature vector of the corresponding sample by each sub-neural network of the neural network model according to the semantic feature vector and the visual feature vector.

Each sample is input into a corresponding sub-neural network of the neural network model, the deep neural network extracts semantic feature vectors of the sample, the shallow neural network extracts visual feature vectors of the sample, and the two feature vectors are spliced to be used as output of the sub-neural network. That is, the image feature vector integrates the semantic feature vector and the visual feature vector, and extracts both the semantic feature and the visual feature of the image.

Specifically, the training sample set is utilized to input a reference sample of the picture group into a first sub-neural network 301 in the neural network model, semantic feature vectors of the reference sample are extracted through a deep neural network of the first sub-neural network 301, visual feature vectors of the reference sample are extracted through a shallow neural network of the first sub-neural network 301, the semantic feature vectors and the visual feature vectors are spliced, and the first sub-neural network 301 outputs image feature vectors of the reference sample.

And inputting the similar samples of the picture group into a second sub-neural network 302 in the neural network model, and extracting semantic feature vectors of the similar samples through a deep neural network of the second sub-neural network 302. The sample-like visual feature vectors are extracted by the shallow neural network of the second sub-neural network 302. The semantic feature vector and the visual feature vector are spliced, and the second sub-neural network 302 outputs the image feature vector of the similar sample.

And S208, training a neural network model with the aim of minimizing the distance between the image feature vectors of the reference sample and the similar sample to obtain an image feature extraction model.

Specifically, training a neural network model by using a reference sample and a similar sample of each picture group in a training sample set, and performing model training based on supervision of a loss function, wherein the aim is to minimize the distance of image feature vectors between the reference sample and the similar sample, iterate continuously until the loss converges, or reach a preset iteration number, and obtain an image feature extraction model.

In one embodiment, the group of pictures employs a triplet of pictures, including a reference sample, a similar sample to the reference sample, and a negative sample to the reference sample. Wherein the negative sample is a picture dissimilar to the reference sample. By training the image feature extraction model by using the image triples, in the model training process, the similarity of two images (a reference sample and a similar sample) is learned, and meanwhile, the difference between the reference sample and a third image (a negative sample) is distinguished, so that the calculation accuracy of the similarity is further improved.

Specifically, acquiring a plurality of groups of pictures for training includes: and obtaining a plurality of reference samples for training, similar samples of each reference sample and negative samples of the reference samples to obtain a picture group.

The method for obtaining a plurality of reference samples for training, similar samples of each reference sample and negative samples of the reference samples comprises the following steps: and acquiring an image set, acquiring each reference sample, determining the similarity between the reference sample and each image in the image set by using a similarity algorithm, wherein the image with the highest similarity is used as a similar sample of the reference sample, and any image with the similarity lower than a threshold value is used as a negative sample of the reference sample.

Specifically, the similarity of the pictures can be calculated using a hash similarity algorithm, a similarity algorithm based on local features, a similarity algorithm based on a depth classification model, and the like.

Specifically, the hash similarity algorithm is to map pictures with different sizes into codes with fixed dimensions according to a certain method, and the more similar pictures are, the higher the coding similarity corresponding to the pictures is. Two pictures can be considered to be relatively similar as long as the similarity of the hash codes of the two pictures is above a certain threshold. Common hash similarity algorithms are aHash (average hash), dHash (difference hash), pHash (perceptual hash) and the like. The calculation speed of this type of algorithm is very fast.

The similarity algorithm based on the local features is to detect the local key points of the pictures, describe the feature representation of the pictures by using the key points, and calculate the distance of the feature representations of the two pictures so as to obtain the similarity of the two pictures. Common local feature operators are SIFT (scale invariant feature transform), HOG (direction gradient histogram), LBP (local binary pattern), haar features, etc. The local feature operators have the advantages of rotation invariance, scale invariance, illumination invariance, shielding resistance and the like, and the similarity calculation precision is high.

The similarity algorithm based on the depth classification model is that training the depth classification model by using a training set with labels; then respectively inputting the two pictures into a depth classification model for feedforward calculation, and taking the vector output by the penultimate layer as the code of each picture; and finally, calculating the coding distance of the two pictures to serve as the similarity of the two pictures. Common deep classification models include CNN, resNet and the like, and the disclosed training data can be classified by using a classification data set of ImageNet and then training of 1000 classifications. The depth classification model can well learn the semantic information of the pictures, so that the similarity algorithm can well calculate the semantic similarity of two pictures, and the similarity calculation accuracy is high.

When the image triplet training is constructed, an image with highest similarity is used as a similar sample of a reference sample by using a similarity algorithm, and any image with similarity lower than a threshold value is used as a negative sample of the reference sample. The mode of constructing the image triples by adopting the similarity algorithm to further obtain the training sample set does not need manual labeling, so that the model training efficiency is improved, and the time cost of model training is reduced.

In another embodiment, the image feature extraction model of the present application may also be used for expression feature extraction model training. Expression refers to static or dynamic image expression, and does not include text symbol expression. Wherein, the dynamic expression is composed of a plurality of frames of static expressions, which can be continuously played to form a simple animation. In the training of the image feature extraction model, the process of constructing the expression triplet training set is shown in fig. 4:

and obtaining multiple groups of similar expressions. In the expression store of the social application, there are similar expressions uploaded by each designer, namely expression packages, and the expressions in each expression package have very similar subjects or styles, so each expression package can be regarded as a group of similar expressions. And a similarity algorithm can be adopted to calculate the similarity of the pictures, and then the similar pictures are clustered to obtain a plurality of groups of similar expressions. Wherein, the similar expressions refer to expressions of the same style, such as expressions in the same expression package; or the expression main body is mostly the same, but the local area is obviously modified.

One expression is randomly selected from the first group of similar expressions as a reference expression, the other expression is extracted from the first group of similar expressions as a similar expression of the reference expression, and any one expression is extracted from the second group of similar expressions as a different expression of the reference expression.

Specifically, as shown in fig. 4, two groups of Similar expressions are randomly extracted, one expression is randomly selected from the first group to be used as an expression for searching a user, and is denoted as Q (representing Query), then another expression is selected from the same group to be used as a Similar expression of Q, and is denoted as S (representing Similar), and then one expression is randomly selected from the second group to be used as a Different expression of Q, and is denoted as D (representing differnt); and forming the expression into an expression triplet.

And extracting key frames of the reference expression, the similar expression and different expressions, and correspondingly obtaining a reference sample, a similar sample of the reference sample and a negative sample of the reference sample.

Because the dynamic expression is composed of multiple frames of pictures, and the model only processes one picture at a time, the expression triplet needs to be converted into the picture triplet. The pretreatment operation comprises the following steps: and extracting a key frame from the dynamic expression, converting the key frame into a uniform picture format, and then performing operations such as scaling in equal proportion, cutting frames and the like on the picture to meet the requirement of the model on input data, so as to obtain a reference sample corresponding to the reference expression, a similar sample corresponding to the similar expression and a negative sample corresponding to different expressions. The key frame may be the first frame of the expression, may be a frame with a large variation of the expression image, or may be a frame with a maximum amount of expression image information.

And for the image triples, when model training is carried out, training a neural network model by taking the distance between the image feature vectors of the reference sample and the similar sample as a target and maximizing the distance between the image feature vectors of the reference sample and the negative sample based on supervision of a triples loss function, so as to obtain an image feature extraction model.

Specifically, as shown in fig. 5, the overall structure of the phase model corresponding to the triplet is that a reference sample Q, a similar sample S and a negative sample D in the triplet are respectively input to the first sub-neural network 501, the second sub-neural network 502 and the third sub-neural network 503, so as to correspondingly generate image feature vectors with fixed dimensions. And inputting the three feature vectors to the last layer, and calculating a triplet loss function. Wherein the triplet loss function is as follows:

wherein N represents the number of training samples; distance represents a Distance function, distance (Q, S) calculates a Distance of an image feature quantity between the reference sample Q and the similar sample S, and Distance (Q, D) calculates a Distance of an image feature quantity between the reference sample Q and the negative sample D. Wherein, the distance function can adopt Euler distance or cosine distance, etc. margin is a super parameter, the larger margin, the larger the separation between the similar samples and the negative samples. Compared with the model network of the binary image group, the model adopting the triple-loss function not only minimizes the distance between two similar images of the similar sample and the reference sample, but also maximizes the distance between two different images of the reference sample and the negative sample, so that the interval (namely margin) between the similar images and the different images is larger than a certain threshold value, and the calculation precision of the image similarity can be effectively improved.

The first sub-neural network 501, the second sub-neural network 502, and the third sub-neural network 503 in fig. 5 are all deep and shallow neural networks, and share parameters. As shown in fig. 6, the sub-neural network is composed of two parts, a deep neural network (e.g., res net, total 50 layers) 601 and a shallow neural network 602, respectively. Among them, the main role of the deep network is to learn semantic similarity, while the main role of the shallow network is to learn visual similarity. The two shade networks take pictures as input and respectively output semantic feature vector representations and visual feature vector representations. The semantic feature vector representation and the visual feature vector are spliced to obtain an image feature vector which is used as the output of the sub-neural network.

The deep neural network 601 includes a plurality of network layers, and an output of each network layer is used as an input of a next network layer, and the deep neural network outputs semantic feature vectors of the image through processing of the plurality of network layers. Specifically, extracting semantic feature vectors of each sample image through a deep neural network of each sub-neural network comprises: and inputting each sample into a deep neural network corresponding to the sub-neural network, obtaining the output of each network layer through each network layer of the deep neural network, taking the output of each network layer as the input of the next network layer, and outputting the semantic feature vector of the corresponding sample by the deep neural network.

As shown in fig. 6, the shallow neural network of each sub-neural network includes a convolution layer and a pooling layer, each sample is input into the shallow neural network of the corresponding sub-neural network, the convolution layer of the shallow neural network carries out convolution processing on the sample to obtain a feature vector, the pooling layer of the shallow neural network carries out downsampling processing on the feature vector input into the pooling layer of the shallow neural network to obtain a visual feature vector of the corresponding sample output by the shallow neural network.

Specifically, the convolution layer is a neural network layer for carrying out convolution operation in the convolution neural network, the convolution layer comprises a plurality of convolution kernels, and each convolution kernel can better capture a characteristic through the convolution operation, so that characteristic extraction by manpower is avoided. The pooling layer is an important component in the convolutional neural network and mainly has the effects of downsampling the characteristics generated by the convolutional layer, reducing the possibility of overfitting and improving the generalization of the convolutional neural network. The maximum pooling layer is one of the common pooling layers, and can rapidly extract the most effective characteristic representation.

In learning semantic similarity, the main content in the picture is considered with emphasis instead of all pixels (e.g., background information that learns visual similarity through a shallow neural network). Thus, the attention mechanism is employed in this embodiment to improve the structure of each network layer in the deep neural network. Taking the deep neural network as an example of a ResNet model, FIG. 7 is an original ResNet Block, wherein Representing the output after being processed by a plurality of convolution layers in the network layer; identify represents an identity transformation, i.e. directly taking an input as an output. The original block performs indiscriminate feature extraction on the whole image features, and key information in the image is not considered seriously.

In this embodiment, each sample is input into a deep neural network corresponding to a sub-neural network, the attention layer of each network layer in the deep neural network is used for obtaining the weight of the sample in each region to obtain an attention vector, and the convolution layer in each network layer is used for obtaining the initial semantic feature vector of the sample; and weighting the initial semantic feature vector by using the attention vector to obtain the output of each network layer. By introducing the attention mechanism, this mechanism references the human vision processing mechanism: when the human eye scans an image quickly, the visual focus is focused on the key area on the image to acquire the required important information, and other useless information is restrained. The attention mechanism can obviously improve the efficiency and accuracy of information processing and help the deep neural network learn more effective characteristic representation.

Specifically, as shown in fig. 8, on the basis of the original block, the input image is processed by adopting the attention network layer, and the weight of the input image in each region is learned Then use->Output of convolutional layer->The weighting is performed to increase the weight of the important region (for example, the subject object) and decrease the weight of the unimportant region.

In the embodiment, by improving the deep neural network and adding the attention mechanism, the deep neural network can be focused on the identification of the main body content in the training process, and more accurate semantic similarity is learned.

In practical applications, deep neural networks, such as ResNet, have a relatively deep layer, relatively large number of parameters, and relatively poor training effect if the parameters are randomly initialized. In order to improve the ResNet effect, the technical scheme initializes the ResNet parameter values in a pre-training mode. Specifically, training the deep neural network by using the classified sample images to obtain initialization parameters of the deep neural network in the image feature extraction model.

Wherein the classified sample image is an ImageNet classification dataset. The dataset has 1000 classes, and the total number of pictures is one hundred thousand. For example, the deep neural network adopts ResNet, a layer of full-connection network of 1000 nodes is connected to the end of ResNet, the ResNet is subjected to multi-classification training by using an ImageNet data set, and Softmax is used as a loss function. And after the classification training is finished, initializing parameters of the deep neural network are obtained, the last layer of ResNet is removed, and the ResNet is accessed into an image feature extraction model. The model training efficiency can be improved by pre-training the initialization parameters of the deep neural network.

As shown in fig. 9, in one embodiment, an image search method is provided. The present embodiment is mainly exemplified by the method applied to the server 130 in fig. 1. Referring to fig. 9, the image search method specifically includes the steps of:

s902, acquiring an image to be searched.

The image to be searched can be received by the searching terminal, input by a user and uploaded to the server.

S904, inputting the image to be searched into a pre-trained image feature extraction model, obtaining a semantic feature vector of the image to be searched through a deep neural network of the image feature extraction model, obtaining a visual feature vector of the image to be searched through a shallow neural network of the deep neural network, and outputting the image feature vector of the image to be searched according to the semantic feature vector and the visual feature vector.

The image feature extraction model may be trained by the aforementioned method, which is not described herein. The image feature extraction model in this embodiment is a sub-neural network of the neural network model previously trained.

And inputting the acquired image to be searched into a pre-trained image feature extraction model, outputting semantic feature vectors of the image to be searched by the deep neural network, outputting visual feature vectors of the image to be searched by the shallow neural network, and splicing the semantic feature vectors and the visual feature vectors to obtain the image feature vectors of the image to be searched.

S906, determining the distance between the image feature vector of the image to be searched and the feature vector of each image in the database.

The database refers to a target gallery of image searches, storing a large number of images. The image searching is to search the database to obtain the target image similar to the image to be searched.

Specifically, the image feature extraction model outputs a fixed-dimension image feature vector, and the user outputs an image at the time of search. In order to quickly acquire similar images in a search system, it is necessary to convert images in a database into vectors.

Specifically, the images in the database are input into a trained feature vector extraction model, namely a deep neural network as shown in fig. 6, and the output vectors are the image feature vectors of each image. The image feature vectors of the images are input into a search system to construct an index, so that preparation is made for subsequent searching.

The expression is used as a dynamic image, and each frame of image changes dynamically. In the process of converting the expression into the vector representation, all expressions in the expression library are preprocessed, and key frames of each expression are extracted, wherein the key frames can be first frames. And sequentially inputting the first frame of picture of each expression into a trained symptom vector extraction module, wherein the output vector is the numerical representation of each expression. All expression vectors are input into the search system to construct an index, and preparation is made for subsequent searches.

And during searching, calculating the distance between the image feature vector of the image to be searched and the feature vector of each image in the database, and determining the similarity between the image to be searched and each image in the database according to the distance.

S908, determining images similar to the images to be searched according to the distance, and obtaining image searching results.

Specifically, the more similar pictures, the closer the distance of the outputted vector. Thus, the closer the vector distance, the more similar the image is to the image to be searched. The N images closest to the image to be searched are generally taken as image search results.

In another embodiment, obtaining semantic feature vectors of an image to be searched through a deep neural network of an image feature extraction model includes: inputting the image to be searched into a deep neural network of an image feature extraction model, and obtaining the output of each network layer through each network layer of the deep neural network; and taking the output of each network layer as the input of the next network layer, and outputting semantic feature vectors of the images to be searched by the deep neural network.

In another embodiment, inputting an image to be searched into a deep neural network of an image feature extraction model, and obtaining an output of each network layer through each network layer of the deep neural network, including: inputting the image to be searched into a deep neural network of an image feature extraction model, acquiring the weight of the image to be searched in each area through an attention layer of each network layer in the deep neural network to obtain an attention vector, and acquiring an initial semantic feature vector of the image to be searched through a convolution layer of each network layer; and weighting the initial semantic feature vector by using the attention vector to obtain the output of each network layer.

In another embodiment, obtaining the visual feature vector of the image to be searched through the shallow neural network of the image feature extraction model includes: and carrying out convolution processing on the images to be searched through a convolution layer of the shallow neural network to obtain feature vectors, inputting the feature vectors into a pooling layer of the shallow neural network to carry out downsampling processing to obtain visual feature vectors of the images to be searched, which are output by the shallow neural network.

In one embodiment, an image search method is provided. The present embodiment is mainly exemplified by the method applied to the search terminal 110 in fig. 1 described above. Referring to fig. 10, the image search method specifically includes the steps of:

s1002, displaying an image search page based on triggering operation of the application interface search control, wherein the image search page comprises an image selection control to be searched.

The image to be searched is the original input of image searching, and the purpose of the image searching is to search out a target image similar to the image from a preset database. The image to be searched may be input by a user through a search terminal. The image searching method can be realized based on any application of the terminal, such as a search engine, a social application or a shopping platform.

Taking the application of the image search method to social applications as an example, as shown in fig. 11, a process diagram of displaying an image search page for one operation is shown. Such as page 1101, which is the primary portal for application software searches, is provided with a search control that, when triggered, such as when clicked on at a search terminal, displays a search service page 1102 where available search services, such as novels, music, images, emoticons, article searches, etc., are presented. When the image search control is triggered at the page, an image search service is triggered. Such as when the search service page 1102 triggers an image search control, an image search page is displayed, which includes an image selection control to be searched.

And S1004, when the triggering operation of the image selection control to be searched is detected, displaying an image selection page.

FIG. 12 is a schematic diagram of an image search operation in one embodiment. When the image search control is triggered at the search service page 1102, an image search page 1201 is displayed, which includes an image selection control 1204 to be searched. When the user triggers the search image selection control 1204 at the terminal, an emotpanel pops up at the bottom, resulting in the display image selection page 1202. At the image panel of the image selection page 1202, the resources existing at the terminal are displayed, and the user selects one image from the selectable images as the image to be searched.

S1006, acquiring an image to be searched according to an image selection operation for an image selection page.

As shown in fig. 12, after the user selects an image on the image selection page 1202, the image selected by the user is taken as an image to be searched.

For expression searching, according to an image selecting operation for an image selecting page, obtaining an image to be searched comprises: and according to the expression selection operation aiming at the image selection page, obtaining the expression to be searched, and extracting the key frame of the expression to be searched to obtain the image to be searched.

Specifically, expression refers to a static or dynamic image-like expression, excluding a text symbol-like expression. Wherein, the dynamic expression is composed of a plurality of frames of static expressions, which can be continuously played to form a simple animation. Since the dynamic expression is composed of multiple frames of pictures, and the model only processes one picture at a time, it is necessary to convert the expression into the picture. And a basis is provided for expression search by processing the expression extraction key frames.

And S1008, sending the image to be searched to a server.

Specifically, after obtaining an image to be searched, the terminal sends the image to be searched to a server, and the server searches in a database by utilizing a trained image feature extraction model to obtain an image search result.

Specifically, the server inputs an image to be searched into a pre-trained image feature extraction model, semantic feature vectors and visual feature vectors of the image to be searched are obtained through the image feature extraction model, the image feature vectors of the image to be searched are output according to the semantic feature vectors and the visual feature vectors, the distance between the image feature vectors and feature vectors of all images in a database is determined, and an image similar to the image to be searched is determined according to the distance, so that an image search result is obtained.

The server searches in the database by using the trained image feature extraction model to obtain the technical implementation of the image search result, which is already described in the previous image search method of the server side, and will not be described here again.

And after the server obtains the image search result, the image search result is sent to the search terminal.

S1010, receiving and displaying the image search result returned by the server.

As mentioned above, the image search result is an image similar to the image to be searched, which is determined according to the distance between the image feature vector of the image to be searched and the feature vector of each image in the database, the image feature vector is obtained by inputting the image to be searched into a pre-trained image feature extraction model, obtaining the semantic feature vector and the visual feature vector of the image to be searched through the image feature extraction model, and outputting the semantic feature vector and the visual feature vector.

As shown in fig. 12, after receiving the image search result returned by the server, the terminal displays the image search result on the image search result display interface 1203.

Specifically, the terminal sorts the images in the image search results according to the similarity to obtain an image search result list, and displays the image search result list. By ordering the image search results according to the degree of similarity, the user can intuitively acquire the most similar search results.

According to the image searching method, the user inputs the image to be searched based on the triggering operation of the application interface, the image to be searched is sent to the server, and the server searches the image. When searching images, inputting the images to be searched into a pre-trained image feature extraction model, obtaining semantic feature vectors and visual feature vectors of the images to be searched by utilizing the image feature extraction model, obtaining image feature vectors of the searched images according to the semantic feature vectors and the visual feature vectors, and determining similar images by the distance between the image feature vectors and the feature vectors of the images in a database to obtain search results. The searching result is determined according to the image feature vector, and the image feature vector integrates the semantic feature vector and the visual feature vector, so that the semantic feature and the visual feature of the picture to be searched are comprehensively considered when the picture is searched, the obtained searching result is similar to the picture to be searched in semantic sense and visual sense, and the image searching precision is further improved.

The following describes the scheme of the present application in detail, taking expression search as an example.

As shown in fig. 13, in order to implement expression search, the implementation of the technical solution of the present application includes four stages, respectively:

the first stage: and constructing a triplet training set.

In the expression store of a social application, there are similar expressions uploaded by the respective designers, namely expression packages, and the expressions in each package have very similar subjects or styles, so each expression package can be treated as a set of similar expressions.

Specifically, as shown in fig. 4, two groups of Similar expressions are randomly extracted, one expression is randomly selected from the first group to be used as an expression for searching a user, and is denoted as Q (representing Query), then another expression is selected from the same group to be used as a Similar expression of Q, and is denoted as S (representing Similar), and then one expression is randomly selected from the second group to be used as a Different expression of Q, and is denoted as D (representing differnt); and forming the expression into a triplet.

Because the dynamic expression is composed of multiple frames of pictures, and the model only processes one picture at a time, the expression triplet needs to be converted into the picture triplet. The pretreatment operation comprises the following steps: extracting a first frame from the dynamic expression, and converting the first frame into a uniform picture format; and then performing operations such as equal-proportion scaling, frame cutting and the like on the picture, and meeting the requirement of the model on input data.

And a second stage: training the image feature direction to extract the model.

The neural network model structure is shown in fig. 5, and includes a sub-neural network corresponding to each sample, and a reference sample Q, a similar sample S, and a negative sample D in the triplet are respectively input to a first sub-neural network 501, a second sub-neural network 502, and a third sub-neural network 503, so as to correspondingly generate an image feature vector with a fixed dimension. And inputting the three feature vectors to the last layer, and calculating a triplet loss function. Wherein the triplet loss function is as follows:

wherein N represents the number of training samples; distance (x, y) represents a Distance function, and the Distance of the image feature quantity between x and y is calculated, and the euler Distance, the cosine Distance, or the like can be used. margin is a super parameter, the larger margin, the larger the separation between the similar samples and the negative samples. Compared with the model network of the binary image group, the model adopting the triple-loss function not only minimizes the distance between two similar images of the similar sample and the reference sample, but also maximizes the distance between two different images of the reference sample and the negative sample, so that the interval (namely margin) between the similar images and the different images is larger than a certain threshold value, and the calculation precision of the image similarity can be effectively improved.

The first sub-neural network 501, the second sub-neural network 502, and the third sub-neural network 503 in fig. 5 are all shade similarity networks, and share parameters. As shown in fig. 6, the sub-neural network is composed of two parts, a deep neural network res net (total 50 layers) 601 and a shallow neural network 602, respectively. Among them, the main role of the deep network is to learn semantic similarity, while the main role of the shallow network is to learn visual similarity. The two shade networks take pictures as input and respectively output semantic feature vector representations and visual feature vector representations. The two feature vectors are spliced to obtain an image feature vector which is used as the output of the sub-neural network.

The deep neural network comprises a plurality of network layers (BLOCK), the output of each network layer is used as the input of the next network layer, and the deep neural network outputs the semantic feature vector of the image through the processing of the plurality of network layers. Specifically, each sample is input into a deep neural network corresponding to a sub-neural network, the output of each network layer is obtained through each network layer of the deep neural network, the output of each network layer is used as the input of the next network layer, and each deep neural network outputs a semantic feature vector corresponding to the sample.

As shown in fig. 6, the shallow neural network of each feature extraction module includes a convolution layer and a pooling layer, each sample is input into the shallow neural network of the corresponding sub-neural network, the convolution layer of the shallow neural network carries out convolution processing on the sample to obtain a feature vector, the pooling layer of the shallow neural network carries out downsampling processing on the feature vector input into the pooling layer of the shallow neural network to obtain a visual feature vector of the corresponding sample output by each shallow neural network.

In learning semantic similarity, the main content in the picture is considered with emphasis instead of all pixels (e.g., background information that learns visual similarity through a shallow neural network). Thus, the attention mechanism is employed in this embodiment to improve the structure of each network layer in the deep neural network.

In this embodiment, each sample is input into a deep neural network corresponding to a sub-neural network, the attention layer of each network layer in the deep neural network is used for obtaining the weight of the sample in each region to obtain an attention vector, and the convolution layer of each network is used for obtaining the initial semantic feature vector of the sample; and weighting the initial semantic feature vector by using the attention vector to obtain the output of each network layer. Specifically, as shown in fig. 8, on the basis of the original block, the input image is processed by adopting the attention network layer, and the weight of the input image in each region is learned Then use->Output of convolutional layer->The weighting is performed to increase the weight of the important region (for example, the subject object) and decrease the weight of the unimportant region.

The classified sample images are image net classified data sets, the data sets have 1000 classes, and the total number of the images is tens of millions. Connecting a layer of full-connection network of 1000 nodes at the end of ResNet; resNet was multi-class trained with ImageNet dataset, with Softmax as the loss function. And after the classification training is finished, initializing parameters of the deep neural network are obtained, the last layer of ResNet is removed, and the ResNet is accessed into an image feature extraction model. The model training efficiency can be improved by pre-training the initialization parameters of the deep neural network.

Training a neural network model, and determining parameters of each sub-network model in the neural network model to obtain an image feature extraction model. The image feature extraction model is a sub-neural network in the neural network model.

And thirdly, vectorizing the expression.

The sub-neural network outputs a fixed-dimension image feature vector, and the user outputs an image at the time of searching. In order to quickly acquire similar expressions in a search system, it is necessary to convert images in a database into vectors.

And a fourth stage, searching similar expressions.

After a user inputs an expression, the system extracts a first frame of picture according to the existing preprocessing mode; then inputting the picture into a deep and shallow neural network (namely a sub-neural network) as shown in fig. 6 to acquire vectors; the vector is input into the search system for query, and the search system returns N most similar expressions. The input operation of the client is shown in fig. 11 and 12.

By adopting the scheme, the labeled triplet data set can be automatically generated without manual labeling, and a large amount of labeling cost is saved. The designed image feature extraction model can automatically learn image features through a neural network, so that complicated feature engineering is avoided, and research and development cost is saved. Compared with the prior art, the similarity calculation model is designed based on the deep neural network and the shallow neural network, semantic similarity is considered, visual similarity is considered, and calculation accuracy of the model is higher. Further, by improving the deep neural network ResNet and adding an Attention mechanism, the ResNet can be more focused on the identification of main body content in the training process, and more accurate semantic similarity can be learned.

As shown in fig. 14, there is provided an image feature extraction model training apparatus including:

the group of pictures acquisition module 1401 is configured to acquire a plurality of groups of pictures for training, where the group of pictures includes at least a reference sample and a similar sample of the reference sample.

The feature extraction module 1402 is configured to input each sample of the image group into a corresponding sub-neural network in the neural network model, extract semantic feature vectors of each sample through a deep neural network of each sub-neural network, and extract visual feature vectors of each sample through a shallow neural network of each sub-neural network.

The feature fusion module 1403 is configured to output an image feature vector of a corresponding sample according to the semantic feature vector and the visual feature vector, by each sub-neural network of the neural network model.

A training module 1404 is configured to train the neural network model with a goal of minimizing a distance between the image feature vectors of the reference sample and the similar sample, and obtain an image feature extraction model.

According to the image feature extraction model training device, two similar images in the image group for training are utilized for training, the two similar images are respectively input into corresponding sub-neural networks in the neural network model during training, the deep neural network of each sub-neural network extracts semantic feature vectors of corresponding samples, the shallow neural network of each sub-neural network extracts visual feature vectors of the corresponding samples, each sub-neural network outputs image feature vectors of the corresponding samples according to the semantic feature vectors and the visual feature vectors, the distance between the reference samples and the image feature vectors between the similar samples is minimized as a target, and the image feature extraction model is obtained through training. In the model training process, each sub-neural network of the neural network model is used for respectively carrying out feature extraction on each sample of the picture group, the layer number of the deep neural network is deeper, the semantic similarity of the images can be learned, the semantic feature vector of the images is extracted, the visual feature vector of the images is extracted by introducing the shallow neural network, the semantic feature vector and the visual feature vector are integrated by the image feature vector, and the semantic features and the visual features of the images are comprehensively considered when the image feature extraction model is trained. When training is performed with the aim of minimizing the distance between the image feature vectors of the two similar images, the image feature extraction model can be enabled to consider the similarity of the two similar images in terms of semantics and vision, and when the image feature extraction model is applied to image search, the image search precision can be improved.

In another embodiment, a picture group obtaining module is configured to obtain a picture group by obtaining a plurality of reference samples for training, a similar sample of each reference sample, and a negative sample of the reference sample.

The training module is used for training the neural network model based on supervision of the triplet loss function, taking the distance between the image feature vectors of the reference sample and the similar sample as a target, and maximizing the distance between the image feature vectors of the reference sample and the negative sample, so as to obtain an image feature extraction model.

In another embodiment, the feature extraction module includes a semantic feature extraction module, configured to input each sample into a deep neural network corresponding to the sub-neural network, obtain, through each network layer of the deep neural network, an output of each network layer, and use the output of each network layer as an input of a next network layer, where the deep neural network outputs a semantic feature vector corresponding to the sample.

The method comprises the steps of inputting each sample into a deep neural network corresponding to a sub-neural network, obtaining the attention vector by obtaining the weight of the sample in each area through the attention layer of each network layer in the deep neural network, obtaining the initial semantic feature vector of the sample through the convolution layer of each network layer, and carrying out weighting processing on the initial semantic feature vector by using the attention vector to obtain the output of each network layer.

In another embodiment, the feature extraction module further includes a visual feature extraction module, configured to input each sample into a shallow neural network of a corresponding sub-neural network, perform convolution processing on the sample through a convolution layer of the shallow neural network to obtain a feature vector, and input the feature vector into a pooling layer of the shallow neural network to perform downsampling processing to obtain a visual feature vector of the corresponding sample output by the shallow neural network.

In another embodiment, the apparatus further comprises a pre-training module for training the deep neural network using the classified sample images to obtain initialization parameters of the deep neural network in the neural network model.

In another embodiment, the image group obtaining module is configured to obtain an image set, obtain each reference sample, determine, using a similarity algorithm, a similarity between the reference sample and each image in the image set, and use an image with a highest similarity as a similar sample of the reference sample, and use any image with a similarity lower than a threshold as a negative sample of the reference sample.

In another embodiment, the training device is used for obtaining multiple groups of similar expressions, randomly extracting two groups of similar expressions, randomly selecting one expression from the first group of similar expressions as a reference expression, extracting the other expression from the first group of similar expressions as a similar expression of the reference expression, extracting any one expression from the second group of similar expressions as a different expression of the reference expression, and extracting key frames of the reference expression, the similar expression and the different expression, wherein the key frames correspond to a reference sample, a similar sample of the reference sample and a negative sample of the reference sample.

The present application also provides an image search apparatus, as shown in FIG. 15, comprising

An image acquisition module 1501 is configured to acquire an image to be searched.

The feature extraction module 1502 is configured to input an image to be searched into a pre-trained image feature extraction model, obtain a semantic feature vector of the image to be searched through a deep neural network of the image feature extraction model, obtain a visual feature vector of the image to be searched through a shallow neural network of the image feature extraction model, and output the image feature vector of the image to be searched according to the semantic feature vector and the visual feature vector.

A distance determining module 1503, configured to determine a distance between an image feature vector of an image to be searched and a feature vector of each image in the database.

The retrieving module 1504 is configured to determine an image similar to the image to be searched according to the distance, and obtain an image search result.

When the image searching device is used for searching images, the images to be searched are input into the pre-trained image feature extraction model, semantic feature vectors of the images to be searched are obtained through a deep neural network of the image feature extraction model, visual feature vectors of the images to be searched are obtained through a shallow neural network of the image feature extraction model, the image feature vectors of the searched images are obtained according to the semantic feature vectors and the visual feature vectors, and similar images are determined according to the distance between the image feature vectors and feature vectors of all the images in the database, so that search results are obtained. The searching result is determined according to the image feature vector, and the image feature vector integrates the semantic feature vector and the visual feature vector, so that the semantic feature and the visual feature of the picture to be searched are comprehensively considered when the picture is searched, the obtained searching result is similar to the picture to be searched in semantic sense and visual sense, and the image searching precision is further improved.

In another embodiment, the feature extraction module includes a semantic feature extraction module, configured to input the image to be searched into a deep neural network of the image feature extraction model, obtain an output of each network layer through each network layer of the deep neural network, and use the output of each network layer as an input of a next network layer, where the deep neural network outputs a semantic feature vector of the image to be searched.

In another embodiment, the semantic feature extraction module is configured to input an image to be searched into a deep neural network of the image feature extraction model, obtain a attention vector by obtaining weights of the image to be searched in each region through attention layers of network layers in the deep neural network, and obtain an initial semantic feature vector of the image to be searched through convolution layers of the network layers; and weighting the initial semantic feature vector by using the attention vector to obtain the output of each network layer.

The present application also provides another image search apparatus, as shown in fig. 16, including:

the search triggering module 1601 is configured to display an image search page based on a triggering operation for a social application interface search control, where the image search page includes an image selection control to be searched.

The selection triggering module 1602 is configured to display an image selection page when a triggering operation for an image selection control to be searched is detected.

A selecting module 1603, configured to obtain an image to be searched according to an image selecting operation for an image selecting page.

Specifically, the selecting module is configured to obtain an expression to be searched according to an expression selecting operation for an image selecting page, extract a key frame of the expression to be searched, and obtain an image to be searched.

A sending module 1604, configured to send the image to be searched to a server.

The display module 1605 is used for receiving and displaying the image search result returned by the server; the image searching result is determined according to the distance between the image feature vector of the image to be searched and the feature vector of each image in the database, and the image feature vector is obtained according to the semantic feature vector and the visual feature vector of the image to be searched, which are determined by the pre-trained image feature extraction model.

Specifically, the display module is configured to sort each image in the image search results according to the similarity, obtain an image search result list, and display the image search result list.

According to the image searching device, the user inputs the image to be searched based on the triggering operation of the application interface, the image to be searched is sent to the server, and the server searches the image. When searching images, inputting the images to be searched into a pre-trained image feature extraction model, obtaining semantic feature vectors and visual feature vectors of the images to be searched by utilizing the image feature extraction model, obtaining image feature vectors of the searched images according to the semantic feature vectors and the visual feature vectors, and determining similar images by the distance between the image feature vectors and the feature vectors of the images in a database to obtain search results. The searching result is determined according to the image feature vector, and the image feature vector integrates the semantic feature vector and the visual feature vector, so that the semantic feature and the visual feature of the picture to be searched are comprehensively considered when the picture is searched, the obtained searching result is similar to the picture to be searched in semantic sense and visual sense, and the image searching precision is further improved.

FIG. 17 is a block diagram of a computer device in one embodiment. Referring to fig. 17, the computer device may be the terminal and the server in fig. 1. The computer device includes a processor, a memory, and a network interface connected by a system bus. The memory includes a nonvolatile storage medium and an internal memory. The non-volatile storage medium of the computer device may store an operating system and a computer program. The computer program, when executed, may cause the processor to perform an image feature extraction model training method or an image search method. The processor of the computer device is used to provide computing and control capabilities, supporting the operation of the entire computer device. The internal memory may store a computer program that, when executed by the processor, causes the processor to perform an image feature extraction model training method or an image search method. The network interface of the computer device is used for network communication.

It will be appreciated by those skilled in the art that the structure shown in fig. 17 is merely a block diagram of a portion of the structure associated with the present application and is not limiting of the computer device to which the present application applies, and that a particular computer device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, the image feature extraction model training apparatus or the image searching apparatus provided in the present application may be implemented as a computer program, which may be executed on a computer device as shown in fig. 17, and a nonvolatile storage medium of the computer device may store respective program modules constituting the image feature extraction model training apparatus or the image searching apparatus. The computer program constituted by the respective program modules is for causing the computer device to execute the steps in the image feature extraction model training method or the image searching method of the respective embodiments of the present application described in the present specification.

In one embodiment, a computer device is provided that includes a memory and a processor, the memory storing a computer program that, when executed by the processor, causes the processor to perform the steps of the image feature extraction model training method or the image search method described above. The step of the image feature extraction model training method or the image search method herein may be a step in the image feature extraction model training method or the image search method of the above-described respective embodiments.

In one embodiment, a computer readable storage medium is provided, storing a computer program which, when executed by a processor, causes the processor to perform the steps of the image feature extraction model training method or the image search method described above. The steps of the image feature extraction model training method or the image search method here may be the steps in the shooting processing method of each of the above embodiments.

It should be noted that, the "first" and "second" in the embodiments of the present application are used only for distinction, and are not limited in terms of size, sequence, slave, etc.

It should be understood that although the steps in the embodiments of the present application are not necessarily performed sequentially in the order indicated by the step numbers. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in various embodiments may include multiple sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, nor do the order in which the sub-steps or stages are performed necessarily performed in sequence, but may be performed alternately or alternately with at least a portion of the sub-steps or stages of other steps or steps.

Those skilled in the art will appreciate that all or part of the processes in the methods of the above embodiments may be implemented by a computer program for instructing relevant hardware, where the program may be stored in a non-volatile computer readable storage medium, and where the program, when executed, may include processes in the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the various embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

The technical features of the above-described embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above-described embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples illustrate only a few embodiments of the invention, which are described in detail and are not to be construed as limiting the scope of the invention. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the invention, which are all within the scope of the invention. Accordingly, the scope of protection of the present invention is to be determined by the appended claims.

Claims

1. An image feature extraction model training method, comprising:

acquiring a plurality of expression packages;

randomly extracting two expression packages, randomly selecting one expression from a first expression package as a reference expression, extracting the other expression from the first expression package as a similar expression of the reference expression, and extracting any expression from a second expression package as a different expression of the reference expression;

Extracting the key frames of the reference expression to obtain a reference sample of a picture group, extracting the key frames of the similar expression to obtain a similar sample of the reference sample, and extracting the key frames of the different expressions to obtain a negative sample of the reference sample; the key frame is a frame with large expression image variation or a frame with maximum expression image information;

inputting a reference sample of the picture group into a deep neural network and a shallow neural network of a first sub-neural network in a neural network model, extracting a semantic feature vector of the reference sample through the deep neural network of the first sub-neural network, and extracting a visual feature vector of the reference sample through the shallow neural network of the first sub-neural network; inputting the similar samples of the picture group into a deep neural network and a shallow neural network of a second sub-neural network in a neural network model, extracting semantic feature vectors of the similar samples through the deep neural network of the second sub-neural network, and extracting visual feature vectors of the similar samples through the shallow neural network of the second sub-neural network; inputting the negative sample of the picture group into a deep neural network and a shallow neural network of a third sub-neural network in a neural network model, extracting semantic feature vectors of the negative sample through the deep neural network of the third sub-neural network, and extracting visual feature vectors of the negative sample through the shallow neural network of the third sub-neural network; the deep neural network is a pre-trained deep neural network, the deep neural network obtains the weight of the sample in each region through an attention layer to obtain an attention vector, and the initial semantic feature vector of the sample obtained through a convolution layer is weighted based on the attention vector so as to improve the weight of a main object in an image and reduce the weight of a non-main object in the image; the semantic feature vector is used for representing semantic information of the image, and the semantic information of the image is semantic meaning expressed by each pixel of the image; the shallow neural network is a convolution layer and a pooling layer; the visual feature vector refers to the edge, texture and chromaticity features of the image;

Connecting the tail end of the deep neural network model with a layer of full-connection network, and training the deep neural network in advance according to the classified classification data set to obtain initialization parameters of the deep neural network model; the number of the nodes of the full-connection network is the same as the classification category of the classification data set, after the pre-training is finished, the full-connection network connected with the tail end of the deep neural network is removed, the pre-trained deep neural network is obtained, and the pre-trained deep neural network is accessed into an image feature extraction model;

splicing the semantic feature vector of the reference sample and the visual feature vector to obtain an image feature vector of the reference sample, splicing the semantic feature vector of the similar sample and the visual feature vector to obtain an image feature vector of the similar sample, splicing the semantic feature vector of the negative sample and the visual feature vector to obtain an image feature vector of the negative sample, and outputting the image feature vector of the corresponding sample by each sub-neural network of the neural network model;

and training the neural network model based on supervision of a triplet loss function with the aim of minimizing the distance between the image feature vectors of the reference sample and the similar sample and maximizing the distance between the image feature vectors of the reference sample and the negative sample to obtain the image feature extraction model.

2. The method of claim 1, wherein the extracting the semantic feature vector of each sample through the deep neural network of each sub-neural network comprises:

inputting each sample into a deep neural network of a corresponding sub-neural network, and obtaining the output of each network layer through each network layer of the deep neural network;

the output of each network layer is used as the input of the next network layer, and the deep neural network outputs the semantic feature vector of the corresponding sample.

3. The method of claim 2, wherein inputting each sample into a deep neural network and a shallow neural network of a deep neural network of a corresponding sub-neural network, and obtaining an output of each network layer through each network layer of the deep neural network, comprises: inputting each sample into a deep neural network of a corresponding sub-neural network, acquiring the weight of the sample in each region through an attention layer of each network layer in the deep neural network to obtain an attention vector, and acquiring an initial semantic feature vector of the sample through a convolution layer of each network layer;

and weighting the initial semantic feature vector by using the attention vector to obtain the output of each network layer.

4. The method of claim 1, wherein the means for extracting the visual feature vector of each sample through the shallow neural network of the sub-neural network comprises: and inputting each sample into a shallow neural network of a sub-neural network, carrying out convolution processing on the sample through a convolution layer of the shallow neural network to obtain a feature vector, and inputting the feature vector into a pooling layer of the shallow neural network to carry out downsampling processing to obtain a visual feature vector of a corresponding sample output by the shallow neural network.

5. An expression search method, the method comprising:

acquiring an expression to be searched;

inputting the expression to be searched into a deep neural network and a shallow neural network of a pre-trained image feature extraction model, obtaining a semantic feature vector of the expression to be searched through the deep neural network of the image feature extraction model, obtaining a visual feature vector of the expression to be searched through the shallow neural network of the image feature extraction model, splicing the semantic feature vector and the visual feature vector to obtain an image feature vector of a corresponding sample, and outputting the image feature vector of the expression to be searched by the image feature extraction model;

Determining the distance between the image feature vector of the expression to be searched and the feature vector of each image in a database;

determining images similar to the expression to be searched according to the distance to obtain image searching results;

the step of training the image feature extraction model comprises the following steps:

acquiring a plurality of expression packages;

Connecting the tail end of the deep neural network model with a layer of full-connection network, and training the deep neural network in advance according to the classified classification data set to obtain initialization parameters of the deep neural network model; the number of the nodes of the full-connection network is the same as the classification category of the classification data set, after the pre-training is finished, the full-connection network connected with the tail end of the deep neural network is removed, the pre-trained deep neural network is obtained, and the pre-trained deep neural network is accessed into the image feature extraction model;

6. The method according to claim 5, wherein the obtaining the semantic feature vector of the expression to be searched through the deep neural network of the image feature extraction model comprises:

inputting the expression to be searched into a deep neural network of an image feature extraction model, and obtaining the output of each network layer through each network layer of the deep neural network;

and taking the output of each network layer as the input of the next network layer, and outputting the semantic feature vector of the expression to be searched by the deep neural network.

7. The method of claim 6, wherein inputting the expression to be searched into a deep neural network of an image feature extraction model, and obtaining an output of each network layer through each network layer of the deep neural network, comprises:

inputting the expression to be searched into a deep neural network of an image feature extraction model, acquiring the weight of the expression to be searched in each area through an attention layer of each network layer in the deep neural network to obtain an attention vector, and acquiring an initial semantic feature vector of the expression to be searched through a convolution layer of each network layer;

8. An image feature extraction model training apparatus comprising:

the picture group acquisition module is used for acquiring a plurality of expression packages; randomly extracting two expression packages, randomly selecting one expression from a first expression package as a reference expression, extracting the other expression from the first expression package as a similar expression of the reference expression, extracting any one expression from a second expression package as a different expression of the reference expression, extracting a key frame of the reference expression, obtaining a reference sample of a picture group, extracting a key frame of the similar expression, obtaining a similar sample of the reference sample, extracting key frames of the different expressions, and obtaining a negative sample of the reference sample; the key frame is a frame with large expression image variation or a frame with maximum expression image information;

the feature extraction module is used for inputting a reference sample of the picture group into a deep neural network and a shallow neural network of a first sub-neural network in a neural network model, extracting semantic feature vectors of the reference sample through the deep neural network of the first sub-neural network, and extracting visual feature vectors of the reference sample through the shallow neural network of the first sub-neural network; inputting the similar samples of the picture group into a deep neural network and a shallow neural network of a second sub-neural network in a neural network model, extracting semantic feature vectors of the similar samples through the deep neural network of the second sub-neural network, and extracting visual feature vectors of the similar samples through the shallow neural network of the second sub-neural network; inputting the negative sample of the picture group into a deep neural network and a shallow neural network of a third sub-neural network in a neural network model, extracting semantic feature vectors of the negative sample through the deep neural network of the third sub-neural network, and extracting visual feature vectors of the negative sample through the shallow neural network of the third sub-neural network; the deep neural network is a pre-trained deep neural network, the deep neural network obtains the weight of the sample in each region through an attention layer to obtain an attention vector, and the initial semantic feature vector of the sample obtained through a convolution layer is weighted based on the attention vector so as to improve the weight of a main object in an image and reduce the weight of a non-main object in the image; the semantic feature vector is used for representing semantic information of the image, and the semantic information of the image is semantic meaning expressed by each pixel of the image; the shallow neural network is a convolution layer and a pooling layer; the visual feature vector refers to the edge, texture and chromaticity features of the image; connecting the tail end of the deep neural network model with a layer of full-connection network, and training the deep neural network in advance according to the classified classification data set to obtain initialization parameters of the deep neural network model; the number of the nodes of the full-connection network is the same as the classification category of the classification data set, after the pre-training is finished, the full-connection network connected with the tail end of the deep neural network is removed, the pre-trained deep neural network is obtained, and the pre-trained deep neural network is accessed into an image feature extraction model;

The feature fusion module is used for splicing the semantic feature vector of the reference sample and the visual feature vector to obtain an image feature vector of the reference sample, splicing the semantic feature vector of the similar sample and the visual feature vector to obtain an image feature vector of the similar sample, splicing the semantic feature vector of the negative sample and the visual feature vector to obtain an image feature vector of the negative sample, and outputting the image feature vector of the corresponding sample by each sub-neural network of the neural network model;

and the training module is used for training the neural network model to obtain the image feature extraction model by taking the distance between the image feature vectors of the reference sample and the similar sample as a target and maximizing the distance between the image feature vectors of the reference sample and the negative sample based on supervision of the triplet loss function.

9. The apparatus of claim 8, wherein the feature extraction module comprises a semantic feature extraction module for inputting each sample into a deep neural network of a corresponding sub-neural network, and obtaining an output of each network layer through each network layer of the deep neural network; the output of each network layer is used as the input of the next network layer, and the deep neural network outputs the semantic feature vector of the corresponding sample.

10. The apparatus of claim 9, wherein the semantic feature extraction module is configured to input each sample into a deep neural network corresponding to a sub-neural network, obtain a attention vector by obtaining a weight of the sample in each region through an attention layer of each network layer in the deep neural network, and obtain an initial semantic feature vector of the sample through a convolution layer of each network layer; and weighting the initial semantic feature vector by using the attention vector to obtain the output of each network layer.

11. The apparatus of claim 8, the feature extraction module further comprising a visual feature extraction module configured to input each sample into a shallow neural network of a sub-neural network, perform convolution processing on the sample through a convolution layer of the shallow neural network to obtain a feature vector, and input the feature vector into a pooling layer of the shallow neural network to perform downsampling processing to obtain a visual feature vector of a corresponding sample output by the shallow neural network.

12. An expression search apparatus, the apparatus comprising:

the image acquisition module is used for acquiring expressions to be searched;

the feature vector extraction module is used for inputting the expression to be searched into a deep neural network and a shallow neural network of a pre-trained image feature extraction model, obtaining a semantic feature vector of the expression to be searched through the deep neural network of the image feature extraction model, obtaining a visual feature vector of the expression to be searched through the shallow neural network of the image feature extraction model, splicing the semantic feature vector and the visual feature vector to obtain an image feature vector of a corresponding sample, and outputting the image feature vector of the expression to be searched by the image feature extraction model;

The distance determining module is used for determining the distance between the image feature vector of the expression to be searched and the feature vector of each image in the database;

the retrieval module is used for determining the distance between the image feature vector of the expression to be searched and the feature vector of each image in the database;

13. The apparatus of claim 12, wherein the feature extraction module includes a semantic feature extraction module for inputting the expression to be searched into a deep neural network of an image feature extraction model, and obtaining an output of each network layer through each network layer of the deep neural network; and taking the output of each network layer as the input of the next network layer, and outputting the semantic feature vector of the expression to be searched by the deep neural network.

14. The apparatus of claim 13, wherein the semantic feature extraction module is configured to input the expression to be searched into a deep neural network of an image feature extraction model, obtain a weight of the expression to be searched in each region through an attention layer of each network layer in the deep neural network to obtain an attention vector, and obtain an initial semantic feature vector of the expression to be searched through a convolution layer of each network layer; and weighting the initial semantic feature vector by using the attention vector to obtain the output of each network layer.

15. A computer readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of the method of any one of claims 1 to 7.

16. A computer device comprising a memory and a processor, the memory storing a computer program that, when executed by the processor, causes the processor to perform the steps of the method of any of claims 1 to 7.