CN116383437A

CN116383437A - Cross-modal material recommendation method based on convolutional neural network

Info

Publication number: CN116383437A
Application number: CN202310359270.0A
Authority: CN
Inventors: 韩忠义; 熊亚平
Original assignee: Suzhou Duoduo Metadata Co ltd
Current assignee: Suzhou Duoduo Metadata Co ltd
Priority date: 2023-04-06
Filing date: 2023-04-06
Publication date: 2023-07-04

Abstract

The invention relates to the field of network neural material recommendation, and discloses a cross-modal material recommendation method based on a convolutional neural network, which comprises the following steps: s1: mapping a text and an image from an original space to a CCA space by using a deep learning method, describing the similarity of the text and the image by calculating the Euclidean distance of the text and the image, and compensating the difference between modes by using an anti-neural network combined with a cross-modal hash method; s2: performing key frame extraction on video mode data by using an image perception hash algorithm to obtain picture mode data; s3: extracting main body picture mode data from the picture mode data; s4: extracting feature vectors of the picture modal data by using the CNN model and storing the feature vectors into a vector database; s5: extracting feature vectors of target text modal contents by using a CNN model, and placing the feature vectors into a target feature vector database to perform Euclidean distance calculation; s6: and sequencing the calculation results in the last step, and recommending the material with the highest similarity to the user.

Description

Cross-modal material recommendation method based on convolutional neural network

Technical Field

The invention relates to the field of network neural material recommendation, in particular to a cross-modal material recommendation method based on a convolutional neural network.

Background

In recent years, with the development of neural networks and deep learning, the image field is greatly broken through, the neural networks, particularly convolutional neural networks and variants thereof, can be used for carrying out very deep understanding and extraction on the characteristics of images, and the obtained characteristic vectors can be used for various applications, such as technologies of target detection, image retrieval and the like; ware learning is a multi-field interdisciplinary, has been widely applied and developed in recent decades, and due to the occurrence of neural networks and deep learning, machine learning again causes hot flashes, academic activities related to machine learning and unprecedented industrial applications are active;

how to find the materials of the cardiometer in a mass material database has become a big research hotspot in the field of computer vision. At present, a material retrieval website uses texts to conduct material retrieval, namely, uses the texts to conduct material retrieval of images and videos; the image material includes not only information such as a subject and a scene, but also complex information such as a subject attribute and a relationship between multiple subjects; the video content contains more abundant action information and interaction information, so that the retrieval purpose is difficult to accurately describe only by inquiring the text, and under the condition that the retrieval purpose is inaccurate in description, an inaccurate image material and a video material can be pushed by a material retrieval website, so that a cross-mode material recommendation method based on a convolutional neural network is provided.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a cross-modal material recommendation method based on a convolutional neural network, which solves the problems.

In order to achieve the above purpose, the present invention provides the following technical solutions: the method comprises the following steps:

s1: firstly, mapping texts and images from original spaces to CCA spaces by using an unsupervised learning method of deep learning, then describing the similarity of the texts and the images by calculating Euclidean distances of the texts and the images, and further using an anti-neural network combined cross-modal hash method to make up the difference between modes;

s2: performing key frame extraction on video mode data by using an image perception hash algorithm to obtain picture mode data;

s3: extracting a picture main body from the picture modal data to generate main body picture modal data;

s4: then, extracting feature vectors of the picture modal data by utilizing the CNN model trained in the first step and storing the feature vectors into a vector database;

s5: extracting feature vectors of target text modal contents by using the CNN model trained in the first step, and placing the feature vectors into a target feature vector database to perform Euclidean distance calculation;

s6: and sequencing the calculation results in the last step, and recommending the material with the highest similarity to the user.

Preferably, the step S1 of mapping the text and the image from the original space to the CCA space specifically includes the following steps:

s11: firstly, extracting the bottom features of each text and each image, namely obtaining matrixes with different dimensions;

s12: after the training data is decentralized, mapping the data with different dimensionalities into the same subspace by utilizing a CCA algorithm, and then correlating the training text with the image characteristics;

s13: finally, when the image is searched by utilizing the image to search the text or the text to search the image, firstly, the image and the text feature are mapped into the subspace which is the same as the training data, then the similarity between the test data and the training data in the same mode is calculated, and the image and the text corresponding to the similarity can be found through the trained image-text association.

Preferably, the step of extracting the key frame from the video mode data in S2 to obtain the picture mode data specifically includes the following steps:

s21: acquiring video materials, and extracting at least two frames of video frames in video mode data, wherein the at least two frames of video frames comprise video frames Ab, b is a positive integer, and b is less than or equal to the total number of the at least two frames of video frames;

s22: obtaining the similarity between the video frame Ab and the rest video frames, wherein the rest video frames comprise video frames except the video frame Ab in the at least two frames of video frames;

s23: if the similarity is equal to or greater than a similarity threshold, determining the video frame Ab as a repeated video frame, deleting the repeated video frame from at least two frames of video frames, and obtaining a video frame to be cut;

s24: and carrying out region identification processing on the video frames to be cut to obtain regions to be cut of the video frames to be cut, and carrying out cutting processing on the regions to be cut in the video frames to be cut to obtain picture mode data.

Preferably, the extracting the picture main body from the picture mode data in S3 includes: the image with the classification label in the image modal data is input into a convolutional neural network for supervised learning, representative characteristics of each layer are obtained, and a prediction result and a global loss function value are obtained at an output layer, which is called forward propagation;

calculating partial derivatives of the loss function with respect to the weight and bias parameter matrix number sets through softmax regression convergence classification, and performing gradient descent optimization on the weight and bias parameters, wherein the gradient descent optimization process is called back propagation;

the updated weight and bias parameter matrix number set after gradient descent optimization is used for carrying out new forward propagation calculation, forward propagation and reverse propagation are continuously carried out back and forth, so that the updated weight and bias parameter matrix number set is trained to enable a loss function to be smaller and smaller, prediction is more accurate, and the effect of improving image recognition accuracy is achieved;

until the optimal value of the global loss function is found, extracting a proper neural network model according to the training precision obtained by supervised learning of the whole training set.

Preferably, the constructing the CNN model in S4 and S5 specifically includes the following steps:

s41: preparing a text modality dataset and a picture modality dataset containing M groups (D, I): d1 D2..dm is text, I1, I2,..im is picture, D and I with identical subscripts are matching correct combinations, D and I with different subscripts are matching incorrect combinations, D1..dn and I1...in are included IN the training set, dn+1..dm and in+1..im are included IN the test set;

s42: constructing a network which can be used for generating picture description words by using a ShowAndToll technology so as to describe each picture in S41 into words;

s43: training a network based on Doc2Vec technology according to the context information by utilizing a corpus area, calculating a K-dimensional text feature vector corresponding to the text D1, D2,..DM respectively and a W-dimensional average value vector of word vectors in the text, and calculating W-dimensional average value vectors of word vectors in the picture I1, I2,..IM generated by ShowAndToll technology;

s44: inputting collected pictures I1, I2, & I & ltIM & gt into a CNN network, taking any layer activation value or all layer activation values of an intermediate convolution layer of the network as feature vectors of corresponding pictures, wherein the dimension of the feature vector of each picture is J dimension;

s45: the method comprises the steps of performing dimension reduction on a J-dimension picture feature vector through a full-connection layer with a decreasing dimension by using a multi-layer neural network method, splicing an average value vector of a Jn-dimension picture feature vector, a K-dimension text feature vector, a W-dimension word vector in an article, and an average value vector of a W-dimension word vector in a picture description text generated by a ShowAndToll technology to form a mixed feature layer, wherein the dimension of the mixed feature layer is Jn+K+W+W, processing the mixed feature layer through the Z1, Z2..

Preferably, the training of the CNN model constructed in S45 specifically includes the following steps:

a, inputting the data composition and the combination (D, I) of the training set in the step S41 into the neural network constructed in the step S45;

b, obtaining a Loss function according to the error of the output and the target value, and calculating residual errors of all nodes in a full-connection layer of the neural network constructed in the S45 based on a neural network residual error back propagation technology;

and c, updating parameters of the full-connection layer and the Image2Vec network by using the residuals so as to reduce the value of the Loss function, and ending training after iterative calculation until the Loss function converges.

Preferably, in S44, the activation value of the topmost layer of the middle convolution layer is used as the feature vector of the corresponding picture.

The step S6 of sequencing the calculation result of the previous step specifically comprises the following steps:

s61: matching the feature vectors of the picture mode data stored in the vector database with the feature vectors of the target text mode data, and clustering the feature vectors with a k-means algorithm by taking Euclidean distance as a measure to obtain category groups;

s62: obtaining a specific category according to a target text input by a user, calculating the space vector distance from a material to the clustering center of the category for the specific category, selecting a plurality of types of center points with the largest statistical collocation quantity of the category, and carrying out the same space vector translation to obtain corresponding points;

s63: calculating a plurality of vectors of the materials with the nearest Euclidean distance from the point obtained in the step S62 as candidate recommended materials, wherein the recommended quantity of the materials is based on the collocation times between two categories, the recommended quantity of the materials is distributed in a weighting mode according to the collocation times, the recommended quantity is set as C, and the collocation times of the categories are C1, C2 and c3..

S64: the materials with recommended materials are ranked according to the distance and the matching times, the materials with recommended materials are ranked according to the weight, the matching times between the two classes are set as c, the Euclidean distance between the calculated points is d, weight=c/d, the recommended materials are returned according to the optimal value of the weight, and therefore the materials with highest similarity can be recommended to users.

Compared with the prior art, the invention provides a cross-modal material recommendation method based on a convolutional neural network, which has the following beneficial effects:

1. according to the cross-modal material recommendation method based on the convolutional neural network, a text and an image are mapped from an original space to a CCA space by using an unsupervised learning method of deep learning, then similarity of the text and the image is described by calculating Euclidean distance of the text and the image, and differences among modes are further compensated by using a method of combining an anti-neural network with cross-modal hash; performing key frame extraction on video mode data by using an image perception hash algorithm to obtain picture mode data; extracting a picture main body from the picture modal data to generate main body picture modal data; then, extracting feature vectors of the picture modal data by utilizing the CNN model trained in the first step and storing the feature vectors into a vector database; extracting feature vectors of target text modal contents by using the CNN model trained in the first step, and placing the feature vectors into a target feature vector database to perform Euclidean distance calculation; and sequencing the calculation results in the last step, and recommending the material with the highest similarity to the user, so that the required material can be rapidly and accurately recommended according to the retrieval requirement of the user.

Drawings

Fig. 1 is a schematic flow chart of a cross-modal material recommendation method based on a convolutional neural network.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Referring to fig. 1, a cross-modal material recommendation method based on a convolutional neural network includes the following steps:

s3: extracting a picture main body from the picture modal data, (most users only have an interest in the picture main body), and generating main body picture modal data;

clustering by taking Euclidean distance as a measure and a K-means algorithm, grouping the materials in category, counting the collocation frequency among the materials in each category, and returning the counted collocation frequency and relative Euclidean distance to recommended materials; the target image is expressed as a string of binary codes with the length of 684 dimension based on each layer of characteristics extracted by the deep learning neural network through the hash coding module, and the similar images are ensured to have similar binary codes; based on 684-dimensional vectors obtained through hash coding of all materials, clustering a large number of existing materials by using a k-means machine learning algorithm by taking Euclidean distance as a measure, and continuously adjusting super-parameters k to ensure that the number of materials in each category is relatively uniform; counting the collocation relation frequency among various materials, and storing the result into a two-dimensional matrix; according to the obtained clustering result, calculating a weight based on the relative distance by taking the collocation times as divisors, and sequencing according to the weight to return a recommended result

Cross-modal hash models (CMHs) are proposed to significantly reduce the overhead of large-scale cross-modal data retrieval systems. However, in many practical applications, the continuous arrival of new types of data requires that the cross-modal hash method have good scalability. I.e. adaptively updated to accommodate new classes of data at minimal cost while maintaining good performance on old data. Unfortunately, existing CMH methods fail to meet scalability requirements. In this regard, the project group proposes a novel extensible cross-modal hash (ECMH) to implement efficient, low-cost model extension, the details of which are shown in FIG. 1. ECMH has several desirable characteristics:

1) It has good forward compatibility, so it is unnecessary to update old hash codes;

2) Through a carefully designed 'weak constraint incremental learning' algorithm, the ECMH model can be expanded to support new data types by using new data only, and compared with a method for retraining the model by using new and old data, the ECMH model can save time cost by 91 percent;

3) It can reach high retrieval precision on new and old data at the same time.

The perceptual hash algorithm (PHA for short) is a type of hash algorithm and is mainly used for searching similar pictures;

in the first step, the size is reduced, the high frequency and detail are removed most quickly, and only the brightness of the structure is kept, namely the size is reduced, the picture is reduced to 8x8, and the total is 64 pixels. Discarding picture differences brought by different sizes and proportions;

secondly, simplifying the color, and converting the reduced picture into 64-level gray scale; that is, all pixels have only 64 colors in total;

thirdly, calculating DCT (discrete cosine transform), wherein the DCT is to aggregate the picture decomposition frequency and the ladder shape, although JPEG uses DCT of 8x8, and DCT of 32 x 32 is used here;

fourth, the DCT is scaled down, and although the result of the DCT is a matrix of size 32 x 32, we only need to preserve the matrix of size 8x8 in the top left corner, which represents the lowest frequency in the picture;

fifth, calculating average value of all 64 values;

sixth, further reducing the DCT, which is the most important step, according to the DCT matrix of 8x8, setting the hash value of 64 bits of 0 or 1, setting the value greater than or equal to the DCT average value as '1', and setting the value smaller than the DCT average value as '0'; the result does not tell us the low frequency of authenticity, but only roughly tells us the relative proportion of frequency to average. As long as the overall structure of the picture remains unchanged, the hash result value remains unchanged. The influence of gamma correction or color histogram adjustment can be avoided.

Seventh, calculating a hash value, setting 64 bits into a 64-bit long integer, and converting the DCT of 32 x 32 into an image of 32 x 32, wherein the combination order is not important as long as all pictures are ensured to adopt the same order;

combining the results of the previous step forms a 64-bit integer, which is the fingerprint of the picture. The order of the combination is not important as long as it is guaranteed that all pictures are in the same order (e.g., left to right, top to bottom, big-endian).

After the fingerprint is obtained, different pictures can be compared to see how many of the 64 bits are different, which is equivalent to calculating a Hamming distance in theory; if the different data bits are not more than 5, the two pictures are similar; if greater than 10, this is illustrated as two different figures.

S1, mapping texts and images from respective original spaces to CCA spaces, wherein the method specifically comprises the following steps of:

The cross search between two most common media contents of images and texts is realized by a computer, wherein the images and the texts are respectively represented by a certain feature vector, namely, image data are mapped to an image feature space, and text data are mapped to a text feature space. However, there is no direct link between the feature space sums, and the CCA algorithm can map the sums to sums through training of a plurality of "image-sample" sample pairs, where the feature space sums are linearly related, and can directly measure similarity between feature vectors in the sums, so as to provide a theoretical basis for image-text cross search.

The step S2 of extracting the key frames of the video mode data to obtain the picture mode data specifically comprises the following steps:

s22: obtaining the similarity between a video frame Ab and the rest video frames, wherein the rest video frames comprise video frames except the video frame Ab in at least two frames of video frames;

The step S3 of extracting the picture main body of the picture mode data comprises the following steps: the image with the classification label in the image modal data is input into a convolutional neural network for supervised learning, representative characteristics of each layer are obtained, and a prediction result and a global loss function value are obtained at an output layer, which is called forward propagation; the partial derivatives of the loss function with respect to the weight and bias parameter matrix sets are calculated by softmax regression convergence classification and the weight and bias parameters are gradient descent optimized, a gradient descent optimization process called back propagation. And carrying out new forward propagation calculation by using the updated weight and bias parameter matrix number set after gradient descent optimization, and continuously carrying out forward propagation and reverse propagation back and forth to train and update the weight and bias parameter matrix number set so as to enable the loss function to be smaller and smaller, enable the prediction to be more accurate and achieve the effect of improving the image recognition precision. Until the optimal value of the global loss function is found, extracting a proper neural network model according to the training precision obtained by supervised learning of the whole training set.

The construction of the CNN model in S4 and S5 specifically comprises the following steps:

The training of the CNN model constructed in the S45 specifically comprises the following steps:

The activation value of the topmost layer of the middle convolution layer is taken as the feature vector of the corresponding picture in S44.

S6, sorting the calculation results of the previous step specifically comprises the following steps:

the text data and the numerical data are quite different, the numerical data can directly calculate the distance between every two by using a distance calculation formula, but the text data cannot be directly calculated, so that the distance between the text data to be calculated is firstly calculated by text numerical value, namely the text is represented by numerical value;

Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. The cross-modal material recommendation method based on the convolutional neural network is characterized by comprising the following steps of:

2. The cross-modal material recommendation method based on the convolutional neural network according to claim 1, wherein the step of mapping the text and the image from the original space to the CCA space in S1 specifically comprises the following steps:

3. The cross-modal material recommendation method based on the convolutional neural network according to claim 1, wherein the step of extracting key frames from video modal data in the step S2 to obtain picture modal data specifically comprises the following steps:

4. The cross-modal material recommendation method based on the convolutional neural network according to claim 1, wherein the step of extracting the picture main body of the picture modal data in S3 includes: the image with the classification label in the image modal data is input into a convolutional neural network for supervised learning, representative characteristics of each layer are obtained, and a prediction result and a global loss function value are obtained at an output layer, which is called forward propagation;

5. The cross-modal material recommendation method based on the convolutional neural network according to claim 1, wherein the construction of the CNN model in S4 and S5 specifically comprises the following steps:

6. The cross-modal material recommendation method based on the convolutional neural network according to claim 5, wherein the training of the CNN model constructed in S45 specifically comprises the following steps:

7. The cross-modal material recommendation method based on a convolutional neural network of claim 5, wherein the activation value of the topmost layer of the middle convolutional layer is used as the feature vector of the corresponding picture in S44.

8. The cross-modal material recommendation method based on the convolutional neural network according to claim 1, wherein the step of ranking the calculation results in the previous step in S6 specifically comprises the following steps: