CN109543714A

CN109543714A - Acquisition methods, device, electronic equipment and the storage medium of data characteristics

Info

Publication number: CN109543714A
Application number: CN201811204515.8A
Authority: CN
Inventors: 张志伟; 王希爱; 王树强
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2018-10-16
Filing date: 2018-10-16
Publication date: 2019-03-29
Anticipated expiration: 2038-10-16
Also published as: CN109543714B

Abstract

This application provides a kind of acquisition methods of data characteristics, device, electronic equipment and storage mediums, comprising: obtains the multimedia sample including image information and text information；The first characteristics of image of image information and the first text feature of text information are extracted respectively；First characteristics of image and the first text feature are imported into attention Mechanism Model, export the second text feature of the attention based on the first characteristics of image, and/or the second characteristics of image of the attention based on the first text feature.The application is based on attention mechanism, capture the relevance between the first characteristics of image and the first text feature, and the second text feature and/or the second characteristics of image for the attention mechanism that has been applied, so that the second text feature and the second characteristics of image include the relevance between image information and text information, the application obtain apply attention mechanism be characterized in based on one end to end attention Mechanism Model realize, reduce dependence of the application scenarios to multi-model.

Description

Acquisition methods, device, electronic equipment and the storage medium of data characteristics

Technical field

The invention relates to Image Classfication Technology field more particularly to a kind of acquisition methods of data characteristics, device, Electronic equipment and storage medium.

Background technique

Recently, as depth learning technology is in the extensive use of the related fieldss such as multimedia sample classification processing, so that moving The application and development of dynamic terminal comes out many multimedia sample classification features, optimizes the functions such as information displaying, the recommendation of application, Improve user experience.

In the related technology, in the scene of reality, general user can also increase this image after image is opened in upload one Add one section of simple verbal description, current image classification function is merged after based on image and corresponding verbal description Technology is handled, and corresponding feature is obtained, and the image that this feature can be used for being associated verbal description carries out labeling Operation, obtain feature operation specifically, according to the characteristic point of image, verbal description be distributed, calculate a prediction result, Then prediction result is post-processed again by different models, respectively obtains corresponding feature.

But at present in scheme, the feature of image and text is extracted using rear integration technology, is based on different and mutual The independent processing that independent model is done respectively, can not efficiently use the correlation in multimedia sample between image and text Property, also, due to each model single optimization, different model treatment speed may be different, and final efficiency is caused to rely on In the model that treatment effeciency is most slow, the efficiency of processing is reduced.

Summary of the invention

The embodiment of the present application provides acquisition methods, device, electronic equipment and the storage medium of a kind of data characteristics, to solve The correlation between image and text cannot be efficiently used by extracting image and the feature of text in the related technology, and cause to handle Inefficiency the problem of.

In a first aspect, the embodiment of the present application provides a kind of acquisition methods of data characteristics, this method comprises:

Multimedia sample is obtained, the multimedia sample includes image information and text information；

The first characteristics of image of described image information and the first text feature of the text information are extracted respectively；

The first image feature and first text feature are imported into attention Mechanism Model, output is based on described the Second text feature of the attention of one characteristics of image, and/or the second image of the attention based on first text feature Feature.

Second aspect, the embodiment of the present application provide a kind of acquisition device of data characteristics, which includes:

Module is obtained, for obtaining multimedia sample, the multimedia sample includes image information and text information；

Fisrt feature extraction module, for extract respectively described image information the first characteristics of image and the text envelope First text feature of breath；

Second feature extraction module, for the first image feature and first text feature to be imported attention machine Simulation exports the second text feature of the attention based on the first image feature, and/or special based on first text Second characteristics of image of the attention of sign.

The third aspect the embodiment of the present application also provides a kind of electronic equipment, including processor, memory and is stored in institute The computer program that can be run on memory and on the processor is stated, when the computer program is executed by the processor The step of realizing the acquisition methods such as data characteristics provided by the present application.

Fourth aspect, the embodiment of the present application also provides a kind of storage mediums, when the instruction in the storage medium is by electricity When the processor of sub- equipment executes, so that electronic equipment is able to carry out the step of the acquisition methods such as data characteristics provided by the present application Suddenly.

5th aspect, the embodiment of the present application also provides a kind of application program, the application program is by electronic equipment Manage device execute when, realize as data characteristics provided by the present application acquisition methods the step of.

In the embodiment of the present application, the available multimedia sample including image information and text information；It extracts respectively First characteristics of image of image information and the first text feature of text information；By the first characteristics of image and the first text feature Attention Mechanism Model is imported, exports the second text feature of the attention based on the first characteristics of image, and/or based on the first text Second characteristics of image of the attention of eigen.The application is based on attention mechanism, captures the first characteristics of image and the first text Relevance between feature, and the second text feature and/or the second characteristics of image for the attention mechanism that has been applied, so that the Two text features and the second characteristics of image include the relevance between image information and text information.Also, the application obtains Apply attention mechanism be characterized in based on one end to end attention Mechanism Model realize, reduce application scenarios Dependence to multi-model.

Above description is only the general introduction of technical scheme, in order to better understand the technological means of the application, And it can be implemented in accordance with the contents of the specification, and in order to allow above and other objects, features and advantages of the application can It is clearer and more comprehensible, below the special specific embodiment for lifting the application.

Detailed description of the invention

By reading the following detailed description of the preferred embodiment, various other advantages and benefits are common for this field Technical staff will become clear.The drawings are only for the purpose of illustrating a preferred embodiment, and is not considered as to the application Limitation.And throughout the drawings, the same reference numbers will be used to refer to the same parts.In the accompanying drawings:

Fig. 1 is a kind of step flow chart of the acquisition methods of data characteristics provided by the embodiments of the present application；

Fig. 2 is the step flow chart of the acquisition methods of another data characteristics provided by the embodiments of the present application；

Fig. 3 is the processing surface chart of the acquisition methods of another data characteristics provided by the embodiments of the present application；

Fig. 4 is a kind of block diagram of the acquisition device of data characteristics provided by the embodiments of the present application；

Fig. 5 is the block diagram of the acquisition device of another data characteristics provided by the embodiments of the present application；

Fig. 6 is the logic diagram of the electronic equipment of the application another embodiment；

Fig. 7 is the logic diagram of the electronic equipment of the application another embodiment.

Specific embodiment

The exemplary embodiment of the application is more fully described below with reference to accompanying drawings.Although showing the application in attached drawing Exemplary embodiment, it being understood, however, that may be realized in various forms the application without should be by embodiments set forth here It is limited.It is to be able to thoroughly understand the application on the contrary, providing these embodiments, and can be by scope of the present application It is fully disclosed to those skilled in the art.

Fig. 1 is a kind of step flow chart of the acquisition methods of data characteristics provided by the embodiments of the present application, as shown in Figure 1, This method may include:

Step 101, multimedia sample is obtained, the multimedia sample includes image information and text information.

Multimedia sample refers to user by being locally uploaded to the multimedia messages that contain of mobile terminal application server The application of sample, mobile terminal can be handled multimedia sample, and multimedia messages therein are shown, are issued Processing.

In the embodiment of the present application, it is directed to using in the multimedia sample obtained, has included at least image information and text This information, under the application scenarios, user can also increase by a Duan Jian to this image information after image information is opened in upload one Single text information is described, and has stronger relevance between text information and image information.

For example, in " circle of friends " function of social application, it is assumed that user uploads the travelling photo on a seashore, and same When upload passage " today goes to the seabeach XX to play a whole day ", therefore, user upload multimedia sample in include This travelling is shone and text, so that extracting features such as " seashore " " travellings " to multimedia sample, row label of going forward side by side classification is more by this Media sample stamps labels such as " travelling " " seashores ", completes the classification of multimedia sample.

It should be noted that can also include video information in multimedia sample in another implementation, video is The stream media information rearranged by multiple video frames, the key frame images of video refer in one section of sequence of frames of video, lead to Certain algorithm or rule are crossed, the video frame picture specified number is extracted, for example, in film, key frame images can be with As stage photo or envelope film, specifically, in this implementation, multiple key frame pictures in video information can be extracted, Using key frame picture as image information.

First text of step 102, the first characteristics of image for extracting described image information respectively and the text information Feature.

In practical applications, the application such as social activity, shopping of mobile terminal all has the function of multimedia sample classification, more matchmakers Effect of the body sample classification in fields such as information displaying, commending contents is increasing, and the sort operation of multimedia sample is based on more The feature of media sample carries out, specifically, multimedia sample be characterized in the abstract of multimedia sample as a result, usually with feature to The form of amount is expressed, in one implementation, can be by labeling model to more for multimedia sample is described The feature of media sample is further processed, so that it is matched to corresponding tag along sort for this feature, it further will be more Media sample is divided into the corresponding classification of the tag along sort, completes the classification of multimedia sample.

It is characterized in that a certain class object is different from the corresponding feature or characteristic or these features and characteristic of other class objects Set, is characterized in that, by measuring or handling the data that can be extracted, the main purpose of feature extraction is dimensionality reduction, and its main thought It is that original image samples are projected to a low-dimensional feature space, obtaining most can decent essence of response diagram or progress image pattern The low-dimensional image pattern feature of differentiation.

For image information, every piece image all has the unique characteristics that can be different from other class images, some It is the physical feature that can be perceive intuitively that, such as brightness, edge, texture and color；Some then be need by transformation or Processing is just getable, such as square, histogram and main composition, and in the embodiment of the present application, the first characteristics of image can pass through Feature vector expression formula is expressed, and e.g., f={ x1, x2 ... xn }, the first common image characteristic extracting method includes: that (1) is several What method feature extraction, geometric method are built upon a kind of analysis of texture method on the basis of image texture basic-element theory.(2) mould Type method feature extraction, modelling is based on the tectonic model of image, using the parameter of model as textural characteristics, such as convolution Neural network model.(3) signal processing method feature extraction, the extraction of textural characteristics mainly have with matching: gray level co-occurrence matrixes, from Return texture model, wavelet transformation etc..

For text information, it is that can allow computer that the purpose of the first text feature of text information, which is by text representation, Come the form understood, i.e., by text vector, the extraction of the first text feature can also pass through corresponding Text Feature Extraction algorithm mould Type is realized, for example, built-in network model.

The first image feature and first text feature are imported attention Mechanism Model by step 103, export base Attention in the second text feature of the attention of the first image feature, and/or based on first text feature Second characteristics of image.

In the embodiment of the present application, the essence of attention (Attention) mechanism is from human visual attention's mechanism, Visual attention mechanism is brain signal treatment mechanism specific to human vision, and human vision passes through the global figure of quickly scanning Picture obtains the target area for needing to pay close attention to, that is, general described ' s focus of attention, more to this regional inputs then More attention resources to obtain the detailed information of more required concern targets, and inhibit other garbages.

Therefore, attention Mechanism Model is based on based on a kind of network model simulating human attention mechanism and establishing Attention mechanism, captures the relevance between the first characteristics of image and the first text feature, which can weigh for attention Weight, by by attention weight distribution to corresponding feature to get to the feature for applying attention mechanism, due to the spy Sign includes the relevance between image information and text information, therefore, subsequent to carry out such as image classification using this feature, pushing away Recommend etc. scenes in application, making classification results or recommendation results more accurate.Also, the embodiment of the present application extract based on note Meaning power mechanism is characterized in reducing dependence of the application scenarios to multi-model based on an attention Mechanism Model end to end.

Specifically, in the embodiment of the present application, attention Mechanism Model can be by largely including the instruction of text and image Practice sample training to obtain, for example the first characteristics of image and the first text feature imported into attention Mechanism Model, can pass through by First characteristics of image is averaged pond, obtains multiple feature vectors, and calculate each feature vector relative to the first text feature The degree of association obtains the distribution of corresponding attention weight, and attention weight and the first text feature are finally weighted summation, The second text feature of the attention based on the first characteristics of image is obtained, similarly, by the first characteristics of image and the first text Feature imports attention Mechanism Model, according to above-mentioned logic, also the second of the available attention based on the first text feature Characteristics of image.

In specific application scenarios, it is assumed that user by client to application server upload a seabeach picture and For one section of text description of the picture: " today goes to the beach tourism with younger sister, and sunlight is very magnificent ", picture is by preliminary spy After sign is extracted, available multiple first characteristics of image state the features such as " sea ", " seabeach ", " parasols " in figure, text respectively This information is after preliminary feature extraction, also available multiple first text features, state respectively " seashore " in text, The features such as " tourism ", " sunlight ", " household ", therefore, the application is by importing note for the first characteristics of image and the first text feature Meaning power Mechanism Model can export the second text feature of the attention based on the first characteristics of image according to attention mechanism, this When, it is based on the first characteristics of image, weaker " household " feature of attention weight in the first text feature can be screened out, protected Stayed " seashore ", " sunlight ", " tourism " the second text feature, and/or, by by the first characteristics of image and the first text feature The second image of the attention based on the first text feature can be exported according to attention mechanism by importing attention Mechanism Model Feature can increase the relevance with " seabeach ", " parasols " at this point, being based on the first text feature in the first characteristics of image Higher " tourism " feature, the second characteristics of image including " seabeach ", " parasols " " tourism ".

Further, according to the second text feature and/or the second characteristics of image, it is assumed that divide the seabeach picture When class, which can be stamped to the label at " seabeach " and " tourism ", it is accurately divided into the classification of landscape and tourism In, therefore, the embodiment of the present application introduces attention Mechanism Model, is associated with by the attention based on text information, by seabeach figure " tourism " feature lacked originally in piece is added, and/or, the attention association based on image information, by text information Lower " household " feature of different degree is filtered, and the second text feature of image Yu textual association degree has been merged And/or second characteristics of image, so that according to the second text feature and the second characteristics of image, in subsequent image classification or image recommendation During can obtain more accurate result.

In conclusion a kind of acquisition methods of data characteristics provided by the embodiments of the present application, available multimedia sample, Multimedia sample includes image information and text information；First characteristics of image and text information of image information are extracted respectively First text feature；First characteristics of image and the first text feature are imported into attention Mechanism Model, output is based on the first image Second text feature of the attention of feature, and/or the second characteristics of image of the attention based on the first text feature.The application Based on attention mechanism, the relevance between the first characteristics of image and the first text feature, and the attention that has been applied are captured The second text feature and/or the second characteristics of image of mechanism, so that the second text feature and the second characteristics of image include image Relevance between information and text information.Also, the attention mechanism that applies of the application acquisition is characterized in based on one Attention Mechanism Model is realized end to end, reduces dependence of the application scenarios to multi-model.

Fig. 2 is the step flow chart of the acquisition methods of another data characteristics provided by the embodiments of the present application, such as Fig. 2 institute Show, this method may include:

Step 201 obtains multimedia sample, and the multimedia sample includes image information and text information.

The implementation of this step is similar with the realization process of above-mentioned steps 101, and this will not be detailed here for the embodiment of the present application.

First text of step 202, the first characteristics of image for extracting described image information respectively and the text information Feature.

The implementation of this step is similar with the realization process of above-mentioned steps 102, and this will not be detailed here for the embodiment of the present application.

Specifically, step 202 can be by following step 2021~2022 come real in a kind of implementation of the application It is existing:

Step 2021, described image information is imported into convolutional neural networks model, output described image information corresponding the One characteristics of image.

In the embodiment of the present application, convolutional neural networks (Convolutional Neural Network, CNN) are a kind of Depth feed forward-fuzzy control, convolutional neural networks include convolutional layer and pond layer, generally, convolutional neural networks it is basic Structure includes two layers, and one is characterized extract layer, and the input of each neuron is connected with the local acceptance region of preceding layer, and extracts The feature of the part.After the local feature is extracted, its positional relationship between other feature is also decided therewith；Its Second is that Feature Mapping layer, each computation layer of network is made of multiple Feature Mappings, and each Feature Mapping is a plane, plane The weight of upper all neurons is equal.Feature Mapping structure is using the small sigmoid function of influence function core as convolutional network Activation primitive so that Feature Mapping have shift invariant.Further, since the neuron on a mapping face shares weight, Thus reduce the number of network freedom parameter, each of convolutional neural networks convolutional layer all followed by one is used to ask office Portion is averagely and the computation layer of second extraction, this distinctive structure of feature extraction twice reduce feature resolution.

Therefore, for some pixel p in a sub-picture information, pixel generally closer from pixel p influences it Also bigger；In addition, the weight in some region can be used for another region according to the statistical property of natural image.Here Weight it is shared be plainly exactly that convolution kernel is shared, using convolutional neural networks model, for a convolution kernel by its with give Image do convolution and can extract the feature of a sub-picture information, different convolution kernels can extract different characteristics of image, It is final to obtain corresponding first characteristics of image of image information, wherein convolution kernel (also referred to as filter, English: convolution Kernel it) can be used to extract feature, image and convolution nuclear convolution, so that it may obtain characteristic value.

Step 2022, the text information is imported into embedded neural network model, it is corresponding exports the text information First text feature.

In the embodiment of the present application, embedded neural network model usually may include input layer, embeding layer, hidden layer and Output layer.

Input layer is used to input the term vector obtained after each participle for after text information word segmentation processing.

Embeding layer is used to carry out word insertion processing to term vector, specifically, initializing projection matrix C first (with thick Close vector indicates).The line number of this projection matrix is " dimension ", and dimension is assumed to be 500, and columns is the big of the term vector of input It is small, it is assumed that be 4000, the weight w in matrix is artificially to initialize in advance.By the term vector of input all respectively multiplied by projection Matrix C, term vector becomes the vector of a 500*1 at this time.That is a whole word embeding layer has 500*1*40000 dimension.

Hidden layer is used to weigh the vector input hidden layer of word insertion output if hidden layer has 100 neurons Weight θ number be exactly 500*40000*100, after carrying out linear transformation input stimulus function tanh, the output of excitation function be The output of hidden layer.

Output layer is used to export the text feature of extraction, wherein the number of the word of the number and input of output layer is mutually all 40000.From the vector that hidden layer exports into the conversion of softmax function is crossed, just have 40000 outputs as a result, it is each the result is that One vector, corresponds to a word, is the probability that this word belongs to each word in vector.

Therefore, text information is imported into embedded neural network model, corresponding first text of text information can be exported Feature.It should be noted that can also be carried out using other modes for the extraction of the first text feature, the embodiment of the present application pair This is not construed as limiting.

The first image feature and first text feature are imported attention Mechanism Model by step 203, export base In the attention of the first image feature the second text feature and the attention based on first text feature second Characteristics of image.

The implementation of this step is similar with the realization process of above-mentioned steps 103, and this will not be detailed here for the embodiment of the present application.

Optionally, in one implementation, step 203 can also include:

Step 2031, the first image feature is averaged pond, obtains corresponding first image feature vector.

In this implementation, to be said for obtaining the second text feature of the attention based on the first characteristics of image It is bright, wherein in attention Mechanism Model, aggregate function as average pondization to can be used usually to characterize the first image spy Sign.Average pondization averages to all values in local acceptance region, to reduce estimated value variance caused by Size of Neighborhood is limited The error of increase, for image information, effect is more background informations for retaining image, therefore, by the first image Feature is averaged pond, available corresponding the first image feature vector for reducing error.

The Chi Huahou specifically, the first characteristics of image is averaged, available multiple first image feature vectors.First image is special It levies vector and is based on attention mechanism, assign different weights to the different vectors in input, input is ultimately expressed as multiple the The weighted sum of one image feature vector.

Step 2032, first text feature and the first image feature vector are subjected to similarity calculation, obtained First text feature corresponds to the third attention weight factor of the first image feature.

In the embodiment of the present application, the attention based on the first image feature can be understood as the first text feature pair The first attention weight factor of the first characteristics of image is answered, the essence of attention mechanism function can be described as an inquiry (query) literary in calculate the attention based on the first characteristics of image second to the mapping of a series of (key key- value value) pair Three steps are broadly divided into when eigen, the first step is to carry out query (the first image feature vector) and key (the first text feature) Similarity calculation obtains weight factor, and common similarity function has dot product, splicing, perceptron etc.；Then second step is usually These weight factors are normalized using a softmax function, obtain normalized weight；Finally by normalized weight and Corresponding key assignments value (the first text feature) is weighted the second text feature for summing to the end.

Step 2033, by the third attention weight factor normalized, corresponding third attention weight is obtained.

In this step, by third attention weight factor normalized, it can use the progress of Sigmoid function, Sigmoid function is often used as the threshold function table of neural network due to properties such as its list increasing and the increasings of inverse function list, and effect is It will be between variable mappings to 0,1.

Step 2034, the third attention weight and first text feature are weighted read group total, obtained Second text feature of the attention based on the first image feature.

In this step, by after normalized weight and corresponding key assignments value (the first text feature) added The second text feature for summing to the end is weighed, the second text feature combines the association of text information and image information at this time Property, allow and obtains more accurate structure in subsequent prediction or sort operation.

Optionally, in another implementation, step 203 can also include:

Step 2035, first text feature is averaged pond, obtains corresponding first Text eigenvector.

The implementation of this step is similar with the realization process of above-mentioned steps 2031, and the embodiment of the present application is no longer detailed herein It states.

Step 2036, the first image feature and first Text eigenvector are subjected to similarity calculation, obtained The first image feature corresponds to the 4th attention weight factor of first text feature.

The implementation of this step is similar with the realization process of above-mentioned steps 2032, and the embodiment of the present application is no longer detailed herein It states.

Step 2037, by the 4th attention weight factor normalized, corresponding 4th attention weight is obtained.

The implementation of this step is similar with the realization process of above-mentioned steps 2033, and the embodiment of the present application is no longer detailed herein It states.

Step 2038, the 4th attention weight and the first image feature are weighted read group total, obtained Second characteristics of image of the attention based on first text feature.

The implementation of this step is similar with the realization process of above-mentioned steps 2034, and the embodiment of the present application is no longer detailed herein It states.

Second text feature and second characteristics of image are carried out feature merging by step 204, obtain more matchmakers The first of body sample merges feature.

In the embodiment of the present application, when the first characteristics of image and first text feature are imported attention mechanism mould Type, exports the second text feature and when the second characteristics of image, at this time can also by the second text feature and the second characteristics of image into Row merging features obtain the first merging feature, for example, the second text feature and the second characteristics of image are respectively 256 features Vector splices the end of the second text feature and the starting point of the second characteristics of image, and the first of available 512 is closed And feature, merging features can specifically be realized by concat function, concat function is for connecting two or more arrays.

After merging features, available one first merging feature, so that the quantity of subsequent input reduces.After for example, When the continuous progress image classification using labeling model, one first merging feature can be inputted, compared to inputting the second text Feature and the second characteristics of image, reduce input quantity.

In conclusion the acquisition methods of another kind data characteristics provided by the embodiments of the present application, available multimedia sample This, multimedia sample includes image information and text information；The first characteristics of image and text envelope of image information are extracted respectively First text feature of breath；First characteristics of image and the first text feature are imported into attention Mechanism Model, output is based on first Second text feature of the attention of characteristics of image, and/or the second characteristics of image of the attention based on the first text feature.This Application is based on attention mechanism, captures the relevance between the first characteristics of image and the first text feature, and the note that has been applied The second text feature and/or the second characteristics of image for power mechanism of anticipating, so that the second text feature and the second characteristics of image include Relevance between image information and text information.Also, the attention mechanism that applies that the application obtains is characterized in being based on One end to end attention Mechanism Model realize, reduce dependence of the application scenarios to multi-model.

Fig. 3 is the step flow chart of the acquisition methods of another data characteristics provided by the embodiments of the present application, such as Fig. 3 institute Show, this method may include:

Step 301 obtains multimedia sample, and the multimedia sample includes image information and text information.

First text of step 302, the first characteristics of image for extracting described image information respectively and the text information Feature.

The first image feature and first text feature are imported attention Mechanism Model by step 303, export base In the attention of the first image feature the second text feature and the attention based on first text feature second Characteristics of image.

The implementation of this step is similar with the realization process of above-mentioned steps 203, and this will not be detailed here for the embodiment of the present application.

The first image feature is imported attention Mechanism Model by step 304, and output is based on the first image feature Attention third characteristics of image.

Optionally, step 304 can also include:

Step 3041, the first image feature is averaged pond, obtains corresponding first image feature vector.

In this implementation, to be said for obtaining the third characteristics of image of the attention based on the first characteristics of image It is bright, wherein in attention Mechanism Model, aggregate function as average pondization to can be used usually to characterize the first image spy Sign.Average pondization averages to all values in local acceptance region, to reduce estimated value variance caused by Size of Neighborhood is limited The error of increase, for image information, effect is more background informations for retaining image, therefore, by the first image Feature is averaged pond, available corresponding the first image feature vector for reducing error.

Step 3042, the first image feature and the first image feature vector are subjected to similarity calculation, obtained The first image feature corresponds to the first attention weight factor of the first image feature.

In the embodiment of the present application, the attention based on the first image feature can be understood as the first characteristics of image pair The the first attention weight factor of itself is answered, the essence of attention mechanism function can be described as an inquiry (query) and arrive The mapping of a series of (key key- value value) pair, when calculating the third characteristics of image of the attention based on the first characteristics of image Three steps are broadly divided into, the first step is that query (the first image feature vector) and key (the first characteristics of image) is carried out similarity meter Calculation obtains weight factor, and common similarity function has dot product, splicing, perceptron etc.；Then second step is usually to use one These weight factors are normalized in softmax function, obtain normalized weight；Finally by normalized weight and corresponding key Value value (the first characteristics of image) is weighted the third characteristics of image for summing to the end.

Step 3043, by the first attention weight factor normalized, corresponding first attention weight is obtained.

In this step, by the first attention weight factor normalized, it can use the progress of Sigmoid function, Sigmoid function is often used as the threshold function table of neural network due to properties such as its list increasing and the increasings of inverse function list, and effect is It will be between variable mappings to 0,1.

Step 3044, the first attention weight and the first image feature are weighted read group total, obtained The third characteristics of image of attention based on the first image feature.

In this step, by after normalized weight and corresponding key assignments value (the first text feature) added The third characteristics of image for summing to the end is weighed, at this time between third feature combinations image information itself each feature Relevance allows and obtains more accurate structure in subsequent prediction or sort operation.

In the embodiment of the present application, attention mechanism can also include from attention mechanism (Self-attention When Mechanism), using from attention mechanism, the input quantity of attention Mechanism Model is one, for example, by the first image spy Sign imports attention Mechanism Model, exports the third characteristics of image of the attention based on the first characteristics of image, third characteristics of image In include internal correlation degree in image information between each feature block.

First text feature is imported attention Mechanism Model by step 305, and output is based on first text feature Attention third text feature.

Optionally, step 305 can also include:

Step 3051, first text feature is averaged pond, obtains corresponding first Text eigenvector.

The implementation of this step is similar with the realization process of above-mentioned steps 3041, and the embodiment of the present application is no longer detailed herein It states.

Step 3052, first text feature and first Text eigenvector are subjected to similarity calculation, obtained First text feature corresponds to the second attention weight factor of first text feature.

The implementation of this step is similar with the realization process of above-mentioned steps 3042, and the embodiment of the present application is no longer detailed herein It states.

Step 3053, by the second attention weight factor normalized, corresponding second attention weight is obtained.

The implementation of this step is similar with the realization process of above-mentioned steps 3043, and the embodiment of the present application is no longer detailed herein It states.

Step 3054, the second attention weight and first text feature are weighted read group total, obtained The third text feature of attention based on first text feature.

The implementation of this step is similar with the realization process of above-mentioned steps 3044, and the embodiment of the present application is no longer detailed herein It states.

In the embodiment of the present application, the attention based on the first text feature is to ignore from the characteristics of attention mechanism The distance between word directly calculates dependence, can learn the internal structure of a sentence, and realization is also relatively simple parallel may be used With parallel computation, obtained third text feature includes the internal correlation degree in text information between each term vector.

Step 306, by second text feature, second characteristics of image, the third characteristics of image and described Three text features carry out feature merging, and obtain the multimedia sample second merges feature.

In the embodiment of the present application, when the first characteristics of image and first text feature are imported attention mechanism mould Type, the second text feature of output, the second characteristics of image, third characteristics of image and when third text feature at this time can also be by the Two text features, the second characteristics of image, third characteristics of image and third text feature carry out merging features, obtain the second merging spy Sign, for example, the second text feature, the second characteristics of image, third characteristics of image and third text feature are respectively 256 features Vector splices each feature head and the tail, and the first of available 1024 merges feature, and merging features can specifically pass through Concat function realizes that concat function is for connecting two or more arrays.

After merging features, available one second merging feature, so that the quantity of subsequent input reduces.After for example, When the continuous progress image classification using labeling model, one second merging feature can be inputted, compared to 4 spies of input respectively Sign, reduces input quantity.

Step 307 merges feature importing labeling model by described second, and it is corresponding to export the second merging feature Tag along sort.

In embodiments of the present invention, the corresponding relationship of feature and label can be prestored in labeling model, and passed through Preset function realize the mapping of input feature vector and corresponding label, so that being reached for input feature vector matches corresponding label, In the step, second merges not only with being associated between image information and text information in feature, but also has image information itself Interrelated degree between the interrelated degree and each term vector of text information itself of each feature block, therefore, for In user after image information is opened in upload one, it can also increase by one section of simple text information to this image information and be described Scene, second merging feature be image information and text information fusion, second merging feature can accurately express figure As the information that information and text information are included, it is inputted labeling model, it is corresponding that the second merging feature can be exported Tag along sort, so that image information is assigned in corresponding classification.

For example, it is assumed that user uploads the photo on a seashore, and passage is uploaded simultaneously " today goes to the seabeach XX to play Play a whole day ", it therefore, in the multimedia sample that user uploads include this travelling photograph and text, so as to multimedia sample The features such as this extraction " seashore " " travelling ", row label of going forward side by side classification, stamp labels such as " travelling " " seashores " for the multimedia sample, Complete the classification of multimedia sample.

Fig. 4 is a kind of block diagram of the acquisition device of data characteristics provided by the embodiments of the present application, as shown in Figure 4, comprising:

Module 401 is obtained, for obtaining multimedia sample, the multimedia sample includes image information and text information；

Fisrt feature extraction module 402, for extracting the first characteristics of image and the text of described image information respectively First text feature of this information；

Second feature extraction module 403 pays attention to for importing the first image feature and first text feature Power Mechanism Model exports the second text feature of the attention based on the first image feature, and/or based on first text Second characteristics of image of the attention of eigen.

In conclusion a kind of acquisition device of data characteristics provided by the embodiments of the present application, available multimedia sample, Multimedia sample includes image information and text information；First characteristics of image and text information of image information are extracted respectively First text feature；First characteristics of image and the first text feature are imported into attention Mechanism Model, output is based on the first image Second text feature of the attention of feature, and/or the second characteristics of image of the attention based on the first text feature.The application Based on attention mechanism, the relevance between the first characteristics of image and the first text feature, and the attention that has been applied are captured The second text feature and/or the second characteristics of image of mechanism, so that the second text feature and the second characteristics of image include image Relevance between information and text information.Also, the attention mechanism that applies of the application acquisition is characterized in based on one Attention Mechanism Model is realized end to end, reduces dependence of the application scenarios to multi-model.

Fig. 5 is the block diagram of the acquisition device of another data characteristics provided by the embodiments of the present application, as shown in Figure 5, comprising:

Module 501 is obtained, for obtaining multimedia sample, the multimedia sample includes image information and text information；

Fisrt feature extraction module 502, for extracting the first characteristics of image and the text of described image information respectively First text feature of this information；

Optionally, fisrt feature extraction module 502, comprising:

First output sub-module, for described image information to be imported convolutional neural networks model, output described image letter Cease corresponding first characteristics of image；

Second output sub-module exports the text for the text information to be imported embedded neural network model Corresponding first text feature of information.

Second feature extraction module 503 pays attention to for importing the first image feature and first text feature Power Mechanism Model exports the second text feature of the attention based on the first image feature, and/or based on first text Second characteristics of image of the attention of eigen.

Optionally, described that the first image feature and first text feature are imported into attention Mechanism Model, it is defeated Second text feature of the attention out based on the first image feature, comprising: the first image feature is averaged pond, Obtain corresponding second image feature vector；First text feature and second image feature vector are subjected to similarity It calculates, obtains the third attention weight factor that first text feature corresponds to the first image feature；By the third Attention weight factor normalized obtains corresponding third attention weight；By the third attention weight with it is described First text feature is weighted read group total, obtains the second text feature of the attention based on the first image feature.

Optionally, described that the first image feature and first text feature are imported into attention Mechanism Model, it is defeated Second characteristics of image of the attention out based on first text feature, comprising: first text feature is averaged pond, Obtain corresponding second Text eigenvector；The first image feature and second Text eigenvector are subjected to similarity It calculates, obtains the 4th attention weight factor that the first image feature corresponds to first text feature；By the described 4th Attention weight factor normalized obtains corresponding 4th attention weight；By the 4th attention weight with it is described First characteristics of image is weighted read group total, obtains the second characteristics of image of the attention based on first text feature.

Optionally, second feature extraction module 503 can also include:

Merge submodule, for second text feature and second characteristics of image to be carried out feature merging, obtains The first of the multimedia sample merges feature.

Third feature extraction module 504, for the first image feature to be imported attention Mechanism Model, output is based on The third characteristics of image of the attention of the first image feature；

Optionally, third feature extraction module 504, comprising:

First average treatment submodule is used for pond that the first image feature is averaged, obtains corresponding first image Feature vector；

First similar computational submodule, for the first image feature and the first image feature vector to be carried out phase It is calculated like degree, obtains the first attention weight factor that the first image feature corresponds to the first image feature；

First normalization submodule, for obtaining the first attention weight factor normalized corresponding First attention weight；

First summation submodule, for the first attention weight and the first image feature to be weighted summation It calculates, obtains the third characteristics of image of the attention based on the first image feature.

Fourth feature extraction module 505, for first text feature to be imported attention Mechanism Model, output is based on The third text feature of the attention of first text feature.

Optionally, fourth feature extraction module 505, comprising:

Second average treatment submodule is used for pond that first text feature is averaged, obtains corresponding first text Feature vector；

Second similar computational submodule, for first text feature and first Text eigenvector to be carried out phase It is calculated like degree, obtains the second attention weight factor that first text feature corresponds to first text feature；

Second normalization submodule, for obtaining the second attention weight factor normalized corresponding Second attention weight；

Second summation submodule, for the second attention weight and first text feature to be weighted summation It calculates, obtains the third text feature of the attention based on first text feature.

Merging module 506 is used for second text feature, second characteristics of image, the third characteristics of image Feature merging is carried out with the third text feature, obtain the multimedia sample second merges feature.

Labeling module 507 imports labeling model for merging feature for described second, exports described second and close And the corresponding tag along sort of feature.

Fig. 6 is the block diagram of a kind of electronic equipment 600 shown according to an exemplary embodiment.For example, electronic equipment 600 can To be mobile phone, computer, digital broadcasting terminal, messaging device, game console, tablet device, Medical Devices are good for Body equipment, the mobile terminals such as personal digital assistant.

Referring to Fig. 6, electronic equipment 600 may include following one or more components: processing component 602, memory 604, Electric power assembly 606, multimedia component 608, audio component 610, the interface 612 of input/output (I/O), sensor module 614, And communication component 616.

The integrated operation of the usual controlling electronic devices 600 of processing component 602, such as with display, call, data are logical Letter, camera operation and record operate associated operation.Processing component 602 may include one or more processors 620 to hold Row instruction, to perform all or part of the steps of the methods described above.In addition, processing component 602 may include one or more moulds Block, convenient for the interaction between processing component 602 and other assemblies.For example, processing component 602 may include multi-media module, with Facilitate the interaction between multimedia component 608 and processing component 602.

Memory 604 is configured as storing various types of data to support the operation in electronic equipment 600.These data Example include any application or method for being operated on electronic equipment 600 instruction, contact data, telephone directory Data, message, picture, video etc..Memory 604 can by any kind of volatibility or non-volatile memory device or it Combination realize, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM) is erasable Except programmable read only memory (EPROM), programmable read only memory (PROM), read-only memory (ROM), magnetic memory, fastly Flash memory, disk or CD.

Power supply module 606 provides electric power for the various assemblies of electronic equipment 600.Power supply module 606 may include power supply pipe Reason system, one or more power supplys and other with for electronic equipment 600 generate, manage, and distribute the associated component of electric power.

Multimedia component 608 includes the screen of one output interface of offer between the electronic equipment 600 and user. In some embodiments, screen may include liquid crystal display (LCD) and touch panel (TP).If screen includes touch surface Plate, screen may be implemented as touch screen, to receive input signal from the user.Touch panel includes one or more touches Sensor is to sense the gesture on touch, slide, and touch panel.The touch sensor can not only sense touch or sliding The boundary of movement, but also detect duration and pressure associated with the touch or slide operation.In some embodiments, Multimedia component 608 includes a front camera and/or rear camera.When electronic equipment 600 is in operation mode, as clapped When taking the photograph mode or video mode, front camera and/or rear camera can receive external multi-medium data.It is each preposition Camera and rear camera can be a fixed optical lens system or have focusing and optical zoom capabilities.

Audio component 610 is configured as output and/or input audio signal.For example, audio component 610 includes a Mike Wind (MIC), when electronic equipment 600 is in operation mode, when such as call mode, recording mode, and voice recognition mode, microphone It is configured as receiving external audio signal.The received audio signal can be further stored in memory 604 or via logical Believe that component 616 is sent.In some embodiments, audio component 610 further includes a loudspeaker, is used for output audio signal.

I/O interface 612 provides interface between processing component 602 and peripheral interface module, and above-mentioned peripheral interface module can To be keyboard, click wheel, button etc..These buttons may include, but are not limited to: home button, volume button, start button and lock Determine button.

Sensor module 614 includes one or more sensors, for providing the state of various aspects for electronic equipment 600 Assessment.For example, sensor module 614 can detecte the state that opens/closes of electronic equipment 600, the relative positioning of component, example As the component be electronic equipment 600 display and keypad, sensor module 614 can also detect electronic equipment 600 or The position change of 600 1 components of electronic equipment, the existence or non-existence that user contacts with electronic equipment 600, electronic equipment 600 The temperature change of orientation or acceleration/deceleration and electronic equipment 600.Sensor module 614 may include proximity sensor, be configured For detecting the presence of nearby objects without any physical contact.Sensor module 614 can also include optical sensor, Such as CMOS or ccd image sensor, for being used in imaging applications.In some embodiments, which may be used also To include acceleration transducer, gyro sensor, Magnetic Sensor, pressure sensor or temperature sensor.

Communication component 616 is configured to facilitate the communication of wired or wireless way between electronic equipment 600 and other equipment. Electronic equipment 600 can access the wireless network based on communication standard, such as WiFi, carrier network (such as 2G, 3G, 4G or 5G), Or their combination.In one exemplary embodiment, communication component 616 receives via broadcast channel and comes from external broadcasting management The broadcast singal or broadcast related information of system.In one exemplary embodiment, the communication component 616 further includes that near field is logical (NFC) module is believed, to promote short range communication.For example, radio frequency identification (RFID) technology, infrared data association can be based in NFC module Meeting (IrDA) technology, ultra wide band (UWB) technology, bluetooth (BT) technology and other technologies are realized.

In the exemplary embodiment, electronic equipment 600 can be by one or more application specific integrated circuit (ASIC), number Word signal processor (DSP), digital signal processing appts (DSPD), programmable logic device (PLD), field programmable gate array (FPGA), controller, microcontroller, microprocessor or other electronic components are realized, obtain multimedia sample for executing, described Multimedia sample includes image information and text information；The first characteristics of image and the text of described image information are extracted respectively First text feature of this information；The first image feature and first text feature are imported into attention Mechanism Model, Export the second text feature of the attention based on the first image feature, and/or the note based on first text feature Second characteristics of image of power of anticipating.

In the exemplary embodiment, a kind of non-transitory storage medium including instruction is additionally provided, for example including instruction Memory 604, above-metioned instruction can by the processor 620 of electronic equipment 600 execute to complete the above method.For example, described non- Transitory storage medium can be ROM, random access memory (RAM), CD-ROM, tape, floppy disk and optical data storage devices Deng.

Fig. 7 is the block diagram of a kind of electronic equipment 700 shown according to an exemplary embodiment.For example, electronic equipment 700 can To be provided as a server.Referring to Fig. 7, it further comprises one or more that electronic equipment 700, which includes processing component 722, Processor, and the memory resource as representated by memory 732, for store can by the instruction of the execution of processing component 722, Such as application program.The application program stored in memory 732 may include it is one or more each correspond to one The module of group instruction.In addition, processing component 722 is configured as executing instruction, to execute acquisition multimedia sample, the multimedia Sample includes image information and text information；The first characteristics of image and the text information of described image information are extracted respectively The first text feature；The first image feature and first text feature are imported into attention Mechanism Model, export base Attention in the second text feature of the attention of the first image feature, and/or based on first text feature Second characteristics of image.

Electronic equipment 700 can also include that a power supply module 726 is configured as executing the power supply pipe of electronic equipment 700 Reason, a wired or wireless network interface 750 are configured as electronic equipment 700 being connected to network and an input and output (I/ O) interface 758.Electronic equipment 700 can be operated based on the operating system for being stored in memory 732, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM or similar.

The embodiment of the present application also provides a kind of application program, when the application program is executed by the processor of electronic equipment, It realizes and obtains the multimedia sample including image information and text information as provided by the present application；The of image information is extracted respectively First text feature of one characteristics of image and text information；First characteristics of image and the first text feature are imported into attention machine Simulation exports the second text feature of the attention based on the first characteristics of image, and/or the attention based on the first text feature The step of second characteristics of image of power.

Those skilled in the art after considering the specification and implementing the invention disclosed here, will readily occur to of the invention its Its embodiment.This application is intended to cover any variations, uses, or adaptations of the invention, these modifications, purposes or Person's adaptive change follows general principle of the invention and including the undocumented common knowledge in the art of the disclosure Or conventional techniques.The description and examples are only to be considered as illustrative, and true scope and spirit of the invention are by following Claim is pointed out.

It should be understood that the present invention is not limited to the precise structure already described above and shown in the accompanying drawings, and And various modifications and changes may be made without departing from the scope thereof.The scope of the present invention is limited only by the attached claims.

Claims

1. a kind of acquisition methods of data characteristics, which is characterized in that the described method includes:

The first image feature and first text feature are imported into attention Mechanism Model, output is based on first figure As the second text feature of the attention of feature, and/or the second characteristics of image of the attention based on first text feature.

2. the method according to claim 1, wherein it is described by the first image feature and it is described first text Eigen imports attention Mechanism Model, exports the second text feature and base of the attention based on the first image feature After the second characteristics of image of the attention of first text feature, further includes:

Second text feature and second characteristics of image are subjected to feature merging, obtain the first of the multimedia sample Merge feature.

3. according to the method described in claim 2, it is characterized in that, in first image for extracting described image information respectively After first text feature of feature and the text information, the method also includes:

The first image feature is imported into attention Mechanism Model, the of attention of the output based on the first image feature Three characteristics of image；

First text feature is imported into attention Mechanism Model, the of attention of the output based on first text feature Three text features.

4. according to the method described in claim 3, it is characterized in that, described by second text feature and second image Feature carries out feature merging, obtains the merging feature, comprising:

By second text feature, second characteristics of image, the third characteristics of image and the third text feature into Row feature merges, and obtain the multimedia sample second merges feature.

5. according to the method described in claim 4, it is characterized in that, the method also includes:

Merge feature for described second and import labeling model, exports described second and merge the corresponding tag along sort of feature.

6. according to the method described in claim 3, it is characterized in that, described import attention mechanism for the first image feature Model exports the third characteristics of image of the attention based on the first image feature, comprising:

The first image feature is averaged pond, corresponding first image feature vector is obtained；

The first image feature and the first image feature vector are subjected to similarity calculation, obtain the first image spy Levy the first attention weight factor of corresponding the first image feature；

By the first attention weight factor normalized, corresponding first attention weight is obtained；

The first attention weight and the first image feature are weighted read group total, obtained based on first figure As the third characteristics of image of the attention of feature.

7. according to the method described in claim 3, it is characterized in that, described import attention mechanism for first text feature Model exports the third text feature of the attention based on first text feature, comprising:

First text feature is averaged pond, corresponding first Text eigenvector is obtained；

First text feature and first Text eigenvector are subjected to similarity calculation, it is special to obtain first text Second attention weight factor of corresponding first text feature of sign；

By the second attention weight factor normalized, corresponding second attention weight is obtained；

The second attention weight and first text feature are weighted read group total, obtained based on first text The third text feature of the attention of eigen.

8. a kind of acquisition device of data characteristics, which is characterized in that described device includes:

Fisrt feature extraction module, for extracting the first characteristics of image and the text information of described image information respectively First text feature；

Second feature extraction module, for the first image feature and first text feature to be imported attention mechanism mould Type exports the second text feature of the attention based on the first image feature, and/or based on first text feature Second characteristics of image of attention.

9. a kind of electronic equipment, which is characterized in that including processor, memory and be stored on the memory and can be described The computer program run on processor is realized when the computer program is executed by the processor as in claim 1 to 7 The step of acquisition methods of described in any item data characteristicses.

10. a kind of storage medium, which is characterized in that be stored with computer program, the computer program on the storage medium The step of acquisition methods of the data characteristics as described in any one of claims 1 to 7 are realized when being executed by processor.