CN110287341A - A kind of data processing method, device and readable storage medium storing program for executing - Google Patents

A kind of data processing method, device and readable storage medium storing program for executing Download PDF

Info

Publication number
CN110287341A
CN110287341A CN201910559777.4A CN201910559777A CN110287341A CN 110287341 A CN110287341 A CN 110287341A CN 201910559777 A CN201910559777 A CN 201910559777A CN 110287341 A CN110287341 A CN 110287341A
Authority
CN
China
Prior art keywords
data
feature
information
identification
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910559777.4A
Other languages
Chinese (zh)
Inventor
刘巍
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201910559777.4A priority Critical patent/CN110287341A/en
Publication of CN110287341A publication Critical patent/CN110287341A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/45Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines

Abstract

The embodiment of the present application discloses a kind of data processing method, device and readable storage medium storing program for executing, and method includes: to obtain the corresponding data content feature of multi-medium data;According to the data content feature, determine for screening the first screening vector with the identification dimensional information correlated characteristic;Target information feature associated with the identification dimensional information is filtered out from the data content feature according to the first screening vector;According to the target information feature, the data attribute type to match in the multi-medium data with the identification dimensional information is determined.Using the embodiment of the present application, the accuracy rate of multi-medium data attribute type classification can be improved.

Description

A kind of data processing method, device and readable storage medium storing program for executing
Technical field
This application involves field of computer technology more particularly to a kind of data processing methods, device and readable storage medium Matter.
Background technique
With the fast development of internet, produce a large amount of multi-medium data (including text, image, video etc. number According to).By analyzing a large amount of multi-medium data, user's content of concern can be extracted, or judge more matchmakers The Sentiment orientation that volume data or multi-medium data publisher are contained, based on these analyses as a result, user can be helped to make more Good decision takes appropriate measures to realize bigger positive effect.
In the prior art, data handling procedure can by multi-medium data carry out data content feature extraction, into And the Sentiment orientation that the multi-medium data is contained is predicted according to the data content feature extracted.Though the scheme of the prior art The Sentiment orientation prediction of multi-medium data is so realized, but the information content as included in multi-medium data is larger, that is, interferes Information is more, so that error occurs in the Sentiment orientation prediction that multi-medium data is contained, and then reduces the standard of Sentiment orientation prediction True rate.
Summary of the invention
The embodiment of the present application provides a kind of data processing method, device and readable storage medium storing program for executing, and multimedia can be improved The accuracy rate of data attribute classification of type.
On the one hand the embodiment of the present application provides a kind of data processing method, comprising:
Obtain the corresponding data content feature of multi-medium data;
According to the data content feature, determine for screen and identify the first of dimensional information correlated characteristic screen to Amount;
It is filtered out from the data content feature according to the first screening vector related to the identification dimensional information The target information feature of connection;
According to the target information feature, the number to match in the multi-medium data with the identification dimensional information is determined According to attribute type.
Wherein, the corresponding data content feature of the acquisition multi-medium data, comprising:
Multi-medium data is obtained, the multi-medium data and identification dimensional information are input in data identification model;
The corresponding data content feature of the multi-medium data is obtained in the data identification model;
The data identification model includes input layer, convolutional layer, resetting gate cell, the first full articulamentum and output layer; The input layer is for inputting the multi-medium data and the identification dimensional information, and the convolutional layer is for obtaining more matchmakers The corresponding data content feature of volume data, the resetting gate cell include the second full articulamentum, and the second full articulamentum is used for The first screening vector is obtained, for obtaining the target information feature, the output layer is used for the first full articulamentum Export the corresponding data attribute type of the multi-medium data.
Wherein, the multi-medium data includes text data;
It is described that the corresponding data content feature of the multi-medium data is obtained in the data identification model, comprising:
In the data identification model, the text data is divided into multiple unit characters, and by each unit word Symbol is converted as unit term vector;
The unit term vector is spliced into the corresponding text matrix of the text data;
Based on the convolutional layer, feature extraction is carried out to the text matrix, obtains the corresponding data of the text data Content characteristic.
Wherein, described according to the data content feature, it determines for screening and the identification dimensional information correlated characteristic First screening vector, comprising:
The identification dimensional information is converted into target term vector;
The target term vector and the data content feature are input in the described second full articulamentum, based on described the Corresponding first activation primitive of two full articulamentums is obtained for screening the first screening with the identification dimensional information correlated characteristic Vector.
Wherein, described to be filtered out from the data content feature and the identification dimension according to the first screening vector The associated target information feature of information, comprising:
The first screening vector and the data content feature are subjected to vector dot, obtain believing with the identification dimension The associated reservation information characteristics of manner of breathing;
It is complete that the reservation information characteristics and the corresponding target term vector of the identification dimensional information are input to described first In articulamentum;
Based on corresponding second activation primitive of the described first full articulamentum, the target letter in the data content feature is obtained Cease feature.
Wherein, the data identification model further includes updating gate cell, and the update gate cell includes the full articulamentum of third;
The method also includes:
It is complete that the corresponding target term vector of the identification dimensional information with the data content feature is input to the third In articulamentum, it is based on the corresponding third activation primitive of the full articulamentum of the third, obtains believing for screening with the identification dimension Cease the second screening vector of correlated characteristic;
It is then described based on corresponding second activation primitive of the described first full articulamentum, it obtains in the data content feature Target information feature, comprising:
Based on corresponding second activation primitive of the described first full articulamentum, obtain in the data content feature with the knowledge The other associated first candidate information feature of dimensional information;
The second screening vector and the first candidate information feature are subjected to vector dot, obtain target information spy Sign.
Wherein, described that the second screening vector and the first candidate information feature are subjected to vector dot, obtain mesh Mark information characteristics, comprising:
The second screening vector and the first candidate information feature are subjected to vector dot, obtain the second candidate information Feature;
Based on the second screening vector and the data content feature, global information feature is determined;
According to the second candidate information feature and the global information feature, target information feature is determined.
Wherein, described according to the target information feature, determine in the multi-medium data with the identification dimensional information The data attribute type to match, comprising:
The classifier target information feature being input in the output layer;
Based on the classifier, in the target information feature and the classifier between a variety of attribute types is identified With probability, will there is the attribute type of maximum matching probability to be determined as in the multi-medium data and the identification dimensional information phase Matched data attribute type.
Wherein, the method also includes:
Obtain sample multi-medium data at least one attribute type label corresponding with the sample multi-medium data;It is described Attribute type label is for characterizing the corresponding data attribute type of the sample multi-medium data;
From the sample multi-medium data, the more matchmakers of sample associated at least one described attribute type label are obtained Body subdata;The corresponding attribute type label of each sample multimedia subdata;
Based on the mapping relations training between the sample multimedia subdata and at least one described attribute type label The data identification model.
Wherein, described from the sample multi-medium data, it obtains associated at least one described attribute type label Sample multimedia subdata, comprising:
The sample multi-medium data is adjusted to target size, the sample multi-medium data after size adjusting is determined as Target sample data;
The corresponding sample data matrix of the target sample data is generated, from the sample data matrix, acquisition and institute State the associated sample multimedia subdata of at least one attribute type label.
On the one hand the embodiment of the present application provides a kind of data processing equipment, comprising:
Data acquisition module, for obtaining the corresponding data content feature of multi-medium data;
First determining module, for according to the data content feature, determination to be related to dimensional information is identified for screening First screening vector of feature;
Screening module, for being filtered out and the identification from the data content feature according to the first screening vector The associated target information feature of dimensional information;
Second determining module, for according to the target information feature, determine in the multi-medium data with the identification The data attribute type that dimensional information matches.
Wherein, the data acquisition module includes:
The multi-medium data and identification dimensional information are input to by data input cell for obtaining multi-medium data In data identification model;
Content characteristic acquiring unit, for obtaining the corresponding data of the multi-medium data in the data identification model Content characteristic;
The data identification model includes input layer, convolutional layer, resetting gate cell, the first full articulamentum and output layer; The input layer is for inputting the multi-medium data and the identification dimensional information, and the convolutional layer is for obtaining more matchmakers The corresponding data content feature of volume data, the resetting gate cell include the second full articulamentum, and the second full articulamentum is used for The first screening vector is obtained, for obtaining the target information feature, the output layer is used for the first full articulamentum Export the corresponding data attribute type of the multi-medium data.
Wherein, the multi-medium data includes text data;The content characteristic acquiring unit includes:
First conversion subunit, in the data identification model, the text data to be divided into multiple units Character, and each unit character is converted as unit term vector;
Splice subelement, for the unit term vector to be spliced into the corresponding text matrix of the text data;
Feature extraction subelement carries out feature extraction to the text matrix, described in acquisition for being based on the convolutional layer The corresponding data content feature of text data.
Wherein, first determining module includes:
Second converting unit, for the identification dimensional information to be converted to target term vector;
First screening vector determination unit, it is described for the target term vector and the data content feature to be input to In second full articulamentum, based on corresponding first activation primitive of the described second full articulamentum, obtain for screening and the identification First screening vector of dimensional information correlated characteristic.
Wherein, the screening module includes:
Dot product unit, for will it is described first screening vector and the data content feature carry out vector dot, obtain and The associated reservation information characteristics of identification dimensional information;
First input unit is used for the reservation information characteristics and the corresponding target term vector of the identification dimensional information It is input in the described first full articulamentum;
Target signature determination unit, for obtaining described based on corresponding second activation primitive of the described first full articulamentum Target information feature in data content feature.
Wherein, the data identification model further includes updating gate cell, and the update gate cell includes the full articulamentum of third;
Described device further include:
Third determining module is used for the corresponding target term vector of the identification dimensional information and the data content feature It is input in the full articulamentum of the third, is based on the corresponding third activation primitive of the full articulamentum of the third, obtains for screening With the second screening vector of the identification dimensional information correlated characteristic;
Then the target signature determination unit includes:
First candidate information determines subelement, for obtaining based on corresponding second activation primitive of the described first full articulamentum Take the first candidate information feature associated with the identification dimensional information in the data content feature;
Fisrt feature determines subelement, for by it is described second screening vector and the first candidate information feature carry out to Dot product is measured, target information feature is obtained.
Wherein, the fisrt feature determines that subelement includes:
Second candidate information determines subelement, for will second screening vector and the first candidate information feature into Row vector dot product obtains the second candidate information feature;
Global characteristics determine subelement, for determining complete based on the second screening vector and the data content feature Office's information characteristics;
Second feature determines subelement, is used for according to the second candidate information feature and the global information feature, really Set the goal information characteristics.
Wherein, second determining module includes:
Second input unit, the classifier for being input to the target information feature in the output layer;
Attribute type determination unit identifies the target information feature and the classifier for being based on the classifier In matching probability between a variety of attribute types, will have the attribute type of maximum matching probability be determined as the multi-medium data In with the data attribute type that matches of identification dimensional information.
Wherein, described device further include:
Sample data obtains module, corresponding at least for obtaining sample multi-medium data and the sample multi-medium data One attribute type label;The attribute type label is for characterizing the corresponding data attribute class of the sample multi-medium data Type;
Sample subdata obtains module, for obtaining and at least one described attribute from the sample multi-medium data The associated sample multimedia subdata of type label;The corresponding attribute type label of each sample multimedia subdata;
Training module, for based between the sample multimedia subdata and at least one described attribute type label The mapping relations training data identification model.
Wherein, the sample subdata acquisition module includes:
Size adjustment module, for adjusting the sample multi-medium data to target size, by the sample after size adjusting This multi-medium data is determined as target sample data;
Generation module, for generating the corresponding sample data matrix of the target sample data, from the sample data square In battle array, sample multimedia subdata associated at least one described attribute type label is obtained.
On the one hand the embodiment of the present application provides a kind of data processing equipment, comprising: processor and memory;
The processor is connected with memory, wherein the memory is used for storing computer program, the processor In calling the computer program, to execute such as the method in the embodiment of the present application in one side.
On the one hand the embodiment of the present application provides a kind of computer readable storage medium, the computer readable storage medium It is stored with computer program, the computer program includes program instruction, and described program is instructed when being executed by a processor, executed Such as the method in the embodiment of the present application in one side.
The embodiment of the present application can be by obtaining the corresponding data content feature of multi-medium data, and according to data content spy Sign determine for screen and identify dimensional information correlated characteristic first screening vector, it is subsequent can according to first screening vector from Target signature information associated with identification dimensional information is filtered out in data content feature, and then true according to target signature information Determine the data attribute type to match in multi-medium data with identification dimensional information.It is added as it can be seen that identification dimensional information is used as Information is inputted, target signature information associated with identification dimensional information can be extracted from multi-medium data, and be based on mesh Mark characteristic information judges the corresponding data attribute type of multi-medium data, can to avoid in multi-medium data remaining information it is dry It disturbs, and then the accuracy rate of multi-medium data attribute type classification can be improved.
Detailed description of the invention
In order to illustrate the technical solutions in the embodiments of the present application or in the prior art more clearly, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of application for those of ordinary skill in the art without creative efforts, can be with It obtains other drawings based on these drawings.
Fig. 1 a is a kind of network architecture diagram provided by the embodiments of the present application;
Fig. 1 b is a kind of schematic diagram of a scenario of data sentiment analysis method provided by the embodiments of the present application;
Fig. 2 is a kind of flow diagram of data processing method provided by the embodiments of the present application;
Fig. 3 is a kind of structural schematic diagram of data identification model provided by the embodiments of the present application;
Fig. 4 is the flow diagram of another data processing method provided by the embodiments of the present application;
Fig. 5 is the structural schematic diagram of another data identification model provided by the embodiments of the present application;
Fig. 6 is the schematic diagram of a scenario of another data sentiment analysis method provided by the embodiments of the present application;
Fig. 7 is the schematic diagram of a scenario of another data sentiment analysis method provided by the embodiments of the present application;
Fig. 8 is a kind of structural schematic diagram of data processing equipment provided by the embodiments of the present application;
Fig. 9 is the structural schematic diagram of another data processing equipment provided by the embodiments of the present application.
Specific embodiment
Below in conjunction with the attached drawing in the embodiment of the present application, technical solutions in the embodiments of the present application carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of embodiments of the present application, instead of all the embodiments.It is based on Embodiment in the application, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall in the protection scope of this application.
A referring to Figure 1 is a kind of network architecture diagram provided by the embodiments of the present application.The network architecture may include service Device 200a and multiple terminal devices (as shown in Figure 1a, specifically include terminal device 100a, terminal device 100b and terminal and set Standby 100c), server 200a can be carried out data transmission by network and each terminal device.
By taking terminal device 100a as an example, when terminal device 100a gets the multi-medium data that user delivers, terminal is set The multi-medium data that standby 100a can will acquire is sent to server 200a.Server 200a can be mentioned from multi-medium data Target information feature associated with identification dimensional information is taken, and then determines that multi-medium data is corresponding according to target information feature Data attribute type, for example, data attribute type can be positive emotion, Negative Affect, neutral emotion etc..Server 200a can It is sent to terminal device 100a with the data attribute type that will be determined, subsequent terminal equipment 100a can be based on above-mentioned data attribute Type carries out data statistics.
It certainly, can also be directly by terminal device 100a if user terminal 100a is integrated with multi-medium data identification function The extraction of target information feature is carried out to multi-medium data, and the data attribute of multi-medium data is determined based on target information feature Type.It is following that how target information feature is extracted with terminal device 100a, and how to determine for data attribute type into Row illustrates.Wherein, terminal device 100a, terminal device 100b and terminal device 100c etc. may include mobile phone, plate Computer, palm PC, mobile internet device (mobile internet device, MID), wearable is set laptop Standby (such as smartwatch, Intelligent bracelet etc.) etc..
B referring to Figure 1 is a kind of schematic diagram of a scenario of data sentiment analysis provided by the embodiments of the present application.Below with mutual It networks for dining room, after user eats or makes a reservation in dining room, can be issued on the corresponding webpage in the dining room and the dining room is commented By the comment may include the evaluation to the restaurant service, food, taste, price etc..In order to understand all users to this The evaluation of dining room various aspects can carry out sentiment analysis to user comment, and then provide the score feelings of the dining room various aspects Condition, as shown in the dining room comment 10b in Fig. 1 b, the comment that user 1 issues is " taste is more and more general and price is more and more expensive ", " taste more and more general and price is more and more expensive " can be used as text data 10c to be analyzed by terminal device 100a, can also be with Referred to as multi-medium data.
Due to the text that text data 10c to be analyzed is Chinese description, there is no separator in Chinese sentence to separate sentence In word, therefore also need terminal device 100a using Chinese Word Automatic Segmentation to text data 10c to be analyzed carry out word segmentation processing, Obtain the corresponding unit character set 10d of text data 10c to be analyzed: " taste ", " increasingly ", " general ", " and ", " valence Lattice ", " increasingly ", " expensive ".Wherein, Chinese Word Automatic Segmentation can be the segmentation methods of word-based allusion quotation, the participle calculation based on statistics Method etc., here without limitation.
By taking the segmentation methods based on statistics as an example, text data 10c can be analysed to as input, exported with " BEMS " The sequence string " BEBMEBESBEBMES " of composition, then again based on sequence string " BEBMEBESBEBMES " to text data to be analyzed 10c carries out word cutting, and then obtains the unit character set 10d of text data 10c to be analyzed.Wherein, it is in word that B, which represents the word, Banner word, it is middle word in word that M, which is represented, and it is end word in word that E, which is represented, and it is individual character into word that S, which is then represented,.
Since unit character set 10d is using natural language description, user terminal 10a word-based can be embedded in (Word Embedding), each unit character in unit character set 10d is converted into computer it will be appreciated that word to Amount, as a kind of numeralization representation of unit character indicate the vector that each unit character is converted into regular length.Eventually Unit character " taste " can be converted to term vector 10e by end equipment 100a, and unit character " increasingly " is converted to term vector Unit character " expensive " is converted term vector 10g etc. by 10f.And by the corresponding word of unit character each in unit character set 10d Vector is spliced, and the corresponding text matrix of text data 10c to be analyzed is combined into.Wherein, the sequence of term vector splicing can be with It is carried out according to position of the unit character in text data 10c to be analyzed.
Terminal device 100a available data identification model 10j, data identification model 10j can be identified in text matrix For the data attribute type of identification dimensional information, data attribute type may include: Negative Affect, positive emotion, neutral feelings Sense etc..Data identification model 10j may include convolutional layer, door control unit, the first full articulamentum and classifier (i.e. output layer), Wherein, convolutional layer is for carrying out convolution algorithm and pond operation to the text matrix of input, by convolution algorithm and Chi Huayun After calculation, the semantic feature vector in above-mentioned text matrix can be extracted, data content feature is referred to as;Door control unit can Semantic feature vector target term vector corresponding with identification dimensional information to export convolutional layer is obtained as input for sieving The screening vector of choosing identification dimensional information correlated characteristic;First full articulamentum is used to obtain from semantic feature vector and tie up with identification Spend the associated characteristic information of information.
Wherein, identification dimensional information can refer in data sentiment analysis, the specific object of Sentiment orientation analysis, can be with For constraining the range for extracting target information feature from text data, similar to the effect of threshold value.Identify dimensional information not In same application scenarios, different objects can be indicated, for example, identification dimensional information can be in the scene of internet dining room Refer to " taste ", " service ", " price ", any of objects such as " environment ";In film evaluation scene, identification dimensional information can To refer to any of objects such as " plot ", " music ", " clothes ", " performer ", " director ".Due to that may be wrapped in text data Containing the evaluation to multiple objects, and in one text data, for different objects, there may be different Sentiment orientations, Such as in text data " taste is good but service is too disappointing ", including commenting two objects " taste " and " service " Valence, the Sentiment orientation for object " taste " is positive emotion, and the Sentiment orientation for object " service " is Negative Affect.When defeated When the identification dimensional information for entering data identification model 10j is " taste ", above-mentioned text can be filtered out in data identification model 10j Feature relevant to object " service " in notebook data " taste is good but service is too disappointing ", so that the feature remained With object " taste " strong correlation, obtained Sentiment orientation is Sentiment orientation relevant to object " taste ";When input data identifies When the identification dimensional information of model 10j is " service ", above-mentioned text data " taste can be filtered out in data identification model 10j Road is good but service is too disappointing " in feature relevant to object " taste " so that the feature remained and object " clothes Business " strong correlation, obtained Sentiment orientation are Sentiment orientation relevant to object " service ".Therefore in order to more accurately analyze use Family can be input to data identification model using identification dimensional information as additional information to the Sentiment orientation of dining room various aspects In 10j, analyze in text data for the Sentiment orientation of single identification dimensional information.
The available identification dimensional information of user terminal 10a " taste 10h ", and word-based insertion, will identify dimensional information " taste 10h " is converted to target term vector 10i, after the convolutional layer in data identification model 10j extracts semantic feature vector, Semantic feature vector sum target term vector 10i can be input in door control unit, the screening obtained based on door control unit to Amount and target term vector finally obtain in text data 10c to be analyzed for the target signature of identification dimensional information " taste 10h " Information, the classifier in data identification model 10j can identify the matching degree between target signature information and multiple attribute types, Identify the matching probability between target signature information and multiple attribute types, as shown in Figure 1 b, text data to be analyzed Dimensional information " taste 10h " corresponding data attribute type is identified in 10c and the matching probability of " Negative Affect " attribute type is 0.80;Identify that dimensional information " taste 10h " corresponding data attribute type and " positive emotion " belong in text data 10c to be analyzed Property type matching probability be 0.03;Dimensional information " taste 10h " corresponding data attribute is identified in text data 10c to be analyzed The matching probability of type and " neutral emotion " attribute type is 0.17.
According to above-mentioned multiple matching probabilities, terminal device 100a, which can be determined in text data 10c to be analyzed, identifies dimension The corresponding data attribute type of information " taste 10h " are as follows: Negative Affect, i.e., the identification dimension letter in text data 10c to be analyzed Ceasing " taste 10h " is Negative Affect.
Terminal device 100a can obtain each user in the dining room and be directed to different identification dimensions according to above-mentioned treatment process The data attribute type (i.e. Sentiment orientation) of information, and then count in all user comments to the emotion of each identification dimensional information Tendency, gives a mark to the various aspects in the dining room, so as to subsequent user reference.
Fig. 2 is referred to, is a kind of flow diagram of data processing method provided by the embodiments of the present application.As shown in Fig. 2, The data processing method may comprise steps of:
Step S101 obtains the corresponding data content feature of multi-medium data;
Specifically, terminal device (the terminal device 100a in the embodiment as corresponding to above-mentioned Fig. 1 b) available multimedia Data, multi-medium data can be text data (the text data 10c to be analyzed in the embodiment as corresponding to above-mentioned Fig. 1 b), figure As any one of data and video data.For different types of multi-medium data, it is also necessary to which terminal device is to multimedia Data are correspondingly pre-processed.
The multi-medium data that terminal device is got is that (text data here can be understood as an independence to text data Sentence) when, it is (single in the embodiment as corresponding to above-mentioned Fig. 1 b that terminal device needs for text data to be divided into multiple unit characters Unit character in the character set 10d of position), and each unit character is converted as unit term vector (as corresponding to above-mentioned Fig. 1 b Term vector 10e in embodiment, term vector 10d etc.), the unit term vector being converted to then is spliced into text matrix.
Terminal device by the detailed process that text data is converted to text matrix may include: terminal device can be based on it is hidden Markov model (Hidden Markov Model, HMM) secondary sequence corresponding to text data is labeled, and then basis Annotated sequence carries out cutting to text data, obtains multiple unit characters.HMM can be described by a five-tuple: observation sequence Column hide sequence, hide state initial probability, hide transition probability (i.e. transition probability) between state, and hiding state shows as observation Probability (i.e. emission probability), wherein initial probability, transition probability and emission probability can by large-scale corpus statistics come It arrives.From hiding state original state, next hiding probability of state is calculated, and successively calculates hiding state transfers all below The hidden state sequence of maximum probability is finally determined as hiding sequence by probability, i.e. sequence labelling result (is properly termed as BEMS mark Infuse sequence).For example, text data is " we are Chinese ", it is based on the available sequence labelling result of HMM are as follows: BESBME, by It is only possible in sentence tail as E or S, so obtained word cutting mode are as follows: BE/S/BME, and then text data is obtained " during we are The word cutting mode of compatriots " are as follows: we/it is/Chinese, obtained multiple unit characters are respectively as follows: " we ", "Yes", " China People ".Certainly, text data is also possible to using language descriptions such as English, then in the corresponding word sequence of text data, word Between using space as nature delimiter, can directly carry out cutting, treatment process is fairly simple.
Then, terminal device can find out the corresponding one-hot encoding (one- of each unit character from character bag of words Hotcode), it is properly termed as the first initial vector.Wherein, in character bag of words include text data in a series of unit characters, And the corresponding one-hot encoding of each unit character, one-hot encoding are only comprising one 1 in vector, remaining is 0 vector.Such as Examples detailed above, the corresponding multiple unit characters of text data are respectively as follows: " we ", "Yes", " Chinese ", only wrap in character bag of words When unit character containing above three, one-hot encoding of the unit character " we " in character bag of words can be indicated are as follows: [1,0,0];It is single One-hot encoding of the position character "Yes" in character bag of words can indicate are as follows: [0,1,0];Unit character " Chinese " is in character bag of words One-hot encoding can indicate are as follows: [0,0,1].As it can be seen that if one-hot encoding is directly used to indicate as the unit term vector of unit character, No calligraphy learning is to the relationship (position and semanteme relationship such as in text data) between each unit character, and in character When in bag of words including many unit characters, the dimension of the unit term vector indicated using one-hot encoding can be very big.Therefore, terminal device The first high-dimensional initial vector is reduced to the term vector of low dimensional by available unit term vector transformation model, based on single The corresponding weight matrix of hidden layer in the term vector transformation model of position, by the first initial vector of input and the weight matrix phase Multiply, the vector obtained after multiplication is the corresponding unit term vector of unit character.Wherein, unit term vector transformation model can be It is obtained according to word2vec (term vector transformation model), GloVe (word insertion tool) training, the line number of weight matrix is equal to the The dimension of one initial vector, the columns of weight matrix are equal to the dimension of unit term vector.For example, at the beginning of unit character corresponding first The size of beginning vector are as follows: 1 × 100, the size of weight matrix are as follows: 100 × 10, then the size of unit term vector are as follows: 1 × 10.
Multiple unit term vector groups can be combined into a text matrix by terminal device, and text matrix can be used to indicate that Above-mentioned text data.If multiple text datas (i.e. multiple sentences) is got, then the corresponding word order of multiple text datas is shown May be different, i.e., the unit character quantity of each text data cutting is different, and then it is right respectively to will cause each text data The text matrix dimensionality answered is different, it is therefore desirable to zero padding processing is carried out to text matrix, so that each text matrix is big Small is the same.
When the multi-medium data that terminal device is got is image data, image data can be adjusted to solid by terminal device Fixed size, if image data is that (i.e. color image, R indicate that red component, G indicate that green component, B indicate blue to RGB image Component), RGB image can be converted to gray level image, and gray level image is normalized, i.e., by institute in image data The each pixel value for including is mapped to the value range of 0-1;If the image data got is gray level image, can directly exchange Gray level image after whole size is normalized, and then can obtain the corresponding input matrix of image data.
When the multi-medium data that terminal device is got is video data, video data can be divided into one by terminal device The image of one frame of frame, and then may refer to processing mode when above-mentioned multi-medium data is image data, every frame image is carried out Pretreatment, and then obtain the corresponding input matrix of every frame image.
The available identification dimensional information (identification dimensional information " taste in the embodiment as corresponding to figure 1 above b of terminal device Road 10h "), wherein identification dimensional information is used to filter out the feature with identification dimensional information strong correlation from multi-medium data, It may include object indicated by identification dimensional information in multi-medium data, can also not include indicated by identification dimensional information Object.
For example, in comment data, user can be to dining room when multi-medium data is comment data of the user to dining room Food, environment, service, price etc. are evaluated.In other words, terminal device can be based on comment data prediction user to this The Sentiment orientation of the food in dining room, environment, service, price etc., identification dimensional information may include " food ", " environment ", " service ", " price ".If in certain multi-medium data only including object " food " corresponding evaluation, and the identification dimensional information inputted For " service ", then can filter out the multi-medium data by data identification model (in other words, can not in data identification model Get the feature with identification dimensional information " service " strong correlation, the data attribute type not matched in classifier).If It include the corresponding evaluation of " food " and " service " two objects in certain multi-medium data, and the identification dimensional information inputted is " clothes Business " then can filter out in multi-medium data the related feature with object " food " by data identification model, remain Feature and identification dimensional information " service " strong correlation, and then can predict and identify that dimensional information " service " is strong by classifier The corresponding data attribute type of relevant feature.
Terminal device can equally search one-hot encoding corresponding with identification dimensional information from dimensional information bag of words, can claim For the second initial vector.It wherein, may include multiple identification dimensional informations and each identification dimension letter in dimensional information bag of words Cease corresponding one-hot encoding.Terminal device can equally be based on target term vector transformation model, by the second high-dimensional initial vector, It is reduced to the target term vector of low dimensional, it, will be defeated based on the corresponding weight matrix of hidden layer in target term vector transformation model The second initial vector entered is multiplied with the weight matrix, and the vector obtained after multiplication is to identify the corresponding target word of dimensional information Vector.Wherein, target term vector transformation model can be according to word2vec (term vector transformation model), GloVe (word insertion work Tool) training obtains.
Terminal device available data identification model (the data identification model in the embodiment as corresponding to above-mentioned Fig. 1 b 10j), which can identify that, for the data attribute type of identification dimensional information in multi-medium data, data are known Other model may include convolutional layer, resetting gate cell, the first full articulamentum and output layer (classifier).It will pass through pretreated Multi-medium data (the text matrix as carried out term vector conversion, or the process pretreated image such as greyscale transformation, normalization/ The corresponding input matrix of video data) it is input in data identification model, the convolutional layer being first applied in data identification model. Convolutional layer corresponds to 1 or multiple convolution kernels (kernel is referred to as filter, or referred to as receptive field), convolution algorithm Refer to that convolution kernel carries out matrix multiplication operation with the submatrix for being located at input matrix different location, the output matrix after convolution algorithm Line number HoutWith columns WoutIt is by the size of input matrix, the size of convolution kernel, step-length (stride) and Boundary filling (padding) it codetermines, i.e. Hout=(Hin-Hkernel+ 2*padding)/stride+1, Wout=(Win-Wkernel+2* padding)/stride+1。Hin, HkernelRespectively indicate the line number of input matrix and the line number of convolution kernel;Win, WkernelRespectively Indicate the columns of input matrix and the columns of convolution kernel.
It after convolution, also needs based on pond layer to output matrix progress pond operation, pond operation refers to extracting Output matrix carries out aggregate statistics, and pond operation may include average pond operation and maximum pond operation.Average pond operation Method is to calculate a mean values in each row (or column) of output matrix to represent the row (or column);It is maximum Pond operation is to extract greatest measure in each row (or column) of output matrix to represent the row (or column).Pass through Convolution algorithm and pond operation can extract the most significant data content feature of multi-medium data, which can With referred to as data content feature vector.
Step S102 is determined according to the data content feature for screening and the identification dimensional information correlated characteristic First screening vector;
Specifically, the data content feature that terminal device is exported according to the convolutional layer in data identification model, it can be by number Being input to resetting gate cell according to content characteristic target term vector corresponding with identification dimensional information, (resetting gate cell is included Second full articulamentum), based on corresponding first activation primitive of the second full articulamentum, it is available for screening and identification dimension letter The associated first screening vector of manner of breathing.Wherein, resetting gate cell be for control ignore in data content feature with identification dimension The degree of the unrelated information of information, the value for resetting door is smaller, and the information ignored in data content feature is more, i.e., ties up with identification The unrelated information of degree information is rejected more.First screening vector R (resetting the value of door) is the data content feature by inputting Xs, target term vector Wa, the second full articulamentum weight matrix WrAnd first activation primitive Sigmod codetermine, it may be assumed that
R=σ (Wr[Xs,Wa]) (1)
Wherein, [] indicates that vector is connected, and σ indicates that the first activation primitive Sigmod, Sigmod function may insure network Output valve is maintained between 0-1, i.e. the value of resetting door is maintained between 0-1, which is conducive to garbled data, works as resetting When the value of door is 0, any number is 0 multiplied by 0, therefore can weed out the partial data;It is any when the value for resetting door is 1 Number is also equal to itself multiplied by 1, therefore the partial data can be remained completely.Certainly, in above-mentioned formula (1) Sigmod function can be replaced with Tanh function, and Tanh function and the maximum difference of Sigmod function are: Tanh function takes Being worth range is that rather than between 0-1, equally can achieve the effect for garbled data between -1 to 1.
Step S103 is filtered out and the identification dimension according to the first screening vector from the data content feature The associated target information feature of information;
Specifically, above-mentioned first screening vector and data content feature can be carried out vector dot by terminal device, from number According to reservation information characteristics associated with identification dimensional information are obtained in content characteristic, information characteristics and target term vector will be retained The full articulamentum of first be input in data identification model can be obtained by corresponding second activation primitive of the first full articulamentum Into above-mentioned data content feature for the target information feature of identification dimensional information, which can be understood as comprehensive The new feature vector for retaining information characteristics and identifying dimensional information is closed, which is the first full articulamentum Output vector, it may be assumed that
O=Relu (WH[R*Xs,Wa]) (2)
Wherein, [] indicates that vector is connected, and * indicates that vector dot (is referred to as Hadamard product, i.e. corresponding element phase Multiply), O indicates the output vector of the first full articulamentum, i.e. target signature information, WHIndicate the weight square in the first full articulamentum Battle array, Rs=R*XsIndicate above-mentioned reservation information matrix.Relu (Rectified Linear Unit) indicates the second activation primitive, Relu function takes 0 in the part less than 0, is 1, i.e. output data itself in the portion slope greater than 0, it is possible to reduce data identification Relation of interdependence between Model Parameter.
Step S104, according to the target information feature, determine in the multi-medium data with the identification dimensional information The data attribute type to match.
Specifically, target information feature is input to the classifier in data identification model by terminal device, it is based on the classification Device can identify the matching probability in target information feature and classifier between a variety of attribute types, from multiple matching probabilities, Maximum matching probability is found out, will there is the attribute type of maximum matching probability, is determined as in multi-medium data and identification dimension letter The matched data attribute type of manner of breathing.
If multi-medium data be dining room scene under user comment data, data attribute type may include: positive emotion, Negative Affect and neutral emotion.Wherein, positive emotion is shown in terms of referring to user in comment to some of the dining room Preference tendency, such as user comment is " I likes the food in this family dining room very much ", and the user can be learnt for the dining room What food was shown is positive emotion;Negative Affect shows discontented in terms of referring to user in comment to some of the dining room The tendency of meaning;Neutral emotion refers to that user objectively evaluates to what some aspect in the dining room was made.
It is a kind of structural schematic diagram of data identification model provided by the embodiments of the present application please also refer to Fig. 3.With more matchmakers Volume data is for text data, the structure of data identification model is as shown in figure 3, the data identification model may include embeding layer It is (including word insertion 20b and target are embedded in 20e), convolutional layer 20c, full articulamentum 20f (being referred to as the second full articulamentum), complete Articulamentum 20h (being referred to as the first full articulamentum) and classifier 20i.Wherein, full articulamentum 20f and dot product 20g can be with Referred to as reset gate cell.
Word insertion 20b is the network layer for being directed to text data and constructing, and word insertion 20b can input multiple unit characters Corresponding one-hot encoding, such as the corresponding one-hot encoding of unit character each in unit character set 20b, and word-based insertion 20b In character transformation model the one-hot encoding of the higher-dimension of input is converted into the unit term vector of low-dimensional, it is corresponding to obtain text data Text matrix, the i.e. corresponding matrix of a sentence.The specific conversion process of participle and target term vector is carried out to text data The description that may refer to step S101 in embodiment corresponding to above-mentioned Fig. 2, is not discussed here.Wherein, target is embedded in 20e It is for that will identify that dimensional information (such as price 20d) is converted to target term vector, i.e. target insertion 20e can input identification dimension The corresponding one-hot encoding of information, and then the corresponding target term vector of identification dimensional information can be exported, specific conversion process is same It may refer to the description of step S101 in embodiment corresponding to above-mentioned Fig. 2, which is not described herein again.
Convolutional layer 20c is used to extract the data content feature (it is to be understood that for semantic feature) in text matrix, volume Lamination 20c can be embedded in the text matrix of 20b output, convolution algorithm and pond operation by convolutional layer, Ke Yiti with input word Take out the data content feature in text matrix, it can from the corresponding text matrix of text data, extract text data In key message, if text data be " I went to eat yesterday, felt taste well ", by the corresponding text of this article notebook data After Input matrix to convolutional layer 20c, the data content feature of output may be comprising " feeling taste also in above-mentioned text data The information such as well ", some inessential information (such as " I went to eat yesterday ") can be filtered.
Full articulamentum 20f can input the data content feature of convolutional layer 20c output and the target of target insertion 20e output Term vector (is properly termed as the first activation primitive, which can be Tanh function or Sigmod by activation primitive Function) output the first screening vector, it can output is ignored unrelated with identification dimensional information in data content feature for control Information degree vector.It can be by the first screening of the data content feature and full articulamentum 20f output that convolutional layer 20c is exported Vector carries out vector dot operation by dot product 20g, obtains retaining information characteristics Rs, reservation information characteristics RsIt can be expressed as Information relevant to identification dimensional information in data content feature.
Full articulamentum 20h, which can be inputted, retains information characteristics RsThe target term vector exported with target insertion 20e, by swashing Function (being properly termed as the second activation primitive, which can be Relu function) living exports target information feature, the target Information characteristics, which can integrate, retains information characteristics RsWith identification dimensional information, new feature associated with identification information is obtained.
The full articulamentum 20h target information feature exported is input in classifier 20i, classifier 20i can be A kind of softmax (classifiers of classifying) classifier, passes through a variety of categories in the available target information feature of operation and classifier more Matching probability between property type, i.e., there are how many kinds of attribute types in classifier, so that it may how many a matching probability values are exported, The corresponding attribute type of maximum matching probability value is determined as in text data for the data attribute type of identification dimensional information. For example, tool is then classified there are three types of data attribute type, respectively positive emotion, Negative Affect and neutral emotion in classifier The output result of device is that the vector of one 3 dimension such as [0.10,0.70,0.20] can determine and identify dimensional information in text data Corresponding data attribute type is the matching probability of positive emotion are as follows: 0.10;The corresponding number of dimensional information is identified in text data It is the matching probability of Negative Affect according to attribute type are as follows: 0.70;The corresponding data attribute class of dimensional information is identified in text data Type is the matching probability of neutral emotion are as follows: 0.20.And then Negative Affect can be determined as to the corresponding data category of this article notebook data Property type.
In the embodiment of the present application, the corresponding number of multi-medium data can be obtained by the convolutional layer in data identification model The first screening vector for screening and identifying dimensional information correlated characteristic is determined according to content characteristic, and based on resetting gate cell, And then it can be based on the first full articulamentum, target signature associated with identification dimensional information is filtered out from data content feature Information, and then the data attribute type to match in multi-medium data with identification dimensional information is determined according to target signature information. As it can be seen that identification dimensional information is input in data identification model as additional information, and use door machine system and convolutional network from Target signature information associated with identification dimensional information is extracted in multi-medium data, and then is judged based on target signature information The corresponding data attribute type of multi-medium data out, can be to avoid the interference of remaining information in multi-medium data, and then can mention The accuracy rate of high multi-medium data attribute type classification.
Fig. 4 is referred to, is the flow diagram of another data processing method provided by the embodiments of the present application.Such as Fig. 4 institute Show, this method may comprise steps of:
Step S201 obtains sample multi-medium data at least one attribute type corresponding with the sample multi-medium data Label;
Specifically, the available sample multi-medium data of terminal device at least one category corresponding with sample multi-medium data Property type label.Sample multi-medium data can be text data perhaps video data or image data, sample multimedia The acquisition modes of data may include: that directly to obtain from disclosed database be that each sample multi-medium data is stamped The multi-medium data of attribute type label, wherein attribute type label can be used to indicate that the corresponding data of sample multi-medium data Attribute type;Or required multi-medium data is directly obtained from network, and artificially stamp attribute for each multi-medium data Type label will artificially stamp the multi-medium data of attribute type label as sample multi-medium data.
Wherein, multiple subject goals (i.e. multiple identification dimensional informations), needle be may include in each sample multi-medium data To different subject goals, different attribute type labels can have, therefore, single sample multi-medium data may correspond to more A attribute type label.
Step S202 is obtained associated at least one described attribute type label from the sample multi-medium data Sample multimedia subdata;
Specifically, sample multi-medium data can be divided into multiple sample multimedia subdatas, each sample by terminal device This multimedia subdata corresponds to an attribute type label.It wherein, can be to sample multi-medium data in some databases It is divided, the data that different subjects target is directed in sample multi-medium data is determined as sample multimedia subdata, terminal Equipment can directly from database obtain sample multi-medium data in sample associated at least one attribute type label Multimedia subdata.Or terminal device can according at least one corresponding attribute type label of sample multi-medium data, into And obtain sample multimedia subdata included in each multimedia sample multi-medium data.
The available a large amount of sample multi-medium data of terminal device, since the size of each sample multi-medium data is not The same, therefore each sample multi-medium data can be adjusted to target size, i.e., by each sample multi-medium data tune Whole is same size, the sample multi-medium data after size adjusting is determined as target sample data, and generate each target The corresponding sample data matrix of sample data is based on the sample data matrix, it is corresponding to obtain each attribute type label Sample multimedia subdata.
For different types of sample multi-medium data, it is different for generating the detailed process of sample data matrix.Sample When this multi-medium data is sample text data, terminal device can carry out word segmentation processing to each sample text data, obtain The corresponding multiple sample unit's characters of each sample text data, and it is based on character transformation model, by each sample unit Character is converted to sample unit's term vector (dimension of the corresponding unit term vector of each unit character is identical), and then can be with The corresponding unit term vector of sample unit's character for being included according to each sample text data, generates each sample text data Corresponding text matrix will cause generation by the quantity difference for the unit character that each sample text data are included Text matrix line number and it is different (each text matrix column number be it is identical, columns be unit term vector dimension Degree), therefore also need to carry out size adjusting to text matrix, by zero padding by each text adjustment of matrix to fixed dimension, If m × n, m indicate the line number of text matrix adjusted, n indicates text matrix column number adjusted, as target term vector Dimension, text matrix adjusted is determined as sample data matrix.
When sample multi-medium data is sample image data or Sample video data, terminal device can be by Sample video Data are divided into image data one by one, are similar to sample image data, and image ash can be carried out to sample image data Degreeization, normalization, a series of pretreatment such as size adjusting, wherein image gray processing indicates RGB image being converted to grayscale image Picture, if sample image data sheet as gray level image, is not necessarily to carry out image gray processing processing;Normalization is indicated image grayscale Pixel value included in sample image data after change becomes the value that value range is 0-1;Size adjusting is indicated sample graph Picture data point reuse adjusts each sample image data to fixed dimension to target size.By above-mentioned pretreated sample Image data can be used as sample data matrix, carry out subsequent processing.
Step S203, based on the mapping between the sample multimedia subdata and at least one described attribute type label The relationship training data identification model;
Specifically, terminal device can be based on identification dimensions different in above-mentioned sample data matrix and sample data matrix The corresponding attribute type label of information, is trained data identification model.The training process of data identification model may include: Weight initialization is carried out to network;Propagated forward of the sample data matrix of input Jing Guo each network layer obtains output valve;It calculates Error between network output valve and target value;Right value update, continuous iteration are carried out according to the error acquired, until the mistake acquired When difference is less than or equal to the desired value of setting, terminate training, and save the corresponding all parameters of data identification model, number at this time The function of data identification is had been provided with according to identification model.
Optionally, the training of data identification model can be by the way of finely tuning (fine-tuning).It is usually more in sample In the case that media data is less, the model parameter of the available initialization of terminal device, i.e., by trained mould Initiation parameter of the shape parameter as the data identification model is specifically finely tuned based on specific Classification and Identification task, with Achieve the purpose that data identify.
Step S204 obtains multi-medium data, and the multi-medium data and identification dimensional information are input to data identification In model, the corresponding data content feature of the multi-medium data is obtained in the data identification model;
Step S205 is determined according to the data content feature for screening and the identification dimensional information correlated characteristic First screening vector;
Wherein, the specific implementation process of step S204- step S205 may refer in embodiment corresponding to above-mentioned Fig. 2 to step The description of rapid S101- step S102, is not discussed here.
The corresponding target term vector of the identification dimensional information and the data content feature are input to institute by step S206 It states in the full articulamentum of third, is based on the corresponding third activation primitive of the full articulamentum of the third, obtain for screening and the knowledge Second screening vector of other dimensional information correlated characteristic;
Specifically, data identification model can also include updating gate cell, updating gate cell may include that third connects entirely Layer.Terminal device according in data identification model convolutional layer output data content feature, can by data content feature with The corresponding target term vector of identification dimensional information is input to update gate cell, and (third that i.e. update gate cell is included connects entirely Layer), it is based on the corresponding third activation primitive of the full articulamentum of third, it is associated with identification dimensional information to be available for screening Second screening vector.Wherein, it is associated with identification dimensional information in reservation data content characteristic for controlling for updating gate cell Information degree, the value for updating door is bigger, and the information retained in data content feature is more.Second screening vector Z is (i.e. more The value of new door) it is data content feature X by inputtings, target term vector Wa, the full articulamentum of third weight matrix WzAnd the What three activation primitive Sigmod were codetermined, it may be assumed that
Z=σ (WZ[Xs,Wa]) (3)
Wherein, [] indicates that vector is connected, and σ indicates that the first activation primitive Sigmod, Sigmod function may insure to update The value of door is maintained between 0-1, which is conducive to garbled data, and when the value for updating door is 0, any number is equal multiplied by 0 It is 0, therefore the partial data can be filtered out;When the value for updating door is 1, any number is also equal to itself multiplied by 1, therefore The partial data can be remained completely.Certainly, the Sigmod function in above-mentioned formula (3) can use Tanh function generation Replace, the maximum difference of Tanh function and Sigmod function is: the value range of Tanh function is between -1 to 1, rather than 0-1 Between, it equally can achieve the effect for garbled data.
Step S207, by it is described first screening vector and the data content feature carry out vector dot, obtain with it is described Identify the associated reservation information characteristics of dimensional information;
Specifically, terminal device can be by the number of the first screening vector R of the second full articulamentum output and convolutional layer output According to content characteristic XsVector dot is carried out, reservation information characteristics R associated with identification dimensional information is obtaineds, it may be assumed that
Rs=R*Xs (4)
Retain information characteristics RsRefer to based on the first screening vector R, it can be from data content feature XsIn extract and identify The associated information characteristics of dimensional information.
The reservation information characteristics and the corresponding target term vector of the identification dimensional information are input to institute by step S208 It states in the first full articulamentum, based on corresponding second activation primitive of the described first full articulamentum, obtains the data content feature In the first candidate information feature associated with the identification dimensional information;
Specifically, being based on data model for the accuracy for the target information feature for ensuring to extract from multi-medium data In update gate cell, terminal device can first will retain information characteristics RsTarget term vector W corresponding with identification dimensional informationa, It is input in the first full articulamentum, first will be determined as by the vector of the corresponding second activation primitive output of the first full articulamentum Candidate information feature H, i.e. above-mentioned formula (2) can be changed are as follows:
H=Relu (WH[Rs,Wa]) (5)
Wherein, meaning represented by formula (5) may refer in the step S103 in embodiment corresponding to above-mentioned Fig. 2 to public affairs The description of formula (2), which is not described herein again.
The second screening vector and the first candidate information feature are carried out vector dot, obtain the by step S209 Two candidate information features;
Specifically, terminal device can be defeated by the second screening vector Z of the full articulamentum output of third and the first full articulamentum The first candidate information feature H out carries out vector dot, i.e. corresponding element is multiplied, available second candidate information feature H ', That is: H '=Z*H.
Step S210 determines global information feature based on the second screening vector and the data content feature, according to The second candidate information feature and the global information feature, determine target information feature;
Specifically, since above-mentioned second candidate information feature H ' is total based on the first screening vector R and the second screening vector Z With determine, therefore the second candidate information feature H ' be data identification model learn from multi-medium data with identification dimension The associated information characteristics of information are spent, in order to make finally obtained information characteristics more comprehensively, terminal device can be complete according to third Second screening vector Z of articulamentum output and the data content feature X of convolutional layer outputs, obtain global information feature G, it may be assumed that G =(1-Z) * Xs.Wherein 1-Z is to balance above-mentioned second candidate information feature H '.
Second candidate information feature H ' and global information feature G can be carried out add operation by terminal device, obtain data Content characteristic XsIn for identification dimensional information target information feature O, it may be assumed that O=G+H '.
Step S211, according to the target information feature, determine in the multi-medium data with the identification dimensional information The data attribute type to match.
Wherein, the specific implementation process of step S211 may refer in embodiment corresponding to above-mentioned Fig. 2 to step S104's Description, is not discussed here.
It is the structural schematic diagram of another data identification model provided by the embodiments of the present application please also refer to Fig. 5.With more Media data is for text data, the structural schematic diagram of data identification model is as shown in figure 5, the data identification model can wrap Embeding layer (including word insertion 20b and target insertion 20e), convolutional layer 20c, full articulamentum 20f is included (to be referred to as second to connect entirely Connect layer), full articulamentum 20h (being referred to as the first full articulamentum), full articulamentum 30a and classifier 20i.Wherein, Quan Lian It meets layer 20f and dot product 20g and is properly termed as resetting gate cell, full articulamentum 30a and dot product 30c are properly termed as updating gate cell.
Wherein, word insertion 20b, target insertion 20e, convolutional layer 20c, full articulamentum 20f and full articulamentum 20h's is specific Function description may refer to the description in embodiment corresponding to above-mentioned Fig. 3, be not discussed here.
Full articulamentum 30a can input the data content feature of convolutional layer 20c output and the target of target insertion 20e output Term vector (is properly termed as the first activation primitive, which can be Tanh function or Sigmod by activation primitive Function) output the second screening vector, it can output retains related to identification dimensional information in data content characteristic for control The degree vector of the information of connection.The first of the second screening vector and full articulamentum 20h output that full articulamentum 30a can be exported Candidate information feature carries out vector dot operation by dot product 30b, obtains the second candidate information feature, second candidate information Feature can indicate further to extract information relevant to identification dimensional information in the first candidate information feature.By 1 with it is complete Second screening vector of articulamentum 30a output does subtraction, and passes through dot product with the data content feature of convolutional layer 20c output 30c does vector dot operation, obtains global information feature.
Second candidate information feature and global information feature are input in classifier 20i, classifier 20i can be A kind of softmax (classifiers of classifying) classifier, passes through a variety of categories in the available target information feature of operation and classifier more Matching probability between property type, i.e., there are how many kinds of attribute types in classifier, so that it may how many a matching probability values are exported, The corresponding attribute type of maximum matching probability value is determined as in text data for the data attribute type of identification dimensional information.
Based on the data identification model in embodiment corresponding to above-mentioned Fig. 5, the embodiment of the present application is with Restaurant (meal Shop) for data set, experimental verification has been carried out to the data recognition effect of the data identification model.Restaurant data set is Public database from SemEval (semanteme assessment) seminar includes the use about dining room in the Restaurant data set (user comment sentence is English) is commented at family, and the sentence in user comment can contain commentator to dining room different aspect not Same attitude, such as " Average to good Thai food, but terrible delivery (eat very well, but send by Thai food It is too disappointing to eat) ", in the sentence, contain the user for issuing the comment to " food (food) " and " delivery (food delivery) " Two aspect attitude, the emotion for " food (food) " be it is positive, the emotion for " delivery (food delivery) " is to disappear Pole, in order to it is more acurrate, more fully verify data recognition effect of the above-mentioned data identification model in the comment data of dining room, Multiple and different data sets is created in the embodiment of the present application, including data volume is big and identifies the lower data set of difficulty, and Data volume is small and identifies the high data set of difficulty.Data volume is big and the identification lower data set of difficulty refers to included in data set Each comment data can contain the different attitudes for different aspect, can also only contain attitude in a certain respect;Data Measure it is small and identify the high data set of difficulty refer to by the different aspect (also referred to as identification dimensional information) to dining room have it is opposite or The sentences of different moods forms, if the sentence is concentrated with 4 copies in data comprising 4 aspects in sentence, each copy with Different aspects and attribute type label is associated.
In the embodiment of the present application, experimental data set can be by data set 1 (including 2014 in Restaurant data set Years -2016 years user comment data), (including -2016 years 2014 user comment numbers of Restaurant data set of data set 2 In, the comment data with multiple subject goals and multiple attribute type labels), (including the Restaurant data of data set 3 Concentrate user comment data in 2014), data set 4 (including in Restaurant data set 2014 user comment data, Comment data with multiple subject goals and multiple attribute type labels) it constitutes.In data set 1 and data set 2, number is commented on According to attribute type include: positive, passive and neutral;In data set 3 and data set 4, the attribute type packet of comment data It includes: positive, passive, neutral and conflict.Experimental data concentration may include at least food, price, service, environment and other Deng 5 subject goals, the specific composition of experimental data set is as shown in table 1:
Table 1
Wherein, "-" indicates no data, without the concern for repeating no more below." training " in table 1 is represented as In experimentation, the corresponding amount of training data of corresponding attribute type, such as the training data of " positive " attribute type in data set 1 Amount is 2710, and the amount of training data of " passiveness " attribute type is 1198 etc., and " test " is represented as during the experiment in table 1, The corresponding amount of test data of corresponding attribute type, if the amount of test data of " positive " attribute type in data set 1 is 1505, The amount of test data of " passiveness " attribute type is 680 etc..Amount of training data can be used for training above-mentioned data identification model, test Data volume can be used for verifying the effect for completing the data model of training.
In the experimentation of the embodiment of the present application, the word embeding layer in data identification model can use GloVe (word Insertion tool) vector initialized, the GloVe vector be carry out that pre-training obtains in a large amount of unlabelled data to Amount.In the training process, can be using stochastic gradient descent algorithm come the value of Optimized model parameter, part training parameter can be with Are as follows: learning rate (Learning rate) is set as 1e-3, and Epoch (frequency of training) is set as 10.Wherein, learning rate can be used In the amplitude for controlling each undated parameter, learning rate is excessive, and the amplitude of each undated parameter is very big, but may cause network model Skip optimal value;Learning rate is too small, the amplitude very little of each undated parameter, but it is very slow and time-consuming to will cause network training;Epoch Refer to and all training datas have been trained to one time number.
For the validity that the data identification model verified in the embodiment of the present application is concentrated in experimental data, the application is implemented Example compares the data identification model and existing multiple network model, and concrete outcome may refer to table 2:
Table 2
Model Test 1 Test 2 Test 3 Test 4
Model 1 - - 75.32 -
Model 2 - - 82.93 -
Model 3 83.91±0.49 66.32±2.28 78.29±0.68 45.62±0.90
Model 4 84.28±0.15 50.43±0.38 79.47±0.32 44.94±0.01
Model 5 84.48±0.06 50.08±0.31 78.67±0.35 44.49±1.52
Model 6 85.92±0.27 70.75±1.19 79.35±0.34 50.55±1.83
Data identification model 86.51±0.13 72.01±0.53 80.21±0.31 51.24±1.75
Wherein, all numerical value in table 2 indicate the test accuracy rate of the target text analysis of each model during the experiment (%), test accuracy rate can predict correct data volume by counting each model in test data set, and prediction is correct Data volume divided by total available test accuracy rate of data volume.Of course, it is possible to carry out many experiments, many experiments are obtained The corresponding average value of accuracy rate as final test accuracy rate.Test 1 indicates the corresponding test data of above-mentioned data set 1, Test 1 indicates the corresponding test data of above-mentioned data set 2, and test 3 indicates the corresponding test data of above-mentioned data set 3, tests 4 tables Show the corresponding test data of data set 4.Existing multiple network model is respectively as follows:
Model 1: target sentiment analysis (aspect-category is used for by one kind that National Research Council of Canada announces Sentiment analysis, ACSA) prediction model, using support vector machines (Support Vector Machine, SVM) As basic classification device, model 1 can be using features such as part of speech labels.
Model 2: it is that the feature used is different with the maximum difference of above-mentioned model 1, model 2 is using sentiment dictionary Feature.Model 1 and model 1 are required to a large amount of flag data.
Model 3: referring to the network model that target sentiment analysis is carried out using attention mechanism, and length has mainly been used to remember in short-term Recall unit (Long Short-Term Memory, LSTM), it can be using identification dimensional information as the additional input of LSTM.
Model 4: convolutional neural networks model (Convolutional Neural Networks, CNN), by convolutional Neural Network application extracts emotional information, convolution by the way that different size of convolution kernel is arranged in text categorization task from text data The size of core can be 3 × 3,4 × 4,5 × 5.
Model 5: refer to gate convolutional neural networks (Gated Convolutional Networks, GCN), cannot will know Other dimensional information is input in GCN model as additional input.
Model 6: refer to the gate convolutional neural networks (GCAE) based on target sentiment analysis, on the basis of GCN model It can be using identification dimensional information as additional input.
By above-mentioned table 2 it is found that the test that the data identification model used in the embodiment of the present application obtains in each data set Accuracy rate is substantially higher than existing network model, especially small and more indiscernible data set (such as data set 2 in data volume With data set 4) in, effect is more prominent.Test accuracy rate of the data identification model in data set 2 is than model 3 in data set 2 In test accuracy rate it is high by 6% or so, it is higher by 1.3% or so than test accuracy rate of the model 6 in data set 2;Data identify mould Type is higher by 7% or so than test accuracy rate of the model 5 in data set 4 in the test accuracy rate in data set 4, than model 6 in number It is high by 1% or so according to the test accuracy rate in collection 4.Therefore, data identification model has good effect in target sentiment analysis Fruit can effectively improve the accuracy rate of text data classification.
It is the schematic diagram of a scenario of another data sentiment analysis method provided by the embodiments of the present application please also refer to Fig. 6. Based on the data identification model in embodiment corresponding to above-mentioned Fig. 5, based on the multi-medium data under different scenes, may be implemented pair Data sentiment analysis under different scenes, as shown in fig. 6, for a film (such as film " your story "), user can be After watching the film, the film can be commented on network, the aspect of user comment (i.e. identification dimensional information) can be with Plot, performer, music and stage property including the film, and by user comment be divided into positive emotion, Negative Affect and in Vertical three data attribute types of emotion.The comment that user delivers can be shown on the corresponding comment displayed page 60a of the film.
The comment data that the available all users of terminal device 100a deliver respectively, and each user that will acquire point Not corresponding comment data is " performer as the comment data that text data to be analyzed (i.e. multi-medium data), such as user 1 are delivered Artistic skills it is excellent, if the pleasing to the ear point of music is just perfect ", terminal device 100a can will " artistic skills of performer are excellent, if The pleasing to the ear point of music is just perfect " it is used as text data 60b to be analyzed, and it is based on trained data identification model, it is available The data attribute type of performer is directed in text data 60b to be analyzed are as follows: positive emotion;It is directed in text data 60b to be analyzed The data attribute type of music are as follows: Negative Affect.Further, comment data " this feelings that terminal device can deliver user 2 Section too old stuff, if it were not for performer's artistic skills are good, has seen to go down early ", as text data 60c to be analyzed, user 3 is sent out The comment data " I likes very much the clothes of the inside, too beautiful " of table, is known as text data 60d to be analyzed, and based on data Other model can determine the data attribute type that plot is directed in text data 60c to be analyzed are as follows: Negative Affect;Text to be analyzed The data attribute type of performer is directed in data 60c are as follows: positive emotion;The data of stage property are directed in text data 60d to be analyzed Attribute type are as follows: positive emotion.It is each for the film in the comment data for determining each user based on data identification model After the data attribute type of a aspect, terminal device 100a can carry out data statistics, obtain user to the film various aspects Score such as obtains the plot score of film according to the comment data of user are as follows: 3.6 (full marks are 5 points);Performer's score are as follows: 4.7; Music score are as follows: 4.0;Stage property score are as follows: 4.2.Above-mentioned obtain can be shown on the comment displayed page 60a of terminal device 100a Point, so that subsequent user decides whether to watch the film according to the score after refinement.
It is the schematic diagram of a scenario of another data sentiment analysis method provided by the embodiments of the present application please also refer to Fig. 7. By taking multi-medium data is image data as an example, as shown in fig. 7, terminal device can using picture 70a as image to be analyzed data, And the pretreatments such as image gray processing, normalization, picture size adjustment are carried out to picture 70a, and obtain the knowledge for being directed to picture 70a Other dimensional information " face 70b " is based on target term vector transformation model, and identification dimensional information " face 70b " is converted to target Term vector 70c.
The available data identification model 70d of terminal device, data identification model 70d can identify needle in picture 70d To the data attribute type of different identification dimensional informations, the corresponding attribute type of image data may include: sad emotion, happiness Emotion and normal emotion.Terminal device will be input to data identification model 70d by pretreated picture 70a, based on number According to the convolutional layer in identification model 70d, the corresponding content-data feature of available picture 70a.Terminal device is defeated by convolutional layer Content-data feature out and target term vector are input to door control unit (including the resetting gate cell in data identification model 70d With update gate cell), the target information feature for identification dimensional information " face 70b ", number are obtained from content-data feature The matching degree between target signature information and multiple attribute types can be identified according to the classifier in identification model 70d, that is, is identified Matching probability between target signature information and multiple attribute types out identifies dimensional information in picture 70a as shown in Figure 7 The matching probability of " face 70b " corresponding data attribute type and " sad emotion " attribute type is 0.05;It is identified in picture 70a The matching probability of the corresponding data attribute type of dimensional information " face 70b " and " happiness emotion " attribute type is 0.75;Picture Dimensional information " face 70b " corresponding data attribute type is identified in 70a and the matching probability of " normal emotion " attribute type is 0.20。
According to above-mentioned multiple matching probabilities, terminal device, which can determine, identifies that dimensional information " face 70b " is right in picture 70a The data attribute type answered are as follows: happiness emotion.
In the embodiment of the present application, the corresponding number of multi-medium data can be obtained by the convolutional layer in data identification model The sieve for screening and identifying dimensional information correlated characteristic is determined according to content characteristic, and based on resetting gate cell and update gate cell Select vector (including first screening vector sum second screen vector), and then can based on first screening vector sum second screen to Amount filters out target signature information associated with identification dimensional information from data content feature, and then according to target signature Information determines the data attribute type to match in multi-medium data with identification dimensional information.As it can be seen that dimensional information will be identified) make Be input in data identification model for additional information, and using door machine system and convolutional network extracted from multi-medium data with Identify the associated target signature information of dimensional information, and target signature information is determined by nonlinear combination mode, Jin Erji The corresponding data attribute type of multi-medium data is judged in target signature information, it can be to avoid remaining information in multi-medium data Interference, and then can be improved multi-medium data attribute type classification accuracy rate;And traditional attention mechanism is compared, is adopted With door machine system can be promoted can concurrency, and then training speed can be improved.
Fig. 8 is referred to, is a kind of structural schematic diagram of data processing equipment provided by the embodiments of the present application.As shown in figure 8, The data processing equipment 1 may include: data acquisition module 10, the first determining module 20, screening module 30, the second determining module 40;
Data acquisition module 10, for obtaining the corresponding data content feature of multi-medium data;
First determining module 20, for determining and believing for screening with the identification dimension according to the data content feature Cease the first screening vector of correlated characteristic;
Screening module 30, for being filtered out and the knowledge from the data content feature according to the first screening vector The other associated target information feature of dimensional information;
Second determining module 40, for according to the target information feature, determine in the multi-medium data with the knowledge The data attribute type that other dimensional information matches.
Wherein, data acquisition module 10, the first determining module 20, screening module 30, the specific function of the second determining module 40 The mode of being able to achieve may refer to the step S101- step S104 in embodiment corresponding to above-mentioned Fig. 2, be not discussed here.
Please also refer to Fig. 8, which can also include: that sample data obtains module 50, sample subdata Obtain module 60, training module 70, third determining module 80;
Sample data obtains module 50, corresponding extremely for obtaining sample multi-medium data and the sample multi-medium data A few attribute type label;The attribute type label is for characterizing the corresponding data attribute class of the sample multi-medium data Type;
Sample subdata obtains module 60, for from the sample multi-medium data, obtaining and at least one described category The property associated sample multimedia subdata of type label;The corresponding attribute type label of each sample multimedia subdata;
Training module 70, for based between the sample multimedia subdata and at least one described attribute type label The mapping relations training data identification model;
Third determining module 80, for the corresponding target term vector of the identification dimensional information and the data content is special Sign is input in the full articulamentum of the third, is based on the corresponding third activation primitive of the full articulamentum of the third, is obtained for sieving Second screening vector of choosing and the identification dimensional information correlated characteristic.
Wherein, sample data obtains module 50, and sample subdata obtains module 60, and the concrete function of training module 70 is realized Mode may refer to the step S201- step S203 in embodiment corresponding to above-mentioned Fig. 4, the concrete function of third determining module 80 Implementation may refer to the step S206 in embodiment corresponding to above-mentioned Fig. 4, be not discussed here.
Please also refer to Fig. 8, data acquisition module 10 may include: data input cell 101, content characteristic acquiring unit 102;
Data input cell 101, for obtaining multi-medium data, by the multi-medium data and identification dimensional information input Into data identification model;
Content characteristic acquiring unit 102, it is corresponding for obtaining the multi-medium data in the data identification model Data content feature.
Wherein, the concrete function implementation of data input cell 101, content characteristic acquiring unit 102 may refer to The step S101 in embodiment corresponding to Fig. 2 is stated, is not discussed here.
Please also refer to Fig. 8, the first determining module 20 may include: the second converting unit 201, and the first screening vector determines Unit 202;
Second converting unit 201, for the identification dimensional information to be converted to target term vector;
First screening vector determination unit 202, for the target term vector and the data content feature to be input to In the second full articulamentum, based on corresponding first activation primitive of the described second full articulamentum, obtain for screen with it is described Identify the first screening vector of dimensional information correlated characteristic.
Wherein, the concrete function implementation of the second converting unit 201, the first screening vector determination unit 202 can join See the step S102 in embodiment corresponding to above-mentioned Fig. 2, is not discussed here.
Please also refer to Fig. 8, screening module 30 may include: dot product unit 301, the first input unit 302, target signature Determination unit 303;
Dot product unit 301 is obtained for the first screening vector and the data content feature to be carried out vector dot Reservation information characteristics associated with the identification dimensional information;
First input unit 302 is used for the reservation information characteristics and the corresponding target word of the identification dimensional information Vector is input in the described first full articulamentum;
Target signature determination unit 303, for obtaining institute based on corresponding second activation primitive of the described first full articulamentum State the target information feature in data content feature.
Wherein, dot product unit 301, the first input unit 302, the concrete function realization side of target signature determination unit 303 Formula may refer to the step S103 in embodiment corresponding to above-mentioned Fig. 2, be not discussed here.
Please also refer to Fig. 8, the second determining module 40 may include: the second input unit 401, attribute type determination unit 402;
Second input unit 401, the classifier for being input to the target information feature in the output layer;
Attribute type determination unit 402 identifies the target information feature and the classification for being based on the classifier To there is matching probability in device between a variety of attribute types the attribute type of maximum matching probability to be determined as the multimedia number The data attribute type to match in the identification dimensional information.
Wherein, the concrete function implementation of the second input unit 401, attribute type determination unit 402 may refer to The step S104 in embodiment corresponding to Fig. 2 is stated, is not discussed here.
Please also refer to Fig. 8, it may include: size adjustment module 601, generation module that sample subdata, which obtains module 60, 602;
Size adjustment module 601, for adjusting the sample multi-medium data to target size, after size adjusting Sample multi-medium data is determined as target sample data;
Generation module 602, for generating the corresponding sample data matrix of the target sample data, from the sample data In matrix, sample multimedia subdata associated at least one described attribute type label is obtained.
Wherein, size adjustment module 601, it is right that the concrete function implementation of generation module 602 may refer to above-mentioned Fig. 4 institute The step S202 in embodiment is answered, is not discussed here.
Please also refer to Fig. 8, content characteristic acquiring unit 102 may include: the first conversion subunit 1021, and it is single to splice son Member 1022, feature extraction subelement 1023;
First conversion subunit 1021, for the text data being divided into multiple in the data identification model Unit character, and each unit character is converted as unit term vector;
Splice subelement 1022, for the unit term vector to be spliced into the corresponding text matrix of the text data;
Feature extraction subelement 1023 carries out feature extraction to the text matrix, obtains for being based on the convolutional layer The corresponding data content feature of the text data.
Wherein, the first conversion subunit 1021, splices subelement 1022, and the concrete function of feature extraction subelement 1023 is real Existing mode may refer to the step S101 in embodiment corresponding to above-mentioned Fig. 2, be not discussed here.
Please also refer to Fig. 8, target signature determination unit 303 may include: that the first candidate information determines subelement 3031, Fisrt feature determines subelement 3032;
First candidate information determines subelement 3031, for activating letter based on the described first full articulamentum corresponding second Number, obtains the first candidate information feature associated with the identification dimensional information in the data content feature;
Fisrt feature determines subelement 3032, for will second screening vector and the first candidate information feature into Row vector dot product obtains target information feature.
Wherein, the first candidate information determines subelement 3031, and fisrt feature determines that the concrete function of subelement 3032 is realized Mode may refer to the step S208- step S210 in embodiment corresponding to above-mentioned Fig. 4, be not discussed here.
Please also refer to Fig. 8, fisrt feature determines that subelement 3032 may include: that the second candidate information determines subelement 30321, global characteristics determine subelement 30322, and second feature determines subelement 30323;
Second candidate information determines subelement 30321, for screening vector and first candidate information for described second Feature carries out vector dot, obtains the second candidate information feature;
Global characteristics determine subelement 30322, for screening vector and the data content feature based on described second, really Determine global information feature;
Second feature determines subelement 30323, for special according to the second candidate information feature and the global information Sign, determines target information feature.
Wherein, the second candidate information determines that subelement 30321, global characteristics determine subelement 30322, and second feature determines The concrete function implementation of subelement 30323 may refer to the step S209- step in embodiment corresponding to above-mentioned Fig. 4 S210 is not discussed here.
In the embodiment of the present application, the corresponding number of multi-medium data can be obtained by the convolutional layer in data identification model The sieve for screening and identifying dimensional information correlated characteristic is determined according to content characteristic, and based on resetting gate cell and update gate cell Select vector (including first screening vector sum second screen vector), and then can based on first screening vector sum second screen to Amount filters out target signature information associated with identification dimensional information from data content feature, and then according to target signature Information determines the data attribute type to match in multi-medium data with identification dimensional information.As it can be seen that identification dimensional information is made It is input in data identification model for additional information, and extracts and know from multi-medium data using door machine system and convolutional network The other associated target signature information of dimensional information, and then the corresponding data of multi-medium data are judged based on target signature information Attribute type, can be to avoid the interference of remaining information in multi-medium data, and then multi-medium data attribute type point can be improved The accuracy rate of class;And compare traditional attention mechanism, using door machine system can be promoted can concurrency, and then instruction can be improved Practice speed.
Fig. 9 is referred to, Fig. 9 is a kind of structural schematic diagram of data processing equipment provided by the embodiments of the present application.Such as Fig. 9 institute Show, which may include: processor 1001, network interface 1004 and memory 1005, in addition, above-mentioned number It can also include: user interface 1003 and at least one communication bus 1002 according to processing unit 1000.Wherein, communication bus 1002 for realizing the connection communication between these components.Wherein, user interface 1003 may include display screen (Display), Keyboard (Keyboard), optional user interface 1003 can also include standard wireline interface and wireless interface.Network interface 1004 It may include optionally standard wireline interface and wireless interface (such as WI-FI interface).Memory 1004 can be high-speed RAM and deposit Reservoir is also possible to non-labile memory (non-volatile memory), for example, at least a magnetic disk storage.It deposits Reservoir 1005 optionally can also be that at least one is located remotely from the storage device of aforementioned processor 1001.As shown in figure 9, conduct It may include operating system, network communication module, user interface mould in a kind of memory 1005 of computer readable storage medium Block and equipment control application program.
In data processing equipment 1000 as shown in Figure 9, network interface 1004 can provide network communication function;And user Interface 1003 is mainly used for providing the interface of input for user;And processor 1001 can be used for calling and store in memory 1005 Equipment control application program, with realize:
Obtain the corresponding data content feature of multi-medium data;
According to the data content feature, determine for screening the first screening with the identification dimensional information correlated characteristic Vector;
It is filtered out from the data content feature according to the first screening vector related to the identification dimensional information The target information feature of connection;
According to the target information feature, the number to match in the multi-medium data with the identification dimensional information is determined According to attribute type.
It should be appreciated that data processing equipment 1000 described in the embodiment of the present application is executable, Fig. 2, Fig. 4 are any above Description in a corresponding embodiment to the data processing method, also can be performed in embodiment corresponding to Fig. 8 above to described The description of data processing equipment 1, details are not described herein.In addition, being described to using the beneficial effect of same procedure, also no longer carry out It repeats.
In addition, it need to be noted that: the embodiment of the present application also provides a kind of computer readable storage medium, and institute Computer program performed by the data processing equipment 1 for being stored with and being mentioned above in computer readable storage medium is stated, and described Computer program includes program instruction, and when the processor executes described program instruction, it is any to be able to carry out Fig. 2, Fig. 4 above Therefore description in a corresponding embodiment to the data processing method will be repeated no longer here.In addition, to use The beneficial effect of same procedure describes, and is also no longer repeated.For the reality of computer readable storage medium involved in the application Undisclosed technical detail in example is applied, the description of the application embodiment of the method is please referred to.
Those of ordinary skill in the art will appreciate that realizing all or part of the process in above-described embodiment method, being can be with Relevant hardware is instructed to complete by computer program, the program can be stored in a computer-readable storage medium In, the program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein, the storage medium can be magnetic Dish, CD, read-only storage memory (Read-Only Memory, ROM) or random access memory (Random Access Memory, RAM) etc..
Above disclosed is only the application preferred embodiment, cannot limit the right model of the application with this certainly It encloses, therefore according to equivalent variations made by the claim of this application, still belongs to the range that the application is covered.

Claims (13)

1. a kind of data processing method characterized by comprising
Obtain the corresponding data content feature of multi-medium data;
According to the data content feature, the first screening vector for screening and identifying dimensional information correlated characteristic is determined;
It is filtered out from the data content feature according to the first screening vector associated with the identification dimensional information Target information feature;
According to the target information feature, the data category to match in the multi-medium data with the identification dimensional information is determined Property type.
2. the method according to claim 1, wherein the corresponding data content of the acquisition multi-medium data is special Sign, comprising:
Multi-medium data is obtained, the multi-medium data and identification dimensional information are input in data identification model;
The corresponding data content feature of the multi-medium data is obtained in the data identification model;
The data identification model includes input layer, convolutional layer, resetting gate cell, the first full articulamentum and output layer;It is described Input layer is for inputting the multi-medium data and the identification dimensional information, and the convolutional layer is for obtaining the multimedia number According to corresponding data content feature, the resetting gate cell includes the second full articulamentum, and the second full articulamentum is for obtaining The first screening vector, the first full articulamentum is for obtaining the target information feature, and the output layer is for exporting The corresponding data attribute type of the multi-medium data.
3. according to the method described in claim 2, it is characterized in that, the multi-medium data includes text data;
It is described that the corresponding data content feature of the multi-medium data is obtained in the data identification model, comprising:
In the data identification model, the text data is divided into multiple unit characters, and each unit character is turned It is changed to unit term vector;
The unit term vector is spliced into the corresponding text matrix of the text data;
Based on the convolutional layer, feature extraction is carried out to the text matrix, obtains the corresponding data content of the text data Feature.
4. according to the method described in claim 2, determination is for sieving it is characterized in that, described according to the data content feature First screening vector of choosing and the identification dimensional information correlated characteristic, comprising:
The identification dimensional information is converted into target term vector;
The target term vector and the data content feature are input in the described second full articulamentum, it is complete based on described second Corresponding first activation primitive of articulamentum, obtain for screen with it is described identification dimensional information correlated characteristic first screen to Amount.
5. according to the method described in claim 2, it is characterized in that, described screen vector out of described data according to described first Hold in feature and filter out target information feature associated with the identification dimensional information, comprising:
The first screening vector and the data content feature are subjected to vector dot, obtained and the identification dimensional information phase Associated reservation information characteristics;
The reservation information characteristics and the corresponding target term vector of the identification dimensional information are input to the described first full connection In layer;
Based on corresponding second activation primitive of the described first full articulamentum, the target information obtained in the data content feature is special Sign.
6. according to the method described in claim 5, it is characterized in that, the data identification model further includes updating gate cell, institute Stating update gate cell includes the full articulamentum of third;
The method also includes:
The corresponding target term vector of the identification dimensional information is input to the third with the data content feature to connect entirely In layer, it is based on the corresponding third activation primitive of the full articulamentum of the third, is obtained for screening and the identification dimensional information phase Close the second screening vector of feature;
It is then described based on corresponding second activation primitive of the described first full articulamentum, obtain the target in the data content feature Information characteristics, comprising:
Based on corresponding second activation primitive of the described first full articulamentum, obtains in the data content feature and tieed up with the identification Spend the associated first candidate information feature of information;
The second screening vector and the first candidate information feature are subjected to vector dot, obtain target information feature.
7. according to the method described in claim 6, it is characterized in that, described by the second screening vector and first candidate Information characteristics carry out vector dot, obtain target information feature, comprising:
The second screening vector and the first candidate information feature are subjected to vector dot, obtain the second candidate information spy Sign;
Based on the second screening vector and the data content feature, global information feature is determined;
According to the second candidate information feature and the global information feature, target information feature is determined.
8. method according to claim 1-7, which is characterized in that it is described according to the target information feature, really The data attribute type to match in the fixed multi-medium data with the identification dimensional information, comprising:
The classifier target information feature being input in the output layer;
Based on the classifier, identify that the matching in the target information feature and the classifier between a variety of attribute types is general To there is rate the attribute type of maximum matching probability to be determined as in the multi-medium data matching with the identification dimensional information Data attribute type.
9. the method according to claim 1, wherein further include:
Obtain sample multi-medium data at least one attribute type label corresponding with the sample multi-medium data;The attribute Type label is for characterizing the corresponding data attribute type of the sample multi-medium data;
From the sample multi-medium data, sample multimedia associated at least one described attribute type label is obtained Data;The corresponding attribute type label of each sample multimedia subdata;
Based on described in the mapping relations training between the sample multimedia subdata and at least one described attribute type label Data identification model.
10. according to the method described in claim 9, it is characterized in that, described from the sample multi-medium data, acquisition and institute State the associated sample multimedia subdata of at least one attribute type label, comprising:
The sample multi-medium data is adjusted to target size, the sample multi-medium data after size adjusting is determined as target Sample data;
Generate the corresponding sample data matrix of the target sample data, from the sample data matrix, obtain with it is described extremely Few associated sample multimedia subdata of an attribute type label.
11. a kind of data processing equipment characterized by comprising
Data acquisition module, for obtaining the corresponding data content feature of multi-medium data;
First determining module, for determining for screening and identifying dimensional information correlated characteristic according to the data content feature First screening vector;
Screening module, for being filtered out from the data content feature and the identification dimension according to the first screening vector The associated target information feature of information;
Second determining module, for according to the target information feature, determine in the multi-medium data with the identification dimension The data attribute type that information matches.
12. a kind of data processing equipment characterized by comprising processor and memory;
The processor is connected with memory, wherein the memory is for storing computer program, and the processor is for adjusting With the computer program, to execute such as the described in any item methods of claim 1-10.
13. a kind of computer readable storage medium, which is characterized in that the computer-readable recording medium storage has computer journey Sequence, the computer program include program instruction, and described program instructs when being executed by a processor, execute such as claim 1-10 Described in any item methods.
CN201910559777.4A 2019-06-26 2019-06-26 A kind of data processing method, device and readable storage medium storing program for executing Pending CN110287341A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910559777.4A CN110287341A (en) 2019-06-26 2019-06-26 A kind of data processing method, device and readable storage medium storing program for executing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910559777.4A CN110287341A (en) 2019-06-26 2019-06-26 A kind of data processing method, device and readable storage medium storing program for executing

Publications (1)

Publication Number Publication Date
CN110287341A true CN110287341A (en) 2019-09-27

Family

ID=68005726

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910559777.4A Pending CN110287341A (en) 2019-06-26 2019-06-26 A kind of data processing method, device and readable storage medium storing program for executing

Country Status (1)

Country Link
CN (1) CN110287341A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110955789A (en) * 2019-12-31 2020-04-03 腾讯科技(深圳)有限公司 Multimedia data processing method and equipment
CN111274393A (en) * 2020-01-17 2020-06-12 深圳数联天下智能科技有限公司 Method and device for constructing knowledge base about article and computing equipment
CN111339255A (en) * 2020-02-26 2020-06-26 腾讯科技(深圳)有限公司 Target emotion analysis method, model training method, medium, and device
CN112890828A (en) * 2021-01-14 2021-06-04 重庆兆琨智医科技有限公司 Electroencephalogram signal identification method and system for densely connecting gating network

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110955789A (en) * 2019-12-31 2020-04-03 腾讯科技(深圳)有限公司 Multimedia data processing method and equipment
CN110955789B (en) * 2019-12-31 2024-04-12 腾讯科技(深圳)有限公司 Multimedia data processing method and equipment
CN111274393A (en) * 2020-01-17 2020-06-12 深圳数联天下智能科技有限公司 Method and device for constructing knowledge base about article and computing equipment
CN111274393B (en) * 2020-01-17 2024-04-09 深圳数联天下智能科技有限公司 Method and device for constructing knowledge base about articles and computing equipment
CN111339255A (en) * 2020-02-26 2020-06-26 腾讯科技(深圳)有限公司 Target emotion analysis method, model training method, medium, and device
CN111339255B (en) * 2020-02-26 2023-04-18 腾讯科技(深圳)有限公司 Target emotion analysis method, model training method, medium, and device
CN112890828A (en) * 2021-01-14 2021-06-04 重庆兆琨智医科技有限公司 Electroencephalogram signal identification method and system for densely connecting gating network

Similar Documents

Publication Publication Date Title
CN109241524B (en) Semantic analysis method and device, computer-readable storage medium and electronic equipment
CN111241237B (en) Intelligent question-answer data processing method and device based on operation and maintenance service
CN110287341A (en) A kind of data processing method, device and readable storage medium storing program for executing
CN112270196B (en) Entity relationship identification method and device and electronic equipment
CN110032632A (en) Intelligent customer service answering method, device and storage medium based on text similarity
CN109271493A (en) A kind of language text processing method, device and storage medium
CN110909549B (en) Method, device and storage medium for punctuating ancient Chinese
CN106462626A (en) Modeling interestingness with deep neural networks
CN111325029B (en) Text similarity calculation method based on deep learning integrated model
CN110489523B (en) Fine-grained emotion analysis method based on online shopping evaluation
CN110502610A (en) Intelligent sound endorsement method, device and medium based on text semantic similarity
US11158349B2 (en) Methods and systems of automatically generating video content from scripts/text
US10685012B2 (en) Generating feature embeddings from a co-occurrence matrix
CN109002473A (en) A kind of sentiment analysis method based on term vector and part of speech
CN112818093A (en) Evidence document retrieval method, system and storage medium based on semantic matching
CN107807958A (en) A kind of article list personalized recommendation method, electronic equipment and storage medium
CN110489747A (en) A kind of image processing method, device, storage medium and electronic equipment
CN107357785A (en) Theme feature word abstracting method and system, feeling polarities determination methods and system
CN110309114A (en) Processing method, device, storage medium and the electronic device of media information
CN108170678A (en) A kind of text entities abstracting method and system
CN110046356A (en) Label is embedded in the application study in the classification of microblogging text mood multi-tag
CN111859967A (en) Entity identification method and device and electronic equipment
CN109271624A (en) A kind of target word determines method, apparatus and storage medium
CN113486174B (en) Model training, reading understanding method and device, electronic equipment and storage medium
CN113886562A (en) AI resume screening method, system, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination