CN108509427B

CN108509427B - Data processing method and application of text data

Info

Publication number: CN108509427B
Application number: CN201810370375.5A
Authority: CN
Inventors: 杨鹏
Original assignee: Beijing Huiwen Technology Group Co ltd
Current assignee: Beijing Huiwen Technology Group Co ltd
Priority date: 2018-04-24
Filing date: 2018-04-24
Publication date: 2022-03-11
Anticipated expiration: 2038-04-24
Also published as: CN108509427A

Abstract

The application relates to a data processing method and a data processing device of text data and electronic equipment. The data processing method comprises the following steps: acquiring text data of a user; extracting the text data by a first feature extraction method to obtain first feature data; extracting the text data by a second feature extraction method to obtain second feature data; and training a hybrid convolutional neural network model with the first feature data and the second feature data, the hybrid convolutional neural network model including a mixture layer for mixing the first feature data and the second feature data. The mixed convolution neural network model is trained based on multi-feature data, and the effectiveness, reliability and robustness of the mixed convolution neural network model can be improved.

Description

Data processing method and application of text data

Technical Field

The present invention relates generally to the field of data processing, and more particularly to a method, a data processing apparatus and an electronic device for text data based data processing.

Background

With the development and popularization of internet technology, electronic commerce accounts for a greater and greater proportion of people's daily life and shopping. During the course of an e-commerce consumption by a user, a large amount of e-commerce data related to a product, for example, review data about the product, is generated. Therefore, it is becoming a hot field of current scientific research to find out related products more comprehensively and optimize products and industries by data mining, such as sentiment information mining, of e-commerce data.

A large number of algorithms are applied to the aspect of electronic commerce data mining, the algorithms play important roles, and a large number of algorithms are applied very successfully and have wide application fields. However, these algorithms also have deficiencies in different aspects, such as robustness, predictability.

Accordingly, there is a need for improved data processing schemes for e-commerce data mining.

Disclosure of Invention

The present application is proposed to solve the above-mentioned technical problems. Embodiments of the present application provide a data processing method, a data processing apparatus, and an electronic device, which train a hybrid convolutional neural network model based on multi-feature data, and can improve effectiveness, reliability, and robustness of the hybrid convolutional neural network model.

According to an aspect of the present application, there is provided a data processing method including: acquiring text data of a user; extracting the text data by a first feature extraction method to obtain first feature data; extracting the text data by a second feature extraction method to obtain second feature data; and training a hybrid convolutional neural network model with the first feature data and the second feature data, the hybrid convolutional neural network model including a mixture layer for mixing the first feature data and the second feature data.

In the data processing method, the first feature extraction method is a feature extraction method for extracting emotion word features of the text data; and the second feature extraction method is a feature extraction method for extracting a word frequency feature of the text data.

In the above data processing method, the extracting the text data by a first feature extraction method to obtain first feature data includes: performing word vector conversion on the text data to obtain a word vector space containing a word vector for each word in the text data; performing word segmentation on the text data; screening out emotion words matched with the emotion dictionary based on the emotion dictionary; and selecting an emotion word vector corresponding to the emotion word in the word vector space as the first feature data.

In the data processing method, before the step of screening out the emotion words matched with the emotion dictionary based on the emotion dictionary, the method comprises the following steps: screening emotional characteristic seed words from open source resources according to a preset rule; and constructing the emotion dictionary based on the emotion characteristic seed words.

In the data processing method, the step of constructing the emotion dictionary based on the emotion feature seed word includes: selecting k candidate words in the word vector space, wherein the distance between the word vector space and the emotional feature seed words meets a preset distance; adding the k candidate words as updated emotional characteristic seed words to the emotional dictionary; and reducing the value of k, and updating the k candidate words with the distance between the k candidate words and the updated emotional characteristic seed word, wherein the distance between the k candidate words and the updated emotional characteristic seed word meets the preset distance, to the emotional dictionary to construct the emotional dictionary with the preset scale.

In the above data processing method, the extracting the text data by the second feature extraction method to obtain second feature data includes: performing word segmentation on the text data; performing word frequency statistics on each word in the text data through a language dictionary; removing low-frequency words with word frequency lower than a preset word frequency threshold value in the text data; numbering the rest words except the low frequency in the text data in a descending order manner to create a word frequency dictionary; screening out word frequency words matched with the word frequency dictionary from the text data based on the word frequency dictionary; and converting the word frequency words into numbers in the word frequency dictionary to serve as the second characteristic data.

In the above data processing method, the mixed layer is located between an embedding layer and a convolution layer of the hybrid neural network model, the mixed layer being used to: receiving the first feature data and vectorized second feature data obtained by converting the second feature data by the embedding layer, wherein the vectors corresponding to the first feature data and the vectorized second feature data are

Wherein k is_iThe length of the text extracted by the ith feature is represented and is an uncertain value, and m is a result of high-dimensional mapping and is a determined value; and converting the first feature data and the vectorized second feature data into a mixed word vector

Where n represents the number of features.

In the above data processing method, the mixed layer is located between the pooling layer and the full-link layer of the hybrid neural network model, and the mixed layer is used for: and combining the one-dimensional vector obtained after the pooling operation with the vector corresponding to the multi-feature data.

In the data processing method, the text data is comment data of the e-commerce website of the user, and the comment data includes comment information and comment star rating.

In the above data processing method, the data processing method further includes: obtaining comment information of an E-commerce website of a user to be mined; and obtaining emotion information of the user through the trained hybrid convolutional neural network.

According to another aspect of the present application, there is also provided a data processing apparatus comprising: a text data acquisition unit for acquiring text data of a user; a first feature extraction unit configured to extract the text data by a first feature extraction method to obtain first feature data; a second feature extraction unit configured to extract the text data by a second feature extraction method to obtain second feature data; and a model training unit for training a hybrid convolutional neural network model with the first feature data and the second feature data, the hybrid convolutional neural network model including a hybrid layer for mixing the first feature data and the second feature data.

In the above data processing apparatus, the first feature extraction method is a feature extraction method for extracting an emotion word feature of the text data; and the second feature extraction method is a feature extraction method for extracting a word frequency feature of the text data.

In the above data processing apparatus, the first feature extraction unit is configured to: performing word vector conversion on the text data to obtain a word vector space containing a word vector for each word in the text data; performing word segmentation on the text data; screening out emotion words matched with the emotion dictionary based on the emotion dictionary; and converting the emotion words into corresponding emotion word vectors based on the word vector space as the first feature data.

In the data processing apparatus, before the first feature extraction unit selects an emotion word matching the emotion dictionary based on the emotion dictionary, the first feature extraction unit is further configured to: screening emotional characteristic seed words from open source resources according to a preset rule; and constructing the emotion dictionary based on the emotion characteristic seed words.

In the data processing apparatus, the first feature extraction unit is configured to construct the emotion dictionary based on the emotion feature seed word, and includes: selecting k candidate words in the word vector space, wherein the distance between the word vector space and the emotional feature seed words meets a preset distance; adding the k candidate words as updated emotional characteristic seed words to the emotional dictionary; and reducing the value of k, and updating the k candidate words with the distance between the k candidate words and the updated emotional characteristic seed word, wherein the distance between the k candidate words and the updated emotional characteristic seed word meets the preset distance, to the emotional dictionary to construct the emotional dictionary with the preset scale.

In the above data processing apparatus, the second feature extraction unit is configured to: performing word segmentation on the text data; performing word frequency statistics on each word in the text data through a language dictionary; removing low-frequency words with word frequency lower than a preset word frequency threshold value in the text data; numbering the rest words except the low frequency in the text data in a descending order manner to create a word frequency dictionary; screening out word frequency words matched with the word frequency dictionary from the text data based on the word frequency dictionary; and converting the word frequency words into numbers in the word frequency dictionary to serve as the second characteristic data.

In the above data processing apparatus, the mixed layer is located between the embedding layer and the convolution layer of the hybrid neural network model, the mixed layer being for: receiving the first feature data and vectorized second feature data obtained by converting the second feature data by the embedding layer, wherein the vectors corresponding to the first feature data and the vectorized second feature data are

Where n represents the number of features.

In the above data processing apparatus, the mixed layer is located between the pooling layer and the full-link layer of the hybrid neural network model, and the mixed layer is configured to: and splicing the one-dimensional vectors obtained after the pooling operation.

In the above data processing apparatus, the text data is comment data of the user on an e-commerce site, the comment data including comment information and a comment star rating.

In the data processing device, the text data acquisition unit is used for acquiring comment data of the e-commerce website of the user to be mined; and the mixed convolutional neural network model obtains the emotion information of the user.

According to yet another aspect of the present application, there is provided an electronic device including: a processor; and a memory in which are stored computer program instructions which, when executed by the processor, cause the processor to carry out the data processing method as described above.

The data processing method, the data processing device and the electronic equipment can train the hybrid convolutional neural network model based on multi-feature data, so that the effectiveness, the reliability and the robustness of the hybrid convolutional neural network model are improved.

Drawings

These and/or other aspects and advantages of the present invention will become more apparent and more readily appreciated from the following detailed description of the embodiments of the invention, taken in conjunction with the accompanying drawings of which:

fig. 1 illustrates a flow chart of a data processing method according to an embodiment of the present application.

Fig. 2 illustrates a schematic diagram of a convolutional neural network in a data processing method according to an embodiment of the present application.

Fig. 3 illustrates a schematic diagram of feature mixing in a data processing method according to an embodiment of the present application.

Fig. 4 illustrates a schematic diagram of processing three or more features in a data processing method according to an embodiment of the present application.

FIG. 5 illustrates a block diagram of a data processing apparatus according to an embodiment of the present application

FIG. 6 illustrates a block diagram of an electronic device in accordance with an embodiment of the present application.

Detailed Description

Hereinafter, example embodiments according to the present application will be described in detail with reference to the accompanying drawings. It should be apparent that the described embodiments are merely some embodiments of the present application and not all embodiments of the present application, with the understanding that the present application is not limited to the example embodiments described herein.

Summary of the application

As described above, a number of data mining algorithms are applied in e-commerce data processing. One of them is an early unsupervised learning algorithm, which utilizes a large number of prior dictionaries and artificial rules to perform data mining, and such non-methods rely heavily on artificial experience and prior knowledge and have no expansibility. The second method is a conventional Machine learning method, such as SVM (Support Vector Machine), bayesian classifier, and the like. And thirdly, the method for deeply learning the neural network. Although the latter two methods, especially the method of deep learning neural network, have good effect, there are still defects of complex method, poor robustness, severe overfitting, etc. In particular, in the process of performing data mining on e-commerce data based on the deep learning neural network, for example, in the process of performing emotion information mining on the e-commerce data, the structure of the deep learning network model is generally complex and is not adjustable. In addition, the deep learning neural network model is often trained by extracting feature data with single lexical features only through a single feature data extraction method, so that the effectiveness and the accuracy of the deep learning neural network model are difficult to guarantee.

Based on the technical problems, the basic idea of the application is to extract multi-feature data of different lexical features by multiple feature extraction methods. Further, the multi-feature data is mixed in a Convolutional Neural Network (CNN) model to obtain a mixed Convolutional Neural model, so that the mixed Convolutional Neural Network model is trained by the multi-feature data. Here, training the hybrid convolutional neural network model based on multi-feature data may improve the effectiveness, reliability, and robustness of the hybrid convolutional neural network model. In addition, the model structure of the hybrid convolutional neural network does not depend on the mutual association of different feature data in the multi-feature data, but only depends on specific different feature extraction methods. Thus, the manner in which the multi-feature data is mixed in the hybrid convolutional neural network is adjustable.

Based on this, the application provides a data processing method, a data processing apparatus and an electronic device, which extract the text data by a first feature extraction method to obtain first feature data; extracting the text data by a second feature extraction method to obtain second feature data; further, a hybrid convolutional neural network model is trained with the first feature data and the second feature data. Therefore, the hybrid convolutional neural network model is trained through multi-feature data, so that the quality and the superiority of the hybrid convolutional neural network model are improved, and the reliability and the effectiveness of data mining are improved.

It should be noted that the above basic concept of the present application can be applied to processing text data of various users, and is not limited to e-commerce data of users. That is, the present application can be applied to data processing systems for various text data.

Having described the general principles of the present application, various non-limiting embodiments of the present application will now be described with reference to the accompanying drawings.

Illustrative method

Fig. 1 illustrates a flow chart of a data processing method according to an embodiment of the present application. As shown in fig. 1, a data processing method according to an embodiment of the present application includes: s110, acquiring text data of a user; s120, extracting the text data by using a first feature extraction method to obtain first feature data; s130, extracting the text data by a second feature extraction method to obtain second feature data, and S140, training a hybrid convolutional neural network model with the first feature data and the second feature data, the hybrid convolutional neural network model including a hybrid layer for mixing the first feature data and the second feature data.

Specifically, in the data processing method according to the embodiment of the application, the text data may be comment data of an e-commerce website of a user, and the hybrid convolutional neural network model is used for obtaining emotion information of the user based on the comment data of the e-commerce website of the user. Hereinafter, a data processing method according to an embodiment of the present application will be described taking this as an example.

In step S110, text data of the user is acquired. For example, as described above, the text data of the user is comment data of the user's e-commerce website, which includes comment information and evaluation interplanetary. In the subsequent training process of the hybrid convolutional neural network model, the evaluation information serves as a training expectation, and the evaluation interplanetary serves as a data tag. In other words, in the data processing method according to the embodiment of the present application, the hybrid convolutional neural network model is a supervised learning method.

It is worth mentioning that the step of obtaining the evaluation data of the user at the e-commerce website is specifically executed, and data mining and data screening are involved. In particular, the data mining work can be completed by a Python tool, and the data acquisition and arrangement can be completed by selectively depending on the operations of Html webpage processing and the input and output data stream of the Python tool. In order to eliminate the condition that comment information and comment star level cannot correspond to each other due to the reason of malicious comment brushing and the like, a certain degree of additional data screening can be assisted so as to improve the effectiveness of training data.

In steps S120 and S130, the text data is extracted in a first feature extraction method to obtain first feature data, and the text data is extracted in a second feature extraction method to obtain second feature data. In particular, in the data processing method according to the embodiment of the present application, the first feature data extraction method is a data feature extraction method for extracting emotion word features of the text data, and the second feature data extraction method is a feature extraction method for extracting word frequency features of the text data. That is, in this embodiment of the present application, the first feature data is emotion word feature data, and the second feature data is word frequency feature data.

Here, in the data processing method according to the embodiment of the present application, the first feature data is acquired by a first feature data extraction method different from the second feature data extraction method, and the second feature data is acquired by a second feature data extraction method, the first feature data having lexical features different from the second feature data. In this way, in the data processing method according to the embodiment of the application, the mixed convolutional neural network model can be trained by extracting multi-feature data of different lexical features from text data through multiple feature extraction methods. Here, the lexical feature refers to a feature formed based on text, words, or the like, that is, a shallower-layer feature that relates only to a text level and not to a semantic or conceptual level.

More specifically, in step S120, the text data is extracted in a first feature extraction method for extracting emotion word features to obtain first feature data. First, word vector conversion is performed on the text data to obtain a word vector space containing a word vector for each word in the text data. That is, in the process of executing step S120, word vector conversion is performed on the text data.

In particular, in the data processing method according to the embodiment of the present application, the word vector conversion on the text data may be performed by: firstly, modeling each word of the text data to obtain a high latitude representation of each word, and further, constructing a Hidden Markov Model (Hidden Markov Model) according to the high latitude representation of each word and the connection probability between the words to form a word vector space corresponding to the text data, wherein the distance between each point of the word vector space represents the semantic relation between different words.

After the word vector space corresponding to the text data is obtained through the method, word segmentation is further carried out on the text data, and emotion words matched with an emotion dictionary in word segmentation of the text data are screened out on the basis of the emotion dictionary. And finally, selecting an emotion word vector corresponding to the emotion words in the word vector space as the first feature data, wherein the first feature data is emotion word feature data.

Here, in the step of screening the text data to obtain the emotion words of the text data, the emotion dictionary is first constructed. In the data processing method according to the embodiment of the application, the emotion dictionary may be constructed based on the following manner: firstly, selecting emotional characteristic seed words from open source resources according to a preset rule, and then constructing the emotional dictionary based on the emotional characteristic seed words. In particular, in the process of constructing the emotion dictionary based on the emotion feature seeds, k candidate words in the word vector space, the distance between which and the emotion feature seed word meets a preset distance, need to be selected, and the k candidate words are used as updated emotion feature seed words and added to the emotion dictionary. Further, the value of k is reduced, and the k candidate words with the distance between the updated emotional feature seed words and the preset distance are updated to the emotional dictionary to construct the emotional dictionary with the preset scale.

That is, the above operation is to expand all words in the lexicon each time, for example, assuming that there are 100 words originally, and the initial value of k is 5, then after the first round of expansion, a maximum of 500 words (without repetition) are added. And, for example, when the second round of expansion, k becomes 4, 600 words updated before will be expanded, and 500 × 4 to 2000 new words will be introduced based on the newly added 500 words. It should be noted here that, firstly, in the second round of expansion, the 4 words closest to the original word must have been expanded, but for the convenience of implementation, the above updating method is still adopted. Furthermore, after two rounds of expansion, each word actually expands 20 new words, but not necessarily the 20 closest words to the original word, because the second round of expansion is based on the results of the first round of updating.

In step S130, the text data is extracted by a second feature extraction method for extracting word frequency features to obtain second feature data. Firstly, performing word segmentation on the text data; and then, performing word frequency statistics on each word in the text data through a language dictionary. Further, low-frequency words (including zero-frequency words) with the word frequency lower than a preset word frequency threshold value in the text data are removed, and the rest words except the low-frequency words in the text data are numbered in a descending order to create a word frequency dictionary. Further, the word frequency words matched with the word frequency dictionary in the text data are screened out based on the word frequency dictionary, and the word frequency words are converted into numbers in the word frequency dictionary to serve as the second feature data.

Further, after the first feature data and the second feature data are extracted by the first feature extraction method and the second feature extraction method, respectively, step S140 is performed: training a hybrid convolutional neural network model with the first feature data and the second feature data, the hybrid convolutional neural network model including a hybrid layer for mixing the first feature data and the second feature data. That is, in the data processing method according to the embodiment of the present application, the hybrid convolutional neural network model is trained with multi-feature data in which the first feature data and the second feature data are mixed.

More specifically, in the data processing method according to the embodiment of the present application, the hybrid convolutional neural network model includes an embedding layer, a convolutional layer, a pooling layer, and a full-link layer in addition to the hybrid layer. In particular, in the data processing method according to the embodiment of the present application, the mixed layer is adjustable at a specific position of the hybrid convolutional neural network, that is, the structure of the hybrid convolutional neural network has adjustability, wherein the way in which the mixed layer mixes the first feature data and the second feature data is changed correspondingly to the position of the mixed layer at different positions of the hybrid convolutional neural network model.

To more clearly illustrate the different ways in which the mixed layer is mixed with the multi-feature data when the mixed layer is located at different positions of the hybrid convolutional neural network model, the embedding layer, the convolutional layer, the pooling layer, and the fully-connected layer of the hybrid convolutional neural network are first described before.

Similar to the existing convolutional neural network model, the embedding layer is used to convert text data of words into word vector form, thereby converting sentences into a matrix. The convolutional layer is used for local feature extraction to perform feature learning. The pooling layer is used for feature screening. The full connection layer is used for converting the vector of the first preset dimension into the vector of the second preset dimension.

Fig. 2 illustrates a schematic diagram of a convolutional neural network in a data processing method according to an embodiment of the present application. As shown in fig. 2, the embedding layer further includes an input layer and a conversion layer, wherein the input layer is standardized to a vector or a matrix with a fixed length, and in particular, in the data processing method according to the embodiment of the present application, the input layer is in a vector form or a matrix form of comment information of the e-commerce platform of the user. The conversion layer is set for normalization format requirements, the function of the conversion layer is to convert word vectors into a matrix, and the specific operation of the conversion layer is to convert each value of the word vectors into a vector so as to realize the conversion of the vector to the matrix. For example, assuming that the maximum value in the word vector is n and the dimension of the output word vector is m, this layer contains an n × m matrix, which converts the value i into the ith row output of this matrix. In the big data training process, the layer can obtain a word vector acquisition means meeting the requirements by continuously using an HMM model and a related processing method by using a simulation distributed word vector acquisition method.

The convolutional layer functions as a local feature extraction, more specifically, a local feature extraction by a convolutional kernel, where if an output matrix of the convolutional layer is set to s, each row of the output matrix s is expressed as: s_i＝g(α·[v_t:v_t+h-1]+ b), where α is the convolution kernel, v is the convolution layer input vector, g is the activation function, b is the offset, t is the convolution start position, h is the convolution start positionLength of nucleus, s_iThe ith row of the matrix s is denoted by t ═ i.

The role of the pooling layer is to further screen features, wherein, in the data processing method according to the embodiment of the present application, the pooling layer may further screen local features extracted by the convolutional layer by means of mean pooling, maximum pooling or a combination of mean pooling and maximum pooling.

Further, in the data processing method according to an embodiment of the present application, the hybrid layer may be disposed between the convolutional layer and the embedding layer of the hybrid convolutional neural network, or between a pooling layer and a full-link layer of the hybrid neural network model, for fusing the first feature data and the second feature data to the hybrid convolutional neural network model in a specific manner. According to the relation of data input and output, according to the relative position relation between the mixed layer and the convolutional layer, the mixed convolutional neural network model in which the mixed layer is arranged before the convolutional layer of the mixed convolutional neural network is defined as a forward mixed convolutional neural network model, and the corresponding multi-feature data mixing method is a forward mixing method. Correspondingly, the hybrid convolutional neural network model in which the hybrid layer is disposed after the convolutional layer of the hybrid convolutional neural network is defined as a backward hybrid convolutional neural network model, and the corresponding multi-feature data mixing method is a backward mixing method.

More specifically, in a forward hybrid convolutional neural network model, the hybrid layer first receives the first feature data and vectorized second feature data obtained by converting the second feature data by the embedding layer. The vector corresponding to the first characteristic data and the vectorized second characteristic data is

Wherein k is_iIndicating the passage of the ith featureThe extracted text is of length and is an indeterminate value, m is the result of the high-dimensional mapping and is a determinate value, and R represents the vector space. Further, the blending layer converts the first feature data and the vectorized second feature data into a blended word vector

Where n represents the number of features. That is, the mixed layer is matrix-spliced.

That is, the forward mixing method includes receiving the first feature data and vectorized second feature data obtained by converting the second feature data by the embedding layer, where a vector corresponding to the first feature data and the vectorized second feature data is

Where n represents the number of features.

Accordingly, in the backward hybrid convolutional neural network model, the hybrid layer concatenates the one-dimensional vectors obtained after the pooling operation to obtain a hybrid vector.

That is, the backward mixing method includes concatenating one-dimensional vectors obtained after the pooling operation.

Thereby, the structural adjustability of the hybrid convolutional neural network model, i.e. the position of the hybrid layer on the hybrid convolutional neural network model, should be fully understood to obtain the forward hybrid convolutional neural network model or the backward hybrid convolutional neural network model. The forward hybrid convolutional neural network model and the backward convolutional neural network model can be used for solving emotion information mining based on comment data of the user in an E-commerce website, wherein the emotion information mining is carried out on the basis of the comment data of the user in the E-commerce website, and the selection of the emotion information mining and the comment data depends on specific situations.

It is worth mentioning that, in the data processing method according to the embodiment of the present application, the structural adjustability of the hybrid convolutional neural network model does not depend on the relationship between multiple features (i.e., the relationship between the first feature data and the second feature data) with each other, but depends only on a specific feature extraction method (i.e., the first feature data extraction method and the second feature data extraction method). The algorithm principle proves as follows:

suppose that selecting a word vector for two lexical feature-based features results in: [ a ] A₁,a₂…a_k,b₁,b₂…,b_k′]Wherein a ═ a₁,a₂,…a_k]Is the result of the word vector selected by the first feature selection system, b ═ b₁,b₂,…b_k′]The result of the word vector selected by the second feature selection system is the final output through the full link layer. Then, this process is equivalent to

y_pre＝vH＝∑_t≤kh_ta_t+∑_p＞kh_pb_p，

Wherein H denotes a matrix of all connected layers, H ═ H₁,h₂,…h_k,h_k+1,…h_k+k′]。

For CNN networks, the introduction loss function is

L＝(y_true-y_pre)²，

Then

Deducing

Therefore, it can be seen from the above derivation that the structure of the hybrid convolutional neural network model is dependent on the relationship of the multi-feature data to each other, and only on the specific feature extraction method. Thus, the structure of the hybrid convolutional neural network model is adjustable, i.e., the positional variability of the hybrid layer can be guaranteed under this condition.

Fig. 3 illustrates a schematic diagram of feature mixing in a data processing method according to an embodiment of the present application. As shown in fig. 3, the comment text of the user is respectively subjected to word frequency filtering and emotion word-word vector conversion. As described above, the text data after the word frequency filtering is converted into word vectors through high-dimensional mapping, and enters the convolutional network. In addition, a corpus is formed by a seed emotion dictionary to perform emotion word vector conversion, and then the corpus enters a convolution network. The solid line of fig. 3 shows the backward mixing method, i.e. mixing before the full connection layer after passing through the convolutional network. On the other hand, the dashed line in fig. 3 shows a forward mixing method, i.e., mixing the obtained word frequency features and emotion word features, and then entering the convolutional network.

Further, after the hybrid convolutional neural network model is trained through the data processing method, comment information of the e-commerce website to be mined accordingly can be input, so that emotion information of the user can be obtained through the trained hybrid convolutional neural network. Here, the hybrid convolutional neural network model is trained by multi-feature data, so that the effectiveness, reliability and robustness of the hybrid convolutional neural network model are improved, and finally, the obtained emotion information of the user has relatively higher reliability and predictability.

Accordingly, in this embodiment of the present application, the data processing method further includes: obtaining comment information of an E-commerce website of a user to be mined; and obtaining emotion information of the user through the trained hybrid convolutional neural network.

In addition, it is worth mentioning that, in the data processing method according to the embodiment of the present application, the hybrid convolutional neural network model has extensibility. That is, in the data processing method according to the embodiment of the present application, a third feature data extraction method may be further introduced to obtain third feature data, where the third feature data has different lexical features from the first feature data and the second feature data. For example, in a specific embodiment, the third feature data includes semantic concept feature data or rating object topic feature data, and the like, and the embodiment of the present invention is not limited in any way.

Fig. 4 illustrates a schematic diagram of processing three or more features in a data processing method according to an embodiment of the present application. As shown in fig. 4, the comment text is subjected to feature extraction by the feature extractor 1, the feature extractor 2, …, and the feature extractor k to obtain a plurality of feature data. Wherein, for example, the data extracted by the feature extractor 2 needs to be mapped in high dimension to obtain a word vector. Then, the plurality of feature data are subjected to convolutional neural networks 1 to k, mixed before the full connection layer, and finally output.

Here, it should be understood that, although in the above, the hybrid convolutional neural network model is used for emotion information mining based on the comment data of the user on the e-commerce website as an example. However, those skilled in the art can understand that the data processing method according to the embodiment of the present application can be applied to data mining of other text information. This application is not intended to be limiting in any way.

Schematic device

FIG. 5 illustrates a block diagram of a data processing apparatus according to an embodiment of the present application.

As shown in fig. 5, the data processing apparatus 200 according to the embodiment of the present application includes: a text data acquisition unit 210 for acquiring text data of a user; a first feature extraction unit 220 configured to extract the text data acquired by the text data acquisition unit 210 by a first feature extraction method to obtain first feature data; a second feature extraction unit 230 configured to extract the text data acquired by the text data acquisition unit 210 by a second feature extraction method to obtain second feature data; and a model training unit 240 configured to train a hybrid convolutional neural network model including a hybrid layer for mixing the first feature data and the second feature data with the first feature data obtained by the first feature extraction unit 220 and the second feature data obtained by the second feature extraction unit 230.

In one example, in the above-described data processing apparatus 200, the first feature extraction method is a feature extraction method for extracting an emotion word feature of the text data; and the second feature extraction method is a feature extraction method for extracting a word frequency feature of the text data.

In one example, in the data processing apparatus 200 described above, the first feature extraction unit 220 is configured to: performing word vector conversion on the text data to obtain a word vector space containing a word vector for each word in the text data; performing word segmentation on the text data; screening out emotion words matched with the emotion dictionary based on the emotion dictionary; and converting the emotion words into corresponding emotion word vectors based on the word vector space as the first feature data.

In one example, in the data processing apparatus 200, before the first feature extraction unit 220 filters out the emotion words matching the emotion dictionary based on the emotion dictionary, the first feature extraction unit is further configured to: screening emotional characteristic seed words from open source resources according to a preset rule; and constructing the emotion dictionary based on the emotion characteristic seed words.

In one example, in the data processing apparatus 200, the constructing the emotion dictionary by the first feature extraction unit 220 based on the emotion feature seed word includes: selecting k candidate words in the word vector space, wherein the distance between the word vector space and the emotional feature seed words meets a preset distance; taking the k candidate words as updated emotional characteristic seed words and adding the updated emotional characteristic seed words to the emotional dictionary; and reducing the value of k, and updating the k candidate words with the distance between the k candidate words and the emotional feature seed word meeting the preset distance to the emotional dictionary to construct the emotional dictionary with the preset scale.

In one example, in the data processing apparatus 200, the second feature extraction unit 230 is configured to: performing word segmentation on the text data; performing word frequency statistics on each word in the text data through a language dictionary; removing low-frequency words with word frequency lower than a preset word frequency threshold value in the text data; numbering the rest words except the low frequency in the text data in a descending order manner to create a word frequency dictionary; screening out word frequency words matched with the word frequency dictionary from the text data based on the word frequency dictionary; and converting the word frequency words into numbers in the word frequency dictionary to serve as the second characteristic data.

In one example, in the above data processing apparatus 200, the mixed layer is located between an embedding layer and a convolution layer of the hybrid neural network model, the mixed layer being used to: receiving the first feature data and vectorized second feature data obtained by converting the second feature data by the embedding layer, wherein the vectors corresponding to the first feature data and the vectorized second feature data are

Where n represents the number of features.

In one example, in the above data processing apparatus 200, the mixed layer is located between the pooling layer and the fully-connected layer of the hybrid neural network model, the mixed layer being used to: and combining the one-dimensional vector obtained after the pooling operation with the vector corresponding to the multi-feature data.

In one example, in the data processing apparatus 200 described above, the text data is comment data of the user's e-commerce website, the comment data including comment information and a comment star rating.

In one example, in the data processing apparatus 200, the text data obtaining unit 210 is configured to obtain comment data of an e-commerce website of the user to be mined; and the mixed convolutional neural network model obtains the emotion information of the user.

Here, it can be understood by those skilled in the art that the specific functions and operations of the respective units and modules in the above-described data processing apparatus 200 have been described in detail in the data processing method described above with reference to fig. 1 to 4, and thus, a repetitive description thereof will be omitted.

As described above, the data processing apparatus according to the embodiment of the present application can be implemented in various terminal devices, such as a server for user data mining. In one example, the data processing apparatus according to the embodiment of the present application may be integrated into the terminal device as a software module and/or a hardware module. For example, the data processing means may be a software module in the operating system of the terminal device, or may be an application developed for the terminal device; of course, the data processing means can also be one of a plurality of hardware modules of the terminal device.

Alternatively, in another example, the data processing apparatus and the terminal device may be separate devices, and the data processing apparatus may be connected to the terminal device through a wired and/or wireless network and transmit the interaction information according to an agreed data format.

Illustrative electronic device

Next, an electronic apparatus according to an embodiment of the present application is described with reference to fig. 6.

As shown in fig. 6, the electronic device 10 includes one or more processors 11 and memory 12.

The processor 11 may be a Central Processing Unit (CPU) or other form of processing unit having data processing capabilities and/or instruction execution capabilities, and may control other components in the electronic device 10 to perform desired functions.

Memory 12 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, Random Access Memory (RAM), cache memory (cache), and/or the like. The non-volatile memory may include, for example, Read Only Memory (ROM), hard disk, flash memory, etc. On which one or more computer program instructions may be stored that may be executed by the processor 11 to implement the data processing methods of the various embodiments of the application described above and/or other desired functions. Various contents such as comment data of the user at the e-commerce website may also be stored in the computer-readable storage medium.

In one example, the electronic device 10 may further include: an input device 13 and an output device 14, which are interconnected by a bus system and/or other form of connection mechanism (not shown).

The input device 13 may be, for example, a keyboard, a mouse, or the like.

The output device 14 can output various information including emotion information of the user to the outside. The output devices 14 may include, for example, a display, speakers, a printer, and a communication network and its connected remote output devices, among others.

Of course, for simplicity, only some of the components of the electronic device 10 relevant to the present application are shown in fig. 6, and components such as buses, input/output interfaces, and the like are omitted. In addition, the electronic device 10 may include any other suitable components depending on the particular application.

Illustrative computer program product

In addition to the above-described methods and apparatus, embodiments of the present application may also be a computer program product comprising computer program instructions that, when executed by a processor, cause the processor to perform the steps in the data processing method according to various embodiments of the present application described in the above-mentioned "exemplary methods" section of this specification.

The computer program product may be written with program code for performing the operations of embodiments of the present application in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server.

Furthermore, embodiments of the present application may also be a computer-readable storage medium having stored thereon computer program instructions that, when executed by a processor, cause the processor to perform steps in a data processing method according to various embodiments of the present application described in the "exemplary methods" section above of the present specification.

The computer-readable storage medium may take any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may include, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The foregoing describes the general principles of the present application in conjunction with specific embodiments, however, it is noted that the advantages, effects, etc. mentioned in the present application are merely examples and are not limiting, and they should not be considered essential to the various embodiments of the present application. Furthermore, the foregoing disclosure of specific details is for the purpose of illustration and description and is not intended to be limiting, since the foregoing disclosure is not intended to be exhaustive or to limit the disclosure to the precise details disclosed.

The block diagrams of devices, apparatuses, systems referred to in this application are only given as illustrative examples and are not intended to require or imply that the connections, arrangements, configurations, etc. must be made in the manner shown in the block diagrams. These devices, apparatuses, devices, systems may be connected, arranged, configured in any manner, as will be appreciated by those skilled in the art. Words such as "including," "comprising," "having," and the like are open-ended words that mean "including, but not limited to," and are used interchangeably therewith. The words "or" and "as used herein mean, and are used interchangeably with, the word" and/or, "unless the context clearly dictates otherwise. The word "such as" is used herein to mean, and is used interchangeably with, the phrase "such as but not limited to".

It should also be noted that in the devices, apparatuses, and methods of the present application, the components or steps may be decomposed and/or recombined. These decompositions and/or recombinations are to be considered as equivalents of the present application.

The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present application. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the application. Thus, the present application is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

The foregoing description has been presented for purposes of illustration and description. Furthermore, the description is not intended to limit embodiments of the application to the form disclosed herein. While a number of example aspects and embodiments have been discussed above, those of skill in the art will recognize certain variations, modifications, alterations, additions and sub-combinations thereof.

Claims

1. A data processing method of text data includes:

acquiring text data of a user;

extracting the text data by a first feature extraction method to obtain first feature data;

extracting the text data by a second feature extraction method to obtain second feature data; and

training a hybrid convolutional neural network model with the first feature data and the second feature data, the hybrid convolutional neural network model including a mixture layer for mixing the first feature data and the second feature data,

the first feature extraction method is a feature extraction method for extracting emotion word features of the text data; and

the second feature extraction method is a feature extraction method for extracting a word frequency feature of the text data,

extracting the text data to obtain first feature data with a first feature extraction method includes: performing word vector conversion on the text data to obtain a word vector space containing a word vector for each word in the text data; performing word segmentation on the text data; screening out emotion words matched with the emotion dictionary based on the emotion dictionary; and selecting an emotion word vector corresponding to the emotion word in the word vector space as the first feature data,

extracting the text data with a second feature extraction method to obtain second feature data includes: performing word segmentation on the text data; performing word frequency statistics on each word in the text data through a language dictionary; removing low-frequency words with word frequency lower than a preset word frequency threshold value in the text data; numbering the rest words except the low frequency in the text data in a descending order manner to create a word frequency dictionary; screening out word frequency words matched with the word frequency dictionary from the text data based on the word frequency dictionary; and converting the word frequency words into numbers in the word frequency dictionary as the second feature data,

wherein the mixed layer is located between the embedding layer and the convolution layer of the hybrid convolutional neural network model, the mixed layer being used to: receiving the first feature data and vectorized second feature data obtained by converting the second feature data by the embedding layer, wherein the vectors corresponding to the first feature data and the vectorized second feature data are

Wherein n represents the number of features, or

Wherein the mixed layer is located between the pooling layer and the full-link layer of the mixed convolutional neural network model, and the mixed layer is used for: and splicing the one-dimensional vectors obtained after the pooling operation.

2. The data processing method of claim 1, wherein before screening out emotion words matching the emotion dictionary based on emotion dictionary, further comprising:

screening emotional characteristic seed words from open source resources according to a preset rule; and

and constructing the emotion dictionary based on the emotion feature seed words.

3. The data processing method of claim 2, wherein constructing the emotion dictionary based on the emotion feature seed word comprises:

selecting k candidate words in the word vector space, wherein the distance between the word vector space and the emotional feature seed words meets a preset distance;

adding the k candidate words as updated emotional characteristic seed words to the emotional dictionary; and

reducing the value of k, and updating the k candidate words with the distance between the k candidate words and the updated emotional feature seed word, wherein the distance between the k candidate words and the updated emotional feature seed word meets the preset distance, to the emotional dictionary to construct the emotional dictionary with the preset scale.

4. The data processing method of any one of claims 1 to 3, wherein the text data is comment data of the user's E-commerce website, the comment data including comment information and a comment star rating.

5. The data processing method of claim 4, further comprising:

obtaining comment data of the E-commerce website of the user to be mined; and

and obtaining the emotion information of the user through the trained hybrid convolutional neural network.

6. A data processing apparatus of text data, comprising:

a text data acquisition unit for acquiring text data of a user;

a first feature extraction unit configured to extract the text data by a first feature extraction method to obtain first feature data;

a second feature extraction unit configured to extract the text data by a second feature extraction method to obtain second feature data; and

a model training unit for training a hybrid convolutional neural network model with the first feature data and the second feature data, the hybrid convolutional neural network model including a hybrid layer for mixing the first feature data and the second feature data,

the first feature extraction unit extracting the text data by a first feature extraction method to obtain first feature data includes: performing word vector conversion on the text data to obtain a word vector space containing a word vector for each word in the text data; performing word segmentation on the text data; screening out emotion words matched with the emotion dictionary based on the emotion dictionary; and selecting an emotion word vector corresponding to the emotion word in the word vector space as the first feature data,

the second feature extraction unit extracting the text data in a second feature extraction method to obtain second feature data includes: performing word segmentation on the text data; performing word frequency statistics on each word in the text data through a language dictionary; removing low-frequency words with word frequency lower than a preset word frequency threshold value in the text data; numbering the rest words except the low frequency in the text data in a descending order manner to create a word frequency dictionary; screening out word frequency words matched with the word frequency dictionary from the text data based on the word frequency dictionary; and converting the word frequency words into numbers in the word frequency dictionary as the second feature data,

Wherein k is_iThe length of the text extracted by the ith feature is represented and is an uncertain value, and m is a result of high-dimensional mapping and is a determined value; and converting the first feature data and the vectorized second feature dataChange to mixed word vector

Wherein n represents the number of features, or

7. The data processing apparatus according to claim 6, wherein the first feature extraction unit, before screening out an emotion word matching the emotion dictionary based on the emotion dictionary, is further configured to:

8. The data processing apparatus of claim 6, wherein the first feature extraction unit constructing the emotion dictionary based on the emotion feature seed word is further for:

9. The data processing apparatus according to any one of claims 6 to 8, wherein the text data is comment data of an e-commerce website of the user, the comment data including comment information and a comment star rating.

10. The data processing apparatus of claim 9,

the text data acquisition unit is used for acquiring comment data of the E-commerce website of the user to be mined; and

and the mixed convolutional neural network model obtains the emotion information of the user.

11. An electronic device, comprising:

a processor; and

memory in which computer program instructions are stored, which, when executed by the processor, cause the processor to carry out the data processing method according to any one of claims 1 to 5.

12. A computer-readable storage medium having stored thereon computer program instructions which, when executed by a processor, cause the processor to carry out the data processing method of any one of claims 1-5.