CN111177392A

CN111177392A - Data processing method and device

Info

Publication number: CN111177392A
Application number: CN201911419495.0A
Authority: CN
Inventors: 刘志煌
Original assignee: Tencent Cloud Computing Beijing Co Ltd
Current assignee: Tencent Cloud Computing Beijing Co Ltd
Priority date: 2019-12-31
Filing date: 2019-12-31
Publication date: 2020-05-19

Abstract

The application provides a data processing method and a data processing device, which relate to the technical field of data processing, and the method comprises the following steps: acquiring text data to be classified, dividing the text data to be classified into at least one text subdata to be classified, and determining a characteristic field corresponding to an attribute field in each text subdata to be classified; according to a vector processing mode of each text subdata to be classified according to the first character arrangement sequence and the second character arrangement sequence, respectively obtaining a first vector characteristic and a second vector characteristic, and comprehensively obtaining the vector characteristic of each text data to be classified; and obtaining a classification result by learning the weight influence of each characteristic field on the vector characteristic vector analysis result of each text subdata to be classified, and taking the classification results of all the text subdata to be classified as the classification results of the text data to be classified. The method and the device solve the problem of long-distance dependence among fields in the sub-data to be classified, and improve the classification accuracy.

Description

Data processing method and device

Technical Field

The embodiment of the invention relates to the technical field of data processing, in particular to a data processing method and device.

Background

In recent years, with the rapid development of the internet and the information industry, a large amount of unstructured text data is growing, and how to efficiently perform information management and data mining becomes one of research hotspots in the face of massive text data, and an information extraction technology is receiving attention of people gradually.

In order to better determine the attention information of people, certain analysis and calculation are generally needed to be performed on the linguistic segments, sentences and pieces of speech expressed by natural languages, and the emotion, concept, preference and even value orientation and the like implied by the texts are explored.

In the prior art, a corpus classification library is usually required to be established, and classification labels are labeled to all corpora in the corpus classification library, but due to large text category data volume, the method using the corpus classification library has low accuracy and poor extensibility for text classification.

Another method is to use a sequence classification method such as a conditional random field or a hidden markov model to perform classification, but this method cannot solve the problem of long distance dependence between different fields in text data.

Disclosure of Invention

The embodiment of the application provides a data processing method and device, which have better expansibility and generalization and at least solve the problem of long-distance dependence among different fields in text data.

In one aspect, an embodiment of the present application provides a data processing method, where the method includes:

the method comprises the steps of obtaining text data to be classified, dividing the text data to be classified into at least one text subdata to be classified according to attribute fields in the text data to be classified, and determining characteristic fields corresponding to the attribute fields in each text subdata to be classified according to the attribute fields in the text data to be classified;

obtaining a first vector characteristic according to a vector processing mode of a first character arrangement sequence of each text subdata to be classified; obtaining a second vector characteristic according to a vector processing mode of a second character arrangement sequence of each text subdata to be classified, and obtaining the vector characteristic of each text data to be classified according to the first vector characteristic and the second vector characteristic, wherein the first character arrangement sequence is opposite to the second character arrangement sequence;

the method comprises the steps of obtaining a classification result of each text subdata to be classified by learning the weight influence of each characteristic field on a vector characteristic vector analysis result of each text subdata to be classified, taking the classification results of all the text subdata to be classified as the classification results of the text data to be classified, enabling a first vector processing module, a second vector processing module and the classification modules to form a trained text data classification model, and enabling the text data classification model to be obtained through iterative training according to a training sample.

In one aspect, an embodiment of the present application provides a data processing apparatus, including:

the characteristic field determining unit is used for acquiring text data to be classified, dividing the text data to be classified into at least one text subdata to be classified according to the attribute field in the text data to be classified, and determining a characteristic field corresponding to the attribute field in each text subdata to be classified according to the attribute field in the text data to be classified;

the vectorization unit is used for obtaining a first vector characteristic according to a vector processing mode of a first character arrangement sequence of each text subdata to be classified; obtaining a second vector characteristic according to a vector processing mode of a second character arrangement sequence of each text subdata to be classified, and obtaining the vector characteristic of each text data to be classified according to the first vector characteristic and the second vector characteristic, wherein the first character arrangement sequence is opposite to the second character arrangement sequence;

the classification result determining unit is used for obtaining a classification result of each text subdata to be classified by learning the weight influence of each characteristic field on a vector characteristic vector analysis result of each text subdata to be classified, taking the classification results of all the text subdata to be classified as the classification results of the text data to be classified, and the first vector processing module, the second vector processing module and the classification module form a trained text data classification model which is obtained by iterative training according to a training sample.

Optionally, the classification result determining unit is specifically configured to:

determining the classification label of the characteristic field in the text subdata to be classified according to the stored classification label of the characteristic field;

and obtaining the classification result of each text subdata to be classified by learning the weight influence of each characteristic field and the classification label of the characteristic field in the text subdata to be classified on the vector characteristic vector analysis result of each text subdata to be classified.

Optionally, the training unit is further configured to:

acquiring an optional training sample, and determining each attribute field and a characteristic field corresponding to each attribute field in the optional training sample according to the acquired attribute field set and characteristic field set;

and dividing the text data to be selected into a plurality of training samples according to each attribute field in the selectable training samples and the characteristic field corresponding to each attribute field.

Optionally, the training unit is further configured to:

acquiring text data to be labeled, labeling the text data to be labeled according to the determined attribute fields and the characteristic fields corresponding to the attribute fields to obtain a labeling sequence of the text data to be labeled;

determining a text mining rule according to the labeling sequence of each text data to be labeled, determining each newly added attribute field and the characteristic field corresponding to each attribute field in each text data to be labeled according to the text mining rule, adding each newly added attribute field and the characteristic field corresponding to each attribute field into each determined attribute field and the characteristic field corresponding to each attribute field, and updating the attribute field set and the characteristic field set.

Optionally, the training unit is specifically configured to:

determining a sequence rule in each labeling sequence according to each labeling sequence;

taking the sequence rule with the occurrence frequency more than the set number in each labeling sequence as a frequent sequence rule;

and taking the frequent sequence rule with the confidence coefficient meeting the most set confidence coefficient as the text mining rule, wherein the confidence coefficient of the frequent sequence rule is determined according to the occurrence frequency of the frequent sequence and the number of attribute fields and category labels of characteristic fields in the frequent sequence.

Optionally, the feature fields include at least one or more of an emotional feature field, a degree feature field, and a negative feature field, and the classification result has at least one of a positive, a negative, and a neutral classification result.

In one aspect, an embodiment of the present application provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the steps of any one of the data processing methods when executing the computer program.

In one aspect, embodiments of the present application provide a computer-readable storage medium storing a computer program executable by a computer device, where the program is executed by the computer device, and causes the computer device to execute the steps of any one of the data processing methods described above.

According to the data processing method provided by the embodiment of the application, after the text data to be classified is obtained, the attribute field in the text data to be classified is determined, and the text data is divided into the plurality of text subdata to be classified according to the attribute field, so that fine-grained text primary division is realized.

And respectively obtaining a first vector characteristic and a second vector characteristic according to the vector processing mode of the first character arrangement sequence of each text subdata to be classified and the second character arrangement sequence of each text subdata to be classified, and obtaining the vector characteristic of each text data to be classified according to the first vector characteristic and the second vector characteristic. Through two types of vectorization processing in opposite sequence, the characteristics of the sub-data to be classified in different processing directions can be obtained, namely the problem of long-distance dependence among fields in the sub-data to be classified is solved.

By learning the weight influence of each characteristic field on the vector characteristic vector analysis result of each text subdata to be classified, the classification result of each text subdata to be classified can be obtained, the classification of text data to be classified with fine granularity is realized, the classification accuracy is improved, and the problem of long-distance dependence among the fields in the subdata to be classified in the prior art is solved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.

Fig. 1 is a schematic view of an application scenario of a data processing method according to an embodiment of the present application;

fig. 2 is a schematic flowchart of a data processing method according to an embodiment of the present application;

FIG. 3 is a schematic structural diagram of an LSTM provided in an embodiment of the present application;

FIG. 4 is a schematic diagram of a training process of an LSTM according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of a text data classification model according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a web page including review content provided in an embodiment of the present application;

FIG. 7 is a schematic diagram of a classified evaluation display provided in an embodiment of the present application;

fig. 8 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more clearly apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

For convenience of understanding, terms referred to in the embodiments of the present application are explained below:

machine Learning (ML): the method is a multi-field cross discipline and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. It is the core of artificial intelligence and the fundamental way to make computer have intelligence. The core of machine learning is "using an algorithm to parse data, learn from it, and then make a decision or prediction about something in the world". This means that computers are not taught how to develop an algorithm to accomplish a task, as it is explicitly writing a program to perform some task.

Deep Learning (DL, Deep Learning): is a new research direction in the field of machine learning, which is introduced into machine learning to make it closer to the original target, Artificial Intelligence (AI).

Artificial Intelligence (AI, Artificial Intelligence): the method is a new technical science for researching and developing theories, methods, technologies and application systems for simulating, extending and expanding human intelligence. Artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence, a field of research that includes robotics, language recognition, image recognition, natural language processing, and expert systems, among others.

Natural Language Processing (NLP): is an important branch of the field of data science and comprises important processes of analyzing, understanding and extracting information from texts in an efficient mode. By utilizing NLPs and their components, large amounts of textual data can be organized, a large number of automated tasks performed, and various problems solved, such as automatic summarization, machine translation, named entity recognition, relationship extraction, emotion analysis, speech recognition, and topic segmentation.

FastText: the text classification model is one of natural language processing models, is a rapid text classification algorithm and is a supervised model, the input of FastText is a word sequence, and the output is the probability that the word sequence belongs to different categories. The words and phrases in the sequence constitute a feature vector, which is mapped to the intermediate layer by linear transformation and then mapped to the classification labels by the intermediate layer.

LSTM (Long Short-Term Memory): the long-time memory model is a natural language processing model, and is also a variation of the RNN (Recurrent Neural Network) model. The LSTM can learn long dependency relationships, structurally improves the hidden layer of the traditional RNN, and is suitable for processing time series data, such as text data.

Bi-LSTM (Bi-directional Long Short-Term Memory): the abbreviation of the bidirectional long-time and short-time memory model is formed by combining a forward LSTM and a backward LSTM, and can better process context information.

In a specific practice process, the inventor of the present application finds that, when text data is classified, an optional method is to classify the text data through a corpus classification library, for example, if the text data to be classified is "i'm good", a classification label of "good" is stored in the corpus classification library as an "emotion label", or further a classification label of "good" is stored as a "positive emotion label", and then a classification note of "i'm good" of the text data to be classified is determined as an "emotion label" or a "positive emotion label".

However, the corpus classification library is customized by depending on dictionaries and voice experts, and the difficulty of training the corpus classification library is increased rapidly due to the large amount of text data to be classified.

In the prior art, a classification label of a text is determined by analyzing the text through a pre-trained classification model, a general classification model is FastText, and the classification label of text data can be determined by using the FastText, but the FastText is too simple in structure, and usually word vectors of words contained in the text are added to obtain a vector as a basis for predicting the classification label of the text, so that the accuracy of predicted emotion polarity cannot be guaranteed.

In the prior art, there are methods for analyzing a text by a classification model to obtain a classification label, such as RNN classification models, but these methods cannot solve the problem of long-distance dependency existing between fields in text data. For example, the text data to be classified is "i feel good today but somewhat cold", and there is a long distance dependency between the field of "cold" and the field of "weather".

Based on the defects of the prior art, the inventor of the present application has conceived a data processing method, in the embodiment of the present application, after text data to be classified is obtained, an attribute field in the text data to be classified is determined, and the text data is divided into a plurality of text sub-data to be classified according to the attribute field, thereby implementing fine-grained text primary division.

Inputting each text subdata to be classified into a trained first vector processing module according to a first character arrangement sequence to obtain a first vector characteristic; and simultaneously, inputting the text data to be classified into a trained second vector processing module according to a second character arrangement sequence to obtain second vector features, and obtaining the vector features of each text data to be classified according to the first vector features and the second vector features. Through two types of vectorization processing in opposite sequence, the characteristics of the sub-data to be classified in different processing directions can be obtained, namely the problem of long-distance dependence among fields in the sub-data to be classified is solved.

The classification result of the text data to be classified can be obtained through the determined vector characteristics of the text data to be classified and the trained classification module, so that the text data to be classified and the trained text data classification model in the embodiment of the application can realize fine-grained text data classification, the classification accuracy is improved, and the problem of long-distance dependence among fields in the subdata to be classified in the prior art is solved.

After introducing the design concept of the embodiment of the present application, some simple descriptions are provided below for application scenarios to which the technical solution of the embodiment of the present application can be applied, and it should be noted that the application scenarios described below are only used for describing the embodiment of the present application and are not limited. In specific implementation, the technical scheme provided by the embodiment of the application can be flexibly applied according to actual needs.

To further illustrate the technical solutions provided by the embodiments of the present application, the following detailed description is made with reference to the accompanying drawings and the detailed description. Although the embodiments of the present application provide the method operation steps as shown in the following embodiments or figures, more or less operation steps may be included in the method based on the conventional or non-inventive labor. In steps where no necessary causal relationship exists logically, the order of execution of the steps is not limited to that provided by the embodiments of the present application.

Fig. 1 is a schematic view of an application scenario of a data processing method according to an embodiment of the present application. The application scenario at least includes a data processing device 101 and a database 102, and the data processing device 101 and the database 102 may be located in the same local area network or in different networks. The data processing apparatus 101 and the database 102 are connected by a wired network or a wireless network.

In this embodiment of the application, the data processing device 101 obtains text data to be classified and a trained text data classification model from the database 102, the data processing device 101 performs fine-grained division on the text data to be classified, that is, the text data to be classified is divided into a plurality of text subdata to be classified according to each attribute field, and each text subdata to be classified is classified by the trained text data classification model to obtain a classification result.

In this embodiment of the present application, the data processing device 101 may store the classification result in the database 102, or send the second data processing model to the application terminal 103, so in this embodiment of the present application, the application scene diagram of the data processing method further includes the application terminal 103, and the application terminal 103 may display or further process the classification result.

Optionally, in this embodiment of the application, the application scenario of the data method may further include a training device 104, and the training device 104 may obtain a trained text data classification model according to training data and a text classification model to be trained, and store the trained text data classification model in the database 102.

It should be understood that the data processing device 101 and the application terminal 103 in the embodiment of the present application include, but are not limited to, electronic devices such as a desktop computer, a mobile phone, a mobile computer, a tablet computer, and the like, and may include a server, and the server may be a server cluster or a single server. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as cloud service, a cloud database, cloud computing, a cloud function, cloud storage, network service, cloud communication, middleware service, domain name service, security service, CDN, and a big data and artificial intelligence platform.

Similarly, in this embodiment of the present application, the database 102 may be a cloud database, where the cloud database refers to a storage system that integrates a large number of storage devices (storage devices are also referred to as storage nodes) of different types in a network through application software or an application interface to cooperatively work through functions such as cluster application, a grid technology, and a distributed storage file system, and that provides data storage and service access functions to the outside. In the embodiment of the present application, the data processing device 101 and the application terminal 103 may access the cloud database through an access structure of the cloud database.

Of course, the method provided in the embodiment of the present application is not limited to be used in the application scenario shown in fig. 1, and may also be used in other possible application scenarios, and the embodiment of the present application is not limited. The functions that can be implemented by each device in the application scenario shown in fig. 1 will be described in the following method embodiments, and will not be described in detail herein.

The following describes the technical solution provided in the embodiment of the present application with reference to the application scenario shown in fig. 1.

Referring to fig. 2, an embodiment of the present application provides a data processing method, including the following steps:

step S201, obtaining text data to be classified, dividing the text data to be classified into at least one text subdata to be classified according to an attribute field in the text data to be classified, and determining a feature field corresponding to the attribute field in each text subdata to be classified according to the attribute field in the text data to be classified.

Specifically, in the embodiment of the present application, the source of the text data to be classified may be a source into which language data is converted, or may be an article or a comment sentence, and the like.

In the embodiment of the present application, the feature fields of the to-be-classified text data having the attribute fields need to be classified, so that the text data without the attribute fields need not be classified, that is, in the present application, at least one or more attribute fields may exist in the to-be-classified text data, and the attribute fields represent attribute meanings of the to-be-classified text data, for example, in the to-be-classified text data, each entity in the to-be-classified text data is the attribute field of the to-be-classified text data.

In order to realize the classification of the fine-grained text data, in the embodiment of the present application, a plurality of attribute fields in the text data to be classified are divided into a plurality of text sub-data to be classified, and only one attribute field is included in the text sub-data to be classified. Illustratively, the text data to be classified is "this room is large and the air is good", semantic bodies expressed in the text data to be classified are "room" and "air", so attribute fields in the text data to be classified are "room" and "air", and the text data to be classified is divided into two text subdata to be classified, i.e., "this room is large" and "the air is good", according to different attribute fields.

In the embodiment of the present application, it is further required to determine feature fields in the sub-data of the text to be classified, where the feature fields are used to describe features of attribute fields in the sub-data of the text to be classified, and the features are generally adjectives or adverbs.

In an optional embodiment, the negative words in the sub-data of the text to be classified also affect the classification result of the sub-data of the text to be classified, so that the negative words can also be used as the characteristic fields of the sub-data of the text to be classified. Illustratively, the text sub-data to be classified is "room uncomfortable", the attribute field is "room", and the feature field is "not" and "comfortable".

In the embodiment of the application, the feature fields in the text data to be classified can be determined through the feature fields in the stored feature field set, the feature fields in the stored feature field set are matched with the text data to be classified, and the matched fields are the feature fields. Illustratively, if the feature fields in the saved feature field set include "good", "comfortable", "beautiful", "big", etc., the feature fields in the text data to be classified can be determined by the saved feature field set.

Further, in this embodiment of the present application, the saved feature field set may include an adjective feature field set, an adverb feature field set, and a negative word feature field set, and the feature fields in the saved feature field set are determined by the adjective feature field set, the adverb feature field set, and the negative word feature field set, respectively.

Optionally, in this embodiment of the present application, the stored feature field set may be determined according to historical text data to be classified, or may be determined by an expert knowledge base or a professional domain thesaurus.

Optionally, in this embodiment, the feature field at least includes one or more of an emotion feature field, a degree feature field, and a negative feature field, that is, the task of the data processing in this embodiment is to determine an emotion classification of the text data to be classified, and optionally, the emotion classification result is one of a positive classification result, a negative classification result, and a positive classification result. For example, "good", "bar", "like", etc. are characterized by a positive emotion classification, "bad", "hard", "annoying", etc. are characterized by a negative emotion classification, "nearly", "can", "go back", etc. are characterized by a neutral classification.

The emotion classification is a research field for analyzing subjective feelings of people such as opinions, emotions, evaluations, opinions, attitudes and the like held by entity objects such as products, services, organizations, individuals, events, themes and attributes thereof. The text sentiment analysis has great practical value and research value, for example, sentiment information of specific commodity attributes is identified from commodity evaluation data, and decision and reference can be provided for merchants, other users, manufacturing enterprises and the like.

Of course, in the embodiment of the present application, the text data to be classified may also be other types of data to be classified, for example, the text data to be classified is classified into a first type of data, a second type of data, and the like, and the first type of data and the second type of data are determined according to the attribute field and the feature field.

Step S202, obtaining a first vector characteristic according to a vector processing mode of a first character arrangement sequence of each text subdata to be classified; and obtaining a second vector characteristic according to a vector processing mode of a second character arrangement sequence of each text subdata to be classified, and obtaining the vector characteristic of each text data to be classified according to the first vector characteristic and the second vector characteristic, wherein the first character arrangement sequence is opposite to the second character arrangement sequence.

Specifically, in the embodiment of the present application, through vector processing manners in different directions, features of the sub-data to be classified in different processing directions can be obtained, that is, the problem of long-distance dependence between fields in the sub-data to be classified is solved.

In this embodiment of the present application, the vector processing manner may be performed by different vector processing models, and specifically, in an optional embodiment, each text subdata to be classified is input to the trained first vector processing module according to the first character arrangement order to obtain a first vector feature; and simultaneously inputting the text data to be classified into a trained second vector processing module according to a second character arrangement sequence to obtain second vector features, and obtaining the vector features of each text data to be classified according to the first vector features and the second vector features, wherein the first character arrangement sequence is opposite to the second character arrangement sequence.

Specifically, after each text subdata to be classified is determined, vectorization processing is performed on each text subdata to be classified through the trained first vector processing module and the trained second vector processing module.

Specifically, in order to determine the dependency relationship between fields in the text sub-data to be classified, the text sub-data to be classified is input into two vector processing modules through a first character arrangement order and a second character arrangement order, the two vector processing modules are trained vector processing modules, and the first vector processing module and the second vector processing module may have the same or different structures.

The text sub-data to be classified, which is ' the room is large ', is respectively input into the first vector processing module according to the first character sequence of ' the ', ' the ' room ', ' the ' and ' the ' large ', and is input into the second vector processing module according to the second character sequence of ' the ' large ', ' the ' room ', ' the ' and ' the.

In an alternative embodiment, the first vector processing module is an LSTM-based deep learning processing module, the second vector processing module is an LSTM-based deep learning processing module, and the first vector processing module and the second vector processing module have the same structure.

Specifically, LSTM is an advanced recurrent neural network, solves the problem of the disappearance of the gradient of the common recurrent neural network using a time backpropagation algorithm, and has the ability to process sequence problems more efficiently, and it can learn long-term dependencies in the sequence.

LSTM introduces the concept of a gate based on RNN, where a gate refers to a fully connected layer that controls the amount of output information, with the input being a vector and the output being a real number between 0 and 1. The gates in the LSTM are divided into three types: forgetting gate, input gate and output gate. The forgetting door is used for controlling the passing condition of the signal transmitted from the previous time state unit to the current time state unit; the input gate is used for controlling the passing condition of the input signal at the current moment transmitted to the state unit at the current moment; the output gate is used for controlling the passing condition of the current time output transmitted by the current time state unit signal.

In an alternative embodiment, the LSTM training method uses a back propagation algorithm, the structure of the LSTM is shown in fig. 3, and the specific steps of the training process are shown in fig. 4, and the method includes:

step S401, calculating the output value of each neuron of the network in the forward direction, including f_t、i_t、c_t、o_t、h_tFive vectors, the direction of signal propagation of which is shown in figure 3.

Inputting the current time t into a signal x_tOutput signal h at last moment_t-1The numbers are combined and together pass through a forgetting gate to determine the discarded information, and the forgetting gate outputs a formula 1:

f_t＝σ(W_fx+b_f) Equation 1

In the formula, W_fIs the forgetting gate weight; x is a vector (h) formed by combining the output of the previous moment and the input of the current moment_t-1,x_t]，b_fTo forget the door bias.

Updating a state unit at the current moment, wherein an input signal respectively passes through an input gate and a tanh layer to obtain an output signal of the input gate and a currently input state unit, and summing a result of multiplying the state unit at the previous moment by elements through forgetting gate information and a result of multiplying the state unit at the current moment by elements through input gate information to obtain the state unit at the current moment, specifically formulas 3-6:

i_t＝σ*(W_ix+b_i) Equation 3

In the formula, W_i、W_cRespectively are the weight of the input gate and the weight of the currently input state unit; x is a vector (h) formed by combining the output of the previous moment and the input of the current moment_t-1,x_t]；b_i、b_cRespectively, input gate bias and currently input state cell bias, representing multiplication of corresponding elements in the matrix.

The output at the previous moment and the input at the current moment are combined and then output information is determined through an output gate, the output gate passing information is multiplied by the unit state of the current moment passing through the tanh function according to elements to obtain an output result at the current moment, and the output result is specifically shown in formula 7 and formula 8:

o_t＝σ*(W_ox+b_o) Equation 7

h_t＝o_t*tanh(c_t) Equation 8

In the formula 7, W_oIs the output gate weight; x is a vector (h) formed by combining the output of the previous moment and the input of the current moment_t-1,x_t]，b_oFor output gate bias, the multiplication of corresponding elements in the matrix is denoted.

Step S402, reversely calculating an error term delta of each neuron of the network, wherein error propagation in the LSTM is divided into two directions, one is propagation along a time direction, namely, the error is transmitted to each previous time from the current time; one is propagation along the network structure, i.e. from the output layer to the previous layer, layer by layer. In particular, error propagation in the time direction is determined, and error propagation along the network structure is determined.

In step S403, the gradient of each weight is updated according to the error term.

Specifically, in the embodiment of the present application, the gradient calculation of the weight needs to be divided into two parts, one part is the update of the weight parameter, the other part is the update of the bias parameter, and an alternative method is a gradient descent method.

The LSTM training process is specifically described in the above embodiment, and in the training process, the LSTM structure in two directions is adopted, and the method in the above embodiment is the same for the training process of each LSTM, but the difference is that after the vector processing result is obtained by the first vector processing module and the second vector processing module in the embodiment of the present application, the vector processing results in all directions need to be combined, the vector input to the classification module is the combined vector processing result, the classification module performs classification according to the combined vector processing result, and the model parameters of the first vector processing module and the second vector processing module are adjusted according to the classification result and the real classification result.

Optionally, in this embodiment of the application, before each text subdata to be classified is input into the first vector processing module according to the first character arrangement order and is input into the second vector processing module according to the second character arrangement order, word embedding is performed on the text subdata to be classified first, the word embedding is a numerical representation of text data, and in general, a field is mapped to a high-dimensional vector to represent the field. For example, we denote the "machine learning" field as [1,2,3], the "deep learning" field as [2,3,3], and so on.

In an alternative embodiment, Word embedding may be performed by a Word2Vec Word embedding method, where Word2Vec is a statistical method that can effectively learn independent Word embedding from a text corpus. Word2vec is a simple neural network, consisting of several layers: an input layer, a hidden layer and an output layer; the input layer inputs the digital vector representation of the text data and outputs the digital vector representation to the hidden layer. The hidden layer uses a neural network model to perform feature processing, the hidden layer outputs the layer, and the hidden layer uses a normalized classification function to operate to obtain the probability of each prediction result, namely the word embedding result of each field.

In the embodiment of the application, after the first vector feature and the second vector feature are obtained, the vector feature of each text data to be classified is obtained according to the first vector feature and the second vector feature, that is, different vector features are obtained through two vector processing modules with opposite processing directions, and the vector feature of each text sub-data to be classified is determined according to the two different vector features.

In an optional embodiment, in the embodiment of the present application, a process of obtaining vector features of the sub-data of the text to be classified by the first processing module and the second processing module is described on the premise that the processing efficiency of the first vector processing module is the same as the processing efficiency of the second vector processing module.

The text data to be classified has N fields, N is larger than or equal to 1, the N fields are input to the trained first vector processing module according to the first character arrangement sequence, and the N fields are input to the trained second vector processing module according to the second character arrangement sequence.

Specifically, at the first time, the 1 st field is input to the first processing module in the first vector processing module, and at the same time, the nth field is input to the nth processing module in the trained second vector processing module. At a second moment, inputting the 2 nd field into a second processing module in the first vector processing module, determining a processing result according to the processing result of the first processing module and the 2 nd field by the second processing module, and sending the processing result to a third processing module; similarly, the (N-1) th field is input into the (N-1) th processing module in the trained second vector processing module, and the (N-1) th processing module determines a processing result according to the processing result of the (N) th processing module and the (N-1) th field and sends the processing result to the (N-2) th processing module.

And if the ith processing module of the first vector processing module already determines the first vector feature and the (N-i + 1) th processing module of the second vector processing module already determines the second vector feature, taking the sum of the vectors of the first vector feature and the second vector feature as the feature vector of the ith field corresponding to the ith processing module.

Exemplarily, the text subdata to be classified is 'big room', and the 'room' is firstly input into a first processing module of a first vector processing module to obtain a first processing result; inputting the first processing result and the 'middle' into a second processing module of the first vector processing module to obtain a second processing result; and inputting the second processing result and the 'big' into a third processing module of the first vector processing module to obtain a third processing result.

Similarly, firstly inputting the 'big' into a third processing module of the second vector processing module to obtain a first processing result; inputting the first processing result and the 'interval' into a second processing module of a second vector processing module to obtain a second processing result; and inputting the second processing result and the room into a third processing module of the second vector processing module to obtain a third processing result.

Determining a feature vector of the house according to a first processing result of the first vector processing module of the house and a third processing result of the second vector processing module; determining a feature vector of 'between' according to a second processing result of the first vector processing module of 'between' and a third processing result of the second vector processing module; the "large" feature vector is determined by the third processing result of the "large" first vector processing module and the first processing result of the second vector processing module.

Step S202, obtaining a classification result of each text subdata to be classified by learning the weight influence of each characteristic field on a vector characteristic vector analysis result of each text subdata to be classified, taking the classification result of all the text subdata to be classified as the classification result of the text data to be classified, wherein the first vector processing module, the second vector processing module and the classification module form a trained text data classification model, and the text data classification model is obtained by iterative training according to a training sample.

Specifically, each feature field has different weight influence on the vector feature vector analysis result of each text subdata to be classified, for example, the weight influence of the feature field 'good' on the attribute field 'air' is large, and the weight influence of the feature field 'good' on the field 'today' is small.

Optionally, in this embodiment of the present application, a deep learning method is used to learn the weight influence of each feature field on the vector feature vector analysis result of each text sub-data to be classified.

In an optional embodiment, the vector features and the feature fields of each text subdata to be classified are used as input values of a trained classification module to obtain a classification result of each text subdata to be classified, the classification results of all the text subdata to be classified are used as classification results of text data to be classified, a first vector processing module, a second vector processing module and the classification module form a trained text data classification model, and the text data classification model is obtained by iterative training according to a training sample.

Optionally, in this embodiment of the application, since the classification label of each feature field also has a weight influence on the vector feature vector analysis result of each text sub-data to be classified, the training sample may include an attribute field, a feature field corresponding to the attribute field, and a classification label of each feature field.

Specifically, in the embodiment of the present application, the vector characteristics of each text subdata to be classified and the characteristic field are input into a trained classification module, so as to obtain a classification result of each text subdata to be classified. In the embodiment of the application, the classification module can determine the classification result of the text subdata to be classified according to the vector characteristics of the text subdata to be classified, and the classification result is assisted to be determined through the characteristic field.

It can be understood that, in the embodiment of the application, the characteristic field may affect the classification result of the text subdata to be classified, and by adding the characteristic field to the classification module, in the classification processing process, the weight of the classified text subdata is increased, and the classification accuracy is improved.

In the embodiment of the present application, the first vector processing module 501, the second vector processing module 502 and the classification module 503 form a trained text data classification model 500, for example, as shown in fig. 5, the first vector processing module 501 in the trained text data classification model 500 is a processing module of an LSTM structure, and similarly, the second vector processing module 501 is also a processing module of an LSTM structure.

The text subdata X to be classified input into the trained text data classification model 500₁、X₂、X₃、……X_nAfter Word vector calculation tool Word2vec504 Word embedding processing, the Word is respectively input to the first vector processing module 501 according to the first character arrangement order and input to the second vector processing module 502 according to the second character arrangement order, processing results are respectively obtained, and the vector characteristics of each field are determined according to the first vector processing result and the second vector.

After the vector features of each field are obtained, the feature fields in the text sub-data to be classified and the vector features of each field are input into the classification module 503, so as to obtain the classification result of the text sub-data to be classified.

In an alternative embodiment, as shown in fig. 5, the vector feature and the feature field may also be subjected to vector concatenation processing by the full concatenation layer 505, and the result of the vector concatenation processing is sent to the classification module 503.

Further, in this embodiment of the application, if the classification label of the feature field can be obtained, the classification label of the feature field is also added to the classification module, so as to obtain the classification result more accurately, an optional method is to determine the classification label of the subdata of the text to be classified according to the stored classification label of the feature field, for example, the subdata of the text to be classified is "large in room", and the feature field is "large", and then the "large" classification label, i.e., the positive classification label, is added to the classification module 503.

In another optional embodiment, except that the text subdata to be classified is input to the trained first vector processing module according to the first character arrangement sequence to obtain a first vector characteristic; and simultaneously, inputting the text data to be classified into a trained second vector processing module according to a second character arrangement sequence to obtain second vector features, and inputting the attribute fields into the first vector processing module and the second vector processing module after carrying out word embedding processing to obtain the first vector features and the second vector features aiming at the attribute fields.

Optionally, in this embodiment of the application, in the word embedding process, the processing result of each sub-data of the text to be classified and the processing result of the vector of the attribute field are combined and input to the first vector processing module and the second vector processing module.

Optionally, in this embodiment of the present application, the text data classification model is obtained by iterative training according to a training sample, where the training sample at least includes an attribute field, a feature field corresponding to the attribute field, and a classification label of each feature field.

Specifically, a training sample is obtained for each training, the training sample comprises an attribute field and a feature field corresponding to the attribute field, and each feature field is provided with a classification label; inputting the training samples to a first vector processing module in the model to be trained according to the first character arrangement sequence to obtain first vector features aiming at the training samples; meanwhile, the training samples are input to a second vector processing module in the model to be trained according to a second character arrangement sequence to obtain second vector features aiming at the training samples, and the vector features of the training samples are obtained according to the first vector features and the second vector features.

And taking the vector features of the training sample and the classification labels of the feature fields in the training sample as input values of a trained classification module in the model to be trained, determining a loss function of the training process according to the classification result of the trained classification module in the model to be trained and the classification labels of the feature fields in the training sample, and adjusting parameters in the model to be trained according to the loss function. And after repeated iterative training, obtaining a text classification model when determining that the loss function of the model to be trained meets the convergence condition.

In an optional embodiment, the training samples at least include attribute fields and feature fields corresponding to the attribute fields, and the training samples may be determined by the attribute field set, the feature field set, and the optional training samples. The optional training samples are text data which can be used as training data, and these data may or may not include attribute fields.

In the embodiment of the application, a set number of text data are obtained as selectable training samples, and the selectable training samples are divided into a plurality of training samples according to the attribute field set and the feature field set.

In an optional embodiment, the attribute field set and the feature field set are determined by text data to be labeled, and the text data to be labeled may be a part of the optional training sample, or may be other text data.

Specifically, text data to be labeled is obtained, and labeling is performed on the text data to be labeled according to the determined attribute fields and the feature fields corresponding to the attribute fields, so that a labeling sequence of the text data to be labeled is obtained; the determined attribute fields and the feature fields corresponding to the attribute fields may be obtained from dictionaries and word stocks.

Determining a text mining rule according to the labeling sequence of each text data to be labeled, determining each newly added attribute field and the characteristic field corresponding to each attribute field in each text data to be labeled according to the text mining rule, adding each newly added attribute field and the characteristic field corresponding to each attribute field into each determined attribute field and the characteristic field corresponding to each attribute field, and updating an attribute field set and a characteristic field set.

That is to say, in the embodiment of the present application, iterative labeling is performed through a part of attribute fields and a part of feature fields, then fields in an attribute field set and fields in a feature field set are updated, and then training samples are determined according to the fields in the updated attribute field set and the fields in the feature field set.

Specifically, in the embodiment of the present application, in the process of determining the rule of text mining, a frequently occurring sequence rule existing in a text is determined, and a frequent sequence satisfying a set confidence is used as the text mining rule.

Firstly, explaining the process of determining frequent sequence rules in the text, and labeling the text data to be labeled by the text data to be labeled and the characteristic fields corresponding to the attribute fields according to the determined attribute fields to obtain the labeling sequence of the text data to be labeled.

Illustratively, the text data to be labeled is "the hotel room is very large", firstly labeling is carried out according to the part of speech of each field in the text data to be labeled, and the labeling result is/r hotel/n/u room/n very/d large/a, wherein r represents pronouns, n represents nouns, u represents auxiliary words, d represents adverbs, and a represents adjectives.

Labeling the text data after part of speech labeling again according to the existing attribute field and the characteristic field, optionally, the attribute field comprises a field attribute field, the characteristic field comprises an emotion word, a degree adverb and a negation word, the labeling process is to traverse the text data, the 4 fields are respectively marked with different labels, for example, the attribute word is labeled with #, the emotion word is labeled with, the degree adverb is labeled with, and the negation word is labeled with! .

Specifically, the first labeling process is that the text data is "room comfortable, service good, and price not cheap", where "room/n (noun), very/d (adverb), comfortable/a (adjective), |, service/n (noun), very/d (adverb), good/a (adjective), |, price/n (noun), not/d (adverb), cheap/a (adjective)".

Then, carrying out a second labeling process of the words of the attribute field and the characteristic field, and labeling clauses with the 'l'; "Room, service, price" is an attribute field, labeled "#"; "comfortable, good, cheap" is an emotional word, labeled "+; "very" is a degree adverb labeled "&"; "not" is a negative term labeled "! "; the results of the annotated sequences are: "#/n, &/d, # a, |, #/n, |! And d,/a ".

In the embodiment of the present application, according to the result of the labeling sequence, the sequence rules whose occurrence times in the labeling sequence are greater than the set number can be determined, and these sequence rules are used as frequent sequence rules.

The determining process of the frequent sequence is explained through a specific embodiment, and the acquired text data is as follows: "the hotel has a large room and a high cost performance", the text data is divided into two clauses, namely clause 1: "the room of this hotel is very large", clause 2 is: the performance-price ratio is very high; through word segmentation and part-of-speech tagging, the tagging sequence is 1: this/r hotel/n/u room/n very/d big/a; the notation sequence 2 is: cost/n is very high/a is high.

In the embodiment of the present application, a prefix span algorithm may be used to mine frequent sequence rules, specifically, all parts of speech and the number of occurrences thereof are firstly counted, and the result is shown in table 1:

TABLE 1

/r	/n	/u	/d	/a
					1	3	1	2	2

If the minimum support is set to 0.5, then there are 2 samples, i.e. the number of occurrences is greater than: the sequence rule of 2 × 0.5 is 1, that is, the number is set to 1, and the sequence rule of more than 1 is set as the frequent sequence rule.

As can be seen from table 1, the part-of-speech elements satisfying at least 2 occurrences in table 1 are shown in table 2:

TABLE 2

/n	/d	/a
			3	2	2

Then the frequent prefixes and their corresponding suffixes comprising only one part-of-speech element that satisfy the minimum support threshold are shown in table 3:

TABLE 3

The following finds the elements in the suffix that also satisfy the minimum support threshold to add to the prefix pattern, with the results shown in table 4:

TABLE 4

The same approach yields a frequent prefix that includes three part-of-speech elements, as shown in table 5:

TABLE 5

Until the corresponding suffix does not meet the element of the support degree, the algorithm is iterated, and the obtained longest frequent prefix (here, the frequent prefix including three part-of-speech elements) is the frequent part-of-speech sequence pattern mined by the user, and is: n/d/a.

That is, with the above contents, a frequent sequence rule, i.e., a frequent parts-of-speech sequence pattern obtained in the above contents, can be determined.

Further, in the embodiment of the present application, after the frequent sequence rule is determined, it is also necessary to determine whether the confidence of the frequent sequence rule meets the set confidence, and in the embodiment of the present application, the set confidence may be set according to different classification tasks.

Specifically, the frequent part-of-speech sequence pattern obtained from the support mining in the previous step is as follows: the confidence coefficient is set to be 0.1, in the embodiment of the application, the text data includes four category labels of an attribute word, an emotion word, a negative word and a degree adverb, the probability of each category label is 1/4, which is 0.25, and as long as the confidence coefficient is greater than 0.1, the set confidence coefficient is satisfied, and the frequent sequence rule is a text mining rule.

Illustratively, in the above, n/d/a is a text mining rule, and n/d/a of the second sentence is not a text mining rule because there is no category or the category proportion does not meet the confidence requirement.

Specifically, in the embodiment of the present application, the newly mined attribute field and feature field in the text data may be added to the determined attribute field and feature field by determining the text mining rule each time, and each field in the attribute field and feature field set may be updated through multiple iterations.

The optional text data may be divided into a plurality of training samples by the updated attribute fields and the fields in the feature field set.

Illustratively, for new text data "the hotel is very close in position, the air is particularly good, the room is comfortable", word segmentation and part-of-speech tagging are performed, and tagging is performed according to an existing training tag, assuming that existing attribute fields and feature fields are: the room is very good, and the resulting labeling sequence is "/r,/n,/u,/n, &/d,/a, |,/n,/d,/a, |, #/n,/d,/a".

By determining the annotation sequence and setting the confidence level to be 0.1, namely, as long as the annotation sequence has one or more types of information, the rule requirements of the text mining rule can be met, and determining that all of "/n, &/d,/a", "/n,/d,/a", "#/n, &/d,/a" are text mining rules, the newly added attribute field is determined to be position and air, the degree characteristic field is special and straight, the emotion characteristic field is near and comfortable, and the updated attribute field set is: location, air, room; the degree characteristic field is very, special and straight; the emotional characteristic field is near, good and comfortable.

In order to better explain the embodiment of the present application, the data processing method provided by the embodiment of the present application is described below in conjunction with a specific implementation scenario, in the method in the embodiment of the present application, if determining the emotion analysis of a certain commodity by a user, it is necessary to obtain the evaluation content of the commodity by the user, and perform emotion classification of the evaluation content. The emotion classification is classified as either positive, depreciative or neutral.

In the embodiment of the present application, the text data to be classified is obtained by displaying the evaluation content of the user in the page through the web page, as shown in fig. 6, the evaluation content of the user a is "commodity is very cost-effective", and the evaluation content of the user B is: "commodity is also possible, the cost performance is not very high", and the like. And determining various field information in the evaluation content through the determined attribute field set, degree characteristic field set, emotional characteristic field set and negative characteristic field set, wherein the 'commodity' is the attribute field, the 'extraordinary', 'also', 'extraordinary' is the degree characteristic field, the 'cost effective', 'high' is the emotional characteristic field, and the 'not' is the negative characteristic field.

Further, the classification label corresponding to the emotional feature field is obtained, that is, "cost" is a recognition label, which may be assigned as 1, "cost performance" is a neutral label, which may be assigned as 0, "high" is a recognition label, which may be assigned as 1.

According to the evaluation content of a user, firstly, word embedding processing is carried out, in the word embedding process, word embedding processing results of commodities and word embedding results of characteristic fields are combined, and combined vector data are input to a trained first vector processing module according to a first character arrangement sequence to obtain first vector characteristics; and simultaneously inputting the combined vector data to a trained second vector processing module according to a second character arrangement sequence to obtain second vector characteristics, and determining the vector characteristics of the comment contents of the two users according to the first vector characteristics and the second vector characteristics. The first vector processing module is a first LSTM model, and the second vector processing module is a second LSTM model.

And inputting the vector features and the feature fields in the comments and the labels of the feature fields into a classification model, wherein the classification model is a softmax classification model, and the classification model is obtained through the classification model, and the classification result is a positive classification aiming at the comment content of the user A, and the classification result is a neutral comment aiming at the comment of the user B.

In the embodiment of the present application, there are also comments of user C, as shown in fig. 6, "the price of a product is very low, and the freight rate is cheap", it may be determined that two attribute features, namely, "product" and "freight rate", exist in the comments of user C, and in the embodiment of the present application, the content of each attribute is classified as text data to be classified, and similarly, through a result of data processing, it may be determined that the content of user C is classified as a "good comment," the comment of a product "is a good comment," the comment of a freight rate "is a good comment, and thus, fine-grained classification of the evaluated content is achieved.

In the embodiment of the present application, after evaluating the evaluation contents of multiple users, the user may conveniently view the evaluation contents in a fine-grained manner by setting a comment tag, for example, as shown in fig. 7, the comment tag may be a comment classification result, that is, a positive comment, a negative comment and a neutral comment, for example, a good comment and a poor comment, and further include multiple attribute fields, such as "goods", "freight", "service", and the like, appearing in the comment contents, so as to facilitate the user to view. After clicking a certain label, the user can see the evaluation content corresponding to the label.

Based on the above embodiments, referring to fig. 8, an embodiment of the invention provides a data processing apparatus 800, including:

a feature field determining unit 801, configured to acquire text data to be classified, divide the text data to be classified into at least one text subdata to be classified according to an attribute field in the text data to be classified, and determine a feature field corresponding to the attribute field in each text subdata to be classified according to the attribute field in the text data to be classified;

the vectorization unit 802 is configured to obtain a first vector feature according to a vector processing manner of a first character arrangement order of each text sub-data to be classified; obtaining a second vector characteristic according to a vector processing mode of a second character arrangement sequence of each text subdata to be classified, and obtaining the vector characteristic of each text data to be classified according to the first vector characteristic and the second vector characteristic, wherein the first character arrangement sequence is opposite to the second character arrangement sequence;

the classification result determining unit 803 is configured to obtain a classification result of each to-be-classified text subdata by learning a weight influence of each feature field on a vector feature vector analysis result of each to-be-classified text subdata, and use the classification results of all to-be-classified text subdata as the classification result of the to-be-classified text data, where the first vector processing module, the second vector processing module, and the classification module form a trained text data classification model, and the text data classification model is obtained by iterative training according to a training sample.

Optionally, the vectorization unit 802 is further configured to:

performing word vector transformation on the text data to be classified to obtain a first word vector;

respectively carrying out word vector transformation on text attribute characteristics in the text data to be classified to obtain a second word vector;

determining word vectors of the text data to be classified according to the first word vectors and the second word vectors;

the vectorization unit 802 is specifically configured to:

and obtaining a first vector characteristic according to a vector processing mode of a first character arrangement sequence of each word vector of the text data to be classified, and obtaining a second vector characteristic according to a vector processing mode of a second character arrangement sequence of each word vector of the text data to be classified.

Optionally, the characteristic field determining unit 801 is specifically configured to:

determining fields matched with all the characteristic fields in the stored characteristic field set in the text subdata to be classified, and taking the matched fields as the characteristic fields corresponding to the attribute fields in the text subdata to be classified.

Optionally, the classification result determining unit 803 is specifically configured to:

and taking the vector characteristics of each text subdata to be classified, the characteristic fields and the classification labels of the characteristic fields as input values of a trained classification module to obtain a classification result of each text subdata to be classified.

Optionally, the apparatus further includes a training unit 804, where the training unit 804 is specifically configured to:

acquiring a training sample aiming at each training, wherein the training sample comprises an attribute field and a characteristic field corresponding to the attribute field, and each characteristic field is provided with a classification label;

inputting the training samples to a first vector processing module in a model to be trained according to a first character arrangement sequence to obtain first vector features aiming at the training samples; meanwhile, the training samples are input to a second vector processing module in the model to be trained according to a second character arrangement sequence to obtain second vector features aiming at the training samples, and the vector features of the training samples are obtained according to the first vector features and the second vector features;

taking the vector features of the training samples and the classification labels of the feature fields in the training samples as input values of a trained classification module in a model to be trained, determining a loss function of the training process according to the classification result of the trained classification module in the model to be trained and the classification labels of the feature fields in the training samples, and adjusting parameters in the model to be trained according to the loss function;

and after repeated iterative training, obtaining a text classification model when determining that the loss function of the model to be trained meets the convergence condition.

Optionally, the training unit 804 is further configured to:

Optionally, the training unit 804 is specifically configured to:

Based on the same technical concept, the embodiment of the present application provides a computer device, as shown in fig. 9, including at least one processor 901 and a memory 902 connected to the at least one processor, where a specific connection medium between the processor 901 and the memory 902 is not limited in this embodiment of the present application, and the processor 901 and the memory 902 are connected through a bus in fig. 9 as an example. The bus may be divided into an address bus, a data bus, a control bus, etc.

In the embodiment of the present application, the memory 902 stores instructions executable by the at least one processor 901, and the at least one processor 901 may execute the steps included in the foregoing data processing method by executing the instructions stored in the memory 902.

The processor 901 is a control center of the computer device, and may connect various parts of the terminal device by using various interfaces and lines, and obtain the client address by executing or executing the instructions stored in the memory 902 and calling the data stored in the memory 902. Alternatively, the processor 901 may include one or more processing units, and the processor 901 may integrate an application processor, which mainly handles an operating system, a user interface, application programs, and the like, and a modem processor, which mainly handles wireless communication. It will be appreciated that the modem processor described above may not be integrated into the processor 901. In some embodiments, the processor 901 and the memory 902 may be implemented on the same chip, or in some embodiments, they may be implemented separately on separate chips.

The processor 901 may be a general-purpose processor, such as a Central Processing Unit (CPU), a digital signal processor, an Application Specific Integrated Circuit (ASIC), a field programmable gate array or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof, and may implement or perform the methods, steps, and logic blocks disclosed in the embodiments of the present Application. A general purpose processor may be a microprocessor or any conventional processor or the like. The steps of a method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware processor, or may be implemented by a combination of hardware and software modules in a processor.

Memory 902, which is a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules. The Memory 902 may include at least one type of storage medium, and may include, for example, a flash Memory, a hard disk, a multimedia card, a card-type Memory, a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Programmable Read Only Memory (PROM), a Read Only Memory (ROM), a charge Erasable Programmable Read Only Memory (EEPROM), a magnetic Memory, a magnetic disk, an optical disk, and so on. The memory 902 is any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to such. The memory 902 of the embodiments of the present application may also be circuitry or any other device capable of performing a storage function for storing program instructions and/or data.

Based on the same technical concept, embodiments of the present application provide a computer-readable storage medium storing a computer program executable by a computer device, the program causing the computer device to perform the steps of the data processing method when the program runs on the computer device.

The computer-readable storage medium may be any available medium or data storage device that can be accessed by a computer, including but not limited to magnetic memory (e.g., floppy disks, hard disks, magnetic tape, magneto-optical disks (MOs), etc.), optical memory (e.g., CDs, DVDs, BDs, HVDs, etc.), and semiconductor memory (e.g., ROMs, EPROMs, EEPROMs, nonvolatile memories (NANDFLASHs), Solid State Disks (SSDs)), etc.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.

It will be apparent to those skilled in the art that various modifications and variations can be made in the embodiments of the present invention without departing from the spirit or scope of the embodiments of the invention. Thus, if such modifications and variations of the embodiments of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to encompass such modifications and variations.

Claims

1. A method of data processing, the method comprising:

2. The method of claim 1, wherein the first vector feature is obtained by vector processing according to a first character arrangement order of each text sub-data to be classified; obtaining a second vector characteristic according to a vector processing mode of a second character arrangement sequence of each text subdata to be classified, wherein the method comprises the following steps:

obtaining a first vector characteristic according to a vector processing mode of a first character arrangement sequence of each text subdata to be classified; obtaining a second vector characteristic according to a vector processing mode of a second character arrangement sequence of each text subdata to be classified, wherein the vector processing mode comprises the following steps:

3. The method of claim 1, wherein the determining the characteristic field corresponding to the attribute field in each sub-text data to be classified comprises:

4. The method of claim 1, wherein before obtaining the classification result of each text sub-data to be classified by learning the weight influence of each feature field on the vector feature vector analysis result of each text sub-data to be classified, the method further comprises:

the obtaining of the classification result of each text subdata to be classified by learning the weight influence of each characteristic field on the vector characteristic vector analysis result of each text subdata to be classified includes:

5. The method of claim 1, wherein the trained text data classification model is iteratively trained from training samples, comprising:

6. The method of claim 5, wherein prior to obtaining the training samples, further comprising:

7. The method of claim 6, wherein before obtaining the optional training samples, further comprising:

8. The method of claim 7, wherein determining the text mining rule according to the labeling sequence of each text data to be labeled comprises:

9. A method according to any one of claims 1 to 8 wherein the feature fields include at least one or more of an emotional feature field, a degree feature field and a negative feature field, and wherein the classification result is at least one of a positive, a negative and a neutral classification result.

10. A data processing apparatus, comprising:

11. The apparatus of claim 10, wherein the vectorization unit is further configured to:

the vectorization unit is specifically configured to:

12. The apparatus according to claim 10, wherein the characteristic field determining unit is specifically configured to:

13. The apparatus according to claim 10, further comprising a training unit, the training unit being specifically configured to:

14. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the steps of the method of any one of claims 1 to 9 are performed by the processor when the program is executed.

15. A computer-readable storage medium, in which a computer program is stored which is executable by a computer device, and which, when run on the computer device, causes the computer device to carry out the steps of the method as claimed in any one of claims 1 to 9.