CN111447574A

CN111447574A - Short message classification method, device, system and storage medium

Info

Publication number: CN111447574A
Application number: CN201811612165.9A
Authority: CN
Inventors: 王浩
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Group Liaoning Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Group Liaoning Co Ltd
Priority date: 2018-12-27
Filing date: 2018-12-27
Publication date: 2020-07-24
Anticipated expiration: 2038-12-27
Also published as: CN111447574B

Abstract

The invention discloses a short message classification method, a device, a system and a storage medium. The method comprises the following steps: performing word segmentation processing on a training text of an industry short message, and determining a word segmentation result of the training text by using the frequency of words obtained by the word segmentation processing and the frequency of phrases formed among the words; constructing a word segmentation array, and determining a feature vector of a word segmentation result; training a short message classification model through the feature vector of the word segmentation result, and obtaining a model parameter of the trained short message classification model by using a cost function of a weight attenuation item in the training process; and constructing a word segmentation vector weight matrix by using the model parameters, determining short message classification to which the short message text to be classified belongs based on the word segmentation vector weight matrix, wherein matrix elements in the word segmentation vector weight matrix are used for expressing the weight values of the feature vectors of word segmentation results. According to the method provided by the embodiment of the invention, the accuracy of word segmentation and the accuracy and performance of model classification can be improved.

Description

Short message classification method, device, system and storage medium

Technical Field

The invention relates to the technical field of computers, in particular to a short message classification method, a short message classification device, a short message classification system and a storage medium.

Background

In the face of the problem of sending spam short messages through an industry port, a government department definitely requires a short message service provider and a short message content provider not to send commercial short messages to the short message service provider and the short message content provider without the consent or the request of users through relevant management regulations. It is also specified that the ports used by the short message service provider and the short message content provider for sending service management and service type short messages must not be used for sending commercial short messages. And related laws also expressly state: the media for publishing the advertisement must be examined for qualification and content legitimacy of the advertisement publisher. In order to better standardize the industry port, the transmission content of the industry port needs to be managed and controlled.

Generally, the content of the short message sent by the industry port can be intelligently and semantically analyzed, and the content sent by the industry port can be classified. Specifically, training sample information and test sample information of short message content (classified industry short messages) can be managed, the training samples are used for a background application system to train a sample base according to a model, and a classification model is trained by combining a corresponding machine learning algorithm; the test samples are used to evaluate the readiness of the classification model. And the background application system classifies the industry short message content by using the classification model.

However, when the classification model is used for classifying and judging the industry short messages, word segmentation needs to be performed firstly, and the word segmentation is inaccurate due to the limited word bank and the current word segmentation algorithm such as a forward maximum matching method, so that the accuracy of the classification model is influenced.

Disclosure of Invention

The embodiment of the invention provides a short message classification method, a short message classification device, a short message classification system and a short message classification storage medium, which can improve the accuracy of word segmentation by combining word frequency to segment industrial short messages, thereby improving the accuracy of a classification model obtained by training.

According to an aspect of the embodiments of the present invention, a method for classifying short messages is provided, including:

performing word segmentation processing on a training text of an industry short message, and determining a word segmentation result of the training text by using the frequency of words obtained by the word segmentation processing and the frequency of phrases formed among the words;

constructing a word segmentation array according to the word segmentation result of the training text, matching words contained in the word segmentation result of the training text with words in the word segmentation array, and determining a feature vector of the word segmentation result according to the matching result;

training a short message classification model through the feature vector of the word segmentation result, and obtaining a model parameter of the trained short message classification model by using a cost function of a weight attenuation item in the training process;

and constructing a word segmentation vector weight matrix by using the model parameters, determining short message classification to which the short message text to be classified belongs based on the word segmentation vector weight matrix, wherein matrix elements in the word segmentation vector weight matrix are used for expressing the weight values of the feature vectors of word segmentation results.

According to another aspect of the embodiments of the present invention, there is provided a short message classification apparatus, including:

the short message text word segmentation module is used for performing word segmentation processing on a training text of an industrial short message, and determining a word segmentation result of the training text by using the frequency of words obtained by the word segmentation processing and the frequency of phrases formed among the words;

the feature vector determining module is used for constructing a word segmentation array according to the word segmentation result of the training text, matching words contained in the word segmentation result of the training text with words in the word segmentation array, and determining the feature vector of the word segmentation result according to the matching result;

the classification model training module is used for training the short message classification model through the feature vector of the word segmentation result and obtaining the model parameters of the trained short message classification model by using the cost function of the weight attenuation item in the training process;

and the text classification determination module is used for constructing a word segmentation vector weight matrix by using the model parameters, determining the short message classification to which the short message text to be classified belongs based on the word segmentation vector weight matrix, and matrix elements in the word segmentation vector weight matrix are used for representing the weight values of the feature vectors of the word segmentation results.

According to another aspect of the embodiments of the present invention, there is provided a short message classification system, including: a memory and a processor; the memory is used for storing programs; the processor is used for reading the executable program codes stored in the memory so as to execute the short message classification method.

According to another aspect of the embodiments of the present invention, there is provided a computer-readable storage medium, in which instructions are stored, and when the instructions are executed on a computer, the instructions cause the computer to execute the short message classification method according to the above aspects.

According to the short message classification method, device, system and storage medium in the embodiment of the invention, the word frequency is increased, the word segmentation result of the training text is determined by using the frequency of the words obtained by word segmentation processing and the frequency of the phrases formed among the words, the word segmentation accuracy is improved, and in the process of training a classification model, the cost function with weight value attenuation items is added, so that the algorithm optimization parameters are improved, and the accuracy and performance problems of model classification are improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required to be used in the embodiments of the present invention will be briefly described below, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a flowchart illustrating a short message classification method according to an embodiment of the present invention;

FIG. 2 is a diagram illustrating a structure of a prefix tree according to one embodiment of the present invention

FIG. 3 is a schematic structural diagram of a short message classification device according to an embodiment of the present invention;

FIG. 4 is a block diagram illustrating an exemplary hardware architecture of a computing device in which methods and apparatus according to embodiments of the invention may be implemented.

Detailed Description

Features and exemplary embodiments of various aspects of the present invention will be described in detail below, and in order to make objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not to be construed as limiting the invention. It will be apparent to one skilled in the art that the present invention may be practiced without some of these specific details. The following description of the embodiments is merely intended to provide a better understanding of the present invention by illustrating examples of the present invention.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

In the embodiment of the invention, a background application system can classify the industry short message content by using a short message classification model, firstly, the interference filtration can be carried out on the short message in the early period, the classification accuracy is improved, then, the word segmentation processing is carried out by using a word segmentation algorithm according to a word segmentation word bank to form a content word segmentation sequence, finally, the content word segmentation sequence is sent into the classification model, after the model is identified, the possible percentage of each content category is given, and the category with the highest percentage and exceeding a preset threshold value is the content category of the short message.

The system supports the repetition elimination of the classification result of the industrial short message according to similar contents, is convenient for an administrator to check and judge whether the analysis result is correct, can correct the recognition result if the recognition result is wrong, and can store the corrected correct result as a sample so as to facilitate the retraining and verification of a subsequent model.

In the embodiment of the invention, when the classification model in the prior art is used for classifying and judging the industry short messages, word segmentation is required to be carried out firstly, and the condition of inaccurate word segmentation can occur due to the limited word bank and the forward maximum matching method of the current word segmentation algorithm.

In addition, the core classification algorithm in the prior art adopts an expansion of the logistic regression algorithm, namely, the softmax regression algorithm, the algorithm requires that the sample data quality is higher, the closer the samples are to the real distribution, the better the trained model is, but the samples with fuzzy boundaries such as industrial short messages and the like are difficult to judge, so that the accuracy of the core classification algorithm in the prior art is not high, and the precision of the trained short message classification model is influenced.

In order to solve the above problems, embodiments of the present invention provide a method, an apparatus, a system and a storage medium for classifying short messages, which propose to use a word frequency form when an industry short message is segmented to improve the accuracy of segmentation, and have a crucial meaning for industry short message classification; in addition, in the model training process, a cost function with a weight attenuation item is added, and the accuracy and the performance of model classification are improved.

For better understanding of the present invention, the method for classifying short messages according to the embodiments of the present invention will be described in detail below with reference to the accompanying drawings, and it should be noted that these embodiments are not intended to limit the scope of the present invention.

Fig. 1 is a flowchart illustrating a short message classification method according to an embodiment of the present invention. As shown in fig. 1, the short message classification method 100 in the embodiment of the present invention includes the following steps:

step S110, performing word segmentation processing on the training text of the industry short message, and determining a word segmentation result of the training text by using the frequency of words obtained by the word segmentation processing and the frequency of phrases formed among the words.

And step S120, constructing a word segmentation array according to the word segmentation result of the training text, matching the words contained in the word segmentation result of the training text with the words in the word segmentation array, and determining the feature vector of the word segmentation result according to the matching result.

Step S130, training the short message classification model through the feature vector of the word segmentation result, and obtaining the model parameters of the trained short message classification model by using the cost function of the weight attenuation item in the training process.

Step S140, a word segmentation vector weight matrix is constructed by utilizing the model parameters, short message classification to which the short message text to be classified belongs is determined based on the word segmentation vector weight matrix, and matrix elements in the word segmentation vector weight matrix are used for representing weight values of feature vectors of word segmentation results.

According to the short message classification method provided by the embodiment of the invention, the word frequency is increased, the word segmentation result of the training text is determined by using the frequency of the words obtained by word segmentation processing and the frequency of the phrases formed among the words, the word segmentation accuracy is greatly improved, and in the process of training a classification model, the algorithm optimization parameters are improved, so that the accuracy and performance problems of model classification are greatly improved.

In an embodiment, the step S110 may specifically include:

and step S111, determining the weight value of the words in the dictionary, matching the training text with the dictionary to obtain matching results in different matching modes, and re-determining the weight value of the words in the dictionary according to the matching results.

In an embodiment, step S111 may specifically include:

and step S111-01, acquiring the weight value of the words in the dictionary, and setting the weight value for each word in the dictionary when the dictionary is used for the first time.

In this step, a dynamic weight may be set to each word in the dictionary after the dictionary is read in. In the process of word segmentation for text classification, when the word in the dictionary is successfully matched, the weight of the successfully matched word is increased.

And step S111-02, according to the specified sequence of the training text, taking the current character in the training text as a starting character, and taking each character after the current character as an ending character, and determining whether a character string formed by the starting character to each ending character is matched with a word in the dictionary or not in sequence.

And step S111-03, if the matched words are matched with the words in the dictionary, increasing the weight value of the matched words, and taking the next character of the current character as a new starting character according to the specified sequence of the training text until the next character is an empty character to obtain the matching results of the training text and the dictionary in different matching modes.

Step S112, determining the weight value of the phrase in the phrase relation table, determining the phrase existing in the matching result according to the phrase relation table, and re-determining the weight value of the phrase in the phrase relation table according to the phrase in the matching result.

As an example, when scanning each current character from left to right, taking the current character as a starting character, taking the next character as an ending character, performing loop matching in the dictionary until the next character exits from the loop for empty, recording the matching result in the whole process, saving the matching result as a prefix tree, and increasing the weight of each matching successful word.

Fig. 2 shows a structural diagram of a prefix tree according to an embodiment of the present invention. As shown in fig. 2, in one embodiment, when the input training text is "people in china", the training text is matched with the dictionary to obtain the prefix tree.

As can be seen from fig. 2, the branches in the prefix tree represent matching results in different matching manners. For example, the branches present in the prefix tree include: "is/china/people", "is/chinese/people" and "is/middle/country/people". And w represents a weight value of a word in the matching result.

As can be seen from fig. 2, the words that are successfully matched may include: "China", "people", "Chinese", and "national people", so the weight value of the successfully matched word in the dictionary can be increased.

In an embodiment, step S112 may specifically include:

and step S112-01, acquiring the weight values of the phrases in the phrase relationship table, and setting the weight values for each phrase in the phrase relationship table when the phrase relationship table is used for the first time.

And S112-02, determining whether the phrase relationship exists between the words of the matching result in different distribution modes according to the phrase relationship table.

And S112-03, if the phrase relationship exists between the words of the matching result in different distribution modes, acquiring the phrases in the matching result, and increasing the weight value of the phrases in the phrase relationship table, which are the same as the phrases in the matching result.

As an example, referring to fig. 2, the weights of the words on different paths in the prefix word tree are respectively counted, and it is determined whether a phrase relationship exists between the words on each path through the phrase relationship table, the weights of the statistical phrases of the phrase relationship exist, and if the phrase relationship does not exist, the phrase weights are set to 0.

Step S113, calculating the weight values of the matching results in different matching modes based on the weight values of the words in the matching results in different re-determined matching modes and the weight values of the phrases in the matching results in different re-determined matching modes.

In this step, the weight values of the matching results in different matching modes may be calculated according to the weight values of the words and the phrase in the matching results in different matching modes represented by each path.

In one embodiment, the weight value of the matching result may be calculated by the following expression (1):

weight-math.log (total frequency of words × a/(frequency of all words) + (1-smoothing parameter) × ((1-smoothing factor) × phrase frequency/total frequency of a words + smoothing factor)) (1)

Wherein Weight represents the Weight value of the matching result corresponding to the matching mode, the A word represents one word in the matching result, the total frequency of the A word represents the frequency of the A word in the matching result, and the frequency of all the words represents the sum of the frequency of each word in the matching result.

In the above expression (1), the smoothing parameter and the smoothing factor are two parameter constants in the weight value expression for calculating the matching result. As an example, the smoothing parameter may take the value of 0.1, and the smoothing factor may take the value of: 1/frequency of all words + 0.00001.

And step S114, selecting the matching result in the matching mode with the maximum weight value obtained by calculation as the word segmentation result of the training text.

In the step, the weights of the words and the weights of the phrases in the matching results in different matching modes are used as references, and the matching result in the matching mode of the calculated matching result is selected and used as the final word segmentation result of the training text.

In an embodiment, the short message classification method 100 may further include:

aiming at the words in the word segmentation result of the training text, increasing the weight value of the same words in the dictionary as the words in the word segmentation result; and aiming at the word group in the word segmentation result of the training text, increasing the weight value of the same word group in the word group relation table as the word group in the word segmentation result.

As an example, with continued reference to fig. 2, assuming that "people in china" exists in the phrase relationship table and the final word segmentation result is "yes \ china \ people", the weight value of the word in the dictionary will increase accordingly, and the weight value of the phrase in the phrase relationship table will also increase accordingly.

Through the steps S111-S114, the word segmentation result of the short message text is determined by adding the word frequency mode, counting the occurrence frequency of the word segmentation and calculating the weight value corresponding to the batch, and the accuracy of the word segmentation result is greatly improved.

In the embodiment of the present invention, through the above steps S111 to S114, the word segmentation processing is performed on each training text of the short message text, so as to obtain word segmentation results of all training texts.

In an embodiment, step S120 may specifically include:

and step S121, obtaining all word segmentation results through the word segmentation result of each training text, and performing duplication elimination processing on all word segmentation results to obtain the duplication eliminated word segmentation results.

And S122, determining the dimension of the feature vector according to the number of the words in the de-duplicated word segmentation result, constructing a word segmentation array, and storing the words in the de-duplicated word segmentation result in the word segmentation array, wherein the dimension of the word segmentation array is equal to the dimension of the feature vector.

In the above steps, all words obtained after all training texts are subjected to word segmentation processing are subjected to duplication elimination processing, and the obtained duplication elimination word number is used as the characteristic dimension of the short message text; and reading each word after the de-duplication processing, and storing each word after the de-duplication processing into a participle array according to the reading sequence for subsequent feature vector conversion.

And S123, aiming at the word segmentation result of each training text, matching the words in the word segmentation result of the training text with the words in the word segmentation array, and determining the value of the feature vector of the training text according to the position of the matched words in the word segmentation array to obtain the feature vector of the word segmentation result.

Through steps S121-S123, first, the number of word segments of the training text may be counted as a feature dimension. As an example, for a training text of an industry short message, if the final training text is subjected to word segmentation processing and deduplication processing, and the number of words is N, for example, then the feature vector dimension of all short message texts is considered to be N. For example, when N is equal to 10, all dimension initial values of the feature vector are defaulted to 0, i.e., [0, 0, 0, 0, 0, 0, 0, 0, 0 ].

Then, after the feature vector dimension statistics is completed, the feature vector conversion can be performed.

And matching the words separated from each training text in the word separation array, and setting the vector value of the word corresponding to the sequence number in the array as 1 after the matching is successful. For example, the words stored in the participle array are [ Chinese ] [ people ] [ reality ] [ emotion ] [ flexibility ] [ time ] [ pipelining ] [ all ] [ probably ], the training text [ Chinese people ] is participle result is [ Chinese ], [ people ], the positions of the two words in the trained participle array are 1 and 2, and then the feature vector of the text is [1, 1, 0, 0, 0, 0, 0 ]; similarly, the word segmentation result of the text [ time, such as pipelining ] is [ time ] [ pipelining ], and the positions of the [ time ] [ pipelining ] words in the trained word segmentation array are 6 and 7, so that the values of the text conversion eigenvector are [0, 0, 0, 0, 0, 1, 1, 0, 0, 0 ]. Through the feature transformation in the above manner, the feature vector can be described by using a feature vector with the same dimension regardless of the text length.

In the embodiment of the invention, in actual use, the number of training texts is more, the number of words after text word segmentation is larger, the dimensionality of the text feature vector is usually ten thousand as a unit, and for the dimensionality with the value of 0 after the text of the short message is converted into the vector, when the following embodiment is used for operation, in order to improve the calculation efficiency, only the dimensionality with the median value of 1 in the feature vector can be calculated.

In an embodiment, step S130 may specifically include:

step S131, constructing a short message classification model by using a softmax regression model, wherein the short message classification model is used for determining the probability value of the word segmentation result for each preset text type.

The softmax regression model used in the embodiment of the invention is a popularization of the logistic regression model in multi-classification problems. In one embodiment, using the softmax regression model, the mathematical model for constructing the short message classification model can be expressed as the following expression (2):

in expression (2) above, the corresponding class labels y ∈ {1, 2.., k }, where k is the number of text classifications₁，θ₂，…，θ_kIs the parameter that needs to be solved by the model, and the weight matrix, x, can be constructed by utilizing the model parameter_iThe feature vector of the text of the short message is represented.

And wherein the one or more of the one,

the probability distribution is normalized so that the sum of all probabilities is 1, i.e. using the classification model, the one with the relatively high output probability is compared as the result of the final classification.

And S132, determining a cost function of the short message classification model, and adding a weight attenuation item in the cost function, wherein the weight attenuation item is used for converging the cost function.

In this step, the cost function can be expressed as the following expression (3):

in the above expression (3), 1{ } is an indicative function, and its value rule is: 1{ expression whose value is true }, and 1{ expression whose value is false }, which is 0. And wherein the one or more of the one,

is an attenuation term added in the cost function.

In the embodiment of the invention, an attenuation term is added in the cost function of the short message classification model constructed by the softmax model, and the attenuation term punishs overlarge parameter values in the model solving process. After the weight decay term is added (lambda > 0), the cost function is called a strict convex function, and a unique solution can be guaranteed. In the solving process, the Hessian matrix, namely the Hessian matrix, can be generally used for judging the extreme value problem of the multivariate function, the weight attenuation term can make the Hessian matrix of the cost function reversible to ensure the convergence of the algorithm, and the finally obtained parameter theta tends to 0 as much as possible, so that the complexity of the model is reduced, and the robustness of the algorithm is increased.

And S133, solving the short message classification model through a gradient descent method and through the feature vector and the cost function of the word segmentation result to obtain parameters of the short message classification model.

In one embodiment, step S133 may specifically include:

and S133-01, acquiring a cost function added with the weight attenuation term, and obtaining the gradient of the cost function through derivation operation.

In this step, the cost function is differentiated to obtain a gradient formula as shown in the following expression (4):

in the above-mentioned expression (4),

is itself a vector whose 1 st element

Represents J (theta) to theta_jPartial derivative of the 1 st component of (a). Substituting the partial derivative formula into a gradient descent method or the like to minimize J (θ) yields the following expression (5):

and S133-02, solving the minimum value of the cost function by using the gradient of the cost function and the feature vector of the word segmentation result to obtain the parameters of the short message classification model.

In this step, the feature vectors and the specified types of the training texts are substituted into the gradient formula represented by the above (4) and the cost function represented by the expression (3) for iteration until a weight matrix represented by the minimum cost function J (θ) is solved.

In the above expression, α represents a learning rate in the range of (0.005-0.03) and λ in the range of (0.001-0.1).

Step S134, a word segmentation vector weight matrix is constructed by using the parameters of the short message classification model, and matrix elements in the word segmentation vector weight matrix are used for representing the weight values of the feature vectors of the word segmentation result.

And according to the steps S131-S134, constructing a weight matrix of the training result by using the model parameters obtained by solving. And during model training, a cost function with a weight attenuation term is added, so that the accuracy and the performance of model classification are improved.

In the embodiment of the invention, the word segmentation array obtained by training is stored in a file and used as a basis for determining the conversion of the short message text into the feature vector when the industrial short message is classified, so as to ensure the consistency of the feature vector conversion during training. And storing the weight matrix constructed by the parameters obtained by solving the short message classification model.

In an embodiment, the step S140 may specifically include:

step S141, performing word segmentation processing on the short message text to be classified, and determining a word segmentation result of the short message text to be classified by using the frequency of words in the short message text to be classified obtained through the word segmentation processing and the frequency of phrases formed among the words in the short message text to be classified.

And S142, determining a feature vector of the word segmentation result of the short message text to be classified according to the constructed word segmentation data and the word segmentation result of the short message text to be classified.

According to the steps S141 and S142, the word segmentation result of the short message text to be classified is obtained by the word segmentation processing method in the embodiment, and the feature vector of the word segmentation result of the short message text to be classified is determined by using the word segmentation result and the stored word segmentation array.

As an example, the text to be classified is converted into feature vectors by the above-mentioned word segmentation processing method and text vector conversion method.

For words not present in the participle array, they are not represented in the participle vector. For example, the word segmentation result of the text [ nice time good like flowing ] is [ nice ] [ time ] [ nice ] [ flowing ], the words stored in the word segmentation array are [ Chinese ] [ human ] [ reality ] [ emotion ] [ flexibility ] [ time ] [ flowing ] [ all ] [ approximately ], [ time ] [ flowing ] two-word segmentation array are 6 and 7, and [ nice ] [ good like ] two words do not exist in the array, which means that the words can not be described in the feature vector without training, so that the value of the text converted into the feature vector is still [0, 0, 0, 1, 1, 0, 0, 0.

In the embodiment of the invention, in order to avoid and reduce the situation that the feature vector cannot completely describe the text, the number and diversity of training samples are required to meet certain requirements and cannot be too small when the mathematical model is trained.

And S143, substituting the feature vector of the word segmentation result of the short message text to be classified into the model parameter to construct a word segmentation vector weight matrix, and obtaining the weight value of the word in the short message text to be classified at the corresponding position in the word segmentation vector weight matrix.

As an example, in the word segmentation vector weight matrix, a row may indicate a type of short message text, such as a loan type, a verification code type, an advertisement type, and the like, and a column indicates a word in the segmented result. Each component in the weight matrix of the participle vector represents the possibility that the word at the position of the column belongs to the type represented by the line at present, and the higher the weight value is, the higher the possibility that the word belongs to the type is.

As a specific example, training texts are classified into three types, i.e., type a, type B, and type C, and a vocabulary used for model training obtained by performing word segmentation on the training short message text, that is, the number of words obtained by performing word segmentation on all the training texts is 100, then the finally trained data model is a weight matrix of 3 × 100.

Wherein the 100 weight values of the first row represent the likelihood that the trained 100 words respectively belong to type A; the 100 weight values of the second line represent the possibility that the trained 100 words respectively belong to type B; and the 100 weight values of the third row represent the likelihood that the 100 trained words belong to type C, respectively.

In practice, the vocabulary used for training will probably be tens of thousands or even hundreds of thousands, and the weight matrix of the classification model will be very large. However, the vocabulary amount of the text to be classified is far lower than the total word amount of training, so that when the matrix is actually used, not all components are required to participate in the operation, and the calculation is performed only according to the weight value of the vocabulary in the text to be classified at the position of the corresponding column in the matrix.

And step S144, respectively bringing the weight values at the corresponding positions into the trained short message classification model to obtain the short message classification to which the short message text to be classified belongs.

As an example, the feature vector of the word segmentation result of the short message text to be classified is brought into the word segmentation vector weight matrix of the short message classification model, and the sequence number of each component with a value of 1 in the feature vector is used to find the corresponding row in the weight matrix. The weight values in the corresponding lines are respectively substituted into the following expression (6):

in the above expression (6), wherein θ₁，θ₂，…，θ_kIs thatThe parameters of the model to be solved and the weight value, x, in the weight matrix constructed by the parameters of the model_iThe feature vector of the word segmentation result of the short message text to be classified is represented.

In the step, the weighted values in the corresponding lines are substituted into the expression (6), the probabilities that the texts to be classified respectively belong to one short message type are solved, and the type with the maximum probability value is used as the result of the text classification.

According to the short message classification method provided by the embodiment of the invention, the word frequency is increased, the word segmentation result is determined by counting the occurrence times of word segmentation and calculating the weight, the word segmentation accuracy is greatly improved, and the problem that certain errors exist in the word segmentation accuracy when the word segmentation is performed by adopting a forward maximum matching algorithm in the prior art is solved.

In addition, according to the short message classification method provided by the embodiment of the invention, the accuracy and performance problems of model classification are greatly improved by improving algorithm optimization parameters, and breakthrough progress and significance are provided for better analyzing the industrial short messages, so that the problems that in the prior art, the performance of the classification model is not high when a large number of training samples exist, and the classification is inaccurate due to fuzzy short message content boundaries when the industrial short messages are classified are solved, and the accuracy and precision of the classification of the industrial short messages are improved.

The following describes a short message classification device according to an embodiment of the present invention in detail with reference to the accompanying drawings.

Fig. 3 is a schematic structural diagram of a short message classification device according to an embodiment of the present invention. As shown in fig. 3, the short message classification apparatus 300 includes:

the short message text word segmentation module 310 is configured to perform word segmentation processing on a training text of an industrial short message, and determine a word segmentation result of the training text by using the frequency of words obtained through the word segmentation processing and the frequency of word groups formed among the words;

the feature vector determination module 320 is configured to construct a word segmentation array according to the word segmentation result of the training text, match words included in the word segmentation result of the training text with words in the word segmentation array, and determine a feature vector of the word segmentation result according to the matching result;

the classification model training module 330 is configured to train a short message classification model through the feature vectors of the word segmentation results, and obtain model parameters of the trained short message classification model by using a cost function of a weight attenuation term in the training process;

the text classification determining module 340 is configured to construct a word segmentation vector weight matrix by using the model parameters, determine a short message classification to which the short message text to be classified belongs based on the word segmentation vector weight matrix, and use matrix elements in the word segmentation vector weight matrix to represent weight values of feature vectors of word segmentation results.

In one embodiment, the short message text word segmentation module 310 may include:

the dictionary matching unit is used for determining the weight value of the words in the dictionary, matching the training text with the dictionary to obtain matching results in different matching modes, and re-determining the weight value of the words in the dictionary according to the matching results;

the phrase relation determining unit is used for determining the weight value of the phrases in the phrase relation table, determining the phrases existing in the matching result according to the phrase relation table, and re-determining the weight value of the phrases in the phrase relation table according to the phrases in the matching result;

the weight value determining unit is used for calculating the weight values of the matching results in different matching modes based on the weight values of the words in the matching results in different re-determined matching modes and the weight values of the word groups in the matching results in different re-determined matching modes;

and the word segmentation result determining unit is used for selecting the matching result under the matching mode with the maximum weight value obtained by calculation as the word segmentation result of the training text.

In an embodiment, the dictionary matching unit is further specifically configured to:

acquiring the weight value of a word in a dictionary, and setting the weight value for each word in the dictionary when the dictionary is used for the first time;

according to the appointed sequence of the training text, taking the current character in the training text as a starting character, taking each character after the current character as an ending character, and sequentially determining whether a character string formed by the starting character to each ending character is matched with a word in a dictionary;

and if the matching result is matched with the word in the dictionary, increasing the weight value of the matched word, and taking the next character of the current character as a new starting character according to the specified sequence of the training text until the next character is an empty character to obtain the matching result of the training text and the dictionary in different matching modes.

In one embodiment, the phrase relationship determining unit is specifically configured to:

acquiring the weight value of the phrases in the phrase relation table, and setting the weight value for each phrase in the phrase relation table when the phrase relation table is used for the first time;

determining whether the phrase relationship exists between the words of the matching result in different distribution modes according to the phrase relationship table;

and if the word group relationship exists between the words of the matching result in different distribution modes, acquiring the word group in the matching result, and increasing the weight value of the word group which is the same as the word group in the matching result in the word group relationship table.

In one embodiment, the feature vector determination module 320 includes:

the word segmentation result duplication removing unit is used for obtaining all word segmentation results through the word segmentation result of each training text, and carrying out duplication removing processing on all word segmentation results to obtain the duplicate-removed word segmentation results;

the word segmentation data construction unit is used for determining the dimension of the characteristic vector according to the number of words in the de-duplicated word segmentation result, constructing a word segmentation array, storing the words in the de-duplicated word segmentation result in the word segmentation array, wherein the dimension of the word segmentation array is equal to the dimension of the characteristic vector;

and the characteristic vector value determination unit is used for matching the words in the word segmentation result of the training text with the words in the word segmentation array according to the word segmentation result of each training text, and determining the value of the characteristic vector of the training text according to the position of the matched word in the word segmentation array to obtain the characteristic vector of the word segmentation result.

In one embodiment, the classification model training module 330 may include:

the system comprises a regression model determining unit, a short message classification model and a classification unit, wherein the regression model determining unit is used for constructing a short message classification model by using a softmax regression model, and the short message classification model is used for determining the probability value of a word segmentation result for each preset text type;

the cost function determination unit is used for determining a cost function of the short message classification model, and a weight attenuation item is added in the cost function and is used for converging the cost function;

and the classification model solving unit is used for solving the short message classification model through a gradient descent method and through the feature vector and the cost function of the word segmentation result to obtain the parameters of the short message classification model.

And the weight matrix construction unit is used for constructing a word segmentation vector weight matrix by using the parameters of the short message classification model, and matrix elements in the word segmentation vector weight matrix are used for expressing the weight values of the feature vectors of the word segmentation result.

In an embodiment, the classification model solving unit is further specifically configured to:

obtaining a cost function added with a weight attenuation term, and obtaining the gradient of the cost function through derivation operation; and solving the minimum value of the cost function by utilizing the gradient of the cost function and the feature vector of the word segmentation result to obtain the parameters of the short message classification model.

In one embodiment, the text classification determination module 340 includes:

the text word segmentation unit is used for performing word segmentation processing on the short message text to be classified, and determining a word segmentation result of the short message text to be classified by using the frequency of words in the short message text to be classified obtained through the word segmentation processing and the frequency of phrases formed among the words in the short message text to be classified;

the characteristic vector determining unit is used for determining the characteristic vector of the word segmentation result of the short message text to be classified according to the constructed word segmentation data and the word segmentation result of the short message text to be classified;

the weight value determining unit is used for substituting the feature vector of the word segmentation result of the short message text to be classified into the model parameter to construct a word segmentation vector weight matrix to obtain the weight value of the word in the short message text to be classified at the corresponding position in the word segmentation vector weight matrix;

and the short message classification determining unit is used for respectively bringing the weight values at the corresponding positions into the trained short message classification model to obtain the short message classification to which the short message text to be classified belongs.

According to the short message classification device provided by the embodiment of the invention, the word frequency is increased, the word segmentation result of the training text is determined by using the frequency of the words obtained by word segmentation processing and the frequency of the phrases formed among the words, the word segmentation accuracy is improved, and in the process of training a classification model, the cost function with weight value attenuation terms is added, the algorithm optimization parameters are improved, and the accuracy and performance problems of model classification are greatly improved.

It is to be understood that the invention is not limited to the particular arrangements and instrumentality described in the above embodiments and shown in the drawings. For convenience and brevity of description, detailed description of a known method is omitted here, and for the specific working processes of the system, the module and the unit described above, reference may be made to corresponding processes in the foregoing method embodiments, which are not described herein again.

Fig. 4 is a block diagram illustrating an exemplary hardware architecture of a computing device capable of implementing the short message classification method and apparatus according to the embodiments of the present invention.

As shown in fig. 4, computing device 400 includes an input device 401, an input interface 402, a central processor 403, a memory 404, an output interface 405, and an output device 406. The input interface 402, the central processing unit 403, the memory 404, and the output interface 405 are connected to each other through a bus 410, and the input device 401 and the output device 406 are connected to the bus 410 through the input interface 402 and the output interface 405, respectively, and further connected to other components of the computing device 400.

Specifically, the input device 401 receives input information from the outside and transmits the input information to the central processor 403 through the input interface 402; the central processor 403 processes the input information based on computer-executable instructions stored in the memory 404 to generate output information, stores the output information temporarily or permanently in the memory 404, and then transmits the output information to the output device 406 through the output interface 405; output device 406 outputs the output information outside of computing device 400 for use by a user.

In one embodiment, the computing device 400 shown in fig. 4 may be implemented as a short message classification system that may include: a memory configured to store a program; and the processor is configured to run the program stored in the memory to execute the short message classification method described in the above embodiment.

According to an embodiment of the invention, the process described above with reference to the flow chart may be implemented as a computer software program. For example, embodiments of the invention include a computer program product comprising a computer program tangibly embodied on a machine-readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network, and/or installed from a removable storage medium.

The computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, e.g., a website, computer, server, or data center, via a wired (e.g., coaxial cable, optical fiber, digital L)) or wireless (e.g., infrared, wireless, website, microwave, etc.) manner, may be transmitted to another website, computer, server, or data center via a solid state medium such as a semiconductor-readable storage medium, a solid state medium such as a floppy disk, a solid state medium such as a semiconductor-readable storage medium, a floppy disk, a magnetic tape, or the like, (e.g., a hard disk-readable storage medium, a magnetic tape, or the like).

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A short message classification method comprises the following steps:

and constructing a word segmentation vector weight matrix by using the model parameters, determining short message classification to which the short message text to be classified belongs based on the word segmentation vector weight matrix, wherein matrix elements in the word segmentation vector weight matrix are used for representing weight values of feature vectors of the word segmentation result.

2. The short message classification method according to claim 1, wherein the performing word segmentation processing on the training text of the industry short message, and determining the word segmentation result of the training text by using the frequency of the words obtained by the word segmentation processing and the frequency of the word group formed between the words comprises:

determining the weight value of the words in the dictionary, matching the training text with the dictionary to obtain matching results in different matching modes, and re-determining the weight value of the words in the dictionary according to the matching results;

determining the weight value of the phrases in a phrase relation table, determining the phrases existing in the matching result according to the phrase relation table, and re-determining the weight value of the phrases in the phrase relation table according to the phrases in the matching result;

calculating the weight values of the matching results in different matching modes based on the re-determined weight values of the words in the matching results in different matching modes and the re-determined weight values of the word groups in the matching results in different matching modes;

and selecting the matching result under the matching mode with the maximum weight value obtained by calculation as the word segmentation result of the training text.

3. The short message classification method according to claim 2, wherein the determining the weight values of the words in the dictionary, matching the training text with the dictionary to obtain matching results in different matching modes, and re-determining the weight values of the words in the dictionary according to the matching results comprises:

acquiring a weight value of a word in a dictionary, and setting the weight value for each word in the dictionary when the dictionary is used for the first time;

according to the appointed sequence of the training text, taking the current character in the training text as a starting character, taking each character after the current character as an ending character, and determining whether a character string formed by the starting character to each ending character is matched with a word in the dictionary or not in sequence;

and if the matching result is matched with the word in the dictionary, increasing the weight value of the matched word, and taking the next character of the current character as a new starting character according to the specified sequence of the training text until the next character is a null character to obtain the matching result of the training text and the dictionary in different matching modes.

4. The short message classification method according to claim 2, wherein the determining the weight value of the word group in the word group relationship table, determining the word group existing in the matching result according to the word group relationship table, and re-determining the weight value of the word group in the word group relationship table according to the word group in the matching result comprises:

acquiring a weight value of a phrase in a phrase relation table, and setting the weight value for each phrase in the phrase relation table when the phrase relation table is used for the first time;

determining whether a phrase relationship exists between words of the matching result in the different distribution modes according to the phrase relationship table;

and if the word group relationship exists between the words of the matching results in different distribution modes, acquiring the word group in the matching result, and increasing the weight value of the word group in the word group relationship table, which is the same as the word group in the matching result.

5. The short message classification method according to claim 1, wherein the constructing a word segmentation array according to the word segmentation result of the training text, matching words included in the word segmentation result of the training text with words in the word segmentation array, and determining the feature vector of the word segmentation result according to the matching result comprises:

obtaining all word segmentation results through the word segmentation result of each training text, and performing duplication elimination processing on all the word segmentation results to obtain duplication eliminated word segmentation results;

determining a feature vector dimension according to the number of words in the de-duplicated word segmentation result, constructing a word segmentation array, and storing the words in the de-duplicated word segmentation result in the word segmentation array, wherein the dimension of the word segmentation array is equal to the feature vector dimension;

and aiming at the word segmentation result of each training text, matching the words in the word segmentation result of the training text with the words in the word segmentation array, and determining the value of the feature vector of the training text according to the position of the matched words in the word segmentation array to obtain the feature vector of the word segmentation result.

6. The short message classification method according to claim 1, wherein the training of the short message classification model through the feature vector of the word segmentation result and the use of the cost function of the weight attenuation term in the training process to obtain the model parameters of the trained short message classification model comprises:

constructing a short message classification model by using a softmax regression model, wherein the short message classification model is used for determining the probability value of the word segmentation result for each preset text type;

determining a cost function of the short message classification model, and adding a weight attenuation item in the cost function, wherein the weight attenuation item is used for converging the cost function;

and solving the short message classification model through a gradient descent method and the feature vector of the word segmentation result and the cost function to obtain the parameters of the short message classification model.

And constructing the word segmentation vector weight matrix by using the parameters of the short message classification model, wherein matrix elements in the word segmentation vector weight matrix are used for expressing the weight values of the feature vectors of the word segmentation result.

7. The short message classification method according to claim 6, wherein the obtaining parameters of the short message classification model by solving the short message classification model through the feature vector of the word segmentation result and the cost function by a gradient descent method comprises:

obtaining a cost function added with the weight attenuation term, and obtaining the gradient of the cost function through derivation operation;

and solving the minimum value of the cost function by using the gradient of the cost function and the feature vector of the word segmentation result to obtain the parameters of the short message classification model.

8. The short message classification method according to claim 1, wherein classifying the short message text to be classified based on the model parameters, and determining the short message classification to which the short message text to be classified belongs comprises:

performing word segmentation processing on the short message text to be classified, and determining a word segmentation result of the short message text to be classified by using the frequency of words in the short message text to be classified and the frequency of phrases formed among the words in the short message text to be classified, which are obtained by the word segmentation processing;

determining a feature vector of a word segmentation result of the short message text to be classified according to the constructed word segmentation data and the word segmentation result of the short message text to be classified;

substituting the feature vector of the word segmentation result of the short message text to be classified into the model parameter to construct a word segmentation vector weight matrix, and obtaining the weight value of the word in the short message text to be classified at the corresponding position in the word segmentation vector weight matrix;

and respectively bringing the weight values at the corresponding positions into the trained short message classification model to obtain the short message classification to which the short message text to be classified belongs.

9. A short message classification device comprises:

the feature vector determination module is used for constructing a word segmentation array according to the word segmentation result of the training text, matching words contained in the word segmentation result of the training text with words in the word segmentation array, and determining the feature vector of the word segmentation result according to the matching result;

the classification model training module is used for training a short message classification model through the feature vector of the word segmentation result and obtaining model parameters of the trained short message classification model by using a cost function of a weight attenuation item in the training process;

and the text classification determining module is used for constructing a word segmentation vector weight matrix by using the model parameters, determining short message classification to which the short message text to be classified belongs based on the word segmentation vector weight matrix, wherein matrix elements in the word segmentation vector weight matrix are used for representing weight values of feature vectors of word segmentation results.

10. The apparatus for classifying short messages according to claim 9, wherein the short message text word segmentation module comprises:

11. The short message classification device of claim 9, wherein the classification model training module comprises:

the regression model determining unit is used for constructing a short message classification model by using a softmax regression model, and the short message classification model is used for determining the probability value of the word segmentation result for each preset text type;

a cost function determining unit, configured to determine a cost function of the short message classification model, add a weight attenuation term to the cost function, where the weight attenuation term is used to converge the cost function;

and the classification model solving unit is used for solving the short message classification model through a gradient descent method and through the feature vector of the word segmentation result and the cost function to obtain the parameters of the short message classification model.

And the weight matrix construction unit is used for constructing the word segmentation vector weight matrix by using the parameters of the short message classification model, and matrix elements in the word segmentation vector weight matrix are used for expressing the weight values of the feature vectors of the word segmentation result.

12. A short message classification system is characterized by comprising a memory and a processor;

the memory is used for storing executable program codes;

the processor is used for reading the executable program code stored in the memory to execute the short message classification method of any one of claims 1 to 8.

13. A computer-readable storage medium comprising instructions which, when executed on a computer, cause the computer to perform the method of classifying short messages according to any one of claims 1 to 8.