CN111428034A

CN111428034A - Training method of classification model, and classification method and device of comment information

Info

Publication number: CN111428034A
Application number: CN202010206016.3A
Authority: CN
Inventors: 刘中伟; 张一凡; 刘云
Original assignee: JD Digital Technology Holdings Co Ltd
Current assignee: JD Digital Technology Holdings Co Ltd
Priority date: 2020-03-23
Filing date: 2020-03-23
Publication date: 2020-07-17

Abstract

The invention provides a training method of a classification model, and a classification method and device of comment information. Wherein the training process comprises: and obtaining a first comment sample and a second comment sample, wherein the first comment sample corresponds to the sample data in the second comment sample one to one. Training an original first model according to the first comment sample, training an original second model according to the second comment sample to respectively obtain a trained first model and a trained second model, and combining the trained first model and the trained second model to obtain a final classification model. The classification model is used to determine a review level for the review information. The training samples adopted in the training process fully consider the relation between the characters in the comment information, and efficient sample data is provided for model training. The constructed classification model can output more accurate classification results of the comment information.

Description

Training method of classification model, and classification method and device of comment information

Technical Field

The embodiment of the invention relates to the technical field of artificial intelligence, in particular to a training method of a classification model, and a classification method and device of comment information.

Background

With the continuous development of computer and internet technologies, people can conveniently acquire needed information and services from network platforms or applications, such as shopping platforms, map applications, meal ordering applications, and the like. The user can check the evaluation information of other users on a certain article or a certain service in a network platform or application to perform autonomous selection.

With the development of numerous service platforms or applications, operations such as user's comment swiping, poor comment marketing and the like inevitably occur, so that the comment and comment grades are inconsistent. For the above situation, the platform or the application should have a hierarchical classification for the comment information, so as to ensure that the comment information in the platform or the application is authentic and credible.

At present, comment grade classification aiming at comment information on a service platform mostly depends on training data of a model, and the intrinsic characteristics of a text are not sufficiently mined, so that the classification accuracy is not high.

Disclosure of Invention

The embodiment of the invention provides a training method of a classification model, a classification method of comment information and a device thereof, which improve the accuracy of comment information classification.

A first aspect of the present invention provides a method for training a classification model for determining a comment level of comment information, the method including:

obtaining a first comment sample, wherein the first comment sample comprises a first comment vocabulary sequence and a first comment grade sequence corresponding to the first comment vocabulary sequence; the first comment vocabulary sequence is obtained by performing word segmentation processing on original comment information;

acquiring a second comment sample, wherein the second comment sample comprises a second comment vocabulary sequence and a second comment grade sequence corresponding to the second comment vocabulary sequence; the second comment vocabulary sequence is obtained by performing word segmentation processing and vocabulary combination on the original comment information;

training an original first model according to the first comment sample to obtain a trained first model, and training an original second model according to the second comment sample to obtain a trained second model;

and combining the trained first model and the trained second model to obtain the classification model.

In one possible implementation manner, the performing word segmentation and word combination on the original comment information includes:

performing word segmentation processing on the original comment information to obtain an original comment vocabulary sequence;

and combining the words of the original comment word sequence by adopting a preset window distance to obtain a second comment word sequence, wherein the second comment word sequence comprises a plurality of groups of comment words.

In one possible implementation, the original comment information includes punctuation marks; the method for combining the words of the original comment word sequence by adopting the preset window distance to obtain the second comment word sequence comprises the following steps:

combining the positions of the punctuations and performing vocabulary combination on the original comment vocabulary sequence by adopting a preset window distance to obtain a second comment vocabulary sequence; and the vocabularies before and after the punctuation mark are not combined.

Optionally, the first comment vocabulary sequence includes a plurality of comment vocabularies with part of speech tagged.

Optionally, the second comment vocabulary sequence includes a plurality of sets of comment vocabularies, and each set of comment vocabularies includes at least two vocabularies with part of speech tagged.

Optionally, the first sequence of comment levels includes the number of different comment levels in the first sequence of comment words, and the second sequence of comment levels includes the number of different comment levels in the second sequence of comment words.

In a possible implementation manner, the combining the trained first model and the trained second model to obtain the classification model includes:

and carrying out weighted combination on the trained first model and the trained second model to obtain the classification model.

Optionally, the first model and the second model are both hidden markov HMM models.

A second aspect of the present invention provides a method for classifying comment information, including:

obtaining comment information to be classified;

preprocessing the comment information to obtain a first comment vocabulary sequence and a second comment vocabulary sequence; the first comment vocabulary sequence comprises a plurality of comment vocabularies marked with parts of speech, the second comment vocabulary sequence comprises a plurality of groups of comment vocabularies, and each group of comment vocabularies comprises at least two vocabularies marked with parts of speech;

inputting the first comment vocabulary sequence and the second comment vocabulary sequence into a classification model to obtain a comment grade of the comment information;

the classification model is obtained by training an original first model and an original second model and combining the trained first model and the trained second model and is used for determining the comment grade of the comment information.

In a possible implementation manner, the preprocessing the comment information to obtain a first comment vocabulary sequence and a second comment vocabulary sequence includes:

performing word segmentation processing on the comment information to obtain a first comment vocabulary sequence;

and post-processing the first comment vocabulary sequence to obtain a second comment vocabulary sequence.

In a possible implementation manner, the performing word segmentation and word combination on the comment information to obtain the second comment word sequence includes:

performing word segmentation processing on the comment information to obtain an original comment vocabulary sequence;

and carrying out vocabulary combination on the original comment vocabulary sequence by adopting a preset window distance to obtain the second comment vocabulary sequence.

Optionally, the comment information includes punctuation marks.

In a possible implementation manner, the performing vocabulary combination on the original comment vocabulary sequence by using a preset window distance to obtain the second comment vocabulary sequence includes:

In a possible implementation manner, the inputting the first comment vocabulary sequence and the second comment vocabulary sequence into a classification model to obtain a comment level of the comment information includes:

inputting the first comment vocabulary sequence into the first model in the classification model to obtain a first comment grade sequence corresponding to the first comment vocabulary sequence;

inputting the second comment vocabulary sequence into the second model in the classification model to obtain a second comment grade sequence corresponding to the second comment vocabulary sequence;

and determining the comment grade of the comment information according to the first comment grade sequence and the second comment grade sequence.

In one possible implementation, the determining the comment level of the comment information according to the first comment level sequence and the second comment level sequence includes:

determining the comprehensive number of different comment grades according to the number of different comment grades in the first comment grade sequence and the number of different comment grades in the second comment grade sequence;

and taking the comment grade with the maximum comprehensive quantity as the comment grade of the comment information.

In one possible implementation, the determining a composite number of different review levels according to the number of different review levels in the first sequence of review levels and the number of different review levels in the second sequence of review levels includes:

and carrying out weighted summation on the number of the same comment grade in the first comment grade sequence and the second comment grade sequence to obtain the comprehensive number corresponding to each comment grade.

A third aspect of the present invention provides a training apparatus for a classification model, including:

the comment processing device comprises an acquisition module, a storage module and a processing module, wherein the acquisition module is used for acquiring a first comment sample, and the first comment sample comprises a first comment vocabulary sequence and a first comment level sequence corresponding to the first comment vocabulary sequence; the first comment vocabulary sequence is obtained by performing word segmentation processing on original comment information;

the obtaining module is further configured to obtain a second comment sample, where the second comment sample includes a second comment vocabulary sequence and a second comment level sequence corresponding to the second comment vocabulary sequence; the second comment vocabulary sequence is obtained by performing word segmentation processing and vocabulary combination on original comment information;

the processing module is used for training an original first model according to the first comment sample to obtain a trained first model, and training an original second model according to the second comment sample to obtain a trained second model;

A fourth aspect of the present invention provides a classification device of comment information, including:

the obtaining module is used for obtaining comment information to be classified;

the processing module is used for preprocessing the comment information to obtain a first comment vocabulary sequence and a second comment vocabulary sequence; the first comment vocabulary sequence comprises a plurality of comment vocabularies marked with parts of speech, the second comment vocabulary sequence comprises a plurality of groups of comment vocabularies, and each group of comment vocabularies comprises at least two vocabularies marked with parts of speech;

A fifth aspect of the present invention provides a training apparatus for classification models, including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of the first aspects.

A sixth aspect of the present invention provides a classification device of comment information, including:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of the second aspects.

A seventh aspect of the present invention provides a computer-readable storage medium comprising: for storing a computer program which, when executed on a computer, causes the computer to perform the method of any one of the first aspects.

An eighth aspect of the present invention provides a computer-readable storage medium comprising: for storing a computer program which, when executed on a computer, causes the computer to perform the method of any of the second aspects.

The embodiment of the invention provides a training method of a classification model, and a classification method and device of comment information. The training method comprises the following steps: the method comprises the steps of obtaining a first comment sample and a second comment sample, wherein the first comment sample comprises a first comment vocabulary sequence and a first comment grade sequence corresponding to the first comment vocabulary sequence, and the second comment sample comprises a second comment vocabulary sequence and a second comment grade sequence corresponding to the second comment vocabulary sequence. The first sequence of comment words is corresponding to the second sequence of comment words. Training an original first model according to the first comment sample to obtain a trained first model, training an original second model according to the second comment sample to obtain a trained second model, and combining the trained first model and the trained second model to obtain a final classification model. The classification model constructed by the embodiment of the invention is used for determining the comment grade of the comment information. Because the training samples adopted in the training process fully consider the relation between the characters in the comment information, the sampling efficiency and the accuracy of the sample data are higher, and high-efficiency sample data are provided for model training. The classification result of the comment information output by the classification model constructed by the embodiment of the invention is more accurate.

Drawings

FIG. 1 is a flowchart of a method for training a classification model according to an embodiment of the present invention;

FIG. 2 is a flowchart of a method for training a classification model according to an embodiment of the present invention;

fig. 3 is a flowchart of a method for classifying comment information according to an embodiment of the present invention;

FIG. 4 is a schematic structural diagram of a training apparatus for classification models according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of a comment information classification apparatus according to an embodiment of the present invention;

fig. 6 is a schematic hardware structure diagram of a training apparatus for classification models according to an embodiment of the present invention;

fig. 7 is a schematic hardware structure diagram of a comment information classification apparatus according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The terms "first," "second," "third," "fourth," and the like in the description and in the claims, as well as in the drawings, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are, for example, capable of operation in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The current network service platforms all provide an evaluation section for a certain article or a certain service, and a user can evaluate the ordered article (such as household electrical appliances, personal articles and the like) or the certain service (such as consulting services, delivery services and the like) by the evaluation section. The evaluation information of the user reflects the user's preference or recommendation degree for a certain object or a certain service. However, with the development of numerous service platforms, there are cases of malicious bad comments, a review being swiped, unrealistic review information, inconsistent review information and review level, and the like. In order to ensure the credibility of the comment information of the service platform, the service platform should have an intelligent classification and screening function for the comment information.

In the related art, the classification method of the comment information by the service platform mainly has the following forms:

one is to use text embedding plus classification model. The method mainly takes characters, words or whole sentences in the comment information as input of a classification model, learns the comment grade corresponding to the comment information by using the classification model, and then predicts new comment information. The characters, words or sentences can adopt the classic method in natural language processing, such as Word bag, TF-IDF, Word2Vec and the like, and the classification model can adopt the common supervised machine learning model including DNN, tree model, naive Bayes and the like.

And secondly, adopting a clustering model. The model does not need a label of the comment information, and the comment information can be classified without knowing the comment grade. The comment information is mainly subjected to Embedding, then a common clustering method is adopted to obtain a plurality of categories, and then prior knowledge is distinguished. The clustering model can adopt an unsupervised model of a machine learning paradigm, comprising K-means, hierarchical clustering, a self-encoder and the like.

And thirdly, adopting a topic model which is an unsupervised model based on an operational research paradigm, continuously resampling samples, correcting the class distribution of the samples, and finally obtaining a model which is most likely to belong to which class each sample belongs, and also is a model paradigm without labels.

The first scheme has the problem of loss of model input information, namely, information loss exists in embedding of each layer of the text, the loss degree is increased along with reduction of embedding dimensionality, meanwhile, the classification model depends on a mode of training data, inherent characteristics of the text are not mined, and the classification accuracy is not high.

The clustering model of the second scheme has more uncertainty in practical application and is sensitive to incremental data, and a comment has different labels in different model training periods or training modes, so that the classification stability in practical application cannot be ensured, and the classification accuracy is not high. In addition, the interpretability of the clustering model is poor, and a good explanation cannot be given for specific services.

The topic model of the third scheme has uncontrollable sampling process besides the problems of the clustering model.

In summary, the current classification models for the comment information all have the problems of low classification accuracy and poor stability. In view of the above problems, embodiments of the present invention provide a method for training a classification model, which outputs a classification result with good stability and accuracy based on the trained classification model provided in embodiments of the present invention.

Specifically, the classification model provided by the embodiment of the invention adopts a hidden markov HMM model as a bottom layer structure, and the model is necessarily converged under the condition that the transfer matrix is determined according to the markov property, so that the problem of large vibration of the classification result caused by multiple operations or increase of the data magnitude is avoided. The classification method provided by the embodiment of the invention does not adopt an embedding mode of characters, words or sentences, so that the condition of losing a large amount of information can be avoided, the inherent characteristics of the comment data are retained to the maximum extent, and the classification method is used for tasks such as subsequent modeling and prediction. In addition, the classification method provided by the embodiment of the invention better considers the relation between the characters in the comment information, establishes high-efficiency sample data in advance before modeling, and avoids the problems of low sample sampling efficiency and poor accuracy rate which possibly occur.

The technical solution of the present invention will be described in detail below with specific examples. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments.

Fig. 1 is a flowchart of a method for training a classification model according to an embodiment of the present invention. As shown in fig. 1, the training method for the classification model includes: the collection process of training samples and the training process of the classification model.

Wherein, the collection process of the training sample comprises the following steps: and performing word segmentation processing on the original comment information to obtain a first comment vocabulary sequence. And carrying out vocabulary combination on the vocabulary sequence after the vocabulary processing to obtain a second comment vocabulary sequence.

As an example, after performing word segmentation processing on the original comment information, non-text symbols (e.g., punctuation marks) and stop words (e.g., common words such as "of", "is", "this", "that") in the original comment information may be further removed, resulting in a first comment vocabulary sequence. Different stop words can be set according to actual application scenes, and the embodiment of the invention is not limited at all. It should be noted that the execution sequence of the above-mentioned word segmentation processing, removing the non-text symbol and stop word in the original comment information is not limited to the above-mentioned execution sequence, and the word segmentation processing may be executed after removing the non-text symbol and stop word in the original comment information. The embodiments of the invention are not limited in any way.

Illustratively, the original comment information of the user on a certain article is acquired as that the chair is very firm, namely the comfort level is general, and after word segmentation processing, punctuation marks and stop words are removed, a first comment vocabulary sequence is obtained: "chair", "very", "firm", "comfort", "general".

As an example, after performing word segmentation processing on the original comment information, word combination may be further performed on the word sequence, a comment word combination may be constructed according to a window distance of 2, and the word sequence after word segmentation processing is traversed from left to right to obtain a second comment word sequence. Optionally, if the original comment information includes stop words, the stop words may be removed first after the word segmentation processing, and then the vocabulary combination may be performed. If the original comment information also includes punctuation marks, the words before and after the punctuation marks are not combined when the words are combined.

Illustratively, the first comment vocabulary sequence is represented as (W)₁,W₂,…,W_n) Wi is the first comment vocabulary sequenceWherein i is 1,2, …, n. The vocabulary combination is performed according to the window distance of 2, and the obtained second comment vocabulary sequence can be expressed as (C)₁,C₂,…,C_n-1)，C_jThe comment vocabulary is the jth comment vocabulary in the second comment vocabulary sequence, wherein j is 1,2, …, n-1. It should be noted that, window distances with different lengths may be set according to actual requirements, and this is not limited in this embodiment of the present invention.

Illustratively, the first comment vocabulary sequence obtained according to the above example: "chair", "ten", "firm", "comfort", "general", vocabulary combination, for example in two vocabularies, results in a second comment vocabulary sequence: "chair is very", "very firm", "firm comfort", "comfort is general".

The training process of the classification model comprises a training process of a first model in the classification model, a training process of a second model in the classification model and a post-processing process. Specifically, a first comment vocabulary sequence and a second comment vocabulary sequence of a preset number are obtained, a first model is trained based on the first comment vocabulary sequence of the preset number, and meanwhile, a second model is trained based on the second comment vocabulary sequence of the preset number. And carrying out post-processing on output results of the first model and the second model to finally obtain the comment grade of the original comment information. The post-processing refers to fusing the output results of the first model and the second model according to a preset fusion rule to obtain the comment grade of the original comment information.

The training method provided by the embodiment of the invention has two sets of training sample data which are respectively input into different models in the classification model for training, when the result is output, the classification model simultaneously considers two sets of classification results in the models and outputs the final classification result according to the preset fusion rule, so that the obtained classification result is more accurate.

The above training process is described in detail below with reference to fig. 2. Fig. 2 is a flowchart of a training method for a classification model according to an embodiment of the present invention, and as shown in fig. 2, the training method according to the embodiment of the present invention includes the following steps:

step 101, a first comment sample is obtained, wherein the first comment sample comprises a first comment vocabulary sequence and a first comment grade sequence corresponding to the first comment vocabulary sequence.

The first comment vocabulary sequence is obtained by performing word segmentation processing on original comment information.

Specifically, the original comment information of the comment block of the service platform can be read in by using a common programming language (such as python/R/SQ L and the like). The read-in original comment information is subjected to word segmentation, for example, by means of Jieba word segmentation.

As an example, after the word segmentation process, a built-in stop word and punctuation mark removal method may be adopted to remove common words such as "yes", "this", "that" and all punctuation marks in the comment information, and finally obtain the first comment vocabulary sequence. As in the foregoing embodiments, the execution sequence of the word segmentation processing, the punctuation removal, and the stop word removal is not limited in any way in the embodiments of the present invention.

Illustratively, the original comment information is that the color of the kettle is correct and the quality is good, and the first comment vocabulary sequence is obtained as the kettle, the color, the correct, the quality and the good after the words are segmented and the stop words and punctuation marks are removed.

As can be seen from the above example, the first comment vocabulary sequence includes a plurality of comment vocabularies, and the first comment vocabulary sequence is arranged directly in the order of the vocabularies in the original comment information.

Optionally, the first comment vocabulary sequence includes a plurality of comment vocabularies with parts of speech tagged, and specifically, an ictclas-compatible tagging method may be adopted to tag the parts of speech for the plurality of comment vocabularies in the first comment vocabulary sequence.

When the first comment vocabulary sequence is obtained, a first comment grade sequence corresponding to the manually marked first comment vocabulary sequence is also required to be obtained. Wherein the first sequence of comment grades comprises the number of different comment grades in the first sequence of comment words.

The comment grades in the embodiment of the invention comprise good comment, medium comment and bad comment. Of course, other forms of comment levels are also possible, such as very good, general, bad, very bad, and the like, and the embodiment of the present invention is not limited in any way as to the specific form of comment level.

As an example, the first comment ranking sequence may be represented as (S)₁,S₂S₃) Wherein S is₁Representing the number of "good comment" words in the first sequence of comment words, S₂Representing the number of "medium comment" words in the first sequence of comment words, S₃Representing the number of "bad comment" words in the first sequence of comment words.

Illustratively, for the first comment vocabulary sequence of "kettle", "color", "very positive", "quality", "good", the corresponding first comment ranking sequence may be represented as (2,0,0), wherein "very positive" and "good" are good comments. For the first comment vocabulary sequence, the first comment vocabulary sequence is 'kettle', 'color', 'general', 'quality', 'good', and the corresponding first comment ranking sequence can be expressed as (1,1,0), wherein 'general' is a middle comment vocabulary and 'good' is a good comment vocabulary.

And 102, obtaining a second comment sample, wherein the second comment sample comprises a second comment vocabulary sequence and a second comment grade sequence corresponding to the second comment vocabulary sequence.

The second comment vocabulary sequence is obtained by performing word segmentation processing and vocabulary combination on the original comment information. Specifically, the second comment vocabulary sequence includes a plurality of sets of comment vocabularies, and each set of comment vocabularies includes at least two vocabularies with part of speech tagged.

Alternatively, the preset window distance may be set to 2. Of course, different window distances may be set according to actual requirements, and the embodiment of the present invention is not limited in any way.

As an example, after the original comment information is participled, a built-in stop word removal method may be adopted to remove common words such as "yes", "this", "that" in the comment information, resulting in an original comment vocabulary sequence. And then, carrying out vocabulary combination on the original vocabulary sequence by adopting a preset window distance to obtain a second comment vocabulary sequence, wherein the second comment vocabulary sequence comprises a plurality of groups of comment vocabularies.

For example, taking the original comment information "this kettle is very colored" as an example, the original comment information is subjected to word segmentation processing and stop words are removed, the obtained word sequence is "kettle", "color" and "very colored", the word sequence is subjected to word combination by adopting the preset window distance of 2, and the obtained word combination sequence is "kettle color" and "very colored". The obtained second comment vocabulary sequence can fully reflect the internal relation among the vocabularies and obtain the real meaning of the original comment information.

Optionally, if the original comment information includes a punctuation mark, when the original comment vocabulary sequence is subjected to vocabulary combination by using the preset window distance, the original comment vocabulary sequence is subjected to vocabulary combination by combining the position of the punctuation mark, so as to obtain a second comment vocabulary sequence. Wherein, the words before and after the punctuation mark are not combined.

Illustratively, taking the original comment information that the color of the kettle is correct and the quality is good as an example, performing word segmentation processing on the original comment information and removing stop words to obtain a vocabulary sequence of kettle/color/correct/,/quality/good, combining the positions of punctuation marks, and performing vocabulary combination on the vocabulary sequence by adopting a preset window distance of 2 to obtain a second comment vocabulary sequence of kettle color, color good and quality good.

It should be noted that, if the original comment information is subjected to word segmentation, stop word removal and punctuation mark removal, then the word sequence is subjected to word combination by adopting the preset window distance of 2, and the obtained second comment word sequence is "kettle color", "color is correct", "quality is correct" and "quality is good". The word sequence has redundant information, for example, "very positive quality" is nonsensical combination form, namely, there is no correlation between "very positive" and "quality", the "very positive" is an adjective at the end of the first half sentence, and the "quality" is a noun at the beginning of the second half sentence. Therefore, it is necessary to semantically reduce the vocabulary combination in combination with the position of the punctuation marks.

And acquiring a second comment grade sequence corresponding to the manually marked second comment vocabulary sequence while acquiring the second comment vocabulary sequence. Wherein the second sequence of comment levels includes a number of different comment levels in the second sequence of comment vocabulary.

Similar to the first comment level sequence described above, the second comment level sequence can also be represented as (S)₁,S₂S₃) Wherein S is₁Representing the number of "good comment" words in the first sequence of comment words, S₂Representing the number of "medium comment" words in the first sequence of comment words, S₃Representing the number of "bad comment" words in the first sequence of comment words.

For example, the second comment vocabulary sequence is "kettle color", "color good", "good quality", and the corresponding second comment ranking sequence may be represented as (2,0,0), where "color good", "good quality" are good comment vocabularies.

Step 103, training the original first model according to the first comment sample to obtain a trained first model, and training the original second model according to the second comment sample to obtain a trained second model.

In the embodiment of the present invention, the first comment sample and the second comment sample are obtained based on the same original comment information, and have a certain difference, which can be specifically referred to in step 101 and step 102.

The method comprises the steps of obtaining a preset number of original comment information, and respectively obtaining a preset number of first comment samples and a preset number of second comment samples. Training an original first model according to a preset number of first comment samples, and simultaneously training an original second model according to a preset number of second comment samples, so that the first model and the second model can output comment level sequences corresponding to comment vocabulary sequences with high accuracy.

Optionally, the first model and the second model are both hidden markov HMM models, that is, the first model may be a first HMM model, and the second model may be a second HMM model. The input of the first HMM model is a first comment vocabulary sequence, and the output of the second HMM model is a first comment level sequence corresponding to the first comment vocabulary sequence. The input of the second HMM model is a second comment vocabulary sequence, and the output of the second HMM model is a second comment level sequence corresponding to the second comment vocabulary sequence.

And 104, combining the trained first model and the trained second model to obtain a classification model.

As an example, combining the trained first model and the trained second model to obtain a classification model includes: and carrying out weighted combination on the trained first model and the trained second model to obtain a classification model.

Namely, the classification model provided by the embodiment of the invention comprises a first model and a second model, the input of the classification model is a first comment vocabulary sequence and a second comment vocabulary sequence, wherein the first comment vocabulary sequence and the second comment vocabulary sequence correspond to the same original comment information. The output of the classification model is the comment grade corresponding to the original comment information, and the comment grade is determined by the output result of the first model in the classification model and the output result of the second model in the classification model.

In a possible implementation manner, the comment grade corresponding to the original comment information is obtained by weighting the output result of the first model and the output result of the second model.

The training method of the classification model provided by the embodiment of the invention comprises the following steps: the method comprises the steps of obtaining a first comment sample and a second comment sample, wherein the first comment sample comprises a first comment vocabulary sequence and a first comment grade sequence corresponding to the first comment vocabulary sequence, and the second comment sample comprises a second comment vocabulary sequence and a second comment grade sequence corresponding to the second comment vocabulary sequence. The first sequence of comment words is corresponding to the second sequence of comment words. Training an original first model according to the first comment sample to obtain a trained first model, training an original second model according to the second comment sample to obtain a trained second model, and combining the trained first model and the trained second model to obtain a final classification model. The classification model provided by the embodiment of the invention is used for determining the comment grade of the comment information. Because the training samples adopted in the training process fully consider the relation between the characters in the comment information, the sampling efficiency and the accuracy of the sample data are higher, and high-efficiency sample data are provided for model training. The classification result of the comment information output by the classification model constructed by the embodiment of the invention is more accurate. In addition, because the first model and the second model in the classification model both adopt HMM structures, the stability of the classification result is higher.

Based on the classification model constructed in the above embodiment, the following embodiment explains in detail how the service platform obtains the comment level corresponding to the new comment information according to the constructed classification model from the perspective of actual application of the model.

Fig. 3 is a flowchart of a method for classifying comment information according to an embodiment of the present invention. As shown in fig. 3, the classification method provided in the embodiment of the present invention includes the following steps:

step 201, obtaining comment information to be classified.

Step 202, preprocessing the comment information to obtain a first comment vocabulary sequence and a second comment vocabulary sequence.

The first comment vocabulary sequence comprises a plurality of comment vocabularies marked with parts of speech, the second comment vocabulary sequence comprises a plurality of groups of comment vocabularies, and each group of comment vocabularies comprises at least two vocabularies marked with parts of speech.

Specifically, in a possible implementation manner, the preprocessing the comment information to obtain a first comment vocabulary sequence includes: and performing word segmentation processing on the comment information to obtain a first comment vocabulary sequence. Optionally, if the comment information includes stop words and/or punctuation marks, performing word segmentation processing on the comment information to obtain an original comment vocabulary sequence, further including: stop words and/or punctuation in the original sequence of commented vocabularies are removed.

Specifically, in a possible implementation manner, the preprocessing is performed on the comment information to obtain a second comment vocabulary sequence, which includes: and performing word segmentation processing and word combination on the comment information to obtain a second comment word sequence. The implementation manner of the vocabulary combination is the same as that of step 102 in the above embodiment, which can be referred to the above embodiment specifically, and is not described herein again. Optionally, if the comment information includes stop words, after the comment information is subjected to word segmentation processing to obtain an original comment vocabulary sequence, the stop words in the original comment vocabulary sequence are removed, and then vocabulary combination is performed. Furthermore, if the comment information also comprises punctuation marks, word segmentation processing is carried out on the comment information, stop words are removed, vocabulary combination is carried out by combining the positions of the punctuation marks, the punctuation marks in the vocabulary sequence are removed while the vocabulary combination is completed, and a second comment vocabulary sequence is obtained.

Step 203, inputting the first comment vocabulary sequence and the second comment vocabulary sequence into the classification model to obtain the comment grade of the comment information.

Specifically, in a possible implementation manner, inputting the first comment vocabulary sequence and the second comment vocabulary sequence into the classification model to obtain the comment level of the comment information includes: and inputting the first comment vocabulary sequence into a first model in the classification models to obtain a first comment grade sequence corresponding to the first comment vocabulary sequence. And inputting the second comment vocabulary sequence into a second model in the classification models to obtain a second comment grade sequence corresponding to the second comment vocabulary sequence. And determining the comment grade of the comment information according to the first comment grade sequence and the second comment grade sequence.

Wherein the first sequence of comment grades comprises the number of different comment grades in the first sequence of comment words. The second sequence of comment levels includes a number of different comment levels in the second sequence of comment words.

As an example, determining a comment level of the comment information according to the first comment level sequence and the second comment level sequence includes: determining the comprehensive number of different comment grades according to the number of different comment grades in the first comment grade sequence and the number of different comment grades in the second comment grade sequence; and taking the comment grade with the maximum comprehensive quantity as the comment grade of the comment information.

Specifically, in a possible implementation manner, determining the comprehensive number of different comment levels according to the number of different comment levels in the first comment level sequence and the number of different comment levels in the second comment level sequence includes: and carrying out weighted summation on the number of the same comment grade in the first comment grade sequence and the second comment grade sequence to obtain the comprehensive number corresponding to each comment grade.

For ease of understanding, the process of step 203 is described in detail below by way of an example.

1. Assume that the first comment vocabulary sequence is "a", "b", "c", "d", "e", and the second comment vocabulary sequence is "ab", "bc", "cd", "de". Therefore, the first comment vocabulary sequence and the second comment vocabulary sequence correspond to each other, and the second comment vocabulary sequence has richer semantic information than the first comment vocabulary sequence.

2. And inputting the first comment vocabulary sequence into a first model in the classification model to obtain a first comment grade sequence (3,2,0), wherein 3 represents that 3 comment vocabularies in the first comment vocabulary sequence are good comment vocabularies, 2 represents that 2 comment vocabularies in the first comment vocabulary sequence are medium comment vocabularies, and 0 represents that no poor comment vocabulary exists in the first comment vocabulary sequence.

3. And inputting the second comment vocabulary sequence into a second model in the classification model to obtain a second comment grade sequence (2,2,0), wherein the first 2 indicates that 2 groups of comment vocabularies in the second comment vocabulary sequence are good comment vocabularies, and the second 2 indicates that 2 groups of comment vocabularies in the second comment vocabulary sequence are medium comment vocabularies.

3. The overall number of good ratings is set to be 3 × 70% +2 × 30% + 2.7 by setting the output result of the first model to be 70% and the output result of the second model to be 30%, and the overall number of medium ratings is set to be 2 × 70% +2 × 30% +2 by setting the output result of the first model to be 70% and the output result of the second model to be 30%.

4. The comment level with the largest comprehensive number is taken as the comment level of the comment information, and the comment level finally output in this example should be good comment.

It should be noted that, since the first model has more semantics and the space is less affected by sparsity than the second model, a higher weight value is usually required to be set. For example, the output results of the first model account for 70% and the output results of the second model account for 30% in the above example. Of course, the ratio (i.e., the weight value) of the output results of the first model and the second model may also be adjusted according to actual requirements to construct an optimal combination model, which is not limited in this embodiment of the present invention.

The method for classifying the comment information provided by the embodiment of the invention is based on the created combined model, the model comprises a first model and a second model, and in general, the input of the combined model comprises a first comment vocabulary sequence and a second comment vocabulary sequence, wherein the first comment vocabulary sequence is obtained by performing word segmentation on the comment information, and the second comment vocabulary sequence is obtained by performing word segmentation and vocabulary combination on the comment information. The first comment vocabulary sequence is analyzed by a first model in the combined model to obtain a first comment grade sequence, the second comment vocabulary sequence is analyzed by a second model in the combined model to obtain a second comment grade sequence, and the comment grade of the comment information is output after post-processing (namely weighted summation) of the combined model. The classification result obtained by the classification method has higher accuracy and stability.

In the embodiment of the present invention, the training device or the classification device may be divided into functional modules according to the method embodiment, for example, each functional module may be divided corresponding to each function, or two or more functions may be integrated into one processing module. The integrated module can be realized in a form of hardware or a form of a software functional module. It should be noted that, the division of the modules in the embodiment of the present invention is schematic, and is only a logic function division, and there may be another division manner in actual implementation. The following description will be given by taking an example in which each functional module is divided by using a corresponding function.

Fig. 4 is a schematic structural diagram of a training apparatus for a classification model according to an embodiment of the present invention. As shown in fig. 4, the training apparatus 300 according to the embodiment of the present invention includes:

the obtaining module 301 is configured to obtain a first comment sample, where the first comment sample includes a first comment vocabulary sequence and a first comment level sequence corresponding to the first comment vocabulary sequence; the first comment vocabulary sequence is obtained by performing word segmentation processing on original comment information;

the obtaining module 301 is further configured to obtain a second comment sample, where the second comment sample includes a second comment vocabulary sequence and a second comment level sequence corresponding to the second comment vocabulary sequence; the second comment vocabulary sequence is obtained by performing word segmentation processing and vocabulary combination on the original comment information;

a processing module 302, configured to train an original first model according to the first comment sample to obtain a trained first model, and train an original second model according to the second comment sample to obtain a trained second model;

Optionally, the processing module 302 is further configured to:

Optionally, the original comment information includes punctuation marks; the processing module 302 is specifically configured to:

Optionally, the processing module 302 is specifically configured to:

The training apparatus provided in the embodiment of the present invention is configured to execute the technical solution of the method embodiment shown in fig. 2, and the implementation principle and the technical effect are similar, which are not described herein again.

Fig. 5 is a schematic structural diagram of a comment information classification apparatus according to an embodiment of the present invention. As shown in fig. 5, the classification apparatus 400 according to the embodiment of the present invention includes:

an obtaining module 401, configured to obtain comment information to be classified;

the processing module 402 is configured to pre-process the comment information to obtain a first comment vocabulary sequence and a second comment vocabulary sequence; the first comment vocabulary sequence comprises a plurality of comment vocabularies marked with parts of speech, the second comment vocabulary sequence comprises a plurality of groups of comment vocabularies, and each group of comment vocabularies comprises at least two vocabularies marked with parts of speech;

Optionally, the processing module 402 is specifically configured to:

and performing word segmentation processing and word combination on the comment information to obtain a second comment word sequence.

Optionally, the processing module 402 is specifically configured to:

Optionally, the comment information includes punctuation marks.

Optionally, the processing module 402 is specifically configured to:

The classification device provided in the embodiment of the present invention is used for implementing the technical solution of the method embodiment shown in fig. 3, and the implementation principle and the technical effect are similar, which are not described herein again.

Fig. 6 is a schematic hardware structure diagram of a training apparatus for a classification model according to an embodiment of the present invention. As shown in fig. 6, the training apparatus 500 provided in this embodiment includes:

at least one processor 501 (only one processor is shown in FIG. 6); and

a memory 502 communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory 502 stores instructions executable by the at least one processor 501, the instructions being executable by the at least one processor 501 to enable the at least one processor 501 to perform the steps of the method embodiment of fig. 2 as previously described.

Fig. 7 is a schematic hardware structure diagram of a comment information classification apparatus according to an embodiment of the present invention. As shown in fig. 7, the classification apparatus 600 provided in this embodiment includes:

at least one processor 601 (only one processor is shown in FIG. 7); and

a memory 602 communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory 602 stores instructions executable by the at least one processor 601 to cause the at least one processor 601 to perform the steps of the method embodiment of fig. 3 described above.

The classification device of the embodiment of the invention can be arranged on any network service platform to realize accurate classification of newly added comment information.

The present invention further provides a computer-readable storage medium, in which computer-executable instructions are stored, and when the computer-executable instructions are executed by a processor, the computer-readable storage medium is used for implementing the technical solution of the foregoing method embodiment shown in fig. 2.

The present invention further provides a computer-readable storage medium, in which computer-executable instructions are stored, and when the computer-executable instructions are executed by a processor, the computer-readable storage medium is used for implementing the technical solution of the foregoing method embodiment shown in fig. 3.

An embodiment of the present invention further provides a program, which is configured to, when executed by a processor, perform any one of the technical solutions in the foregoing method embodiments.

The embodiment of the present invention further provides a computer program product, which includes program instructions, where the program instructions are used to implement the technical solutions in any of the foregoing method embodiments.

An embodiment of the present invention further provides a chip, including: a processing module and a communication interface, wherein the processing module can execute the technical scheme in any one of the method embodiments.

Further, the chip further includes a storage module (e.g., a memory), where the storage module is configured to store instructions, and the processing module is configured to execute the instructions stored in the storage module, and the execution of the instructions stored in the storage module causes the processing module to execute the technical solution in any one of the foregoing method embodiments.

It should be understood that the processor mentioned in the embodiments of the present invention may be a Central Processing Unit (CPU), and may also be other general purpose processors, Digital Signal Processors (DSP), Application Specific Integrated Circuits (ASIC), Field Programmable Gate Arrays (FPGA) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, and the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

It should also be understood that the Memory referred to in embodiments of the present invention may be either volatile Memory or non-volatile Memory, or may include both volatile and non-volatile Memory, wherein non-volatile Memory may be Read-Only Memory (ROM), Programmable Read-Only Memory (Programmable ROM, PROM), Erasable Programmable Read-Only Memory (EPROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), or flash Memory.

It should be noted that when the processor is a general-purpose processor, a DSP, an ASIC, an FPGA or other programmable logic device, a discrete gate or transistor logic device, or a discrete hardware component, the memory (memory module) is integrated in the processor.

It should be noted that the memory described herein is intended to comprise, without being limited to, these and any other suitable types of memory.

It should be understood that, in various embodiments of the present invention, the sequence numbers of the above-mentioned processes do not mean the execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation on the implementation process of the embodiments of the present invention.

The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A training method of a classification model, wherein the classification model is used for determining comment grades of comment information, and the method comprises the following steps:

2. The method of claim 1, wherein the performing word segmentation and word combination on the original comment information comprises:

3. The method of claim 2, wherein the original comment information includes punctuation marks; the method for combining the words of the original comment word sequence by adopting the preset window distance to obtain the second comment word sequence comprises the following steps:

4. The method of any of claims 1-3, wherein the first sequence of comment words includes a plurality of comment words of tagged part of speech.

5. The method of any of claims 1-3, wherein the second sequence of comment words includes a plurality of sets of comment words, each set of comment words including at least two words of tagged part of speech.

6. The method of any of claims 1-3, wherein the first sequence of comment levels includes a number of different comment levels in the first sequence of comment words and the second sequence of comment levels includes a number of different comment levels in the second sequence of comment words.

7. The method according to any of claims 1-3, wherein said combining the trained first model and the trained second model to obtain the classification model comprises:

8. The method of any of claims 1-3, wherein the first model and the second model are both hidden Markov HMM models.

9. A method for classifying comment information, comprising:

obtaining comment information to be classified;

10. The method of claim 9, wherein preprocessing the comment information to obtain a first sequence of comment words and a second sequence of comment words comprises:

performing word segmentation processing on the comment information to obtain the first comment vocabulary sequence;

and performing word segmentation processing and word combination on the comment information to obtain the second comment word sequence.

11. The method of claim 10, wherein the performing word segmentation and word combination on the comment information to obtain the second comment word sequence comprises:

12. The method of claim 11, wherein the comment information includes punctuation marks; the method for combining the words of the original comment word sequence by adopting the preset window distance to obtain the second comment word sequence comprises the following steps:

13. The method of any of claims 9-12, wherein the inputting the first sequence of comment words and the second sequence of comment words to a classification model resulting in a comment ranking for the comment information comprises:

14. The method of claim 13, wherein the first sequence of comment levels includes a number of different comment levels in the first sequence of comment words and the second sequence of comment levels includes a number of different comment levels in the second sequence of comment words.

15. The method of claim 13, wherein determining the review level for the review information from the first sequence of review levels and the second sequence of review levels comprises:

16. The method of claim 15, wherein determining a composite number of different review levels from the number of different review levels in the first sequence of review levels and the number of different review levels in the second sequence of review levels comprises:

17. The method of any of claims 9-12, wherein the first model and the second model are each a hidden markov HMM model.

18. A training device for classification models, comprising:

the obtaining module is further configured to obtain a second comment sample, where the second comment sample includes a second comment vocabulary sequence and a second comment level sequence corresponding to the second comment vocabulary sequence; the second comment vocabulary sequence is obtained by performing word segmentation processing and vocabulary combination on the original comment information;

19. An apparatus for classifying comment information, comprising:

20. A training device for classification models, comprising:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-8.

21. An apparatus for classifying comment information, comprising:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 9-17.

22. A computer-readable storage medium, comprising: for storing a computer program which, when executed on a computer, causes the computer to perform the method of any one of claims 1-8 or the method of any one of claims 9-17.