CN113393276A

CN113393276A - Comment data classification method and device and computer readable medium

Info

Publication number: CN113393276A
Application number: CN202110715966.3A
Authority: CN
Inventors: 王泰舟
Original assignee: Shiheng Shanghai Technology Service Co ltd
Current assignee: Shiheng Shanghai Technology Service Co ltd
Priority date: 2021-06-25
Filing date: 2021-06-25
Publication date: 2021-09-14
Anticipated expiration: 2041-06-25
Also published as: CN113393276B

Abstract

The invention relates to a method and a device for classifying comment data and a computer readable medium. The classification method comprises the following steps: classifying the comment data by adopting a classifier to obtain a classification result and confidence of the comment data; and comparing the confidence coefficient with a preset confidence coefficient range, and taking the classification result with the confidence coefficient within the preset confidence coefficient range as a final classification result. The method for classifying the comment data can obtain the confidence corresponding to the classification result, and can judge whether the classification result is accurate or not by comparing the confidence with the preset confidence range, so that the effect of saving manual review time is achieved.

Description

Comment data classification method and device and computer readable medium

Technical Field

The invention mainly relates to the field of big data, in particular to a method and a device for classifying comment data and a computer readable medium.

Background

In the big data age, the evaluation, especially the bad evaluation, obtained by the merchant on the E-commerce platform plays an important role in guiding the merchant to improve the operation. The number of simply obtained bad comments cannot reflect the reason for submitting the bad comments by the customers, and has no effective guiding significance for the merchants. However, the specific text content of the bad comment is complicated, and the data size is huge, so that the result with guiding significance is difficult to extract from the text content only by manpower and is used for guiding the improvement direction of the merchant.

Disclosure of Invention

The invention aims to provide a classification method, a device and a computer readable medium which have high accuracy and save manpower for comment data.

The technical solution adopted by the present invention to solve the above technical problems is a method for classifying comment data, including: classifying the comment data by adopting a classifier to obtain a classification result and confidence of the comment data; and comparing the confidence coefficient with a preset confidence coefficient range, and taking the classification result with the confidence coefficient within the preset confidence coefficient range as a final classification result.

In an embodiment of the present invention, training the classifier using training data with classification labels further includes: performing word segmentation processing on the training data with the classification marks to obtain a first word segmentation result; converting the first word segmentation result into a first word vector by adopting an M word vector algorithm, wherein M is an integer greater than or equal to 1; classifying the word vectors by adopting N machine learning algorithms, wherein N is an integer greater than or equal to 1; obtaining M x N first classification results corresponding to each piece of training data; comparing the first classification result with the classification mark to obtain the accuracy of each group of word vector algorithm and machine learning algorithm, and taking the combination of the word vector algorithm and the machine learning algorithm with the accuracy greater than a baseline threshold as a baseline algorithm, wherein the first classification result corresponding to the baseline algorithm is a first baseline classification result; and obtaining, for each of the training data, a first confidence level for the baseline algorithm based on the first baseline classification result and all of the first classification results.

In one embodiment of the present invention, the step of obtaining a first confidence level of the baseline algorithm based on the first baseline classification result and all of the first classification results comprises calculating a first confidence level C1 of the baseline algorithm using the following formula:

wherein k1 represents the same number of the first classification results as the first baseline classification results.

In an embodiment of the present invention, the method further includes: taking the comment data with the confidence coefficient out of the preset confidence coefficient range as training data, and carrying out classification marking on the training data; and training the classifier using the training data.

In an embodiment of the present invention, the classifying the comment data with the classifier includes: performing word segmentation processing on the comment data to obtain a second word segmentation result; converting the second word segmentation result into a second word vector by adopting the M word vector algorithms; classifying the second word vector by adopting the N machine learning algorithms; obtaining M x N second classification results of the comment data; obtaining a baseline classification result of the baseline algorithm; and obtaining the confidence level according to the baseline classification result and the second classification result.

In an embodiment of the present invention, the step of obtaining the confidence level according to the baseline classification result and the second classification result includes calculating the confidence level C by using the following formula:

wherein k represents the same number of the second classification results as the baseline classification result.

In an embodiment of the present invention, the method further includes: setting a confidence threshold, and when the confidence is greater than the confidence threshold, making the confidence equal to 1, otherwise, making the confidence equal to 0.

In an embodiment of the invention, the comment data includes bad comment data.

In an embodiment of the present invention, the word vector algorithm includes a custom dictionary including custom words representing bad scoring classifications, each of the custom words having a corresponding weight.

The present invention further provides a comment data classification device for solving the above technical problems, including: a memory for storing instructions executable by the processor; a processor for executing the instructions to implement the method as described above.

The present invention also provides a computer readable medium storing computer program code, which when executed by a processor implements the method as described above.

The method for classifying the comment data can obtain the confidence corresponding to the classification result, and can judge whether the classification result is accurate or not by comparing the confidence with the preset confidence range, so that the effect of saving manual review time is achieved.

Drawings

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below, wherein:

FIG. 1 is an exemplary flow diagram of a method of classification of review data in accordance with an embodiment of the present invention;

FIG. 2 is a schematic flow chart of training classifiers in a classification method according to an embodiment of the invention;

FIG. 3 is an exemplary flowchart of classifying comment data using a classifier in a classification method according to an embodiment of the present invention;

fig. 4 is a system block diagram of a classification device of comment data according to an embodiment of the present invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, but the present invention may be practiced in other ways than those specifically described herein, and thus the present invention is not limited to the specific embodiments disclosed below.

As used in this application and the appended claims, the terms "a," "an," "the," and/or "the" are not intended to be inclusive in the singular, but rather are intended to be inclusive in the plural unless the context clearly dictates otherwise. In general, the terms "comprises" and "comprising" merely indicate that steps and elements are included which are explicitly identified, that the steps and elements do not form an exclusive list, and that a method or apparatus may include other steps or elements.

The relative arrangement of the components and steps, the numerical expressions, and numerical values set forth in these embodiments do not limit the scope of the present application unless specifically stated otherwise. Meanwhile, it should be understood that the sizes of the respective portions shown in the drawings are not drawn in an actual proportional relationship for the convenience of description. Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate. In all examples shown and discussed herein, any particular value should be construed as merely illustrative, and not limiting. Thus, other examples of the exemplary embodiments may have different values. It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.

It should be noted that the terms "first", "second", and the like are used to define the components, and are only used for convenience of distinguishing the corresponding components, and the terms have no special meanings unless otherwise stated, and therefore, the scope of protection of the present application is not to be construed as being limited. Further, although the terms used in the present application are selected from publicly known and used terms, some of the terms mentioned in the specification of the present application may be selected by the applicant at his or her discretion, the detailed meanings of which are described in relevant parts of the description herein. Further, it is required that the present application is understood not only by the actual terms used but also by the meaning of each term lying within.

Flow charts are used herein to illustrate operations performed by systems according to embodiments of the present application. It should be understood that the preceding or following operations are not necessarily performed in the exact order in which they are performed. Rather, various steps may be processed in reverse order or simultaneously. Meanwhile, other operations are added to or removed from these processes.

Fig. 1 is an exemplary flowchart of a classification method of comment data according to an embodiment of the present invention. Referring to fig. 1, the classification method of this embodiment includes the steps of:

step S110: classifying the comment data by adopting a classifier to obtain a classification result and confidence of the comment data; and

step S120: and comparing the confidence coefficient with a preset confidence coefficient range, and taking the classification result with the confidence coefficient within the preset confidence coefficient range as a final classification result.

The above steps are explained in detail below.

In step S110, comment data refers to any data related to comments, such as: the comment content of the user on the commodity on the E-commerce platform, the comment content of the user on the logistics service on the logistics platform and the like. Generally, to obtain the user's comments, the platform provides three options for good comment, medium comment and bad comment. In some platforms, a scoring evaluation function is also provided, such as prompting the user to score from 1-10. In some platforms, a star rating function is provided, such as lighting 1-5 stars. The comment data belong to simple categories, and the guidance significance provided by the comment data is limited. For example, counting the ratio of the number of bad reviews to the total order amount can only indicate the trend of the number or the number change, but not the reason why the bad reviews are caused, or what aspects are improved and what aspects are still needed to be improved.

On some platforms, a text input function is also provided, so that a user can input text or voice for comment by himself. The platform can also provide some fixed comment characters for the user to select, so that the user can conveniently and quickly input comments. Obviously, the text content can reflect more definite experience and demand of the user, and if useful information can be extracted from the comment text, the text content has great significance for improving operation of merchants.

In some embodiments, the comment data includes only poor comment data, i.e., comment textual content corresponding to comments marked as poor comment by the user. In these embodiments, the text content of the comment corresponding to the bad comment cannot be empty.

In some embodiments, the comment data includes all types of data, including comment text corresponding to comments with good comment, medium comment, and bad comment. It can be understood that the user may also input some comment text content in case of good comment, which may contain information such as user's dissatisfaction or improvement opinions.

The classifier in step S110 is a classifier obtained through training of a large amount of review data, and is adapted to classify the review data and obtain a classification result and confidence corresponding to the review data.

The invention does not limit the classification category, the number of categories, and the like of the comment data. The user can set the corresponding classification category, and the number of categories, according to what information is required to obtain from the comment data.

The invention is not limited to the specific type of classifier, and any classifier in the art may be used.

In some embodiments, the classification method of the present invention further comprises the step of training a classifier using training data having classification labels.

FIG. 2 is a schematic flow chart of training a classifier in the classification method according to an embodiment of the present invention. Referring to fig. 2, the step of training the classifier of this embodiment includes the steps of:

step S210: and performing word segmentation processing on the training data with the classification marks to obtain a first word segmentation result.

The training data here is also the comment data as described in step S110, but the purpose is different. The training data in step S210 is comment data for training the classifier.

In this step, some comment data may be selected from the massive comment data as a labeled object, for example, 5000 pieces of comment data are selected as a data set of a training classifier, a part of the comment data are used as training samples, a part of the comment data are used as test samples, for example, 3000 pieces of comment data are selected as training samples, then the 3000 pieces of comment data are training data, and the remaining 2000 pieces of comment data are used as test samples.

The step of assigning a classification label to the training data may be further included before the step of step S210 is performed, so that each training data has its corresponding classification label. In some embodiments, the training data may be assigned classification labels in a manual labeling manner, which is not limited by the present invention.

In step S210, the classification labels of the training data may be used as a standard result. The content to be marked may be set as a classification tag, for example, the setting tag includes: product quality, logistics, service attitude, etc. Labels can also be marked by numbers or letters, for example, the label of product quality is A, the label of logistics is B, the label of service attitude is C, and the like.

For example, the text content includes "stubborn" and "foreign matter", and these contents are artificially labeled as "product quality" labels. The text contents comprise 'untimely' and 'not sent to the user', and the contents are marked as 'logistics' labels manually. The text content comprises 'response is not timely' and 'attitude is poor', and the text content is manually marked as 'service attitude' labels. These examples are merely illustrative and are not intended to limit the specific content or number of category labels of the present invention.

In some embodiments, a piece of training data may have multiple classification labels. For example, the training data is: "stubborn, slow logistics", the piece of training data has two class labels a and B at the same time.

Various segmentation methods in the art can be adopted in step S210 to obtain a first segmentation result with key information from the training data. For example, after segmenting the training data "too stubborn", the first segmentation result obtained is: "too", "difficult to eat" and "what you want". In this step, the word with no actual meaning and the redundant word can be eliminated according to the first word-dividing result, for example, the above example only leaves "stuttering" after elimination, and the first word-dividing result has a label which is enough to be assigned to "product quality" as its classification label.

Step S220: and converting the first word segmentation result into a first word vector by adopting an M word vector algorithm. Wherein M is an integer of 1 or more.

To be suitable for the machine learning algorithm, the first word segmentation result is converted into a first word vector that can be used by the machine learning algorithm in step S220. The invention does not limit the specific algorithm of the word vector algorithm. Various word vector algorithms in the art may be employed, such as count, tf-idf, word2 vec.

In some embodiments, the word vector algorithm includes a custom dictionary including custom words representing the triage, each custom word having a corresponding weight. In these embodiments, a custom dictionary may be set according to the set classification tags, including some custom words. The merchant can define the information which is considered to be more important by the merchant in the self-defining words, and endow different weights to the self-defining words according to the importance degree.

The M word vector algorithms are M different word vector algorithms.

In some embodiments, step S220 further comprises storing the M word vector algorithm models in a storage medium to facilitate other steps of the classification method to invoke the M word vector algorithm.

Step S230: the first word vectors are classified using N machine learning algorithms. Wherein N is an integer of 1 or more.

The invention is not limited to the specific algorithm of the machine learning algorithm. Various machine learning algorithms in the art may be employed, such as naive bayes algorithms, logistic regression methods, decision trees, and the like.

In some embodiments, step S230 further comprises storing the N machine learning algorithm models in a storage medium to facilitate other steps of the classification method to invoke the N machine learning algorithms.

The invention is not limited to the number of M and N, M is at least equal to 1, and N is at least equal to 1.

Step S240: corresponding to each piece of training data, M × N first classification results are obtained.

In conjunction with steps S220 and S230, an algorithm combination of M × N word vector algorithms and a machine learning algorithm may be obtained, and accordingly, M × N first classification results may be obtained corresponding to each piece of training data, where each first classification result corresponds to a set of word vector algorithms and a machine learning algorithm.

In step S240, it is not clear which combination of the word vector algorithm and the machine learning algorithm will yield the most optimal classification result.

In a preferred embodiment, the machine learning algorithm employs a logistic regression method.

Step S250: and comparing the first classification result with the classification mark to obtain the accuracy of each group of word vector algorithm and machine learning algorithm, taking the combination of the word vector algorithm and the machine learning algorithm with the accuracy greater than a baseline threshold value as a baseline algorithm, and taking the first classification result corresponding to the baseline algorithm as the first baseline classification result.

In step S250, all the first classification results are compared with the classification flag. For example, for 5000 pieces of training data, each group of word vector algorithm and machine learning algorithm obtains 5000 first classification results, and the 5000 classification results and the classification labels of the 5000 pieces of training data are compared, if the classification results are the same as the classification labels, the classification results are accurate, and if the classification results are different from the classification labels, the classification results are not accurate. Thus, the accuracy of each group of word vector algorithm and machine learning algorithm can be obtained.

Setting a baseline threshold Th, and using a combination of a word vector algorithm and a machine learning algorithm with the accuracy greater than the baseline threshold Th as a baseline algorithm. According to this embodiment, it is possible to obtain multiple sets of baseline algorithms.

In some embodiments, the combination of the set of word vector algorithms and the machine learning algorithm with the highest accuracy may be selected as the baseline algorithm.

Step S260: a first confidence level of the baseline algorithm is obtained from the first baseline classification result and all of the first classification results, corresponding to each piece of training data.

In step S250, although the selected baseline algorithm has had some accuracy, for the application of the classifier, a first confidence level C1 of the calculation algorithm is obtained in step S260.

In some embodiments, the first confidence level C1 is calculated using the following equation (1):

where k1 represents the same number of first classification results as the first baseline classification result.

For example, for the training Data1, for example, the first baseline classification result obtained by the baseline algorithm Abase is a1, for the other M × N-1 algorithm combinations, the first classification result obtained by k1-1 algorithm combinations is also a1, and for the other M × N-k1 algorithm combinations, the first classification result obtained by the baseline algorithm Abase is not a1, then the first confidence C1 of the baseline algorithm Abase is:

the closer the value of the first confidence C1 is to 1, the more reliable the result of the baseline algorithm Abase is.

In some embodiments, the classification method of the present invention further comprises testing the classifier with the test data. For these embodiments, the training data in steps S210-S260 may be replaced with test data.

In some embodiments, a first confidence threshold of the first confidence may be set, and the above steps S210-S260 of training the classifier are repeatedly performed until the first confidence reaches the set first confidence threshold.

Referring to fig. 1, after steps S210-S260, a trained classifier may be obtained, and the trained classifier may be used to perform machine classification on the comment data without the classification label in step S110.

Fig. 3 is an exemplary flowchart of classifying the comment data by using a classifier in the classification method according to an embodiment of the present invention. Referring to fig. 3, the step of classifying the comment data by using the classifier in this embodiment includes:

step S310: and performing word segmentation processing on the comment data to obtain a second word segmentation result.

This step is similar to step S210, except that the comment data does not have a classification flag. Therefore, the machine classification by the classifier is required to obtain the classification result.

Step S320: and converting the second word segmentation result into a second word vector by adopting an M word vector algorithm.

This step is similar to step S220, wherein the M word vector algorithms used are identical to the M word vector algorithms used in step S220.

In some embodiments, the M word vector algorithms are stored in a storage medium. The M word vector algorithms may be read from the storage medium at step S320.

Step S330: and classifying the second word vectors by adopting N machine learning algorithms.

This step is similar to step S230, wherein the N machine learning algorithms used are identical to the N machine learning algorithms used in step S230.

In some embodiments, the N machine learning algorithms are stored in a storage medium. The N machine learning algorithms may be read from the storage medium at step S330.

Step S340: and obtaining M × N second classification results of the comment data.

Step S340 is similar to step S240. The number of the comment data is not limited in the step, and for one piece of comment data, M × N second classification results can be obtained.

Since steps S310-S340 have similar contents to steps S210-S240, the related descriptions can refer to the descriptions of steps S210-S240 and will not be expanded.

Step S350: a baseline classification result of the baseline algorithm is obtained.

In step S350, the baseline algorithm refers to the baseline algorithm obtained in step S250. The M × N second classification results obtained in step S340 already include the baseline classification result of the baseline algorithm, and the baseline classification result is extracted in step S350.

Step S360: and obtaining confidence according to the baseline classification result and the second classification result.

In step S360, the confidence degree C obtained from the baseline classification result and the second classification result is the confidence degree corresponding to the comment data.

In some embodiments, confidence C is calculated using the following equation (2):

where k represents the number of second classification results that are the same as the baseline classification results.

In some embodiments, the classification method of the present invention further comprises: and setting a confidence threshold, and when the confidence is greater than the confidence threshold, making the confidence equal to 1, otherwise, making the confidence equal to 0. For example, a confidence threshold value of 80% is set, and when the confidence C is greater than 80%, C is made 1; when the confidence C is less than 80%, let C be 0.

Referring to fig. 1, in these embodiments, the preset range of confidence in step S120 may be equal to 1, and the classification result with confidence equal to 1 is taken as the final classification result.

In some embodiments, the classification method of the present invention further comprises:

step S270: taking the comment data with the confidence coefficient out of the confidence coefficient preset range as training data, and carrying out classification marking on the training data; and

step S280: the classifier is trained using the training data.

In these embodiments, the confidence level being outside the preset confidence level range indicates that the classification result of the classifier on the part of the review data is not accurate. In order to improve the classification accuracy of the classifier, the part of the comment data is subjected to classification labeling, and the classifier is trained as new training data.

The comment data are not analyzed once, after the comment data are obtained for a period of time, the comment data can be classified by the classification method, meanwhile, the data with the confidence coefficient of the classification result being out of the preset confidence coefficient range are used as training data, and the classifier is retrained through the steps S270-S280, so that the performance of the classifier is gradually optimized, and the classification accuracy of the classifier is further improved.

In some embodiments, if the preset confidence range is equal to 1, the classifier is retrained by using the comment data with the confidence C equal to 0 as training data.

According to the comment data classification method, after a new piece of comment data is classified, the classification result of the baseline algorithm and the corresponding confidence coefficient of the classification result can be obtained, when the confidence coefficient is within the preset confidence coefficient range, the classification result is reliable and can be directly adopted, and a merchant does not need to confirm; and when the confidence coefficient is out of the preset confidence coefficient range, the classification result is unreliable, and the classification result is not adopted, so that the merchant is prompted to check the classification result. Therefore, the auditing time of the merchant can be greatly saved. The invention has the advantages of accurate classification result, high efficiency and labor saving. Due to the fact that the obtained classification result is high in accuracy, the method is beneficial for a merchant to find problems in production and operation according to the commented classification result, and production and operation are improved more effectively.

The invention also comprises a comment data classification device which comprises a memory and a processor. Wherein the memory is to store instructions executable by the processor; the processor is configured to execute the instructions to implement the classification method for review data described above.

Fig. 4 is a system block diagram of a classification device of comment data according to an embodiment of the present invention. Referring to fig. 4, the sorting apparatus 400 may include an internal communication bus 401, a processor 402, a Read Only Memory (ROM)403, a Random Access Memory (RAM)404, and a communication port 405. When used on a personal computer, the sorting apparatus 400 may further comprise a hard disk 406. An internal communication bus 401 may enable data communication among the components of the sorting apparatus 400. The processor 402 may make the determination and issue the prompt. In some embodiments, processor 402 may be comprised of one or more processors. The communication port 405 can enable data communication between the operation device 400 and the outside. In some embodiments, the sorting apparatus 400 may send and receive information and data from a network through the communication port 405. The sorting apparatus 400 may also comprise various forms of program storage units and data storage units, such as a hard disk 406, Read Only Memory (ROM)403 and Random Access Memory (RAM)404, capable of storing various data files for computer processing and/or communication use, as well as possible program instructions for execution by the processor 402. The processor executes these instructions to implement the main parts of the method. The results processed by the processor are communicated to the user device through the communication port and displayed on the user interface.

The classification method described above may be implemented as a computer program, stored in the hard disk 406, and loaded into the processor 402 for execution, so as to implement the classification method of the present application.

The invention also comprises a computer-readable medium having stored thereon computer program code which, when executed by a processor, implements the method of classifying review data as described hereinbefore.

When the classification method of the comment data is implemented as a computer program, it may be stored in a computer-readable storage medium as an article of manufacture. For example, computer-readable storage media can include but are not limited to magnetic storage devices (e.g., hard disk, floppy disk, magnetic strips), optical disks (e.g., Compact Disk (CD), Digital Versatile Disk (DVD)), smart cards, and flash memory devices (e.g., electrically Erasable Programmable Read Only Memory (EPROM), card, stick, key drive). In addition, various storage media described herein can represent one or more devices and/or other machine-readable media for storing information. The term "machine-readable medium" can include, without being limited to, wireless channels and various other media (and/or storage media) capable of storing, containing, and/or carrying code and/or instructions and/or data.

It should be understood that the above-described embodiments are illustrative only. The embodiments described herein may be implemented in hardware, software, firmware, middleware, microcode, or any combination thereof. For a hardware implementation, the processor may be implemented within one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), processors, controllers, micro-controllers, microprocessors, and/or other electronic units designed to perform the functions described herein, or a combination thereof.

Aspects of the present application may be embodied entirely in hardware, entirely in software (including firmware, resident software, micro-code, etc.) or in a combination of hardware and software. The above hardware or software may be referred to as "data block," module, "" engine, "" unit, "" component, "or" system. The processor may be one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), digital signal processing devices (DAPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), processors, controllers, microcontrollers, microprocessors, or a combination thereof. Furthermore, aspects of the present application may be represented as a computer product, including computer readable program code, embodied in one or more computer readable media. For example, computer-readable media may include, but are not limited to, magnetic storage devices (e.g., hard disk, floppy disk, magnetic strips … …), optical disks (e.g., Compact Disk (CD), Digital Versatile Disk (DVD) … …), smart cards, and flash memory devices (e.g., card, stick, key drive … …).

The computer readable medium may comprise a propagated data signal with the computer program code embodied therein, for example, on a baseband or as part of a carrier wave. The propagated signal may take any of a variety of forms, including electromagnetic, optical, and the like, or any suitable combination. The computer readable medium can be any computer readable medium that can communicate, propagate, or transport the program for use by or in connection with an instruction execution system, apparatus, or device. Program code on a computer readable medium may be propagated over any suitable medium, including radio, electrical cable, fiber optic cable, radio frequency signals, or the like, or any combination of the preceding.

Having thus described the basic concept, it will be apparent to those skilled in the art that the foregoing disclosure is by way of example only, and is not intended to limit the present application. Various modifications, improvements and adaptations to the present application may occur to those skilled in the art, although not explicitly described herein. Such modifications, improvements and adaptations are proposed in the present application and thus fall within the spirit and scope of the exemplary embodiments of the present application.

Also, this application uses specific language to describe embodiments of the application. Reference throughout this specification to "one embodiment," "an embodiment," and/or "some embodiments" means that a particular feature, structure, or characteristic described in connection with at least one embodiment of the present application is included in at least one embodiment of the present application. Therefore, it is emphasized and should be appreciated that two or more references to "an embodiment" or "one embodiment" or "an alternative embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, some features, structures, or characteristics of one or more embodiments of the present application may be combined as appropriate.

Numerals describing the number of components, attributes, etc. are used in some embodiments, it being understood that such numerals used in the description of the embodiments are modified in some instances by the use of the modifier "about", "approximately" or "substantially". Unless otherwise indicated, "about", "approximately" or "substantially" indicates that the number allows a variation of ± 20%. Accordingly, in some embodiments, the numerical parameters used in the specification and claims are approximations that may vary depending upon the desired properties of the individual embodiments. In some embodiments, the numerical parameter should take into account the specified significant digits and employ a general digit preserving approach. Notwithstanding that the numerical ranges and parameters setting forth the broad scope of the range are approximations, in the specific examples, such numerical values are set forth as precisely as possible within the scope of the application.

Claims

1. A method of classifying comment data, comprising:

classifying the comment data by adopting a classifier to obtain a classification result and confidence of the comment data; and

and comparing the confidence coefficient with a preset confidence coefficient range, and taking the classification result with the confidence coefficient within the preset confidence coefficient range as a final classification result.

2. The classification method of claim 1, further comprising training the classifier using training data having classification labels, including:

performing word segmentation processing on the training data with the classification marks to obtain a first word segmentation result;

converting the first word segmentation result into a first word vector by adopting an M word vector algorithm, wherein M is an integer greater than or equal to 1;

classifying the word vectors by adopting N machine learning algorithms, wherein N is an integer greater than or equal to 1;

obtaining M x N first classification results corresponding to each piece of training data;

comparing the first classification result with the classification mark to obtain the accuracy of each group of word vector algorithm and machine learning algorithm, and taking the combination of the word vector algorithm and the machine learning algorithm with the accuracy greater than a baseline threshold as a baseline algorithm, wherein the first classification result corresponding to the baseline algorithm is a first baseline classification result; and

obtaining, for each of the training data, a first confidence level for the baseline algorithm based on the first baseline classification result and all of the first classification results.

3. The method of classifying according to claim 2, wherein the step of deriving a first confidence level for the baseline algorithm based on the first baseline classification result and all of the first classification results comprises calculating a first confidence level for the baseline algorithm using the formula C1:

4. The classification method of claim 2, further comprising:

taking the comment data with the confidence coefficient out of the preset confidence coefficient range as training data, and carrying out classification marking on the training data; and

training the classifier using the training data.

5. The classification method of claim 2, wherein the step of classifying the comment data using the classifier includes:

performing word segmentation processing on the comment data to obtain a second word segmentation result;

converting the second word segmentation result into a second word vector by adopting the M word vector algorithms;

classifying the second word vector by adopting the N machine learning algorithms;

obtaining M x N second classification results of the comment data;

obtaining a baseline classification result of the baseline algorithm; and

obtaining the confidence level according to the baseline classification result and the second classification result.

6. The classification method of claim 5, wherein obtaining the confidence level based on the baseline classification result and the second classification result comprises calculating the confidence level C using the formula:

7. The classification method of claim 6, further comprising: setting a confidence threshold, and when the confidence is greater than the confidence threshold, making the confidence equal to 1, otherwise, making the confidence equal to 0.

8. The classification method of claim 1, wherein the comment data includes bad comment data.

9. The classification method of claim 2, wherein the word vector algorithm includes a custom dictionary including custom words representing poor scoring classifications, each of the custom words having a corresponding weight.

10. A classification apparatus of comment data, comprising:

a memory for storing instructions executable by the processor;

a processor for executing the instructions to implement the method of any one of claims 1-9.

11. A computer-readable medium having stored thereon computer program code which, when executed by a processor, implements the method of any of claims 1-9.