CN111813932A

CN111813932A - Text data processing method, text data classification device and readable storage medium

Info

Publication number: CN111813932A
Application number: CN202010556440.0A
Authority: CN
Inventors: 彭团民; 徐泽宇
Original assignee: Beijing Xiaomi Pinecone Electronic Co Ltd
Current assignee: Beijing Xiaomi Pinecone Electronic Co Ltd
Priority date: 2020-06-17
Filing date: 2020-06-17
Publication date: 2020-10-23
Anticipated expiration: 2040-06-17
Also published as: CN111813932B

Abstract

The disclosure relates to a text data processing method, a text data classification device and a readable storage medium, wherein the method comprises the following steps: acquiring a target sample from a training sample set of a text classification model; respectively inputting the target sample into a text classification model and a rule model to obtain a first classification result output by the text classification model and a second classification result output by the rule model, wherein the first classification result comprises a prediction classification corresponding to the text data, the rule model comprises a plurality of classification rules, each classification rule corresponds to the same or different text classifications, and the second classification result is used for indicating the classification rule hit by the text data; and under the condition that the target classification corresponding to the text data is determined according to the first classification result and the second classification result and the target classification is not included in the target sample, updating the target sample according to the target classification. Therefore, the data volume of the training samples can be increased, the accuracy of the training sample labeling is guaranteed, and the data volume of manual labeling is reduced.

Description

Text data processing method, text data classification device and readable storage medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a text data processing method, a text data classification device, and a readable storage medium.

Background

Users can publish various texts through various platforms, such as news information, comment information and the like. However, with the increase of data volume and high requirement of network environment purification, when the text content is released, the text content needs to be audited to determine whether the text content can be released. In the related art, a text classification model is generally trained, and sensitive text or abnormal text is classified based on the text classification model, so that the text content is audited.

However, in the above process, training texts are usually labeled manually, and if the training texts are labeled incorrectly or in insufficient quantity, the accuracy of the text classification model is seriously affected, so that accurate verification of text contents based on the text classification model cannot be ensured.

Disclosure of Invention

In order to overcome the problems in the related art, the present disclosure provides a text data processing method, a text data classification device, and a readable storage medium.

According to a first aspect of the embodiments of the present disclosure, there is provided a text data processing method, including:

acquiring a target sample from a training sample set of a text classification model, wherein the target sample comprises text data or comprises the text data and a classification labeled for the text data;

respectively inputting the target sample into a text classification model and a rule model to obtain a first classification result output by the text classification model and a second classification result output by the rule model, wherein the first classification result comprises a prediction classification corresponding to the text data, the rule model comprises a plurality of classification rules, each classification rule corresponds to the same or different text classifications, and the second classification result is used for indicating a classification rule hit by the text data;

and under the condition that the target classification corresponding to the text data is determined according to the first classification result and the second classification result and the target classification is not included in the target sample, updating the target sample according to the target classification.

Optionally, the text data originates from any one of: news information, instant messaging messages, short texts;

the classification of the text data includes an abnormal text category.

Optionally, the target classification is determined by:

determining the text classification corresponding to the hit classification rule as a target classification under the condition that the type of the classification rule hit by the text data is a target type;

determining the predicted classification as the target classification under the condition that the text classification corresponding to the hit classification rule is not the target type, the predicted classification is consistent with the text classification corresponding to the hit classification rule, and the confidence of the predicted classification is greater than a first confidence threshold, wherein the first classification result further comprises the confidence of the predicted classification;

determining the prediction classification as the target classification if the confidence of the prediction classification is greater than a second confidence threshold and the second classification result indicates that the text data misses a classification rule, wherein the second confidence threshold is greater than the first confidence threshold.

Optionally, the method further comprises:

outputting the target sample when any one of the following conditions is met, wherein after the target sample is labeled, the labeled sample is used as a training sample of the text classification model:

the second classification result indicates that the text data does not hit a classification rule, and the confidence of the prediction classification in the first classification result is less than a third confidence threshold or the confidence belongs to a classification boundary confidence range, wherein the first classification result further comprises the confidence of the prediction classification;

the prediction classification is different from the text classification corresponding to the hit classification rule, and the text classification corresponding to the hit classification rule is not the target type.

Optionally, the method further comprises:

determining the hit rate and the error rate of each classification rule, wherein the hit rate is the ratio of the hit times of the classification rule to the total times of rule matching, and the error rate is the ratio of the times of prediction classification obtained by inputting a target sample into the text classification model when the target sample hits the classification rule, which is different from the text classification corresponding to the classification rule, to the hit times of the classification rule;

outputting a classification rule to be updated, wherein the classification rule to be updated is a classification rule of which the hit rate is greater than a hit threshold and the error rate is greater than an error threshold;

updating the classification rule to be updated in response to an update instruction for the classification rule to be updated.

Optionally, the method further comprises:

outputting a sample to be processed, wherein the confidence of the corresponding prediction classification of the sample to be processed is greater than a preset threshold, and the corresponding second classification result indicates that the classification rule is not hit;

in response to a rule setting instruction for the sample to be processed, adding a classification rule indicated by the rule setting instruction to the rule model.

Optionally, the method further comprises:

under the condition that the target sample comprises text data and a classification labeled for the text data, determining a loss value of the text classification model according to a prediction classification corresponding to the target sample and the classification labeled for the text data;

and updating the text classification model according to the loss value when the loss value is larger than a classification threshold value.

In a second aspect, a method for classifying text data is provided, the method comprising:

acquiring text data to be classified;

inputting the text data to be classified into a text classification model, and obtaining the classification of the text data to be classified, wherein a training sample set corresponding to the text classification model is generated by any one of the text data processing methods in the first aspect.

Optionally, the text data to be classified originates from any one of: news information, instant messaging messages, short texts;

the classification of the text data comprises an abnormal text category;

the step of inputting the text data to be classified into a text classification model to obtain the classification of the text data to be classified comprises the following steps:

and inputting the text data to be classified into the text classification model to obtain the text data to be classified corresponding to the abnormal text classification.

In a third aspect, there is provided a text data processing apparatus, including:

the system comprises a first obtaining module, a second obtaining module and a third obtaining module, wherein the first obtaining module is configured to obtain a target sample from a training sample set of a text classification model, and the target sample comprises text data or comprises the text data and a classification labeled for the text data;

a first input module, configured to input the target sample into a text classification model and a rule model respectively, and obtain a first classification result output by the text classification model and a second classification result output by the rule model, where the first classification result includes a predicted classification corresponding to the text data, the rule model includes a plurality of classification rules, each classification rule corresponds to the same or different text classifications, and the second classification result is used to indicate a classification rule hit by the text data;

the first updating module is configured to update the target sample according to the target classification when the target classification corresponding to the text data is determined according to the first classification result and the second classification result and the target classification is not included in the target sample.

the classification of the text data includes an abnormal text category.

Optionally, the apparatus further comprises:

a first determination module configured to determine the target classification by:

Optionally, the apparatus further comprises:

a first output module, configured to output the target sample when any one of the following conditions is satisfied, wherein after the target sample is labeled, the labeled sample is used as a training sample of the text classification model:

Optionally, the apparatus further comprises:

a second determining module, configured to determine, for each of the classification rules, a hit rate and an error rate of the classification rule, where the hit rate is a ratio of hit times of the classification rule to a total number of rule matching times, and the error rate is a ratio of a number of times that a prediction classification obtained by inputting a target sample into the text classification model is different from a text classification corresponding to the classification rule when the target sample hits the classification rule, to the hit times of the classification rule;

a second output module configured to output a classification rule to be updated, wherein the classification rule to be updated is a classification rule in which the hit rate is greater than a hit threshold and the error rate is greater than an error threshold;

a second updating module configured to update the classification rule to be updated in response to an update instruction for the classification rule to be updated.

Optionally, the apparatus further comprises:

the third output module is configured to output a to-be-processed sample, wherein the to-be-processed sample is a sample of which the confidence coefficient of the corresponding prediction classification is greater than a preset threshold value and the corresponding second classification result indicates that the classification rule is not hit;

an adding module configured to add a classification rule indicated by a rule setting instruction to the rule model in response to the rule setting instruction for the sample to be processed.

Optionally, the apparatus further comprises:

a third determining module configured to determine a loss value of the text classification model according to a prediction classification corresponding to the target sample and a classification labeled for the text data when the target sample includes the text data and the classification labeled for the text data;

a third updating module configured to update the text classification model according to the loss value if the loss value is greater than a classification threshold.

In a fourth aspect, there is provided an apparatus for classifying text data, the apparatus comprising:

a second obtaining module configured to obtain text data to be classified;

a second input module, configured to input the text data to be classified into a text classification model, and obtain a classification of the text data to be classified, where a training sample set corresponding to the text classification model is generated by any one of the text data processing methods in the first aspect.

the classification of the text data comprises an abnormal text category;

the second input module includes:

In a fifth aspect, there is provided a text data processing apparatus, including:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to:

In a sixth aspect, there is provided a text data classification apparatus including:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to:

acquiring text data to be classified;

inputting the text data to be classified into a text classification model, and obtaining the classification of the text data to be classified, wherein a training sample set corresponding to the text classification model is generated by any one of the text data processing methods of the first aspect.

In a seventh aspect, there is provided a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, perform the steps of the method of any of the first or second aspects.

The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects:

in the above technical solution, a target sample is obtained from a training sample set of a text classification model, the target sample is respectively input into the text classification model and a rule model, a first classification result output by the text classification model and a second classification result output by the rule model are obtained, and the target sample is updated according to the target classification when the target classification corresponding to the text data is determined according to the first classification result and the second classification result and the target classification is not included in the target sample. Therefore, by the technical scheme, the training sample set can contain marked text data and unmarked text data, automatic marking of part of the text data can be realized, the data volume of the training sample can be effectively increased, the marking accuracy of the training sample can be ensured, and the workload of manual marking can be reduced. Moreover, the classification of the labeled text data in the training sample set can be verified, the accuracy of the training sample is further improved, and the accuracy of a text classification model obtained based on the training sample set is ensured, so that technical support can be provided for accurate audit and filtration of text contents, the safe release of the text contents is realized, and the network environment is purified.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.

Fig. 1 is a flowchart illustrating a method of processing text data according to an exemplary embodiment.

Fig. 2 is a block diagram illustrating a text data processing apparatus according to an exemplary embodiment.

Fig. 3 is a block diagram illustrating a text data processing apparatus according to an exemplary embodiment.

Fig. 4 is a block diagram illustrating a text data processing apparatus according to an exemplary embodiment.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

Fig. 1 is a flowchart illustrating a text data processing method according to an exemplary embodiment, and as shown in fig. 1, may include the following steps:

in step 11, target samples are obtained from a training sample set of a text classification model, where the target samples include text data or include text data and a classification labeled for the text data.

When the text classification model is trained, the text data is usually pre-labeled in a manual labeling manner, so that the classification corresponding to each text data can be determined, and the target sample including the text data and the classification labeled for the text data can be directly used for training the text classification model.

Exemplarily, the text data originates from any one of: news information, instant messaging messages, short text. Illustratively, the news information may be crawled from news pages by crawler technology or obtained from news information publishing platforms. The news information may be the original text data obtained in the above manner, or may be text data obtained by performing data processing on the original text data, for example, extracting keywords, key sentences, and key paragraphs. Short text may be used to represent text with a lower number of words posted in micro blogs, forums, etc. As described in the background, in order to clean the network environment, when text data is published, it is usually necessary to perform an audit on the text data to determine whether the text data contains sensitive information, for example, the sensitive information may be part of information related to the political class, or information related to the vulgar text, that is, whether the audit text is sensitive text; or low-quality content such as a title party, copied content text and the like can be audited to improve the quality of the network environment data. Therefore, the classification of the text data may include an abnormal text category, where the abnormal text category may be used to represent a sensitive text type and/or a low-quality content category, and a text belonging to the sensitive text category is a text containing sensitive information, so that the text data may be classified by the text classification model, thereby accurately classifying the abnormal text and implementing text content auditing and filtering.

However, in an actual use scenario, the number of training samples obtained by a manual labeling manner is limited, which may affect the accuracy of the classification result of the trained text classification model. Therefore, in order to increase the number of training samples and reduce the manual workload, the set of training samples in the present disclosure may include unlabeled samples, i.e., the target sample may contain text data, which may implement automatic labeling of the text data in the target sample through subsequent steps. For example, news information is marked because its text is long or its content is complex, which results in increased marking workload and affects the accuracy of marking. In the embodiment of the disclosure, in order to increase the number of training samples, data enhancement may be performed on a training sample set through part of unlabeled news information, so as to increase the accuracy and training speed of the text classification model to some extent.

In step 12, a target sample is respectively input into a text classification model and a rule model, and a first classification result output by the text classification model and a second classification result output by the rule model are obtained, wherein the first classification result includes a prediction classification corresponding to the text data, the rule model includes a plurality of classification rules, each classification rule corresponds to the same or different text classifications, and the second classification result is used for indicating a classification rule hit by the text data.

In this embodiment, the plurality of classification rules included in the rule model may be determined based on the text data labeled with the classification in the training sample set. For example, the annotator can analyze the text data annotated as the same category to determine a classification rule corresponding to the type, and determine the category as the text category corresponding to the classification rule. Illustratively, the classification rule may be represented by a keyword, a key sentence pattern, or the like, or may be represented by a regular expression. Therefore, the target sample is input into the rule model, and the text data in the target sample can be matched with the classification rule in the rule model, so as to obtain a second classification result.

If the text data hits the classification rule in the rule model, the second classification result may include the classification rule hit by the text data, and if the text data misses the classification rule, the second classification result may be a null value to indicate that the text data misses the classification rule.

In step 13, when the target classification corresponding to the text data is determined according to the first classification result and the second classification result and the target classification is not included in the target sample, the target sample is updated according to the target classification.

By the steps, the first classification result and the second classification result corresponding to the text data in the target sample can be determined, the first classification result and the second classification result can be integrated to determine the target classification corresponding to the text data, and the accuracy of the determined target classification can be guaranteed to a certain extent.

As an example, when the target sample includes text data and does not include a classification labeled for the text data, the target classification is not necessarily included in the target sample, at this time, the target sample may be directly updated according to the target classification, and if the target classification may be labeled for the text data, thereby implementing automatic labeling of the text data, the updated target sample includes the text data and the target classification, and thus may be used in a training process of the text classification model, and an amount of training sample data of the text classification model is increased.

As another example, the target sample includes text data and a classification labeled to the text data, and it is determined whether the target sample includes the target classification, that is, whether the classification labeled to the text data is the same as the target classification, and if the classification labeled to the text data is the same as the target classification, the target classification is included in the target sample, which indicates that the classification labeled to the text data in the target sample is accurate and does not need to be updated. If the classification of the text data label is different from the target classification, the target classification is not contained in the target sample, and the classification of the text data label in the target sample is inaccurate, the classification of the text data label in the target sample can be updated to the target classification, so that the target sample is updated, the accuracy of a training sample used for training the text classification model based on the target sample in the subsequent training process of the text classification model is ensured, and the accuracy of the text classification model is improved.

In order to make those skilled in the art understand the technical solutions provided by the embodiments of the present invention, the following detailed descriptions are provided for the above steps.

Optionally, the target classification is determined by:

in a first mode, when the type of the classification rule hit by the text data is a target type, the text classification corresponding to the hit classification rule is determined as the target classification.

When the classification rule in the rule model is preset, the type of the classification rule can be set, and the type is used for representing whether the target classification can be determined only according to the classification rule when the classification rule is hit. For example, the type may include a blacklist type and an in-doubt type, where the blacklist type may be used as the target type, that is, the target type is determined according to only the text classification corresponding to the classification rule for characterizing, and as long as the type of the hit classification rule is the blacklist type, the text classification corresponding to the hit classification rule may be directly determined as the target classification. And aiming at the classification rule of which the type is the in-doubt type, the target classification needs to be determined by combining the first classification result comprehensive analysis.

In a second mode, when the text classification corresponding to the hit classification rule is not the target type, the prediction classification is consistent with the text classification corresponding to the hit classification rule, and the confidence of the prediction classification is greater than a first confidence threshold, the prediction classification is determined as the target classification, wherein the first classification result further includes the confidence of the prediction classification.

In this embodiment, the text classification corresponding to the hit classification rule is not the target type, and it is stated that the target classification cannot be directly determined only according to the text classification corresponding to the hit classification rule, and at this time, the target classification needs to be determined by combining the first classification result. For example, when the prediction classification is consistent with the text corresponding to the hit classification rule, it indicates that the classification determined by inputting the target sample into the text classification model and the rule model is the same, and the confidence degree corresponding to the prediction classification is greater than the first confidence threshold, which indicates that the prediction classification output by the text classification model is relatively reliable, and at this time, the prediction classification, that is, the text classification corresponding to the hit classification rule, may be determined as the target classification. The first confidence threshold may be set according to an actual usage scenario, which is not limited by the present disclosure.

In a third way, in the case that the confidence of the prediction classification is greater than a second confidence threshold value and the second classification result indicates that the text data misses a classification rule, the prediction classification is determined as the target classification, wherein the second confidence threshold value is greater than the first confidence threshold value. The second confidence threshold may be set according to an actual usage scenario, which is not limited by the present disclosure, that is, the confidence level of the prediction classification determined in this manner is higher than the confidence level of the prediction classification determined in the second manner. For example, the second confidence threshold may be set to a larger value, and the confidence of the prediction classification is larger than the second confidence threshold to indicate that the confidence of the prediction classification is higher.

In this embodiment, the second classification result indicates that the text data misses the classification rule, that is, the text data does not match the classification rule in the rule model, in this case, if the confidence of the prediction classification obtained by the text classification model is greater than the second confidence threshold and the second confidence threshold is greater than the first confidence threshold, that is, the prediction classification obtained by the text classification model is credible, at this time, the prediction classification may be directly determined as the target classification.

Therefore, by the technical scheme, the target classification corresponding to the text data can be determined based on the text classification model and the rule model, the accuracy of the determined target classification can be ensured, and the false labeling in the manual labeling process can be effectively avoided. And moreover, the method can accurately and automatically label the text data which are not labeled, and provides data support for accurately judging the classification of the labeled text data. In addition, data cleaning can be carried out on training samples of the text classification model while the text classification model is trained, accuracy of the training samples of the text classification model is guaranteed, accuracy of the trained text classification model is further guaranteed, and text classification and text auditing based on the text classification model can be conveniently carried out subsequently.

Optionally, the method further comprises:

outputting the target sample when any one of the following conditions is met, wherein after the target sample is labeled, the labeled target sample is used as a training sample of the text classification model:

a first condition that the second classification result indicates that the text data miss a classification rule and that a confidence of the predicted classification in the first classification result is less than a third confidence threshold or that the confidence belongs to a classification boundary confidence range, wherein the first classification result further includes a confidence of the predicted classification.

As an example, the third confidence threshold is smaller than the first confidence threshold, and the third confidence threshold may be set to a smaller value, and the confidence of the prediction classification smaller than the third confidence threshold is used to indicate that the confidence of the prediction classification is lower, when the determined prediction classification of the text data is not accurate; and the second classification result indicates that the text data does not hit the classification rule, that is, the classification of the text data cannot be determined according to the second classification result, at this time, the target sample can be manually labeled, so that the classification corresponding to the text data in the target sample is determined, and the text is classified according to the sample obtained after manual labeling.

As another example, when classifying text data based on a trained text classification model, a corresponding classification confidence threshold may be generally set, and when the confidence of a classification obtained by the text classification model is greater than the classification confidence threshold, the classification obtained by the text classification model is determined as the classification of the text data. Therefore, when the text classification model is trained, the accuracy of the text classification model is greatly influenced by the obtained training samples with the confidence degrees of the prediction classifications near the classification confidence threshold value. Therefore, in this embodiment, a classification boundary confidence range may be determined based on the classification confidence threshold, and if the set classification confidence threshold is 75% and the floating range is 10%, the classification boundary confidence range may be determined to be [ 70%, 80% ], and therefore, when the second classification result indicates that the text data does not hit the classification rule, if the determined confidence of the predicted classification is 74%, and the determined confidence belongs to the classification boundary confidence range, the target sample may be output, so as to perform manual labeling on the target sample.

A second condition that the prediction classification is different from the text classification corresponding to the hit classification rule and that the text classification corresponding to the hit classification rule is not of the target type.

In this embodiment, when the text classification corresponding to the hit classification rule is not the target type, it indicates that the target classification of the text data cannot be determined only according to the text classification corresponding to the hit classification rule, and the prediction classification is different from the text classification, that is, when different classifications are determined for the same text data, the target classification of the target sample may be determined by manual labeling.

In the technical scheme, the target samples of the target classification of the text data which have great influence on the accuracy of the text classification model or cannot be determined are output so as to be manually labeled, so that the labeling accuracy of the target samples can be ensured, and the training efficiency and accuracy of the text classification model are further improved.

In an actual use scenario, in the training process of the text classification model, whether the classification rule in the rule model is accurate or not can be determined according to the training process of the text classification model, so that the dynamic update of the classification rule is realized, the accuracy of a target sample is improved, and meanwhile, the accuracy of the text classification model can also be improved in a feedback manner. Accordingly, the present disclosure also provides the following embodiments.

Optionally, the method further comprises:

and determining the hit rate and the error rate of each classification rule, wherein the hit rate is the ratio of the hit times of the classification rule to the total times of rule matching, and the error rate is the ratio of the times of different prediction classifications obtained by inputting a target sample into the text classification model when the target sample hits the classification rule to the text classifications corresponding to the classification rule to the hit times of the classification rule.

For example, N classification rules exist in the rule model, for the classification rule R1, the hit frequency is C1, the total number of times of rule matching performed by inputting the target sample into the rule model is MC, and the hit rate corresponding to the classification rule is the ratio of C1 to MC. In addition, in the embodiment of the present disclosure, the target sample is input into the text classification model and the rule model respectively, and thus when the target sample hits the classification rule R1, it inevitably obtains the predicted classification output by the text classification model, and when the text classification corresponding to the classification rule R1 is different from the predicted classification, it indicates that there is an error between the text classification model and the rule model, and records the error frequency W plus one, where the error frequency is initially 0, and the error rate is the ratio of the error frequency W to the hit frequency C1 of the classification rule. The hit rate and the error rate corresponding to each classification rule can be determined in the above manner, and are not described herein again.

And outputting a classification rule to be updated, wherein the classification rule to be updated is a classification rule of which the hit rate is greater than a hit threshold and the error rate is greater than an error threshold.

The hit threshold and the error threshold may be set according to an actual usage scenario, which is not limited in this disclosure. For example, when the hit rate is greater than the hit threshold, that is, the number of times that the text data in the training sample set hits the classification rule is large, and when the error rate is greater than the error threshold, that the text data in the training sample set hits the classification rule is large, the number of times that the prediction classification obtained by the text classification model is different from the text classification corresponding to the classification rule is large, which may be caused by inaccurate setting of the classification rule, for example, the matching range corresponding to the classification rule is too large, and at this time, the classification rule needs to be updated, so as to improve the accuracy of the classification rule.

For example, the classification rule to be updated may be output in a form of a display interface or a lead-out list, and text data hitting the classification rule may also be output, so that the user may further analyze the classification rule according to the text data. And then, the user can trigger an update instruction based on the mode of inputting or importing the rule on the display interface, and the classification rule to be updated is updated in response to the update instruction for the classification rule to be updated. For example, the rule indicated by the update instruction may replace the classification rule to be updated, thereby implementing the update.

Therefore, by the technical scheme, the classification rules in the rule model can be dynamically updated, on one hand, the accuracy of the classification rules can be improved, on the other hand, the accuracy of target sample labeling can be further improved, and data support is provided for obtaining an accurate text classification model, so that the accuracy and efficiency of text data classification and auditing can be improved, and the user experience can be improved.

Optionally, the method further comprises:

and outputting a sample to be processed, wherein the confidence of the corresponding prediction classification of the sample to be processed is greater than a preset threshold, and the corresponding second classification result indicates that the classification rule is not hit.

The preset threshold may be the second confidence threshold, or may be set according to an actual usage scenario. The confidence of the corresponding prediction classification is greater than the preset threshold value, and the confidence for representing the prediction classification is higher, and the sample does not hit the classification rule, which may be caused by incomplete classification rules in the rule model.

In this embodiment, the to-be-processed sample may be output and displayed to a user, and the user may perform analysis based on the to-be-processed sample, so as to determine a corresponding classification rule in the to-be-processed sample. For example, the user may input or import the classification rule based on a visual interface to trigger a rule setting instruction, and in response to the rule setting instruction for the sample to be processed, add the classification rule indicated by the rule setting instruction to the rule model.

By the technical scheme, the classification rule is added into the rule model, on one hand, the comprehensiveness of the classification rule in the rule model can be ensured, and the application range of the text data processing method is widened. Meanwhile, the hit rate of the classification rules can be improved, the target classification determining efficiency of the text data is improved, and the training sample cleaning efficiency is improved. Meanwhile, automatic labeling of samples in the training sample set is facilitated.

Optionally, the method further comprises:

and under the condition that the target sample comprises text data and the classification labeled for the text data, determining a loss value of the text classification model according to the prediction classification corresponding to the target sample and the classification labeled for the text data.

In this embodiment, the target sample includes text data and a classification labeled for the text data, that is, the target sample is a labeled sample, and after the target sample is input into the text classification model to obtain a prediction classification, the text classification model may be trained according to the prediction classification and the classification labeled for the text data.

If the determined loss value is greater than the classification threshold value, the accuracy of the text classification model does not meet the training standard, and the text classification model is updated according to the loss value, for example, the text classification model can be updated by a gradient descent method. And after updating, acquiring the target sample from the training sample set again, executing the steps to enable the text classification model to learn new features and improve the accuracy of the text classification model, so that the training of the text classification model can be completed under the condition that the loss value is less than or equal to the classification threshold value, and the text data can be accurately classified and audited based on the trained text classification model.

Therefore, by the technical scheme, the target sample can be cleaned and labeled synchronously, the text classification model can be trained synchronously, the efficiency of the trained text classification model can be improved, the training time of the text classification model is saved, the high quality and the data quantity of the target sample can be effectively guaranteed, the accuracy of the text classification model can be improved, and the accuracy of text data processing is improved.

The present disclosure also provides a method for classifying text data, which may include:

acquiring text data to be classified;

inputting the text data to be classified into a text classification model, and obtaining the classification of the text data to be classified, wherein a training sample set corresponding to the text classification model is generated by any one of the text data processing methods.

Therefore, by the technical scheme, the text classification model for accurately classifying can be obtained, so that the text data to be classified can be accurately classified, the accuracy of the classification result can be ensured, technical support can be provided for accurate auditing and filtering of the text content, the text content can be safely published, the network environment is purified, and the user experience is improved.

the classification of the text data comprises an abnormal text category;

The manner of obtaining the text data is described in detail above, and is not described herein again. For example, in order to purify a network environment, when news information is published, it is generally necessary to review the news information to determine whether the news information contains abnormal text. The abnormal text can be part of information related to the current politics needing to be filtered or information related to vulgar text, namely whether the news information needs to be determined to be abnormal text; or low-quality content such as a title party, copied content text and the like can be audited to improve the quality of the network environment data. Thus, in this embodiment, the classification of the text data may comprise an abnormal text category, wherein the abnormal text category may be used to represent a sensitive text type and/or a low quality content category, the text belonging to the sensitive text category being the text containing sensitive information.

Therefore, in the above technical solution, the text data to be classified, such as news information, can be input into the text classification model, so that the news information can be classified by the text classification model, and when it is determined that the news information contains abnormal text, that is, it is determined that the news information is classified into an abnormal text type, the news information is output, thereby automatically auditing the text content of the news information can be realized, and auditing and filtering of the text content can be realized.

The present disclosure also provides a text data processing apparatus, as shown in fig. 2, the apparatus 10 includes:

a first obtaining module 100, configured to obtain a target sample from a training sample set of a text classification model, where the target sample includes text data or includes text data and a classification labeled for the text data;

a first input module 200, configured to input the target sample into a text classification model and a rule model respectively, and obtain a first classification result output by the text classification model and a second classification result output by the rule model, where the first classification result includes a predicted classification corresponding to the text data, the rule model includes a plurality of classification rules, each of the classification rules corresponds to the same or different text classifications, and the second classification result is used to indicate a classification rule hit by the text data;

a first updating module 300, configured to update the target sample according to the target classification when the target classification corresponding to the text data is determined according to the first classification result and the second classification result and the target classification is not included in the target sample.

the classification of the text data includes an abnormal text category.

Optionally, the apparatus further comprises:

The present disclosure also provides a text data classification apparatus, the apparatus including:

a second obtaining module configured to obtain text data to be classified;

the classification of the text data comprises an abnormal text category;

the second input module includes:

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

The present disclosure also provides a computer-readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the steps of the method of processing or method of classifying text data provided by the present disclosure.

Fig. 3 is a block diagram illustrating a text data processing apparatus 800 according to an example embodiment. For example, the apparatus 800 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.

Referring to fig. 3, the apparatus 800 may include one or more of the following components: a processing component 802, a memory 804, a power component 806, a multimedia component 808, an audio component 810, an input/output (I/O) interface 812, a sensor component 814, and a communication component 816.

The processing component 802 generally controls overall operation of the device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component 802 may include one or more processors 820 to execute instructions to perform all or a portion of the steps of the method of processing text data described above. Further, the processing component 802 can include one or more modules that facilitate interaction between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.

The memory 804 is configured to store various types of data to support operations at the apparatus 800. Examples of such data include instructions for any application or method operating on device 800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

Power component 806 provides power to the various components of device 800. The power components 806 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the device 800.

The multimedia component 808 includes a screen that provides an output interface between the device 800 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the device 800 is in an operating mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the apparatus 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 also includes a speaker for outputting audio signals.

The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor assembly 814 includes one or more sensors for providing various aspects of state assessment for the device 800. For example, the sensor assembly 814 may detect the open/closed status of the device 800, the relative positioning of components, such as a display and keypad of the device 800, the sensor assembly 814 may also detect a change in the position of the device 800 or a component of the device 800, the presence or absence of user contact with the device 800, the orientation or acceleration/deceleration of the device 800, and a change in the temperature of the device 800. Sensor assembly 814 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 816 is configured to facilitate communications between the apparatus 800 and other devices in a wired or wireless manner. The device 800 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 816 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 816 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the apparatus 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described text data processing methods.

In an exemplary embodiment, a non-transitory computer-readable storage medium comprising instructions, such as the memory 804 comprising instructions, executable by the processor 820 of the apparatus 800 to perform the above-described method of processing text data is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

In another exemplary embodiment, a computer program product is also provided, which contains a computer program executable by a programmable apparatus, the computer program having code portions for performing the above-mentioned text data processing method when executed by the programmable apparatus.

Fig. 4 is a block diagram illustrating a text data processing apparatus 1900 according to an example embodiment. For example, the apparatus 1900 may be provided as a server. Referring to fig. 4, the device 1900 includes a processing component 1922 further including one or more processors and memory resources, represented by memory 1932, for storing instructions, e.g., applications, executable by the processing component 1922. The application programs stored in memory 1932 may include one or more modules that each correspond to a set of instructions. Further, the processing component 1922 is configured to execute instructions to perform the above-described processing method of text data.

The device 1900 may also include a power component 1926 configured to perform power management of the device 1900, a wired or wireless network interface 1950 configured to connect the device 1900 to a network, and an input/output (I/O) interface 1958. The device 1900 may operate based on an operating system, such as Windows Server, stored in memory 1932^TM，MacOS X^TM，Unix^TM，Linux^TM，FreeBSD^TMOr the like.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method for processing text data, comprising:

2. The method of claim 1, wherein the textual data originates from any one of: news information, instant messaging messages, short texts;

the classification of the text data includes an abnormal text category.

3. The method of claim 1, wherein the target classification is determined by:

4. The method of claim 1, further comprising:

5. The method of claim 1, further comprising:

6. The method of claim 1, further comprising:

7. The method of claim 1, further comprising:

8. A method for classifying text data, the method comprising:

acquiring text data to be classified;

inputting the text data to be classified into a text classification model to obtain the classification of the text data to be classified, wherein a training sample set corresponding to the text classification model is generated by the text data processing method according to any one of claims 1 to 7.

9. The method according to claim 8, characterized in that the text data to be classified originates from any one of the following: news information, instant messaging messages, short texts;

the classification of the text data comprises an abnormal text category;

10. A processing apparatus of text data, comprising:

11. An apparatus for classifying text data, the apparatus comprising:

a second obtaining module configured to obtain text data to be classified;

a second input module, configured to input the text data to be classified into a text classification model, and obtain a classification of the text data to be classified, where a training sample set corresponding to the text classification model is generated by the text data processing method according to any one of claims 1 to 7.

12. A processing apparatus of text data, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to:

13. An apparatus for classifying text data, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to:

acquiring text data to be classified;

14. A computer-readable storage medium, on which computer program instructions are stored, which program instructions, when executed by a processor, carry out the steps of the method according to any one of claims 1 to 9.