CN111813932B

CN111813932B - Text data processing method, text data classifying device and readable storage medium

Info

Publication number: CN111813932B
Application number: CN202010556440.0A
Authority: CN
Inventors: 彭团民; 徐泽宇
Original assignee: Beijing Xiaomi Pinecone Electronic Co Ltd
Current assignee: Beijing Xiaomi Pinecone Electronic Co Ltd
Priority date: 2020-06-17
Filing date: 2020-06-17
Publication date: 2023-11-14
Anticipated expiration: 2040-06-17
Also published as: CN111813932A

Abstract

The present disclosure relates to a method for processing text data, a method for classifying text data, a device and a readable storage medium, wherein the method comprises: obtaining a target sample from a training sample set of the text classification model; respectively inputting a target sample into a text classification model and a rule model to obtain a first classification result output by the text classification model and a second classification result output by the rule model, wherein the first classification result comprises prediction classification corresponding to the text data, the rule model comprises a plurality of classification rules, each classification rule corresponds to the same or different text classification, and the second classification result is used for indicating the classification rule hit by the text data; and updating the target sample according to the target classification when the target classification corresponding to the text data is determined according to the first classification result and the second classification result and the target classification is not included in the target sample. Therefore, the data volume of the training sample can be increased, the accuracy of the marking of the training sample is ensured, and the data volume of the manual marking is reduced.

Description

Text data processing method, text data classifying device and readable storage medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a method and apparatus for processing text data, and a readable storage medium.

Background

The user can release various texts through various platforms, such as news information, comment information and the like. However, with the increasing amount of data and the high demands of network environment purification, when the text content is released, it is necessary to audit the text content to determine whether the text content can be released. In the related art, a text classification model is trained, and based on the text classification model, a sensitive text or an abnormal text is classified, so that auditing of text contents is realized.

However, in the above process, usually, the training text needs to be labeled manually, and if the training text is labeled incorrectly or the number is insufficient, the accuracy of the text classification model is seriously affected, so that the accurate auditing of the text content based on the text classification model cannot be ensured.

Disclosure of Invention

To overcome the problems in the related art, the present disclosure provides a method of processing text data, a method of classifying the text data, an apparatus, and a readable storage medium.

According to a first aspect of an embodiment of the present disclosure, there is provided a method for processing text data, including:

Obtaining a target sample from a training sample set of a text classification model, wherein the target sample comprises text data or comprises the text data and a classification marked for the text data;

respectively inputting the target sample into a text classification model and a rule model to obtain a first classification result output by the text classification model and a second classification result output by the rule model, wherein the first classification result comprises a prediction classification corresponding to the text data, the rule model comprises a plurality of classification rules, each classification rule corresponds to the same or different text classification, and the second classification result is used for indicating the classification rule hit by the text data;

and updating the target sample according to the target classification when the target classification corresponding to the text data is determined according to the first classification result and the second classification result and the target classification is not included in the target sample.

Optionally, the text data is derived from any one of the following: news information, instant messaging messages, short text;

the classification of the text data includes an abnormal text category.

Optionally, the target classification is determined by:

determining a text classification corresponding to the hit classification rule as the target classification under the condition that the type of the classification rule hit by the text data is the target type;

determining the prediction classification as the target classification when the text classification corresponding to the hit classification rule is not the target type, the prediction classification is consistent with the text classification corresponding to the hit classification rule, and the confidence of the prediction classification is greater than a first confidence threshold, wherein the first classification result further comprises the confidence of the prediction classification;

in the event that the confidence of the predicted classification is greater than a second confidence threshold, and the second classification result indicates that the text data misses a classification rule, the predicted classification is determined to be the target classification, wherein the second confidence threshold is greater than the first confidence threshold.

Optionally, the method further comprises:

outputting the target sample when any one of the following conditions is met, wherein after the target sample is marked, the marked sample is used as a training sample of the text classification model:

The second classification result indicates that the text data does not hit the classification rule, and the confidence of the prediction classification in the first classification result is smaller than a third confidence threshold or the confidence belongs to a classification boundary confidence range, wherein the first classification result further comprises the confidence of the prediction classification;

the predictive classification is different from the text classification corresponding to the hit classification rule, and the text classification corresponding to the hit classification rule is not the target type.

Optionally, the method further comprises:

determining hit rate and error rate of the classification rule according to each classification rule, wherein the hit rate is the ratio of the hit times of the classification rule to the total times of rule matching, and the error rate is the ratio of the number of times of the predicted classification obtained by inputting the text classification model into the target sample to the hit times of the classification rule when the target sample hits the classification rule, wherein the number of times of the predicted classification obtained by inputting the text classification model is different from the number of times of the hit times of the classification rule;

outputting a classification rule to be updated, wherein the classification rule to be updated is a classification rule that the hit rate is greater than a hit threshold value and the error rate is greater than an error threshold value;

And responding to an updating instruction aiming at the classification rule to be updated, and updating the classification rule to be updated.

Optionally, the method further comprises:

outputting a sample to be processed, wherein the sample to be processed is a sample of which the confidence coefficient of the corresponding prediction classification is larger than a preset threshold value and the corresponding second classification result indicates a miss classification rule;

in response to a rule setting instruction for the sample to be processed, adding a classification rule indicated by the rule setting instruction to the rule model.

Optionally, the method further comprises:

determining a loss value of the text classification model according to the prediction classification corresponding to the target sample and the classification marked for the text data under the condition that the target sample comprises the text data and the classification marked for the text data;

and updating the text classification model according to the loss value under the condition that the loss value is larger than a classification threshold value.

In a second aspect, there is provided a method of classifying text data, the method comprising:

acquiring text data to be classified;

inputting the text data to be classified into a text classification model to obtain the classification of the text data to be classified, wherein a training sample set corresponding to the text classification model is generated by any one of the text data processing methods of the first aspect.

Optionally, the text data to be classified is derived from any one of the following: news information, instant messaging messages, short text;

the classification of the text data includes an abnormal text category;

inputting the text data to be classified into a text classification model to obtain the classification of the text data to be classified, wherein the method comprises the following steps:

inputting the text data to be classified into the text classification model to obtain the text data to be classified corresponding to the abnormal text category.

In a third aspect, there is provided a processing apparatus of text data, including:

a first acquisition module configured to acquire a target sample from a training sample set of a text classification model, wherein the target sample comprises text data or comprises text data and a classification annotated for the text data;

the first input module is configured to input the target sample into a text classification model and a rule model respectively, and obtain a first classification result output by the text classification model and a second classification result output by the rule model, wherein the first classification result comprises a prediction classification corresponding to the text data, the rule model comprises a plurality of classification rules, each classification rule corresponds to the same or different text classification, and the second classification result is used for indicating the classification rule hit by the text data;

And the first updating module is configured to update the target sample according to the target classification when the target classification corresponding to the text data is determined according to the first classification result and the second classification result and the target classification is not included in the target sample.

the classification of the text data includes an abnormal text category.

Optionally, the apparatus further comprises:

a first determination module configured to determine the target classification by:

Optionally, the apparatus further comprises:

the first output module is configured to output the target sample when any one of the following conditions is met, wherein after the target sample is marked, the marked sample is used as a training sample of the text classification model:

Optionally, the apparatus further comprises:

a second determining module configured to determine, for each of the classification rules, a hit rate of the classification rule, which is a ratio of a hit number of the classification rule to a total number of rule matches, and an error rate, which is a ratio of a number of times a target sample hits the classification rule, the target sample inputs a predicted classification obtained by the text classification model to a number of times the predicted classification is different from a text classification corresponding to the classification rule, to a hit number of the classification rule;

The second output module is configured to output a classification rule to be updated, wherein the classification rule to be updated is a classification rule that the hit rate is greater than a hit threshold value and the error rate is greater than an error threshold value;

and the second updating module is configured to respond to an updating instruction for the classification rule to be updated and update the classification rule to be updated.

Optionally, the apparatus further comprises:

the third output module is configured to output a sample to be processed, wherein the sample to be processed is a sample with confidence of corresponding prediction classification larger than a preset threshold value, and a corresponding second classification result indicates a miss classification rule;

an adding module configured to respond to a rule setting instruction for the sample to be processed and add a classification rule indicated by the rule setting instruction to the rule model.

Optionally, the apparatus further comprises:

a third determining module configured to determine a loss value of the text classification model according to a predicted classification corresponding to the target sample and the classification labeled for the text data, in a case where the target sample includes text data and a classification labeled for the text data;

And a third updating module configured to update the text classification model according to the loss value if the loss value is greater than a classification threshold.

In a fourth aspect, there is provided a text data classifying apparatus, the apparatus comprising:

the second acquisition module is configured to acquire text data to be classified;

the second input module is configured to input the text data to be classified into a text classification model to obtain the classification of the text data to be classified, wherein a training sample set corresponding to the text classification model is generated by the processing method of any one of the text data in the first aspect.

the classification of the text data includes an abnormal text category;

the second input module includes:

In a fifth aspect, there is provided a processing apparatus for text data, including:

a processor;

a memory for storing processor-executable instructions;

Wherein the processor is configured to:

In a sixth aspect, there is provided a text data classifying apparatus including:

a processor;

a memory for storing processor-executable instructions;

Wherein the processor is configured to:

acquiring text data to be classified;

inputting the text data to be classified into a text classification model to obtain the classification of the text data to be classified, wherein a training sample set corresponding to the text classification model is generated by any one of the text data processing methods in the first aspect.

In a seventh aspect, there is provided a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, perform the steps of the method of any of the first or second aspects above.

The technical scheme provided by the embodiment of the disclosure can comprise the following beneficial effects:

in the above technical solution, a target sample is obtained from a training sample set of a text classification model, the target sample is respectively input into the text classification model and a rule model, a first classification result output by the text classification model and a second classification result output by the rule model are obtained, and when a target classification corresponding to the text data is determined according to the first classification result and the second classification result and the target classification is not included in the target sample, the target sample is updated according to the target classification. Therefore, through the technical scheme, the training sample set can contain marked text data and unmarked text data, and can realize automatic marking of part of text data, so that the data volume of the training sample can be effectively increased, the marking accuracy of the training sample can be ensured, and the workload of manual marking can be reduced. And the classification of the marked text data in the training sample set can be verified, the accuracy of the training sample is further improved, and the accuracy of a text classification model obtained based on the training sample set is ensured, so that technical support can be provided for accurate examination and filtration of text content, safe release of the text content is realized, and the network environment is purified.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure.

Fig. 1 is a flowchart illustrating a method of processing text data according to an exemplary embodiment.

Fig. 2 is a block diagram illustrating a text data processing apparatus according to an exemplary embodiment.

Fig. 3 is a block diagram illustrating a text data processing apparatus according to an exemplary embodiment.

Fig. 4 is a block diagram illustrating a text data processing apparatus according to an exemplary embodiment.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the accompanying claims.

Fig. 1 is a flowchart illustrating a method of processing text data according to an exemplary embodiment, and as shown in fig. 1, may include the steps of:

in step 11, a target sample is obtained from a training sample set of a text classification model, wherein the target sample comprises text data or comprises text data and a classification annotated for the text data.

When training the text classification model, the text data is labeled in advance usually by a manual labeling mode, so that the classification corresponding to each text data can be determined, and the target sample comprising the text data and the classification labeled for the text data can be directly used for training the text classification model.

Illustratively, the text data is derived from any one of the following: news information, instant messaging messages, short text. The news information may be obtained by crawling from news pages by crawler technology, or from a news information-publishing platform, for example. The news information may be the original text data obtained in the above manner, or may be text data obtained by performing data processing on the original text data, for example, extracting keywords, key sentences, and key paragraphs. Short text may be used to represent text with a small number of words posted in a microblog, forum, or the like. As described in the background art, in order to clean up a network environment, when text data is published, it is generally necessary to audit the text data to determine whether the text data contains sensitive information, for example, the sensitive information may be part of information related to an administrative class, or information related to a popular text, that is, whether the audit text is a sensitive text; or may also audit low quality content such as title parties, duplicate content text, etc. to improve the quality of the network environment data. Therefore, the classification of the text data can comprise abnormal text categories, wherein the abnormal text categories can be used for representing the type of the sensitive text and/or low-quality content categories, and the text belonging to the sensitive text categories, namely the text containing sensitive information, can be classified through the text classification model, so that the abnormal text can be accurately classified, and the auditing and filtering of the text content can be realized.

However, in an actual use scenario, the number of training samples obtained by the manual labeling mode is limited, which affects the accuracy of the classification result of the trained text classification model. Thus, to increase the number of training samples and reduce the amount of manual effort, the training sample set in the present disclosure may include unlabeled samples, i.e., the target sample may contain text data, which may enable automatic labeling of the text data in the target sample through subsequent steps. For example, news information, when marked, can result in increased marking effort and affect marking accuracy due to longer text or more complex content. In an embodiment of the present disclosure, in order to increase the number of training samples, the training sample set may be data-enhanced by part of the unlabeled news information, so as to increase the accuracy and training speed of the text classification model to some extent.

In step 12, the target sample is respectively input into a text classification model and a rule model, and a first classification result output by the text classification model and a second classification result output by the rule model are obtained, wherein the first classification result comprises a prediction classification corresponding to the text data, the rule model comprises a plurality of classification rules, each classification rule corresponds to the same or different text classification, and the second classification result is used for indicating the classification rule hit by the text data.

In this embodiment, the plurality of classification rules included in the rule model may be determined based on text data labeled with classifications in the training sample set. For example, the annotator may analyze the text data annotated as the same classification to determine a classification rule corresponding to the type and determine the classification as the text classification corresponding to the classification rule. The classification rules may be represented by keywords, key sentences, etc., or by regular expressions, for example. Thus, the target sample is input into the rule model, and the text data in the target sample can be matched with the classification rules in the rule model, so that a second classification result is obtained.

If the text data hits the classification rule in the rule model, the second classification result may include the classified rule hit by the text data, and if the text data misses the classified rule, the second classification result may be null to indicate that the text data misses the classified rule.

In step 13, in the case that the target classification corresponding to the text data is determined according to the first classification result and the second classification result and the target classification is not included in the target sample, the target sample is updated according to the target classification.

The first classification result and the second classification result corresponding to the text data in the target sample can be determined through the steps, so that the target classification corresponding to the text data can be determined by integrating the first classification result and the second classification result, and the accuracy of the determined target classification can be ensured to a certain extent.

As an example, when the target sample includes text data and does not include a classification marked for the text data, the target classification is not necessarily included in the target sample, and at this time, the target sample may be updated directly according to the target classification, for example, the target classification may be marked for the text data, so as to implement automatic marking of the text data, and the updated target sample includes the text data and the target classification, so that the updated target sample may be used in a training process of the text classification model, and the training sample data size of the text classification model may be increased.

As another example, the target sample includes text data and a classification labeled for the text data, at this time, it is determined whether the target classification is included in the target sample, that is, whether the classification labeled for the text data is the same as the target classification, and if the classification labeled for the text data is the same as the target classification, the target classification is included in the target sample, at this time, it is indicated that the classification labeled for the text data in the target sample is accurate, and no update is required. If the classification of the text data label is different from the target classification, the target classification is not contained in the target sample, and the classification of the text data label in the target sample is inaccurate, the classification of the text data label in the target sample can be updated to the target classification, so that the updating of the target sample is realized, the accuracy of a training sample used for training the text classification model based on the target sample in the subsequent training process of the text classification model is ensured, and the accuracy of the text classification model is improved.

In order to enable those skilled in the art to better understand the technical solutions provided by the embodiments of the present invention, the following details of the above steps are described.

Optionally, the target classification is determined by:

in a first manner, when the type of the hit classification rule of the text data is a target type, determining the text classification corresponding to the hit classification rule as the target classification.

When the classification rule in the rule model is preset, the type of the classification rule can be set, and the type is used for representing whether the target classification can be determined only according to the classification rule when the classification rule is hit. For example, the types may include a blacklist type and an in-doubt type, where the blacklist type may be used as the target type, that is, to characterize that the target classification is determined only according to the text classification corresponding to the classification rule, so long as the type of the hit classification rule is the blacklist type, the text classification corresponding to the hit classification rule may be directly determined as the target classification. Aiming at the classification rule with the type of in-doubt, the target classification needs to be determined by combining the comprehensive analysis of the first classification result.

In a second manner, when the text classification corresponding to the hit classification rule is not a target type, the prediction classification is consistent with the text classification corresponding to the hit classification rule, and the confidence of the prediction classification is greater than a first confidence threshold, the prediction classification is determined to be the target classification, wherein the first classification result further comprises the confidence of the prediction classification.

In this embodiment, the text classification corresponding to the hit classification rule is not the target type, and the expression can not directly determine the target classification only according to the text classification corresponding to the hit classification rule, and the target classification needs to be determined by combining the first classification result. For example, when the predicted classification is consistent with the text corresponding to the hit classification rule, the determined classification of the target sample input text classification model and the determined classification of the rule model are the same, the confidence corresponding to the predicted classification is larger than the first confidence threshold, and the predicted classification output by the text classification model is relatively reliable, at this time, the predicted classification, that is, the text classification corresponding to the hit classification rule, may be determined as the target classification. Wherein the first confidence threshold may be set according to an actual usage scenario, which is not limited by the present disclosure.

In a third way, the prediction classification is determined to be the target classification if the confidence of the prediction classification is greater than a second confidence threshold, and the second classification result indicates that the text data does not hit a classification rule, wherein the second confidence threshold is greater than the first confidence threshold. The second confidence threshold may be set according to an actual usage scenario, which is not limited by the disclosure, that is, the reliability of the prediction classification determined in this manner is higher than the reliability of the prediction classification determined in the second manner. For example, the second confidence threshold may be set to a larger value, and the confidence of the predicted classification is greater than the second confidence threshold to indicate that the confidence of the predicted classification is higher.

In this embodiment, the second classification result indicates that the text data does not hit the classification rule, i.e. the text data does not match the classification rule in the rule model, in which case if the confidence of the predicted classification by the text classification model is greater than the second confidence threshold, and the second confidence threshold is greater than the first confidence threshold, i.e. the predicted classification by the text classification model is trusted, then the predicted classification may be determined directly as the target classification.

Therefore, through the technical scheme, the target classification corresponding to the text data can be determined based on the text classification model and the rule model, the accuracy of the determined target classification can be ensured, and the false labeling in the manual labeling process can be effectively avoided. And the unlabeled text data can be accurately and automatically labeled, and data support is provided for accurately judging the classification of the labeled text data. In addition, the training sample of the text classification model can be subjected to data cleaning while the text classification model is trained, so that the accuracy of the training sample of the text classification model is ensured, the accuracy of the trained text classification model is further ensured, and the text classification and text auditing based on the text classification model can be conveniently carried out later.

Optionally, the method further comprises:

outputting the target sample when any one of the following conditions is met, wherein after the target sample is marked, the marked target sample is used as a training sample of the text classification model:

a first condition, the second classification result indicates that the text data does not hit the classification rule, and the confidence of the prediction classification in the first classification result is smaller than a third confidence threshold or the confidence belongs to a classification boundary confidence range, wherein the first classification result further comprises the confidence of the prediction classification.

As an example, the third confidence threshold is smaller than the first confidence threshold, the third confidence threshold may be set to a smaller value, and the confidence level of the prediction classification is smaller than the third confidence threshold, which is used to indicate that the confidence level of the prediction classification is lower, and the determined prediction classification of the text data is not accurate; the second classification result indicates that the text data does not hit the classification rule, i.e. the classification of the text data cannot be determined according to the second classification result, at this time, the target sample can be manually marked, so that the classification corresponding to the text data in the target sample is determined, and the text is classified according to the sample obtained after the manual marking.

As another example, when classifying text data based on a trained text classification model, a corresponding classification confidence threshold may generally be set, and when the confidence of the classification obtained by the text classification model is greater than the classification confidence threshold, the classification obtained by the text classification model is determined to be the classification of the text data. Thus, when training a text classification model, training samples with confidence of the obtained prediction classification being in the vicinity of the classification confidence threshold have a great influence on the accuracy of the text classification model. Thus, in this embodiment, the classification boundary confidence range may be determined based on the classification confidence threshold, if the set classification confidence threshold is 75%, and the floating range is 10%, then the classification boundary confidence range may be determined to be [70%,80% ], and therefore, when the second classification result indicates that the text data miss classification rule is indicated, if the determined confidence of the prediction classification is 74%, which belongs to the classification boundary confidence range, then the target sample may be output, and thus the target sample may be manually labeled.

In a second condition, the predictive classification is different from the text classification corresponding to the hit classification rule, and the text classification corresponding to the hit classification rule is not the target type.

In this embodiment, when the text classification corresponding to the hit classification rule is not the target type, it indicates that the target classification of the text data cannot be determined only according to the text classification corresponding to the hit classification rule, and the predicted classification is different from the text classification, that is, when different classifications are determined for the same text data, the target classification of the target sample may be determined by manual labeling.

In the technical scheme, the target sample which has a large influence on the accuracy of the text classification model or cannot determine the target classification of the text data is output for manual labeling, so that the accuracy of labeling of the target sample can be ensured, and the training efficiency and accuracy of the text classification model are further improved.

In an actual use scene, in the training process of the text classification model, whether the classification rule in the rule model is accurate or not can be determined according to the training process of the text classification model, so that the dynamic update of the classification rule is realized, the accuracy of a target sample is improved, and meanwhile, the accuracy of the text classification model can be improved in a feedback manner. Accordingly, the present disclosure also provides the following examples.

Optionally, the method further comprises:

and determining the hit rate and the error rate of the classification rule according to each classification rule, wherein the hit rate is the ratio of the hit number of the classification rule to the total number of rule matching, and the error rate is the ratio of the number of times that the target sample inputs the predicted classification obtained by the text classification model to the number of times that the predicted classification obtained by the target sample inputs the text classification model to the classification rule is different from the number of times that the target sample hits the classification rule.

For example, N classification rules exist in the rule model, for the classification rule R1, the hit number is C1, and the total number of times of rule matching performed by inputting the target sample into the rule model is MC, so that the hit rate corresponding to the classification rule is the ratio of C1 to MC. In addition, in the embodiment of the disclosure, the target sample is respectively input into the text classification model and the rule model, and when the target sample hits the classification rule R1, the target sample inevitably obtains the prediction classification output by the text classification model, and when the text classification corresponding to the classification rule R1 is different from the prediction classification, the text classification model and the rule model have errors, and the number W of errors is recorded and increased by one, wherein the number of errors is initially 0, and the error rate is the ratio of the number W of errors to the number C1 of hits of the classification rule. The hit rate and error rate corresponding to each classification rule can be determined in the above manner, and will not be described herein.

And outputting a classification rule to be updated, wherein the classification rule to be updated is a classification rule that the hit rate is larger than a hit threshold value and the error rate is larger than an error threshold value.

The hit threshold and the error threshold may be set according to an actual usage scenario, which is not limited in this disclosure. For example, when the hit rate is greater than the hit threshold, that is, the number of times that the text data hits the classification rule in the training sample set is greater, and the error rate is greater than the error threshold, that is, the number of times that the predicted classification obtained by the text classification model is different from the text classification corresponding to the classification rule is greater.

For example, the classification rule to be updated may be output through a display interface or in the form of a derived list, and text data that hits the classification rule may be output at the same time, so that a user may further analyze the classification rule according to the text data. And then, the user can trigger an update instruction based on the mode of inputting or importing the rule through the display interface, and the update instruction for the classification rule to be updated is responded to update the classification rule to be updated. For example, the rule indicated by the update instruction may be replaced with the classification rule to be updated to implement the update.

Therefore, through the technical scheme, the classification rules in the rule model can be dynamically updated, on one hand, the accuracy of the classification rules can be improved, meanwhile, the accuracy of target sample labeling can be further improved, data support is provided for obtaining an accurate text classification model, and therefore the accuracy and efficiency of text data classification and auditing can be improved, and the user experience is improved.

Optionally, the method further comprises:

outputting a sample to be processed, wherein the sample to be processed is a sample of which the confidence coefficient of the corresponding prediction classification is larger than a preset threshold value and the corresponding second classification result indicates a miss classification rule.

The preset threshold may be the second confidence threshold, or may be set according to an actual usage scenario. The confidence level of the corresponding prediction classification is higher than a preset threshold value, and the confidence level of the prediction classification is higher, and the sample does not hit the classification rule, which may be caused by incomplete classification rules in the rule model.

In this embodiment, the sample to be processed may be output and displayed to the user, and the user may analyze based on the processed sample, so as to determine the corresponding classification rule in the sample to be processed. For example, the user may input or import the classification rule based on the visual interface to trigger a rule setting instruction, and in response to the rule setting instruction for the sample to be processed, add the classification rule indicated by the rule setting instruction to the rule model.

By the technical scheme, the classification rules are added into the rule model, so that the comprehensiveness of the classification rules in the rule model can be ensured, and the application range of the text data processing method is improved. Meanwhile, the hit rate of the classification rule can be improved, the efficiency of target classification determination of text data is improved, and the cleaning efficiency of training samples is improved. Meanwhile, automatic labeling of samples in the training sample set is facilitated.

Optionally, the method further comprises:

and under the condition that the target sample comprises text data and a classification marked for the text data, determining a loss value of the text classification model according to the prediction classification corresponding to the target sample and the classification marked for the text data.

In this embodiment, the target sample includes text data and a classification labeled for the text data, i.e., the target sample is a labeled sample, and after the target sample is input into the text classification model to obtain a prediction classification, the text classification model may be trained according to the prediction classification and the classification labeled for the text data.

The determining the loss value of the model may select an existing loss function to determine, if the determined loss value is greater than the classification threshold, it indicates that the accuracy of the text classification model does not meet the training standard, and the text classification model is updated according to the loss value, for example, the text classification model may be updated by a gradient descent method. After updating, the target sample can be acquired from the training sample set again, and the steps are executed, so that the text classification model can learn new features, and the accuracy of the text classification model is improved, therefore, the training of the text classification model can be completed under the condition that the loss value is smaller than or equal to the classification threshold value, and the text data can be accurately classified and checked based on the trained text classification model.

Therefore, through the technical scheme, the target sample can be cleaned, marked and the text classification model can be trained synchronously, so that the efficiency of the trained text classification model can be improved, the training time of the text classification model is saved, the high quality and the data volume of the target sample can be effectively ensured, the accuracy of the text classification model can be improved, and the accuracy of text data processing can be improved.

The present disclosure also provides a method of classifying text data, the method may include:

acquiring text data to be classified;

inputting the text data to be classified into a text classification model to obtain the classification of the text data to be classified, wherein a training sample set corresponding to the text classification model is generated by any one of the text data processing methods.

Therefore, through the technical scheme, the text classification model for accurately classifying can be obtained, so that the text data to be classified can be accurately classified, the accuracy of classification results can be ensured, technical support can be provided for accurately auditing and filtering the text content, safe release of the text content is realized, the network environment is purified, and the user experience is improved.

the classification of the text data includes an abnormal text category;

The manner of acquiring the text data is described in detail above, and will not be described herein. For example, in order to clean up a network environment, news information is typically required to be reviewed to determine whether abnormal text is contained in the news information when the news information is published. For example, the abnormal text may be part of information related to the administrative class to be filtered, or information related to the popular text, i.e., whether the news information is abnormal text needs to be determined; or may also audit low quality content such as title parties, duplicate content text, etc. to improve the quality of the network environment data. Thus, in this embodiment, the classification of the text data may comprise an abnormal text category, wherein the abnormal text category may be used to represent a type of sensitive text, i.e. text containing sensitive information, and/or a low quality content category.

Therefore, in the technical scheme, the text data to be classified, such as the news information, can be input into the text classification model to classify the news information through the text classification model, and when the news information is determined to contain abnormal texts, namely the news information is determined to be classified into abnormal text types, the news information is output, so that automatic auditing of the text content of the news information can be realized, auditing and filtering of the text content are realized, the accuracy of auditing of the text data can be ensured, the workload of auditing staff can be effectively reduced, and the efficiency of auditing of the text data is improved.

The present disclosure further provides a processing apparatus for text data, as shown in fig. 2, the apparatus 10 includes:

a first obtaining module 100 configured to obtain a target sample from a training sample set of a text classification model, wherein the target sample comprises text data, or comprises text data and a classification annotated for the text data;

a first input module 200 configured to input the target sample into a text classification model and a rule model respectively, and obtain a first classification result output by the text classification model and a second classification result output by the rule model, wherein the first classification result comprises a prediction classification corresponding to the text data, the rule model comprises a plurality of classification rules, each classification rule corresponds to the same or different text classification, and the second classification result is used for indicating a classification rule hit by the text data;

a first updating module 300 is configured to update the target sample according to the target classification when it is determined that the target classification corresponding to the text data is determined according to the first classification result and the second classification result and the target classification is not included in the target sample.

the classification of the text data includes an abnormal text category.

Optionally, the apparatus further comprises:

The present disclosure also provides a text data classifying apparatus, the apparatus including:

the classification of the text data includes an abnormal text category;

the second input module includes:

The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.

The present disclosure also provides a computer-readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the steps of the processing method of text data or the classifying method of text data provided by the present disclosure.

Fig. 3 is a block diagram illustrating a text data processing apparatus 800 according to an exemplary embodiment. For example, apparatus 800 may be a mobile phone, computer, digital broadcast terminal, messaging device, game console, tablet device, medical device, exercise device, personal digital assistant, or the like.

Referring to fig. 3, apparatus 800 may include one or more of the following components: a processing component 802, a memory 804, a power component 806, a multimedia component 808, an audio component 810, an input/output (I/O) interface 812, a sensor component 814, and a communication component 816.

The processing component 802 generally controls overall operation of the apparatus 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. Processing component 802 may include one or more processors 820 to execute instructions to perform all or part of the steps of the text data processing method described above. Further, the processing component 802 can include one or more modules that facilitate interactions between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.

The memory 804 is configured to store various types of data to support operations at the apparatus 800. Examples of such data include instructions for any application or method operating on the device 800, contact data, phonebook data, messages, pictures, videos, and the like. The memory 804 may be implemented by any type or combination of volatile or nonvolatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.

The power component 806 provides power to the various components of the device 800. The power components 806 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the device 800.

The multimedia component 808 includes a screen between the device 800 and the user that provides an output interface. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may sense not only the boundary of a touch or slide action, but also the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front camera and/or a rear camera. The front camera and/or the rear camera may receive external multimedia data when the apparatus 800 is in an operational mode, such as a photographing mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have focal length and optical zoom capabilities.

The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the device 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may be further stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 further includes a speaker for outputting audio signals.

The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be a keyboard, click wheel, buttons, etc. These buttons may include, but are not limited to: homepage button, volume button, start button, and lock button.

The sensor assembly 814 includes one or more sensors for providing status assessment of various aspects of the apparatus 800. For example, the sensor assembly 814 may detect an on/off state of the device 800, a relative positioning of the components, such as a display and keypad of the device 800, the sensor assembly 814 may also detect a change in position of the device 800 or a component of the device 800, the presence or absence of user contact with the device 800, an orientation or acceleration/deceleration of the device 800, and a change in temperature of the device 800. The sensor assembly 814 may include a proximity sensor configured to detect the presence of nearby objects without any physical contact. The sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscopic sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 816 is configured to facilitate communication between the apparatus 800 and other devices, either in a wired or wireless manner. The device 800 may access a wireless network based on a communication standard, such as WiFi,2G or 3G, or a combination thereof. In one exemplary embodiment, the communication component 816 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, the communication component 816 further includes a Near Field Communication (NFC) module to facilitate short range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the apparatus 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic elements for performing the above-described methods of processing text data.

In an exemplary embodiment, a non-transitory computer readable storage medium is also provided, such as memory 804 including instructions executable by processor 820 of apparatus 800 to perform the above-described method of processing text data. For example, the non-transitory computer readable storage medium may be ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.

In another exemplary embodiment, a computer program product is also provided, which comprises a computer program executable by a programmable apparatus, the computer program having code portions for performing the above-mentioned text data processing method when being executed by the programmable apparatus.

Fig. 4 is a block diagram illustrating a text data processing apparatus 1900 according to an example embodiment. For example, the apparatus 1900 may be provided as a server. Referring to fig. 4, the apparatus 1900 includes a processing component 1922 that further includes one or more processors and memory resources represented by memory 1932 for storing instructions, such as application programs, that are executable by the processing component 1922. The application programs stored in memory 1932 may include one or more modules each corresponding to a set of instructions. Further, the processing component 1922 is configured to execute instructions to perform the processing methods of text data described above.

The apparatus 1900 may further include a power component 1926 configured to perform power management of the apparatus 1900, a wired or wireless network interface 1950 configured to connect the apparatus 1900 to a network, and an input/output (I/O) interface 1958. The apparatus 1900 may operate based on an operating system stored in the memory 1932, such as Windows Server ^TM ，Mac OS X ^TM ，Unix ^TM ，Linux ^TM ，FreeBSD ^TM Or the like.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure. This application is intended to cover any adaptations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method for processing text data, comprising:

2. The method of claim 1, wherein the text data is derived from any one of: news information, instant messaging messages, short text;

the classification of the text data includes an abnormal text category.

3. The method of claim 1, wherein the target classification is determined by:

4. The method according to claim 1, wherein the method further comprises:

5. The method according to claim 1, wherein the method further comprises:

6. The method according to claim 1, wherein the method further comprises:

7. The method according to claim 1, wherein the method further comprises:

8. A method of classifying text data, the method comprising:

acquiring text data to be classified;

inputting the text data to be classified into a text classification model to obtain the classification of the text data to be classified, wherein a training sample set corresponding to the text classification model is generated by the processing method of any one of claims 1-7.

9. The method of claim 8, wherein the text data to be classified is derived from any one of: news information, instant messaging messages, short text;

the classification of the text data includes an abnormal text category;

10. A processing apparatus for text data, comprising:

11. A text data classification apparatus, the apparatus comprising:

the second input module is configured to input the text data to be classified into a text classification model to obtain the classification of the text data to be classified, wherein a training sample set corresponding to the text classification model is generated by the processing method of the text data in any one of claims 1-7.

12. A processing apparatus for text data, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to:

13. A text data classification apparatus, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to:

acquiring text data to be classified;

14. A computer readable storage medium having stored thereon computer program instructions, which when executed by a processor, implement the steps of the method of any of claims 1-9.