CN109739989B

CN109739989B - Text classification method and computer equipment

Info

Publication number: CN109739989B
Application number: CN201811653926.5A
Authority: CN
Inventors: 李斌; 禹庆华
Original assignee: Qianxin Technology Group Co Ltd
Current assignee: Qianxin Technology Group Co Ltd
Priority date: 2018-12-29
Filing date: 2018-12-29
Publication date: 2021-05-18
Anticipated expiration: 2038-12-29
Also published as: CN109739989A

Abstract

The present disclosure provides a text classification method, including: acquiring a text to be classified; obtaining a first classification result based on the full text data of the text to be classified; obtaining a second classification result based on one or more sub-text data extracted from the full-text data; and determining the classification result of the text to be classified according to the first classification result and the second classification result. The present disclosure also provides a computer device.

Description

Text classification method and computer equipment

Technical Field

The present disclosure relates to a text classification method and a computer device.

Background

Text classification is the process of determining text categories from text content under a given classification system. Text classification is an important part of natural language processing and has wide application, including news classification, mail classification, spam recognition, illegal web page recognition and the like.

The existing text classification scheme classifies the texts based on the whole content of the texts, and because a large amount of interference information irrelevant to the text classification exists in the whole content of the texts, the discriminant characteristics of the text classification are buried in the interference information, and an accurate classification result cannot be obtained.

Disclosure of Invention

One aspect of the present disclosure provides a text classification method, including: acquiring a text to be classified; obtaining a first classification result based on the full text data of the text to be classified; obtaining a second classification result based on one or more sub-text data extracted from the full-text data; and determining the classification result of the text to be classified according to the first classification result and the second classification result.

Optionally, the obtaining a first classification result based on the full text data of the text to be classified includes: inputting the full-text data into a full-text classification model corresponding to a plurality of preset categories, determining a first score of the full-text data on each preset category in the plurality of preset categories based on the full-text classification model, and taking the preset category with the highest first score as the category corresponding to the full-text data.

Optionally, before obtaining the second classification result based on one or more sub-text data extracted from the full-text data, the method further includes: one or more sub-text data are extracted from the full-text data. The extracting one or more sub-text data from the full-text data includes: matching the keywords in a preset keyword set with the full text data; for a first keyword which is successfully matched, extracting a character string with a first preset length before the first keyword and/or a character string with a second preset length after the first keyword from the full text data; and combining the extracted character string and the first keyword into sub-text data according to the position sequence in the full-text data.

Optionally, the obtaining a second classification result based on one or more sub-text data extracted from the full-text data includes: for a first sub-text data, inputting the first sub-text data into a sub-text classification model corresponding to the plurality of preset categories, determining a second score of the first sub-text data for each of the plurality of preset categories based on the sub-text classification model; and calculating a third score of each preset category of the one or more sub-text data based on a second score of each preset category of each sub-text data in the one or more sub-text data, and taking the preset category with the highest third score as the category corresponding to the one or more sub-text data.

Optionally, the calculating a third score of the one or more sub-text data for each preset category based on the second score of each sub-text data of the one or more sub-text data for each preset category includes: and for any preset category in the plurality of preset categories, carrying out weighted summation on the second scores of the sub-text data about the preset category to obtain a third score of the one or more sub-text data about the preset category.

Optionally, the determining the classification result of the text to be classified according to the first classification result and the second classification result includes: and calculating to obtain a comprehensive score of the text to be classified about each preset category according to the first score of the full text data about each preset category and the third score of the one or more sub text data about each preset category, and taking the preset category with the highest comprehensive score as the category corresponding to the text to be classified.

Optionally, the calculating a composite score of the text to be classified about each preset category according to the first score of the full text data about each preset category and the third score of the one or more sub-text data about each preset category includes: setting a first weight corresponding to the full-text data and a second weight corresponding to the one or more sub-text data; and for any preset category in the plurality of preset categories, according to the first weight and the second weight, carrying out weighted summation on the first score of the full text data about the preset category and the third score of the one or more sub text data about the preset category to obtain the comprehensive score of the text to be classified about the preset category.

Optionally, the obtaining a second classification result based on one or more sub-text data extracted from the full-text data includes: for first sub-text data, inputting the first sub-text data into sub-text classification models corresponding to the plurality of preset categories, determining scores of the first sub-text data with respect to the preset categories based on the sub-text classification models, and taking the preset category with the highest score as the category corresponding to the first sub-text data. When a first category exists in categories corresponding to each sub-text data in the one or more sub-texts, determining the category corresponding to the one or more sub-text data as the first category; and when the categories corresponding to the sub-text data in the one or more sub-texts are all second categories, determining the categories corresponding to the one or more sub-text data as the second categories.

Optionally, the preset categories include a first category and a second category. The determining the classification result of the text to be classified according to the first classification result and the second classification result includes: when the category corresponding to the full text data and the category corresponding to the one or more sub-text data are both a second category, determining that the category corresponding to the text to be classified is the second category; and when the category corresponding to the full text data and/or the category corresponding to the one or more sub-text data is a first category, determining that the category corresponding to the text to be classified is the first category.

Another aspect of the present disclosure provides a text classification apparatus, including: the device comprises an acquisition module, a first classification module, a second classification module and a comprehensive classification module. The acquisition module is used for acquiring the text to be classified. The first classification module is used for obtaining a first classification result based on the full text data of the text to be classified. The second classification module is used for obtaining a second classification result based on one or more sub-text data extracted from the full text data. And the comprehensive classification module is used for determining the classification result of the text to be classified according to the first classification result and the second classification result.

Optionally, the first classification module is configured to input the full-text data into a full-text classification model corresponding to a plurality of preset categories, determine, based on the full-text classification model, a first score of the full-text data with respect to each preset category in the plurality of preset categories, and take a preset category with a highest first score as a category corresponding to the full-text data.

Optionally, the apparatus further includes a sub-text extraction module, configured to extract one or more sub-text data from the full-text data before the second classification module obtains a second classification result based on the one or more sub-text data extracted from the full-text data. The subfile extracting module comprises a matching submodule, an extracting submodule and a combining submodule. And the matching sub-module is used for matching the keywords in the preset keyword set with the full text data. The extraction sub-module is used for extracting a first character string with a first preset length before the first keyword and/or a second character string with a second preset length after the first keyword from the full text data for the first keyword which is successfully matched. And the combination sub-module is used for combining the extracted character string and the first keyword into sub-text data according to the position sequence in the full text data.

Optionally, the second classification module comprises a first prediction sub-module and a computation sub-module. The first prediction sub-module is configured to, for a first sub-text data, input the first sub-text data into a sub-text classification model corresponding to the plurality of preset categories, and determine a second score for the first sub-text data with respect to each of the plurality of preset categories based on the sub-text classification model. And the calculation sub-module is used for calculating a third score of each preset category of the one or more sub-text data based on a second score of each preset category of each sub-text data in the one or more sub-text data, and taking the preset category with the highest third score as the category corresponding to the one or more sub-text data.

Optionally, the calculating sub-module is specifically configured to, for any preset category in the plurality of preset categories, perform weighted summation on the second scores of the sub-text data about the preset category to obtain a third score of the one or more sub-text data about the preset category.

Optionally, the comprehensive classification module includes a comprehensive calculation sub-module, configured to calculate, according to the first score of the full text data with respect to each preset category and the third score of the one or more sub-text data with respect to each preset category, a comprehensive score of the text to be classified with respect to each preset category, and use the preset category with the highest comprehensive score as the category corresponding to the text to be classified.

Optionally, the comprehensive calculation sub-module is configured to set a first weight corresponding to the full-text data and a second weight corresponding to the one or more sub-text data; and for any preset category in the plurality of preset categories, according to the first weight and the second weight, carrying out weighted summation on the first score of the full text data about the preset category and the third score of the one or more sub text data about the preset category to obtain the comprehensive score of the text to be classified about the preset category.

Optionally, the second classification module comprises a second prediction sub-module and a first determination sub-module. The second prediction submodule is used for inputting the first sub-text data into sub-text classification models corresponding to the preset categories for the first sub-text data, determining scores of the first sub-text data about the preset categories based on the sub-text classification models, and taking the preset category with the highest score as the category corresponding to the first sub-text data. The first determining sub-module is used for determining the category corresponding to the one or more sub-text data as a first category when the first category exists in the categories corresponding to the sub-text data in the one or more sub-texts; and when the categories corresponding to the sub-text data in the one or more sub-texts are all second categories, determining the categories corresponding to the one or more sub-text data as the second categories.

Optionally, the preset categories include a first category and a second category. The comprehensive classification module comprises a second determining submodule and is used for determining that the category corresponding to the text to be classified is a second category when the category corresponding to the full text data and the category corresponding to the one or more sub-text data are both the second category; and when the category corresponding to the full text data and/or the category corresponding to the one or more sub-text data is a first category, determining that the category corresponding to the text to be classified is the first category.

Another aspect of the present disclosure provides a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor when executing the computer program being adapted to implement the method as described above.

Another aspect of the present disclosure provides a computer-readable storage medium storing computer-executable instructions for implementing the method as described above when executed.

Another aspect of the disclosure provides a computer program comprising computer executable instructions for implementing the method as described above when executed.

Drawings

For a more complete understanding of the present disclosure and the advantages thereof, reference is now made to the following descriptions taken in conjunction with the accompanying drawings, in which:

FIG. 1 schematically illustrates an application scenario of a text classification method and a computer device according to an embodiment of the present disclosure;

FIG. 2 schematically shows a flow diagram of a text classification method according to an embodiment of the disclosure;

FIG. 3A schematically illustrates a schematic diagram of a text classification process according to an embodiment of the disclosure;

FIG. 3B schematically shows a schematic diagram of a text classification process according to another embodiment of the present disclosure;

FIG. 4 schematically shows a block diagram of a text classification apparatus according to an embodiment of the present disclosure;

fig. 5 schematically shows a block diagram of a text classification apparatus according to another embodiment of the present disclosure; and

fig. 6 schematically shows a block diagram of a computer device according to an embodiment of the present disclosure.

Detailed Description

Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood that the description is illustrative only and is not intended to limit the scope of the present disclosure. In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the disclosure. It may be evident, however, that one or more embodiments may be practiced without these specific details. Moreover, in the following description, descriptions of well-known structures and techniques are omitted so as to not unnecessarily obscure the concepts of the present disclosure.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The terms "comprises," "comprising," and the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.

All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It is noted that the terms used herein should be interpreted as having a meaning that is consistent with the context of this specification and should not be interpreted in an idealized or overly formal sense.

Where a convention analogous to "at least one of A, B and C, etc." is used, in general such a construction is intended in the sense one having skill in the art would understand the convention (e.g., "a system having at least one of A, B and C" would include but not be limited to systems that have a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.). Where a convention analogous to "A, B or at least one of C, etc." is used, in general such a construction is intended in the sense one having skill in the art would understand the convention (e.g., "a system having at least one of A, B or C" would include but not be limited to systems that have a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.).

Some block diagrams and/or flow diagrams are shown in the figures. It will be understood that some blocks of the block diagrams and/or flowchart illustrations, or combinations thereof, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the instructions, which execute via the processor, create means for implementing the functions/acts specified in the block diagrams and/or flowchart block or blocks. The techniques of this disclosure may be implemented in hardware and/or software (including firmware, microcode, etc.). In addition, the techniques of this disclosure may take the form of a computer program product on a computer-readable storage medium having instructions stored thereon for use by or in connection with an instruction execution system.

Embodiments of the present disclosure provide a text classification method and a computer device capable of applying the same. The method comprises a text to be classified acquisition stage, a classification stage and a comprehensive processing stage. And acquiring the text to be classified at a text acquisition stage. In the classification stage, a first classification result corresponding to full text data of a text to be classified and a second classification result corresponding to one or more sub-text data extracted from the full text data are obtained respectively. And then, entering a comprehensive processing stage, and determining a classification result of the text to be classified according to the first classification result and the second classification result.

Fig. 1 schematically shows an application scenario of a text classification method and a computer device according to an embodiment of the present disclosure. It should be noted that fig. 1 is only an example of a scenario in which the embodiments of the present disclosure may be applied to help those skilled in the art understand the technical content of the present disclosure, but does not mean that the embodiments of the present disclosure may not be applied to other devices, systems, environments or scenarios.

As shown in fig. 1, the application scenario may include

terminal devices

101, 102, 103, a network 104 and a server/server cluster 105. The network 104 serves to provide a medium of communication links between the

terminal devices

101, 102, 103 and the server/server cluster 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

A user may use the

terminal devices

101, 102, 103 to interact with the server/server cluster 105 via the network 104 to receive or send messages or the like. The

terminal devices

101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.

The server/server cluster 105 may be a server or a server cluster providing various services, and the background management server or the server cluster may analyze and process data such as a received user request, and feed back a processing result to the terminal device.

It should be noted that the text classification method provided by the embodiments of the present disclosure may be generally performed by the server/server cluster 105. Accordingly, the text classification apparatus provided by the embodiments of the present disclosure may be generally disposed in the server/server cluster 105. The text classification method provided by the embodiment of the present disclosure may also be performed by a server or a server cluster different from the server/server cluster 105 and capable of communicating with the

terminal devices

101, 102, 103 and/or the server/server cluster 105. Correspondingly, the text classification apparatus provided in the embodiments of the present disclosure may also be disposed in a server or a server cluster different from the server/server cluster 105 and capable of communicating with the

terminal devices

101, 102, 103 and/or the server/server cluster 105.

It should be understood that the number of end devices, networks, and server/server clusters in fig. 1 is illustrative only. There may be any number of end devices, networks, and server/server clusters, as desired for implementation.

Fig. 2 schematically shows a flow chart of a text classification method according to an embodiment of the present disclosure.

As shown in fig. 2, the method includes operations S201 to S204.

In operation S201, a text to be classified is acquired.

In operation S202, a first classification result is obtained based on full text data of the text to be classified.

In operation S203, a second classification result is obtained based on one or more sub-text data extracted from the full-text data.

In operation S204, a classification result of the text to be classified is determined according to the first classification result and the second classification result.

As can be seen, for the text to be classified, the method shown in fig. 2 obtains a first classification result based on the full text data of the text to be classified on the one hand, and obtains a second classification result based on the sub-text data in the full text data of the text to be classified on the other hand, and obtains the classification result of the text to be processed by combining the first classification result and the second classification result. The classification of the text to be classified is considered from the perspective of the whole text to be classified, the classification of the text to be classified is considered from the perspective of key information in the text to be classified, and the recall rate and the accuracy rate of the text classification result can be effectively improved by combining the classification of the text to be classified and the key information.

In an embodiment of the present disclosure, the obtaining, by the operation S202, a first classification result based on full text data of the text to be classified includes: inputting the full-text data into a full-text classification model corresponding to a plurality of preset categories, determining a first score of the full-text data on each preset category in the plurality of preset categories based on the full-text classification model, and taking the preset category with the highest first score as the category corresponding to the full-text data.

After the text to be classified is obtained, the text to be classified is preprocessed to obtain full text data, and the full text classification model corresponds to a plurality of preset classes. Inputting the full text data into a full text classification model, and respectively obtaining first scores of the full text data about each preset classification. The first score of the full text data with respect to one preset category X indicates a score for predicting that the full text data belongs to this preset category X. And when the first score of the full text data about one preset category X is larger than or equal to the first score of the full text data about any other preset category, taking the preset category X as the category corresponding to the full text data. The advantage of classifying the full-text data by using the full-text classification model is that the features in the full-text data can be considered from the whole and the correlation among the features can be considered in the classification process, but because a large amount of interference information irrelevant to the text classification exists in the full-text data, the discriminant features of the text classification can be annihilated in the interference information, and an accurate first classification result cannot be obtained. Therefore, the first classification result and the second classification result are subjected to comprehensive processing subsequently.

In an embodiment of the present disclosure, before obtaining the second classification result based on the one or more sub-text data extracted from the full-text data in operation S203, the method shown in fig. 2 further includes: one or more sub-text data are extracted from the full-text data.

Specifically, the extracting one or more sub-text data from the full-text data includes: matching the keywords in a preset keyword set with the full text data; for a first keyword which is successfully matched, extracting a character string with a first preset length before the first keyword and/or a character string with a second preset length after the first keyword from the full text data; and combining the extracted character string and the first keyword into sub-text data according to the position sequence in the full-text data.

The preset keyword set comprises one or more keywords playing a key role in text classification. And matching the full text data based on a preset keyword set, combining the successfully matched keywords at the matching position with the character strings before and after the keywords at each matching position to obtain the context text data of the keywords at the matching position, and taking the context text data as sub-text data. Because the sub-text data screens out the keyword information with a key effect, the interference of redundant information in the full-text data is eliminated, and compared with the process of classifying based on the full-text data, the process of classifying based on the sub-text data can pay more attention to the discriminant characteristics related to text classification. Furthermore, the sub-text data not only comprises the keyword information, but also comprises the context information of the keyword, and the key function of the keyword in text classification can be embodied only by placing the keyword in a specific context, so that the second classification result is more accurate.

In one embodiment of the present disclosure, the obtaining of the second classification result based on the one or more sub-text data extracted from the full-text data in operation S203 includes: for a first sub-text data, inputting the first sub-text data into a sub-text classification model corresponding to the plurality of preset categories, determining a second score of the first sub-text data for each of the plurality of preset categories based on the sub-text classification model; and calculating a third score of each preset category of the one or more sub-text data based on a second score of each preset category of each sub-text data in the one or more sub-text data, and taking the preset category with the highest third score as the category corresponding to the one or more sub-text data.

And the sub-text classification model is consistent with the preset classification corresponding to the full-text classification model. And for any sub-document data, inputting the sub-document data into a sub-document classification model, and respectively obtaining second scores of the sub-document data about each preset classification. The second score of the sub-text data with respect to a preset category X represents a score of the prediction sub-text data belonging to the preset category X.

When there is only one of the sub text data, the second score of the sub text data with respect to each preset category is equal to the third score of the sub text data as a whole with respect to each preset category. And when the second score of the sub text data about one preset category X is larger than or equal to the second score of the sub text data about any other preset category, taking the preset category X as the category corresponding to the sub text data.

When the number of the sub text data is plural, a third score of the whole sub text data with respect to each preset category is calculated based on the second score of each sub text data with respect to each preset category. And when the third score of the whole sub text data about one preset category X is larger than or equal to the third score of the whole sub text data about any other preset category, taking the preset category X as the category corresponding to the whole sub text data.

In the process, the sub-text data is classified by the sub-text classification model one by one to obtain the classification result of each sub-text data, and then the classification result of each sub-text data is combined to obtain a second classification result. Because different sub-text data contain the same or different keyword information in different context scenarios, different sub-text data can express different characteristics related to text classification in the full text data, the classification process of the sub-text classification model for each sub-text data is equivalent to the extraction and classification process of the characteristics corresponding to each sub-text data, and various characteristics related to text classification are comprehensively considered to obtain a second classification result.

Specifically, the calculating a third score of the one or more sub-text data for each preset category based on the second score of each sub-text data in the one or more sub-text data for each preset category includes: and for any preset category in the plurality of preset categories, carrying out weighted summation on the second scores of the sub-text data about the preset category to obtain a third score of the one or more sub-text data about the preset category.

The keywords in the preset keyword set may have different levels, and the levels represent the degree of the corresponding keywords playing a role in text classification. For example, if a certain keyword is included in the text, the text is determined to be an illegal text, and the keyword has the highest rank. Corresponding weights can be set for the sub-text data according to the grades of the keywords contained in the sub-text data, and the second scores of the sub-text data about the same preset category X are weighted and summed based on the weights corresponding to the sub-text data, so that a third score of the whole sub-text data about the preset category X is obtained.

On this basis, in an embodiment of the present disclosure, the determining, by the operation S204 according to the first classification result and the second classification result, a classification result of the text to be classified includes: and calculating to obtain a comprehensive score of the text to be classified about each preset category according to the first score of the full text data about each preset category and the third score of the one or more sub text data about each preset category, and taking the preset category with the highest comprehensive score as the category corresponding to the text to be classified.

Specifically, the calculating the comprehensive score of the text to be classified about each preset category according to the first score of the full text data about each preset category and the third score of the one or more sub-text data about each preset category includes: setting a first weight corresponding to the full-text data and a second weight corresponding to the one or more sub-text data; and for any preset category in the plurality of preset categories, according to the first weight and the second weight, carrying out weighted summation on the first score of the full text data about the preset category and the third score of the one or more sub text data about the preset category to obtain the comprehensive score of the text to be classified about the preset category.

The scheme of the embodiment obtains the classification result of the text to be classified based on the first classification result and the second classification result, and the first classification result takes the whole feature and the association among the features in the full text data as the classification basis, and the second classification result takes the discriminant features related to the text classification as the classification basis, so that the accuracy of the whole classification trend can be mastered, and the discriminant features can be focused to enable the classification result to be accurate. Specifically, a first weight and a second weight may be set for the first classification result and the second classification result, respectively, according to the importance degree of the full-text classification process and the sub-text classification process, the first score and the third score with respect to the same preset category X may be weighted and summed based on the first weight and the second weight to obtain a composite score with respect to the preset category X, and the final classification result may be determined according to the composite score with respect to each preset category.

In another embodiment of the present disclosure, the obtaining of the second classification result based on the one or more sub-text data extracted from the full-text data in operation S203 includes: for first sub-text data, inputting the first sub-text data into sub-text classification models corresponding to the plurality of preset categories, determining scores of the first sub-text data with respect to the preset categories based on the sub-text classification models, and taking the preset category with the highest score as the category corresponding to the first sub-text data. When a first category exists in categories corresponding to each sub-text data in the one or more sub-texts, determining the category corresponding to the one or more sub-text data as the first category; and when the categories corresponding to the sub-text data in the one or more sub-texts are all second categories, determining the categories corresponding to the one or more sub-text data as the second categories.

The sub-text classification model and the full-text classification model correspond to preset categories, the preset categories comprise a first category and a second category, the first category is highest in risk level, and the second category is lowest in risk level. And for any sub-text data, inputting the sub-text data into a sub-text classification model to respectively obtain a second score of the sub-text data about each preset category, wherein the second score of the sub-text data about one preset category X represents the score of the prediction sub-text data belonging to the preset category X. And when the second score of the sub text data about one preset category X is larger than or equal to the second score of the sub text data about any other preset category, taking the preset category X as the category corresponding to the sub text data.

When there is only one piece of sub-text data, the category corresponding to the sub-text data is the category corresponding to the whole sub-text data. When there are a plurality of pieces of sub text data, if there is a first category in the categories corresponding to the sub text data, the first category is set as the category corresponding to the entire sub text data. When the categories corresponding to the sub-document data are all the second categories, the second categories are taken as the categories corresponding to the entire sub-document data.

In the process, classification is performed on the sub-text data by sub-text classification models, so as to obtain a class corresponding to each sub-text data, and regarding whether a first class with the highest risk level exists in the classes corresponding to the sub-text data, if so, the first class is directly used as the class corresponding to the whole sub-text data, namely, a second classification result. The process is suitable for a situation that a category (such as a first category) needing special attention is contained in preset categories, and the category needing special attention is given the highest judgment weight so as to eliminate security threats possibly caused by texts of the category needing special attention.

On this basis, in an embodiment of the present disclosure, the determining, by the operation S204 according to the first classification result and the second classification result, a classification result of the text to be classified includes: when the category corresponding to the full text data and the category corresponding to the one or more sub-text data are both a second category, determining that the category corresponding to the text to be classified is the second category; and when the category corresponding to the full text data and/or the category corresponding to the one or more sub-text data is a first category, determining that the category corresponding to the text to be classified is the first category.

The scheme of the embodiment obtains the classification result of the text to be classified based on the first classification result and the second classification result, and the first classification result takes the whole feature and the association among the features in the full text data as the classification basis, and the second classification result takes the discriminant features related to the text classification as the classification basis, so that the accuracy of the whole classification trend can be mastered, and the discriminant features can be focused to enable the classification result to be accurate. Specifically, when there is a first category with the highest risk level among the category corresponding to the full-text data and the category corresponding to the sub-text data, the classification result of the text to be classified is the first category, and when both the category corresponding to the full-text data and the category corresponding to the sub-text data are a second category with the lowest risk level, the classification result of the text to be classified is the second category. The process is suitable for a situation that a category (such as a first category) needing special attention is contained in preset categories, and the category needing special attention is given the highest judgment weight so as to eliminate security threats possibly caused by texts of the category needing special attention.

The method shown in fig. 2 is described below with reference to fig. 3A to 3B in conjunction with specific embodiments:

fig. 3A schematically shows a schematic diagram of a text classification process according to an embodiment of the disclosure.

As shown in fig. 3A, in the text classification process, a text to be classified is first obtained, and the text to be classified is preprocessed to obtain full text data. The preprocessing includes some conventional processing methods, such as removing HTML tags in the text to be classified, performing word segmentation on the text to be classified, removing redundant words, and the like. The preprocessed full text is matched based on a preset keyword set, keywords which are successfully matched are found in the full text, N (N is more than 0) matching positions are totally obtained, a certain length range before and after each matching position is extracted to serve as context texts of the corresponding keywords, the context texts are called as sub-text data, and in this example, N sub-text data are extracted. The preset keyword set comprises important words collected according to specific classification tasks. The keyword matching is adopted to extract the key contents related to the text classification task in the whole text and eliminate the interference of irrelevant texts.

On one hand, full text data is input into a full text classification model, and a classification result corresponding to the full text data is predicted by using the full text classification model. The full text classification model adopts text full texts in various corpora as a training set in a training stage so as to train the capacity of the full text classification model for classifying the whole text content. There are various choices of the full Text classification model, such as Text classification based on convolutional neural network (Text-CNN), Recursive Neural Network (RNN), and the like. The full-text classification model outputs a classification result corresponding to the full-text data, corresponding to the first classification result in fig. 2. Specifically, the full text classification model corresponds to 3 preset categories: class 1, class 2, and class 3. The classification results corresponding to the full text data include: the full text data has a first score a (1) for category 1, the full text data has a first score a (2) for category 2, and the full text data has a first score a (3) for category 3.

On the other hand, each of the N sub-text data is input to a sub-text classification model, and a classification result corresponding to each sub-text data is predicted using the sub-text classification model. In the training stage, the sub-document classification model takes context texts of keywords in a preset keyword set in various corpora as a training set so as to train the capability of the sub-document classification model for classifying sub-document data. There are various choices of the sub-Text classification model, such as Text classification based on convolutional neural network (Text-CNN), Recursive Neural Network (RNN), and the like. The sub-text classification model outputs the classification result corresponding to each sub-text data, and obtains N classification results corresponding to the N sub-text data, which correspond to the second classification result in fig. 2. Specifically, the sub-text classification model corresponds to 3 preset categories: class 1, class 2, and class 3. For the ith sub-text data, the classification result corresponding to the sub-text data includes: second score b of ith (1. ltoreq. i. ltoreq.N, i is an integer) sub-text data with respect to class 1_i(1) Second score b of ith sub-text data with respect to Category 2_i(2) And a second score b of the ith sub-text data with respect to class 3_i(3)。

And distributing weights to the used models according to the importance degree of the middle model of the specific classification task, and combining the classification results of the full-text classification model and the sub-text classification model by using a weight distribution scheme to obtain a final classification result. In this example, a weight α is assigned to the full-text classification model that outputs the classification result corresponding to the full-text data, and the ith sub-model is outputSub-text classification model of classification result corresponding to text data is assigned with weight beta_iThen, a comprehensive score of the text to be classified about the category 1 is obtained:

the integrated score of the text to be classified with respect to category 2:

the composite score of the text to be classified with respect to category 3:

and selecting the category with the highest comprehensive score as the category of the text to be classified, so as to obtain a final classification result, wherein the final classification result corresponds to the classification result of the text to be classified in the figure 2.

Compared with the prior art, the scheme of the embodiment has the advantages that the key text content is accurately positioned, the interference of irrelevant text content is avoided, and the recall rate and the accuracy rate of text classification are effectively improved.

Fig. 3B schematically shows a schematic diagram of a text classification process according to another embodiment of the present disclosure.

As shown in fig. 3B, the text classification process is specifically a classification process for determining whether the web page is illegal content (related content such as pornography, gambling, terrorism, etc.). Firstly, a crawler system crawls a webpage source code, and the webpage source code is provided with an HTML label and serves as a text to be classified. And performing some preprocessing operations on the webpage source code, such as operations of removing HTML tags, cutting words and the like. And preprocessing the webpage text to obtain the webpage text serving as full text data. The preset keyword set includes some keywords playing a key role in text classification, such as "hong kong horse race", "ao betting", "one code middle special", "one night" and the like. Matching the webpage original text based on a preset keyword set, specifically, matching by using a multi-mode matching algorithm, finding out successfully matched keywords, wherein N (N is more than 0) matching positions are total, extracting character strings with 60 character lengths before and after each keyword matching position, combining the extracted character strings and the matched keywords in sequence in the webpage original text to obtain a keyword context text with about 120 character lengths as sub-text data, and extracting N sub-text data in the embodiment.

On one hand, full text data is input into a full text classification model, and a classification result corresponding to the full text data is predicted by using the full text classification model. The full Text classification model in this example employs a Text classification (Text-CNN) model based on convolutional neural networks, corresponding to two preset classes: "violation" and "normal". The full-text classification model outputs a classification result corresponding to the full-text data, corresponding to the first classification result in fig. 2, which is "violation" or "normal".

On the other hand, each of the N sub-text data is input to a sub-text classification model, and a classification result corresponding to each sub-text data is predicted using the sub-text classification model. In this example, the sub-Text classification model also employs a Text classification (Text-CNN) model based on a convolutional neural network. The sub-text classification model outputs the classification result corresponding to each sub-text data, and obtains N classification results corresponding to N sub-text data, where the classification result corresponding to each sub-text data is "violation" or "normal" corresponding to the second classification result in fig. 2.

In order to combine the classification results output by the two models, the adopted weight distribution scheme is as follows: when all classification results have 'violation', a weight 1 is assigned to the model outputting the classification result 'violation', and a weight 0 is assigned to the model outputting the classification result 'normal'. When no violation exists in all classification results, a weight of 1 is assigned to the model outputting the classification result of "normal". That is, if there is a category of "violation" in the classification result, the classification result of the final webpage is "violation", and if there is a category of "normal" in the classification result, the classification result of the final webpage is "normal", so as to give the highest priority and sensitivity to the classification result of "violation", thereby avoiding the security threat brought by the violation webpage.

Through tests, in the illegal content classification task, compared with the prior art, the accuracy of the classification result is improved by at least 10%.

Fig. 4 schematically shows a block diagram of a text classification apparatus according to an embodiment of the present disclosure.

As shown in fig. 4, the text classification apparatus 400 includes: an acquisition module 410, a first classification module 420, a second classification module 430, and a composite classification module 440.

The obtaining module 410 is used for obtaining the text to be classified.

The first classification module 420 is configured to obtain a first classification result based on full text data of the text to be classified.

The second classification module 430 is configured to obtain a second classification result based on one or more sub-text data extracted from the full text data.

The comprehensive classification module 440 is configured to determine a classification result of the text to be classified according to the first classification result and the second classification result.

Fig. 5 schematically shows a block diagram of a text classification apparatus according to another embodiment of the present disclosure.

As shown in fig. 5, the text classification apparatus 500 includes: an acquisition module 410, a first classification module 420, a second classification module 430, and a composite classification module 440.

The first classification module 420 is configured to input the full-text data into a full-text classification model corresponding to a plurality of preset categories, determine a first score of the full-text data with respect to each preset category in the plurality of preset categories based on the full-text classification model, and take the preset category with the highest first score as the category corresponding to the full-text data.

In an embodiment of the present disclosure, the text classification apparatus 500 further includes a sub-text extraction module 450, configured to extract one or more sub-text data from the full-text data before the second classification module obtains a second classification result based on the one or more sub-text data extracted from the full-text data.

The subfile extracting module 450 may include: a matching sub-module 451, an extraction sub-module 452 and a combination sub-module 453.

The matching sub-module 451 is configured to match the full text data with keywords in a preset keyword set. The extracting sub-module 452 is configured to, for a first keyword that is successfully matched, extract, from the full text data, a character string with a first preset length before the first keyword and/or a character string with a second preset length after the first keyword. And the combining submodule 453 is configured to combine the extracted character string and the first keyword into one sub-text data in the order of positions in the full-text data.

In one embodiment of the present disclosure, the second classification module 430 includes a first prediction submodule 431 and a calculation submodule 432.

The first prediction sub-module 431 is configured to, for a first sub-text data, input the first sub-text data into a sub-text classification model corresponding to the plurality of preset categories, and determine a second score of the first sub-text data for each of the plurality of preset categories based on the sub-text classification model. And the calculating sub-module 432 is configured to calculate a third score of each preset category of the one or more sub-text data based on a second score of each preset category of each sub-text data of the one or more sub-text data, and set the preset category with the highest third score as the category corresponding to the one or more sub-text data.

The calculating sub-module 432 is specifically configured to, for any preset category in the plurality of preset categories, perform weighted summation on the second scores of the sub-text data with respect to the preset category to obtain the third scores of the one or more sub-text data with respect to the preset category.

In an embodiment of the present disclosure, the comprehensive classification module 440 includes a comprehensive calculation sub-module 441, configured to calculate a comprehensive score of the text to be classified about each preset category according to the first score of the full text data about each preset category and the third score of the one or more sub-text data about each preset category, and use the preset category with the highest comprehensive score as the category corresponding to the text to be classified.

Specifically, as an optional embodiment, the comprehensive calculation sub-module 441 is configured to set a first weight corresponding to the full-text data and a second weight corresponding to the one or more sub-text data; and for any preset category in the plurality of preset categories, according to the first weight and the second weight, carrying out weighted summation on the first score of the full text data about the preset category and the third score of the one or more sub text data about the preset category to obtain the comprehensive score of the text to be classified about the preset category.

In one embodiment of the present disclosure, the second classification module 430 includes a second prediction sub-module 433 and a first determination sub-module 434. The second prediction submodule 433 is configured to, for a first sub-text data, input the first sub-text data into a sub-text classification model corresponding to the plurality of preset categories, determine scores of the first sub-text data with respect to each preset category based on the sub-text classification model, and set a preset category with a highest score as a category corresponding to the first sub-text data. The first determining sub-module 434 is configured to determine, when a first category exists in the categories corresponding to the respective sub-text data in the one or more sub-texts, the category corresponding to the one or more sub-text data as the first category; and when the categories corresponding to the sub-text data in the one or more sub-texts are all second categories, determining the categories corresponding to the one or more sub-text data as the second categories.

Wherein the preset categories include a first category and a second category. In an embodiment of the present disclosure, the comprehensive classification module 440 includes a second determining sub-module 442, configured to determine that the category corresponding to the text to be classified is a second category when both the category corresponding to the full text data and the category corresponding to the one or more sub-text data are the second category; and when the category corresponding to the full text data and/or the category corresponding to the one or more sub-text data is a first category, determining that the category corresponding to the text to be classified is the first category.

It should be noted that the implementation, solved technical problems, implemented functions, and achieved technical effects of each module/unit/subunit and the like in the apparatus part embodiment are respectively the same as or similar to the implementation, solved technical problems, implemented functions, and achieved technical effects of each corresponding step in the method part embodiment, and are not described herein again.

Any number of modules, sub-modules, units, sub-units, or at least part of the functionality of any number thereof according to embodiments of the present disclosure may be implemented in one module. Any one or more of the modules, sub-modules, units, and sub-units according to the embodiments of the present disclosure may be implemented by being split into a plurality of modules. Any one or more of the modules, sub-modules, units, sub-units according to embodiments of the present disclosure may be implemented at least in part as a hardware circuit, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or may be implemented in any other reasonable manner of hardware or firmware by integrating or packaging a circuit, or in any one of or a suitable combination of software, hardware, and firmware implementations. Alternatively, one or more of the modules, sub-modules, units, sub-units according to embodiments of the disclosure may be at least partially implemented as a computer program module, which when executed may perform the corresponding functions.

For example, any number of the obtaining module 410, the first classification module 420, the second classification module 430, the comprehensive classification module 440, and the sub-text extraction module 450 may be combined and implemented in one module, or any one of the modules may be split into a plurality of modules. Alternatively, at least part of the functionality of one or more of these modules may be combined with at least part of the functionality of the other modules and implemented in one module. According to an embodiment of the present disclosure, at least one of the obtaining module 410, the first classifying module 420, the second classifying module 430, the comprehensive classifying module 440, and the sub-text extracting module 450 may be at least partially implemented as a hardware circuit, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or may be implemented by hardware or firmware in any other reasonable manner of integrating or packaging a circuit, or implemented by any one of three implementations of software, hardware, and firmware, or implemented by a suitable combination of any several of them. Alternatively, at least one of the obtaining module 410, the first classifying module 420, the second classifying module 430, the comprehensive classifying module 440 and the sub-text extracting module 450 may be at least partially implemented as a computer program module which, when executed, may perform a corresponding function.

Fig. 6 schematically shows a block diagram of a computer device adapted to implement the above described method according to an embodiment of the present disclosure. The computer device shown in fig. 6 is only an example and should not bring any limitation to the function and scope of use of the embodiments of the present disclosure.

As shown in fig. 6, the computer device 600 includes a processor 610 and a computer-readable storage medium 620. The computer device 600 may perform a method according to an embodiment of the present disclosure.

In particular, the processor 610 may comprise, for example, a general purpose microprocessor, an instruction set processor and/or related chip set and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), or the like. The processor 610 may also include onboard memory for caching purposes. The processor 610 may be a single processing unit or a plurality of processing units for performing the different actions of the method flows according to embodiments of the present disclosure.

Computer-readable storage medium 620, for example, may be a non-volatile computer-readable storage medium, specific examples including, but not limited to: magnetic storage devices, such as magnetic tape or Hard Disk Drives (HDDs); optical storage devices, such as compact disks (CD-ROMs); a memory, such as a Random Access Memory (RAM) or a flash memory; and so on.

The computer-readable storage medium 620 may include a computer program 621, which computer program 621 may include code/computer-executable instructions that, when executed by the processor 610, cause the processor 610 to perform a method according to an embodiment of the disclosure, or any variation thereof.

The computer program 621 may be configured with, for example, computer program code comprising computer program modules. For example, in an example embodiment, code in computer program 621 may include one or more program modules, including 621A, 621B, … …, for example. It should be noted that the division and number of the modules are not fixed, and those skilled in the art may use suitable program modules or program module combinations according to actual situations, so that the processor 610 may execute the method according to the embodiment of the present disclosure or any variation thereof when the program modules are executed by the processor 610.

According to an embodiment of the present invention, at least one of the obtaining module 410, the first classifying module 420, the second classifying module 430, the comprehensive classifying module 440 and the sub-text extracting module 450 may be implemented as a computer program module described with reference to fig. 6, which, when executed by the processor 610, may implement the text classifying method described above.

The present disclosure also provides a computer-readable storage medium, which may be contained in the apparatus/device/system described in the above embodiments; or may exist separately and not be assembled into the device/apparatus/system. The computer-readable storage medium carries one or more programs which, when executed, implement the method according to an embodiment of the disclosure.

According to embodiments of the present disclosure, the computer-readable storage medium may be a non-volatile computer-readable storage medium, which may include, for example but is not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Those skilled in the art will appreciate that various combinations and/or combinations of features recited in the various embodiments and/or claims of the present disclosure can be made, even if such combinations or combinations are not expressly recited in the present disclosure. In particular, various combinations and/or combinations of the features recited in the various embodiments and/or claims of the present disclosure may be made without departing from the spirit or teaching of the present disclosure. All such combinations and/or associations are within the scope of the present disclosure.

While the disclosure has been shown and described with reference to certain exemplary embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the disclosure as defined by the appended claims and their equivalents. Accordingly, the scope of the present disclosure should not be limited to the above-described embodiments, but should be defined not only by the appended claims, but also by equivalents thereof.

Claims

1. A method of text classification, comprising:

acquiring a text to be classified;

obtaining a first classification result based on full text data of the text to be classified, wherein the first classification result comprises first scores of the full text data about all preset classes;

obtaining a second classification result based on one or more sub-text data extracted from the full-text data, the second classification result including a third score of the one or more sub-text data with respect to each preset category;

extracting one or more sub-text data from the full-text data comprises:

matching the keywords in a preset keyword set with the full text data;

for a first keyword which is successfully matched, extracting a character string with a first preset length before the first keyword and/or a character string with a second preset length after the first keyword from the full text data; and

combining the extracted character string and the first keyword into subfile data according to the position sequence in the full-text data;

for a first sub-text data, inputting the first sub-text data into a sub-text classification model corresponding to the plurality of preset categories, determining a second score of the first sub-text data for each of the plurality of preset categories based on the sub-text classification model; and

calculating a third score of each of the one or more sub-text data with respect to each preset category based on a second score of each of the one or more sub-text data with respect to each preset category, taking a preset category with a highest third score as a category corresponding to the one or more sub-text data; and

determining a classification result of the text to be classified according to the first classification result and the second classification result, wherein the classification result comprises:

and calculating to obtain a comprehensive score of the text to be classified about each preset category according to the first score of the full text data about each preset category and the third score of the one or more sub text data about each preset category, and taking the preset category with the highest comprehensive score as the category corresponding to the text to be classified.

2. The method of claim 1, wherein the obtaining a first classification result based on full text data of the text to be classified comprises:

inputting the full-text data into a full-text classification model corresponding to a plurality of preset categories, determining a first score of the full-text data on each preset category in the plurality of preset categories based on the full-text classification model, and taking the preset category with the highest first score as the category corresponding to the full-text data.

3. The method of claim 1, wherein said calculating a third score for one or more sub-text data for each preset category based on a second score for each sub-text data in the one or more sub-text data for each preset category comprises:

and for any preset category in the plurality of preset categories, carrying out weighted summation on the second scores of the sub-text data about the preset category to obtain a third score of the one or more sub-text data about the preset category.

4. The method of claim 1, wherein the calculating a composite score of the text to be classified for each preset category according to the first score of the full text data for each preset category and the third score of the one or more sub-text data for each preset category comprises:

setting a first weight corresponding to the full-text data and a second weight corresponding to the one or more sub-text data; and

and for any preset category in the plurality of preset categories, according to the first weight and the second weight, carrying out weighted summation on the first score of the full text data about the preset category and the third score of the one or more sub text data about the preset category to obtain a comprehensive score of the text to be classified about the preset category.

5. The method of claim 2, wherein the obtaining a second classification result based on one or more sub-text data extracted from the full text data comprises:

for first sub-text data, inputting the first sub-text data into sub-text classification models corresponding to the plurality of preset categories, determining scores of the first sub-text data with respect to the preset categories based on the sub-text classification models, and taking the preset category with the highest score as the category corresponding to the first sub-text data;

when a first category exists in categories corresponding to each sub-text data in the one or more sub-texts, determining the category corresponding to the one or more sub-text data as the first category; and

when the categories corresponding to the sub-text data in the one or more sub-texts are all the second categories, determining the categories corresponding to the one or more sub-text data as the second categories.

6. The method of claim 1 or 5, wherein:

the preset categories comprise a first category and a second category;

the determining the classification result of the text to be classified according to the first classification result and the second classification result includes:

when the category corresponding to the full text data and the category corresponding to the one or more sub-text data are both a second category, determining that the category corresponding to the text to be classified is the second category; and

when the category corresponding to the full text data and/or the category corresponding to the one or more sub-text data is a first category, determining that the category corresponding to the text to be classified is the first category.

7. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor when executing the computer program being for implementing a method of text classification as claimed in any one of claims 1 to 6.