CN110968687A

CN110968687A - Method and device for classifying texts

Info

Publication number: CN110968687A
Application number: CN201811156700.4A
Authority: CN
Inventors: 陈云枫
Original assignee: Beijing Gridsum Technology Co Ltd
Current assignee: Beijing Gridsum Technology Co Ltd
Priority date: 2018-09-30
Filing date: 2018-09-30
Publication date: 2020-04-07
Anticipated expiration: 2038-09-30
Also published as: CN110968687B

Abstract

The invention discloses a method and a device for classifying texts, which relate to the technical field of natural language processing, can enable classification results to be closer to the requirements of different services, and improve the quality and the efficiency of the classification results, and the main technical scheme of the invention is as follows: judging whether the text data to be classified is matched with preset strong rule logic or not, wherein the preset strong rule logic is used for distinguishing whether the text data belongs to a category irrelevant to the service requirement or not; if yes, determining the classification of the text data according to a matching result corresponding to the preset strong rule logic; if not, performing classification processing on the text data through a preset text classification model, wherein the preset text classification model comprises a preset weak rule logic, and the preset weak rule logic is used for expanding characteristics according to service requirements when performing classification processing on the text data so as to enable classification results obtained by the classification processing to be matched with the service requirements. The invention is applied to the optimized execution of text classification processing.

Description

Method and device for classifying texts

Technical Field

The invention relates to the technical field of natural language processing, in particular to a method and a device for classifying texts.

Background

With the continuous innovative development of science and technology, machine learning can be applied to judge the category of texts. Currently, the main process of text classification using machine learning includes: firstly, training a text classification model by using a text with a labeled class, and secondly, processing an original text without a labeled class by using the text classification model to predict the class to which the original text belongs, thereby finishing the purpose of classifying the original text. However, when the text classification model is applied to different specific services, because the content related to different services can be very different, if only one general text classification model is used, the requirements of different services cannot be met, but if one text classification model is trained for each specific service, a large amount of cost is consumed, and the process of performing classification on the original text becomes complicated, redundant and inefficient.

Disclosure of Invention

In view of this, the present invention provides a method and an apparatus for classifying texts, and mainly aims to optimize a processing flow for performing classification on an original text, so that a classification result is closer to requirements of different services, the quality of the classification result is improved, and meanwhile, the classification efficiency is greatly improved.

In order to solve the above problems, the present invention mainly provides the following technical solutions:

in one aspect, the present invention provides a method for classifying texts, including:

judging whether the text data to be classified is matched with preset strong rule logic or not, wherein the preset strong rule logic is used for distinguishing whether the text data belongs to a category irrelevant to service requirements or not;

if yes, determining the classification of the text data according to a matching result corresponding to the preset strong rule logic;

if not, performing classification processing on the text data through a preset text classification model, wherein the preset text classification model comprises a preset weak rule logic, and the preset weak rule logic is used for expanding characteristics according to the service requirements when performing classification processing on the text data so as to enable classification results obtained correspondingly by the classification processing to be matched with the service requirements.

Optionally, the preset strong rule logic includes rule bodies and rule matching results corresponding to each rule body, and the rule bodies are written in regular expressions.

Optionally, the determining whether the text data to be classified is logically matched with a preset strong rule includes:

acquiring regular expression information corresponding to each rule body, wherein the regular expression information comprises screening logic of regular expressions;

screening the text data according to the screening logic of the regular expression;

judging whether a target text matched with the screening logic of the regular expression is screened out from the text data;

and if so, determining that the text data is logically matched with a preset strong rule.

Optionally, the performing classification processing on the text data through a preset text classification model includes:

performing word segmentation on the text data;

vectorization processing is carried out on the word segmentation, and a plurality of characteristic dimensions corresponding to the text data and dimension information corresponding to each characteristic dimension are output;

performing feature selection on the plurality of feature dimensions by using a feature selector, and outputting the screened feature dimensions and corresponding dimension information;

according to preset weak rule logic, expanding the characteristic dimension of the text data and obtaining corresponding dimension information;

inputting the screened feature dimensions and the corresponding dimension information, the expanded feature dimensions and the corresponding dimension information into a classifier, and outputting a classification result for performing prediction on the text data.

Optionally, the preset weak rule logic includes a plurality of rule groups, each rule group corresponds to a plurality of rule bodies, each rule body is written in a regular expression, and each rule body corresponds to one rule matching result.

Optionally, the expanding the feature dimension of the text data and obtaining corresponding dimension information according to a preset weak rule logic includes:

acquiring a rule group contained in the preset weak rule logic;

determining the rule group as an expanded feature dimension;

judging whether the text data hits the rule logic of the rule group;

if so, using information corresponding to rule logic of the text data hit the rule group as a rule matching result, and determining the rule matching result as dimension information corresponding to the feature dimension;

if not, the information corresponding to the rule logic of the rule group, which is missed by the text data, is used as a rule matching result, and the rule matching result is determined as the dimension information corresponding to the dimension characteristic.

Optionally, the determining whether the text data hits in the rule logic of the rule group includes:

under the same rule group, inquiring regular expression information corresponding to each rule body, wherein the regular expression information comprises screening logic of regular expressions;

judging whether the text data hits the rule logic of the rule body according to the screening logic of the regular expression;

if yes, determining that the text data hits the rule logic of the rule group;

if not, determining that the text data does not hit the rule logic of any rule body in the rule bodies when the text data does not hit the rule logic of any rule body in the rule bodies in the same rule group.

In order to achieve the above object, according to another aspect of the present invention, a storage medium is provided, and the storage medium includes a stored program, wherein when the program runs, a device on which the storage medium is located is controlled to execute the method for classifying texts.

In order to achieve the above object, according to another aspect of the present invention, a processor for executing a program is provided, wherein the program executes the method for classifying texts.

In another aspect, the present invention further provides an apparatus for classifying texts, including:

the system comprises a judging unit, a classification processing unit and a classification processing unit, wherein the judging unit is used for judging whether text data to be subjected to classification processing is matched with preset strong rule logic or not, and the preset strong rule logic is used for distinguishing whether the text data belongs to a class irrelevant to service requirements or not;

the determining unit is used for determining the classification of the text data according to a matching result corresponding to the preset strong rule logic when the judging unit judges that the text data is matched with the preset strong rule logic;

and the execution unit is used for executing classification processing on the text data through a preset text classification model when the judgment unit judges that the text data is not matched with preset strong rule logic, wherein the preset text classification model comprises preset weak rule logic, and the preset weak rule logic is used for expanding characteristics according to the service requirement when the classification processing is executed on the text data so as to enable a classification result obtained by the classification processing correspondingly to be matched with the service requirement.

Optionally, the determining unit includes:

the acquisition module is used for acquiring regular expression information corresponding to each rule body, and the regular expression information contains screening logic of regular expressions;

the screening module is used for screening the text data according to the screening logic of the regular expression acquired by the acquisition module;

the judging module is used for judging whether a target text matched with the screening logic of the regular expression is screened out from the text data;

and the determining module is used for determining that the text data is logically matched with a preset strong rule when the judging module judges that the text data is screened out of the target text matched with the screening logic of the regular expression.

Optionally, the execution unit includes:

the word segmentation module is used for performing word segmentation on the text data;

the vectorization processing module is used for executing vectorization processing on the participles obtained by the participle module and outputting a plurality of characteristic dimensions corresponding to the text data and dimension information corresponding to each characteristic dimension;

the characteristic selection module is used for performing characteristic selection on the plurality of characteristic dimensions by utilizing a characteristic selector and outputting the screened characteristic dimensions and corresponding dimension information;

the extension module is used for extending the characteristic dimensionality of the text data and obtaining corresponding dimensionality information according to preset weak rule logic;

and the execution module is used for inputting the feature dimensions and the corresponding dimension information screened by the feature selection module, the feature dimensions expanded by the expansion module and the corresponding dimension information into a classifier and outputting a classification result for performing prediction on the text data.

Optionally, the expansion module includes:

the obtaining submodule is used for obtaining a rule group contained in the preset weak rule logic;

the first determining submodule is used for determining the rule group acquired by the acquiring submodule as an expanded characteristic dimension;

the first judgment submodule is used for judging whether the text data hits the rule logic of the rule group acquired by the acquisition submodule;

a first execution sub-module, configured to, when the first determination sub-module determines that the text data hits the rule logic of the rule group, take information corresponding to the rule logic where the text data hits the rule group as a rule matching result, and determine the rule matching result as dimension information corresponding to the feature dimension;

and the second execution submodule is used for taking information corresponding to the rule logic of the rule group in which the text data is not hit as a rule matching result and determining the rule matching result as dimension information corresponding to the dimension characteristic when the first judgment submodule judges that the text data is not hit in the rule logic of the rule group.

Optionally, the first determining sub-module includes:

the query submodule is used for querying regular expression information corresponding to each rule body under the same rule group, wherein the regular expression information comprises screening logic of regular expressions;

the second judgment submodule is used for judging whether the text data hits the rule logic of the rule body according to the screening logic of the regular expression;

a second determining submodule, configured to determine that the text data hits the rule logic of the rule group when the second determining submodule determines that the text data hits the rule logic of the rule body;

a third determining submodule, configured to determine that the text data does not hit the rule logic of the rule body if the second determining submodule determines that the text data does not hit the rule logic of the rule body, if the rule logic of the rule body does not exist in the same rule group, the rule logic of the rule group is not hit by the text data.

By the technical scheme, the technical scheme provided by the invention at least has the following advantages:

the invention provides a method and a device for classifying texts, which pre-use a service strong rule logic to pre-execute forced classification matching on an original text, if the service strong rule logic is satisfied, the classification attribution of the original text can be directly determined according to a matching result corresponding to the execution of the forced classification matching, if the service strong rule logic is not satisfied, an improved text classification model is used for performing classification processing on the original text, and the improved text classification model comprises the service weak rule logic, and can expand the characteristics of text data after vectorizing the original text according to the service weak rule logic, thereby performing classification processing on the original text by combining the expanded characteristics. Compared with the prior art, the problems that the requirements of different service contents cannot be met, the cost is wasted, the process is complicated and redundant, and the efficiency is low when the original text is classified by using the conventional text classification model are solved. The invention introduces the service strong rule logic and the weak rule logic when the classification processing is executed to the original text, so that the classification result is closer to the requirements of different services, and the quality of the classification result is improved.

The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

fig. 1 is a flowchart of a method for classifying texts according to an embodiment of the present invention;

fig. 2 is a flowchart of another method for classifying texts according to an embodiment of the present invention;

fig. 3 is a flowchart of performing classification processing on a text by using a text classification model according to an embodiment of the present invention;

fig. 4 is a block diagram illustrating an apparatus for classifying texts according to an embodiment of the present invention;

fig. 5 is a block diagram of another apparatus for classifying text according to an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the invention are shown in the drawings, it should be understood that the invention can be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.

The embodiment of the invention provides a method for classifying texts, as shown in fig. 1, the method introduces a service strong rule logic and a service weak rule logic when performing classification processing on an original text, and optimizes the process of performing classification processing on the original text, and the embodiment of the invention provides the following specific steps:

101. and judging whether the text data to be classified is logically matched with a preset strong rule.

In the embodiment of the present invention, the text data refers to an original text to be subjected to classification processing, and the original text may be collected by a web crawler or other technologies, for example, the web crawler is used to obtain comment text data on a web page, and specifically, a method for obtaining the original text is used.

The preset strong rule logic is used for carrying out forced classification matching on the text data in advance, is provided by a service party and has a forced rule, namely, once the text data is matched with the strong rule logic, the text data is directly classified according to a corresponding matching result in matching and the attribution category of the text data is determined.

In the embodiment of the present invention, forced classification matching is performed on text data in advance by using strong rule logic, which is equivalent to performing classification processing at the highest priority, such as: when the comment text data collected on the comment website is screened, whether the comment text data to be classified is spam text or not is judged, whether the content of the comment text data contains keywords related to gambling, profanity and advertisement information or not can be searched, for example, "gambling" and "register win and win ten million gifts" can be directly judged, if yes, the comment text data is spam text, and the execution rule of whether the content of the search comment text data contains keywords related to gambling, profanity and advertisement information or not can be strong rule logic provided by a service party, for example, the keywords can directly indicate whether the comment text data is spam text or not.

And 102a, if the text data is judged to be matched with the preset strong rule logic, determining the classification of the text data according to the matching result corresponding to the preset strong rule logic.

In the embodiments of the present invention, for example: the preset strong rule logic is provided by a service party and is used for judging whether the text data is a junk text, so that when the text data is judged to be matched with the preset strong rule logic, namely, the text data can be directly judged to be the junk text, and the attribution type of the text data is determined to be the junk text.

And 102b, if the text data is judged not to be logically matched with the preset strong rule, performing classification processing on the text data through a preset text classification model.

The preset text classification model comprises preset weak rule logic, and the preset weak rule logic is used for expanding characteristics according to service requirements when classification processing is performed on text data so that classification results obtained by the classification processing are matched with the service requirements.

In the embodiment of the invention, the preset weak rule logic is provided by a service party and is used for expanding the characteristics of the text data and further classifying the original text by combining the expanded characteristics, so that the classification result is closer to the requirements of different services. In the embodiment of the present invention, the preset text classification model is different from the existing text classification model, and the process of performing classification processing on text data by the existing classification model includes: performing word segmentation on the text data, performing vectorization processing on the words, performing feature selection on the vectorization processing result by using a feature selector, and performing classification processing on the output feature selection by using a classifier to finally obtain a classification result predicted on the text data. However, for the embodiment of the present invention, the existing classification model is improved during model training, weak rule logic is added during model training according to the business requirements, and further, when classification processing is performed on text data by the improved text classification model, after feature selection is performed on the quantitative processing result by using the feature selector in the existing classification processing flow, some features are expanded according to the weak rule logic, so that the features output by the feature selector and the expanded features can be processed together by using the classifier, and the classification result predicted on the text data finally is closer to the business requirements than the classification result predicted by the existing model.

In the embodiment of the present invention, if it is determined that the text data does not match the preset strong rule logic, for example, the example shown in the above 102a is continued, that is, the text data is indirectly determined to be not a spam text, a fraud text, a rumor, etc., then, further, the text data is classified by the improved text classification model, and along with the change of the service content, the existing text classification model can be simply, conveniently and flexibly improved by adding the weak rule logic method, so as to ensure that the classification result is closer to the requirements of different services, and improve the quality of the classification result.

The embodiment of the invention provides a method and a device for classifying texts, wherein the embodiment of the invention uses a service strong rule logic to execute forced classification matching on an original text in advance, if the service strong rule logic is satisfied, the classification attribution of the original text can be determined directly according to a matching result corresponding to the preset strong rule logic, if the service strong rule logic is not satisfied, an improved text classification model is used to execute classification processing on the original text, because the improved text classification model comprises the service weak rule logic, and the characteristics of text data can be expanded after vectorization is performed on the original text according to the service weak rule logic, and then the expanded characteristics are combined to execute classification processing on the original text. Compared with the prior art, the problems that the requirements of different service contents cannot be met, the cost is wasted, the process is complicated and redundant, and the efficiency is low when the original text is classified by using the conventional text classification model are solved. The embodiment of the invention introduces the service strong rule logic and the weak rule logic when the classification processing is carried out on the original text, so that the classification result is closer to the requirements of different services, and the quality of the classification result is improved.

In order to describe the above embodiment in more detail, another method for classifying texts is further provided in the embodiment of the present invention, as shown in fig. 2, the method determines whether text data is logically matched with a strong rule through a screening logic of a regular expression, and expands the feature of the text data according to a plurality of rule groups included in a weak rule logic, and for this, the following specific steps are provided in the embodiment of the present invention:

201. and judging whether the text data to be subjected to classification processing is logically matched with a preset strong rule.

Wherein the preset strong rule logic is used for distinguishing whether the text data belongs to a category irrelevant to the service requirement.

In the embodiment of the invention, the preset strong rule logic comprises rule bodies and rule matching results corresponding to each rule body, and the rule bodies are written by regular expressions. For example: the strong rule logic written is as follows:

body result

'. entertainment city 1

'. bet platform 1

'. bet net 1

'. registration. {0,4} casino. { 1}

'. registration {0,4} gambling

'. registration. {0,4} lines, { 1} lines

'.. Xinglujing' 1

'. am i felt. 0

'. am i am considered 0

'. dotting {0,1} mark'. times.0

Wherein "body" is a rule ontology, "result" is a rule matching result, and "1" in the rule result represents a text data hit strong rule logic, and "0" represents a text data miss strong rule logic corresponding rule ontology. Such as: the strong rule logic is used for judging whether the text data is the junk text, indicating that the text data is the junk text when the matched rule result is '1', and indicating that the text data is not the junk text when the matched rule result is '0'.

In the embodiment of the present invention, the specific step of determining whether the text data is logically matched with the preset strong rule may include: the method comprises the steps of obtaining regular expression information corresponding to each rule body, wherein the regular expression information comprises screening logic of regular expressions, screening text data according to the screening logic of the regular expressions, judging whether a target text matched with the screening logic of the regular expressions is screened out from the text data, and if yes, determining that the text data is matched with preset strong rule logic. It should be noted that when the text data matches the preset strong rule logic, there are two matching relationships between the text data and the strong rule logic, that is: the text data hits the strong rule logic (i.e., the rule matching result is "1"), and the text data misses the rule ontology to which the strong rule logic corresponds (i.e., the rule matching result is "0").

Further, through a specific application scenario, detailed description is made on a specific step of judging whether the text data is logically matched with the strong rule:

for example, a specific application scenario is to filter comment text data of a certain comment website to remove spam text data, and includes 4 pieces of comment text data collected as follows.

Text 1: i feel the movie good.

Text 2: the movie is good.

Text 3: today registered 16788 online casinos send 100 beans.

Text 4: registering Bowin app and sending thousand yuan good gift.

The texts 1 to 4 are respectively matched with the above-mentioned strong rule logic, and the corresponding matching results are as follows, and here, in order to present the text and the strong rule logic clearly to perform the matching process, how to perform the matching, the corresponding matching results and the classification results determined according to the matching results are filled in the following table one.

Watch 1

The text 1 and the text 3 are respectively matched with the strong logic rules, but according to the rule results respectively corresponding to the strong logic rules, the attribution type of the text 1 is determined to be not the junk text, and the attribution type of the text 3 is determined to be the junk text.

202a, if the text data is judged to be matched with the preset strong rule logic, determining the classification of the text data according to the matching result corresponding to the preset strong rule logic.

For the embodiment of the present invention, following the example of "determining whether the text data is a spam text" mentioned above, as shown in table one above, when the text data matches with the strong rule logic, the classification of the text data can be further determined according to the corresponding rule matching result, that is, the text data hits the strong rule logic (i.e., the rule matching result is "1"), the text data is determined to be a spam text, and the text data does not hit the rule body corresponding to the strong rule logic (i.e., the rule matching result is "0"), and the text data is determined not to be a spam text.

In the embodiment of the present invention, if it is determined that the text data does not logically match the preset strong rule, the text data is classified by the preset text classification model according to the following steps 202b to 206 b.

202b, performing word segmentation on the text data.

203b, executing vectorization processing on the words, and outputting a plurality of characteristic dimensions corresponding to the text data and dimension information corresponding to each characteristic dimension.

204b, performing feature selection on the plurality of feature dimensions by using a feature selector, and outputting the screened feature dimensions and corresponding dimension information.

In the above steps 202b to 204b, for the text data to be subjected to the classification processing, word segmentation is performed first, specifically, word segmentation may be performed on the text data by using dependency syntax, and for the embodiment of the present invention, a method for performing word segmentation is not limited; secondly, performing vectorization processing on the words by using a vectorizer, and outputting a plurality of characteristic dimensions corresponding to the text data and dimension information corresponding to each characteristic dimension; and then, performing feature selection on the plurality of output feature dimensions by using a feature selector, and outputting the screened feature dimensions and corresponding dimension information. The above steps can also be performed in the existing text classification model, and are not specifically stated herein.

205b, expanding the characteristic dimension of the text data and obtaining corresponding dimension information according to the preset weak rule logic.

The preset weak rule logic comprises a plurality of rule groups, each rule group corresponds to a plurality of rule bodies, each rule body is written by regular expressions, and one rule body corresponds to one rule matching result. For example: the weak rule logic written is as follows:

group body result

1 'Megao' 1

2'. on the line 1

Entertainment 1

Bet 1

2'. registration. {0,4} app. 1

2'. registration. {0,4} App.' 1

3'. Myanmar' 1

3'. nine five extreme' 1

2'. The win

1'. la vicas.' 1

3'. sending. {0,4} good gift. 1

4'. The' 1 is a fantasy of {0,10}

4'. best comment,. 1

4' difference score 1

Wherein, the group is a rule group, which is respectively '1', '2', '3', '4', 'body' is a rule body, the result is a rule matching result, and '1' in the rule matching result represents the rule logic of the text data hit the rule body.

In the embodiment of the present invention, the specific steps of logically expanding the feature dimension of the text data and obtaining the corresponding dimension information according to the weak rule may be as follows:

firstly, a rule group contained in preset weak rule logic is obtained.

In the embodiment of the present invention, following the above-mentioned weak rule logic, as "group" is "1", "2", "3" or "4", respectively, correspondingly, 4 rule groups included in the weak rule logic are obtained.

Second, the rule set is determined as an extended feature dimension.

Thirdly, whether the text data hits the rule logic of the rule group is judged.

In the embodiment of the present invention, the specific step of determining whether the text data hits the rule logic of the rule group may include: and under the same rule group, inquiring regular expression information corresponding to each rule body, wherein the regular expression information comprises screening logic of the regular expression, judging whether the text data hits the rule logic of the rule body according to the screening logic of the regular expression, if so, determining that the text data hits the rule logic of the rule group, and if not, determining that the text data does not hit the rule logic of any rule body in the rule bodies when the text data does not hit the rule logic of the rule group under the same rule group.

Further, a detailed description is given to the specific steps of determining whether the text data hits the rule logic of the rule group through a specific application scenario, where the weak rule logic is also used to determine whether the text data is a spam text, and further, matching the text 2 and the text 4 with the rule logic, where the matching result is as follows, and how to perform matching, correspond to the matching result, and determine the classification result according to the matching result are filled in the following table two.

Watch two

As shown in table two, in the same rule group, the rule group corresponds to a plurality of rule bodies, each rule body corresponds to a rule matching result, and the interception of the 4 th group of rules in the above weak logic rules is shown as follows:

group body result

4'. The' 1 is a fantasy of {0,10}

4'. best comment,. 1

4' difference score 1

In the embodiment of the present invention, the text 2 is matched to the 4 th group of rules ". about.a good comment", that is, the text 2 hits the rule logic of the rule text, and further, when the text data hits the rule logic of any one rule body in the same rule group, it is determined that the text data hits the rule logic of the rule group, that is, the text 2 hits the rule logic of the 4 th group of rules, and may be marked as "1" as a corresponding rule result. Correspondingly, if the text data does not hit the rule logic of any rule body in the same rule group, it is determined that the text data does not hit the rule logic of the rule group, that is, the text 2 does not hit the rule logic corresponding to the rules of the 1 st group, the 2 nd group and the 3 rd group, and may be respectively and correspondingly marked as "0", such as the feature dimension array [..,. 0,0,0,1] recorded in table two, in which, the "omission of the un-written part" corresponds to the feature dimension after screening output in the step 205b, and correspondingly, in the feature dimension array [. 0,0,1], the last 4 feature dimensions correspond to the feature dimensions expanded according to the weak rule logic, wherein, in the order from left to right in the array, the 4 rule groups of the weak rule logic are respectively corresponding, a "0" in the array indicates that there is no rule logic for the rule group hit, and a "1" in the array indicates that there is a rule logic for the rule group hit.

And fourthly, if so, taking information corresponding to rule logics of the text data hit rule group as a rule matching result, and determining the rule matching result as dimension information corresponding to the feature dimension.

And fifthly, if not, taking the information corresponding to the rule logic of the text data miss rule group as a rule matching result, and determining the rule matching result as dimension information corresponding to the dimension characteristics.

In the embodiment of the present invention, based on the example of table two, the details of "fourth" and "fifth" are described, where text 2 is matched to group 4 rule ". multidot.i., text 2 hits the rule logic of group 4 rule, and is marked as" 1 ", and" 1 "is the rule result corresponding to text 2 when matching is performed with group 4 rule, and the rule result is the dimension information of the 4 th feature dimension expanded in the feature dimension array [. multidot.0, 0,0,1] in table two. Correspondingly, the text 2 is not matched with the rules of the 1 st, 2 nd and 3 rd groups, that is, the text 2 does not hit the rule logics of the rules of the 1 st, 2 nd and 3 rd groups, and is respectively and correspondingly marked as "0", and the "0" is the corresponding rule result when the text 2 is matched with the rules of the 1 st, 2 nd and 3 rd groups, and the rule result is the dimension information of the 1 st, 2 nd and 3 rd feature dimensions expanded in the feature dimension array [.., 0,0,0,1] in the table two.

206b, inputting the screened feature dimensions and the corresponding dimension information, the expanded feature dimensions and the corresponding dimension information into a classifier, and outputting a classification result for performing prediction on the text data.

In the embodiment of the invention, the expanded characteristic dimension and the corresponding dimension information are introduced into the existing text classification model according to the weak rule logic, so that the existing text model is improved, and the weak rule logic is set according to the business change requirement, so that the screened characteristic dimension, the corresponding dimension information, the expanded characteristic dimension and the corresponding dimension information are processed together by using the classifier, the classification result of performing prediction on the text data is output to be closer to different requirements of the business, and the quality of the output classification result is improved.

Further, an embodiment of the present invention further provides a process for performing classification processing on a text through a text classification model, as shown in fig. 3, a service strong rule logic is used to perform forced classification matching on an original text in advance, if the service strong rule logic is satisfied, a classification attribution of the original text can be determined directly according to a matching result corresponding to the performed forced classification matching, and if the service strong rule logic is not satisfied, the improved text classification model is used to perform classification processing on the original text. As shown in the left side of fig. 3, when the existing text classification model is improved, a weak rule logic is added to expand the characteristics of the text data, and since the improved text classification model includes a business weak rule logic, when the text data is classified by the improved text classification model, as shown in the right side of fig. 3, after a characteristic selector is used to perform characteristic selection on a quantitative processing result in the existing classification processing flow, some characteristics are expanded according to the weak rule logic, and classification processing is performed on the original text by combining the expanded characteristics, so that the classification result predicted by the text data finally is closer to the business requirement than the classification result predicted by the existing model.

In order to achieve the above object, according to another aspect of the present invention, an embodiment of the present invention further provides a storage medium, where the storage medium includes a stored program, and when the program runs, a device on which the storage medium is located is controlled to execute the method for classifying texts.

In order to achieve the above object, according to another aspect of the present invention, an embodiment of the present invention further provides a processor, configured to execute a program, where the program executes the method for classifying texts.

Further, as an implementation of the method shown in fig. 1 and fig. 2, an embodiment of the present invention provides an apparatus for classifying texts. The embodiment of the apparatus corresponds to the embodiment of the method, and for convenience of reading, details in the embodiment of the apparatus are not repeated one by one, but it should be clear that the apparatus in the embodiment can correspondingly implement all the contents in the embodiment of the method. The device is applied to a classification processing process executed on a text by strong rule logic and weak rule logic optimization of joining a service, and specifically, as shown in fig. 4, the device includes:

a judging unit 41, configured to judge whether text data to be classified is matched with preset strong rule logic, where the preset strong rule logic is used to distinguish whether the text data belongs to a category unrelated to a service requirement;

a determining unit 42, configured to determine, when the determining unit 41 determines that the text data matches a preset strong rule logic, a classification of the text data according to a matching result corresponding to the preset strong rule logic;

an executing unit 43, configured to, when the determining unit 41 determines that the text data does not match a preset strong rule logic, perform classification processing on the text data through a preset text classification model, where the preset text classification model includes a preset weak rule logic, and the preset weak rule logic is configured to expand a feature according to the service requirement when performing classification processing on the text data, so that a classification result obtained by the classification processing correspondingly matches the service requirement.

Further, as shown in fig. 5, the preset strong rule logic includes rule bodies and rule matching results corresponding to each of the rule bodies, and the rule bodies are written in regular expressions.

Further, as shown in fig. 5, the judging unit 41 includes:

an obtaining module 411, configured to obtain regular expression information corresponding to each rule body, where the regular expression information includes a screening logic of a regular expression;

a screening module 412, configured to perform screening processing on the text data according to the screening logic of the regular expression obtained by the obtaining module 411;

a judging module 413, configured to judge whether a target text that matches the screening logic of the regular expression is screened out from the text data;

a determining module 414, configured to determine that the text data is logically matched with a preset strong rule when the determining module 413 determines that the target text matched with the filtering logic of the regular expression is screened out from the text data.

Further, as shown in fig. 5, the execution unit 43 includes:

a word segmentation module 431, configured to perform word segmentation on the text data;

a vectorization processing module 432, configured to perform vectorization processing on the segmented words obtained by the segmentation module 431, and output a plurality of feature dimensions corresponding to the text data and dimension information corresponding to each feature dimension;

a feature selection module 433, configured to perform feature selection on the plurality of feature dimensions by using a feature selector, and output the feature dimensions after being filtered and corresponding dimension information;

an extension module 434, configured to extend the feature dimension of the text data and obtain corresponding dimension information according to a preset weak rule logic;

an executing module 435, configured to input the feature dimensions and the corresponding dimension information filtered by the feature selecting module 433, the feature dimensions expanded by the expanding module 434, and the corresponding dimension information into a classifier, and output a classification result of performing prediction on the text data.

Further, as shown in fig. 5, the preset weak rule logic includes a plurality of rule groups, each rule group corresponds to a plurality of rule bodies, each rule body is written in a regular expression, and each rule body corresponds to a rule matching result.

Further, as shown in fig. 5, the extension module 434 includes:

an obtaining submodule 4341, configured to obtain a rule group included in the preset weak rule logic;

a first determining submodule 4342, configured to determine the rule group acquired by the acquiring submodule 4341 as an expanded feature dimension;

a first determining submodule 4343, configured to determine whether the text data hits the rule logic of the rule group acquired by the acquiring submodule 4441;

a first executing sub-module 4344, configured to, when the first determining sub-module 4343 determines that the text data hits the rule logic of the rule group, take information corresponding to the rule logic where the text data hits the rule group as a rule matching result, and determine the rule matching result as dimension information corresponding to the feature dimension;

a second executing sub-module 4345, configured to, when the first determining sub-module 4343 determines that the text data misses the rule logic of the rule group, take information corresponding to the rule logic of the rule group for which the text data misses the rule logic as a rule matching result, and determine the rule matching result as dimension information corresponding to the dimension feature.

Further, as shown in fig. 5, the first determining sub-module 4343 includes:

a query submodule 43431, configured to query, in the same rule group, regular expression information corresponding to each rule body, where the regular expression information includes a screening logic of a regular expression;

a second judgment submodule 43432, configured to judge, according to the filtering logic of the regular expression, whether the text data hits the rule logic of the rule body;

a second determining submodule 43433 configured to determine that the text data hits the rule logic of the rule group when the second determining submodule 43432 determines that the text data hits the rule logic of the rule body;

a third determining submodule 43434, configured to, if the second determining submodule 43432 determines that the text data misses the rule logic of the rule body, determine that the text data misses the rule logic of the rule group when there is no rule logic of the text data hitting any rule body of the multiple rule bodies in the same rule group.

In summary, the present invention uses the service strong rule logic to pre-execute forced classification matching on the original text, judging whether the text data is matched with the strong rule logic or not through the screening logic of the regular expression, if the text data meets the service strong rule logic, the classification attribution of the original text can be determined directly according to the matching result corresponding to the forced classification matching, if the service strong rule logic is not satisfied, the classification processing is carried out on the original text by utilizing the improved text classification model, since the improved text classification model contains the business weak rule logic, and the characteristics of the text data can be expanded after vectorizing the original text according to the business weak rule logic, further, and expanding the characteristics of the text data according to a plurality of rule groups contained in the weak rule logic, and further performing classification processing on the original text by combining the expanded characteristics. The invention introduces the service strong rule logic and the weak rule logic when the classification processing is executed to the original text, so that the classification result is closer to the requirements of different services, and the quality of the classification result is improved.

The device for classifying the texts comprises a processor and a memory, wherein the judging unit, the determining unit, the executing unit and the like are stored in the memory as program units, and the processor executes the program units stored in the memory to realize corresponding functions.

The processor comprises a kernel, and the kernel calls the corresponding program unit from the memory. The kernel can be set to be one or more than one, and the processing flow for executing classification on the original text is optimized by adjusting the kernel parameters, so that the classification result is closer to the requirements of different services, the quality of the classification result is improved, and the classification efficiency is also greatly improved.

The memory may include volatile memory in a computer readable medium, Random Access Memory (RAM) and/or nonvolatile memory such as Read Only Memory (ROM) or flash memory (flash RAM), and the memory includes at least one memory chip.

An embodiment of the present invention provides a storage medium on which a program is stored, which, when executed by a processor, implements the method of classifying text.

The embodiment of the invention provides a processor, which is used for running a program, wherein the method for classifying texts is executed when the program runs.

The embodiment of the invention provides equipment, which comprises a processor, a memory and a program which is stored on the memory and can run on the processor, wherein the processor executes the program and realizes the following steps:

a method of classifying text, the method comprising: judging whether the text data to be classified is matched with preset strong rule logic or not, wherein the preset strong rule logic is used for distinguishing whether the text data belongs to a category irrelevant to service requirements or not; if yes, determining the classification of the text data according to a matching result corresponding to the preset strong rule logic; if not, performing classification processing on the text data through a preset text classification model, wherein the preset text classification model comprises a preset weak rule logic, and the preset weak rule logic is used for expanding characteristics according to the service requirements when performing classification processing on the text data so as to enable classification results obtained correspondingly by the classification processing to be matched with the service requirements.

Further, the preset strong rule logic includes rule bodies and rule matching results corresponding to each rule body, and the rule bodies are written in regular expressions.

Further, the determining whether the text data is logically matched with a preset strong rule includes:

acquiring regular expression information corresponding to each rule body, wherein the regular expression information comprises screening logic of regular expressions; screening the text data according to the screening logic of the regular expression; judging whether a target text matched with the screening logic of the regular expression is screened out from the text data; and if so, determining that the text data is logically matched with a preset strong rule.

Further, the performing classification processing on the text data through a preset text classification model includes: performing word segmentation on the text data; vectorization processing is carried out on the word segmentation, and a plurality of characteristic dimensions corresponding to the text data and dimension information corresponding to each characteristic dimension are output; performing feature selection on the plurality of feature dimensions by using a feature selector, and outputting the screened feature dimensions and corresponding dimension information; according to preset weak rule logic, expanding the characteristic dimension of the text data and obtaining corresponding dimension information; inputting the screened feature dimensions and the corresponding dimension information, the expanded feature dimensions and the corresponding dimension information into a classifier, and outputting a classification result for performing prediction on the text data.

Further, the preset weak rule logic includes a plurality of rule groups, each rule group corresponds to a plurality of rule bodies, each rule body is written in a regular expression, and each rule body corresponds to a rule matching result.

Further, expanding the feature dimension of the text data and obtaining corresponding dimension information according to a preset weak rule logic includes: acquiring a rule group contained in the preset weak rule logic; determining the rule group as an expanded feature dimension; judging whether the text data hits the rule logic of the rule group; if so, using information corresponding to rule logic of the text data hit the rule group as a rule matching result, and determining the rule matching result as dimension information corresponding to the feature dimension; if not, the information corresponding to the rule logic of the rule group, which is missed by the text data, is used as a rule matching result, and the rule matching result is determined as the dimension information corresponding to the dimension characteristic.

Further, the determining whether the text data hits in the rule logic of the rule group includes: under the same rule group, inquiring regular expression information corresponding to each rule body, wherein the regular expression information comprises screening logic of regular expressions; judging whether the text data hits the rule logic of the rule body according to the screening logic of the regular expression; if yes, determining that the text data hits the rule logic of the rule group; if not, determining that the text data does not hit the rule logic of any rule body in the rule bodies when the text data does not hit the rule logic of any rule body in the rule bodies in the same rule group.

The device herein may be a server, a PC, a PAD, a mobile phone, etc.

The present application further provides a computer program product adapted to perform program code for initializing the following method steps when executed on a data processing device: judging whether the text data to be classified is matched with preset strong rule logic or not, wherein the preset strong rule logic is used for distinguishing whether the text data belongs to a category irrelevant to service requirements or not; if yes, determining the classification of the text data according to a matching result corresponding to the preset strong rule logic; if not, performing classification processing on the text data through a preset text classification model, wherein the preset text classification model comprises a preset weak rule logic, and the preset weak rule logic is used for expanding characteristics according to the service requirements when performing classification processing on the text data so as to enable classification results obtained correspondingly by the classification processing to be matched with the service requirements.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). The memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.

The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

1. A method of classifying text, the method comprising:

2. The method of claim 1, wherein the preset strong rule logic comprises rule bodies and rule matching results corresponding to each of the rule bodies, and the rule bodies are written in regular expressions.

3. The method of claim 2, wherein said determining whether the text data to be classified logically matches a preset strong rule comprises:

4. The method according to claim 1, wherein the performing a classification process on the text data through a preset text classification model comprises:

performing word segmentation on the text data;

5. The method of claim 4, wherein the pre-set weak rule logic comprises a plurality of rule groups, wherein the rule groups correspond to a plurality of rule bodies, wherein the rule bodies are written in regular expressions, and wherein one of the rule bodies corresponds to one of the rule matching results.

6. The method of claim 5, wherein expanding the feature dimensions of the text data and obtaining corresponding dimension information according to a preset weak rule logic comprises:

acquiring a rule group contained in the preset weak rule logic;

determining the rule group as an expanded feature dimension;

judging whether the text data hits the rule logic of the rule group;

7. The method of claim 6, wherein the determining whether the text data hits in the rule logic of the rule group comprises:

if yes, determining that the text data hits the rule logic of the rule group;

8. An apparatus for classifying text, the apparatus comprising:

9. A storage medium comprising a stored program, wherein the program, when executed, controls an apparatus on which the storage medium is located to perform a method of classifying text as claimed in any one of claims 1-7.

10. A processor, characterized in that the processor is configured to run a program, wherein the program when running performs the method of classifying text according to any one of claims 1-7.