CN114328930A

CN114328930A - Text classification method and system based on entity extraction

Info

Publication number: CN114328930A
Application number: CN202111666994.7A
Authority: CN
Inventors: 章明珠; 钟志成; 刘宇
Original assignee: Chengdu Siwei Century Technology Co ltd
Current assignee: Chengdu Siwei Century Technology Co ltd
Priority date: 2021-12-31
Filing date: 2021-12-31
Publication date: 2022-04-12

Abstract

The invention discloses a text classification method and a text classification system based on entity extraction, wherein the method comprises the following steps: respectively customizing keywords for different text categories, and then executing the following steps: s100: carrying out entity labeling based on words on the user-defined keywords in the known category texts to obtain training samples; s200: performing entity recognition training on the keyword extraction model by using the training sample; s300: extracting keywords from the text to be classified by using the trained keyword extraction model; s400: respectively calculating the similarity between the extracted keywords and the user-defined keywords of each text category by using a similarity detection model; s500: and classifying the texts to be classified based on the similarity. The method has strong reusability and higher accuracy, is also suitable for detecting the similarity of excessive and overlarge text data, and can be applied to automatic classification of the text data in the current enterprise data assets.

Description

Text classification method and system based on entity extraction

Technical Field

The invention relates to the technical field of text processing, in particular to a text classification method and system based on entity extraction.

Background

The existing text similarity detection method mostly uses Simhash similarity, and the steps are summarized as follows:

(1) the method comprises the steps of segmenting a text by adopting a keyword extraction method, taking the first N words (feature) with the highest weight value and the weight (weight) in a document, and obtaining a set (feature: weight) with the length of N; the larger the weight value is, the greater the importance of the word to the text is.

(2) And carrying out common hash on words in the set (feature: weight) to obtain a corresponding 64-bit binary number (hash), so that the set (feature: weight) is converted into a set (hash: weight) with the length of N.

(3) And according to the fact that each position of the binary number (hash) in the set (hash) is 1 or 0, taking a positive value and a negative value for the corresponding position to obtain N lists with the length of 64. For example, the term hashed binary number and weight value is (010111: 5), the binary number listed here is only 6 bits, which is only used as an example, and the binary number obtained in practical applications is often 64 bits. The list obtained after step (3) is then [ -5,5, -5,5,5,5 ].

(4) And (4) performing row-to-row accumulation on the N lists obtained in the step (3) to obtain a list with the length of 64. For example, the lists [ -5,5, -5,5, 5], [ -3, -3, -3,3, -3,3], [1, -1, -1,1,1,1] are accumulated in row directions to obtain the list [ -7, 1, -9, 9, 3, 9 ].

(5) Judging the positive and negative of each numerical value in the list obtained in the step (4), and taking 0 as the corresponding position when the numerical value is a negative value; when the value is positive, the corresponding position is 1; thereby obtaining a Simhash value of the text. For example, 010111 is obtained from the list [ -7, 1, -9, 9, 3, 9], i.e. the text has a Simhash value of 010111.

(6) Performing XOR on the Simhash values of the two texts, and judging that the two texts are not similar if the number of '1' in the result exceeds a user-defined threshold value M; otherwise, the judgment is similar.

The text similarity detection method is simple in thought, firstly extracts text keywords, and then reflects the similarity of the two texts according to the frequency and frequency of occurrence of the words. But the defect is obvious, and the text similarity detection method cannot judge the similar meaning words and the synonyms. In practice, words with similar meanings, such as "compensation" and "indemnity", are considered as two different words, so that two texts that are originally similar are judged to be dissimilar. Therefore, it is desirable to accurately measure the similarity between two texts, and the semantic analysis is also used as a basis.

When text similarity detection is performed, keywords need to be extracted from a text first, and then similarity is detected based on the frequency and frequency of occurrence of the keywords. At present, the keyword extraction method mainly comprises a keyword extraction method based on statistical characteristics (such as TF-IDF method) and a keyword extraction method based on a word graph model (such as PageRank algorithm and TextRank algorithm). The keyword extraction method based on statistical characteristics is to extract keywords by using statistical information of words in a text. The idea of the keyword extraction method based on the word graph model is as follows: firstly, a language network diagram of a text is constructed, and then the language network diagram is analyzed to find out words or phrases with important functions, wherein the words or phrases are keywords of the text. However, the current keyword extraction methods also have the following defects: words in text that appear less frequently but are more critical are easily ignored. Words such as "name", "graduation school", "work experience", etc. in the resume text are important words representing the text, although they occur less frequently in the resume text. The existing keyword extraction method often ignores the words, and screens out the common words which appear repeatedly.

In conclusion, the current text similarity detection method does not detect the text similarity based on semantics, and the detection accuracy is not high; and the accuracy of the existing keyword extraction method needs to be improved.

With the rapid development of the internet and big data, text data such as office documents, mails, research reports, laws and regulations become the main forms of enterprise data. In the face of these rapidly growing text data, how to classify them effectively becomes one of the major challenges facing the enterprise. When the text is too long, the above method of directly detecting the similarity of the text is not suitable. The main reasons are: excessive common words in the text bring extra interference to the text similarity detection, so that the detection accuracy is not high; the amount of computation is large, and too much and too long text may result in a failure to input the model at one time, thereby affecting efficiency.

Disclosure of Invention

The invention aims to provide a text classification method and system based on entity extraction, which are efficient and more accurate.

The invention adopts the following technical scheme:

the text classification method based on entity extraction comprises the following steps:

respectively customizing keywords for different text categories, and then executing the following steps:

s100: carrying out entity labeling based on words on the user-defined keywords in the known category texts to obtain training samples;

s200: performing entity recognition training on the keyword extraction model by using the training sample;

the keyword extraction model comprises an ALBERT layer, a BILSTM layer and a CRF layer; when a text is input, the keyword extraction model divides the text into single words and inputs the single words into an ALBERT layer, the ALBERT layer is used for representing the words into word vectors fused with context semantic information and inputting the word vectors into a BILSTM layer, and the BILSTM layer is used for calculating the probability that the words belong to each label; the CRF layer is used for outputting the label corresponding to the maximum probability value as the label of each character;

s300: extracting keywords from the text to be classified by using the trained keyword extraction model;

s400: respectively calculating the similarity between the extracted keywords and the user-defined keywords of each text category by using a similarity detection model; the similarity detection model adopts an ALBERT model;

s500: taking the maximum similarity, comparing the maximum similarity with a preset similarity threshold, and classifying the text to be classified into the text category corresponding to the maximum similarity when the maximum similarity is not less than the similarity threshold; otherwise, the text to be classified does not belong to any one of the current existing text categories; the similarity threshold is an empirical value.

In some embodiments, step S400 further comprises:

s410: respectively converting the extracted keywords and the self-defined keywords of each text category into word vectors by using an ALBERT model;

s420: splicing word vectors of the extracted keywords into keyword vectors, and splicing the word vectors of the self-defined keywords of each text category into self-defined keyword vectors respectively;

s430: and respectively calculating the similarity between the keyword vector and each text type self-defined keyword vector.

In some embodiments, the similarity is cosine similarity in step S430.

The text classification system based on entity extraction is characterized by comprising the following steps:

the entity labeling module is used for carrying out word-based entity labeling on the user-defined keywords in the known type texts to obtain training samples; respectively customizing keywords for different text categories before entity labeling;

the entity recognition training module is used for carrying out entity recognition training on the keyword extraction model by using the training samples;

the keyword extraction model comprises an ALBERT layer, a BILSTM layer and a CRF layer; when a text is input, the keyword extraction model divides the text into single words and inputs the single words into an ALBERT layer, the ALBERT layer is used for representing the words into word vectors fused with context semantic information and inputting the word vectors into a BILSTM layer, and the BILSTM layer is used for calculating the probability that the words belong to each label; the CRF layer is used for taking the label corresponding to the maximum probability value as the label of each character and outputting the label, and extracting keywords according to the label of each character;

the keyword extraction module is used for extracting keywords from the text to be classified by using the trained keyword extraction model;

the similarity calculation module is used for calculating the similarity between the extracted keywords and the user-defined keywords of each text category respectively by using a similarity detection model; the similarity detection model adopts an ALBERT model;

the classification module is used for obtaining the maximum similarity, comparing the maximum similarity with a preset similarity threshold, and classifying the text to be classified into the text category corresponding to the maximum similarity when the maximum similarity is not less than the similarity threshold; otherwise, the text to be classified does not belong to any one of the current existing text categories; the similarity threshold is an empirical value.

In some embodiments, the similarity calculation module further comprises a sub-module:

the word vector conversion sub-module is used for converting the extracted keywords and the self-defined keywords of each text category into word vectors by utilizing an ALBERT model;

the splicing module is used for splicing the word vectors of the extracted keywords into keyword vectors and splicing the word vectors of the self-defined keywords of each text category into self-defined keyword vectors respectively;

and the calculating submodule is used for calculating the similarity between the keyword vector and each text type self-defined keyword vector.

Compared with the prior art, the invention has the following characteristics and beneficial effects:

(1) according to the method, the ALBERT model is applied to the keyword extraction model, the keywords are abstracted into the entity, the process of extracting the keywords is the process of extracting the entity, and the accuracy of extracting the keywords can be improved; the keywords can be extracted without word segmentation, so that the interference of inaccurate word segmentation on text similarity detection can be avoided; therefore, the invention has strong reusability and higher accuracy.

(2) When the keyword extraction model is trained, the pre-training model of the ALBERT is utilized, and the training can be carried out only by marking a small number of samples, so that the workload of manual marking can be obviously reduced.

(3) The method can avoid the limitation that the synonyms and synonyms cannot be judged at present, has higher similarity detection accuracy, is also applicable to the similarity detection of excessive and overlarge text data, and can be applied to the automatic classification of the text data in the data assets of the enterprises at present.

Drawings

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is a schematic diagram of the principle of the ALBERT + BILSTM + CRF binding model of the present invention.

Detailed Description

The following detailed description of embodiments of the invention refers to the accompanying drawings. It is to be understood that the specific embodiments described are merely a few examples of the invention and not all examples. All other embodiments, which can be derived by a person skilled in the art from the described embodiments without inventive step, are within the scope of protection of the invention.

The invention relates to a text classification method and a text classification system based on entity extraction, wherein the thought is as follows: constructing a keyword extraction model based on the ALBERT + BILSTM + CRF combined model, training the keyword extraction model by adopting a text of a known category, and extracting keywords from the text to be classified by adopting the trained keyword extraction model; and constructing a similarity detection model by adopting an ALBERT model, converting the extracted keywords into corresponding word vectors by adopting the similarity detection model, then respectively calculating the similarity between the word vectors of the text to be classified and the keywords of the text of each known category, and classifying the text to be classified based on the similarity. In specific applications, the text categories include contract texts, patent texts, financial reports, web page contents, resume texts, legal texts, and the like, and the text categories can be customized according to actual situations.

The keyword extraction model mainly comprises an ALBERT layer, a BILSTM layer and a CRF layer, wherein the ALBERT layer, the BILSTM layer and the CRF layer are respectively realized by adopting an ALBERT model, a BILSTM model and a CRF model, and the ALBERT model, the BILSTM model and the CRF model are all existing models. The ALBERT model is a transformer-based bi-directional encoder representation model that may be used to represent words, words as word vectors. The BILSTM model is a bidirectional long-short memory recursive network model, which is used for classification. The CRF model is a conditional random field that is used to add constraints.

The ALBERT model has obvious effect on entity recognition. According to the method, the ALBERT model is introduced into the keyword extraction model, keywords are abstracted into entities to be extracted, and the process of extracting the keywords is the process of extracting the entities. In the invention, keywords are customized for different types of texts. For example, for the resume text, "name", "graduate school", "work experience" and the like can be customized as keywords; for legal texts, "original announcement", "announced" and the like can be customized as keywords. Therefore, the extracted keywords are representative for different types of texts, and the text types can be reflected more accurately.

Generally, before extracting keywords, the text needs to be segmented and then the keywords are extracted. Therefore, whether the word segmentation result is accurate or not greatly influences the extraction result of the keyword. However, the ALBERT model can be based on a word training model and is not restricted by words, so that the keyword extraction model does not need word segmentation, and the interference of inaccurate word segmentation on the text similarity detection result can be avoided.

In the invention, the detailed flow of extracting the keywords by using the keyword extraction model is as follows:

(1) and carrying out entity labeling on the keywords appearing in the texts of the known types to obtain the training samples.

The keywords adopted in the method are self-defined keywords which are respectively self-defined aiming at each text category, and the self-defined keywords can be extracted by subsequent keyword extraction. The invention is based on the word training model, therefore, when the entity is labeled, the entity is also labeled based on the word. The label comprises a non-keyword label, a keyword starting word label and a keyword follow-up word label. For example, for a sentence in the legal text, "weekly plan and answer debate", labeled "[ weekly O, Hua O, plan O, and O, answer B-W, debate I-W, person I-W, debate B-W, marriage I-W ]", where O represents content that does not need to be identified, i.e., a non-keyword tag; w represents a proper noun entity belonging to a legal text, namely a customized keyword; b represents the beginning of an entity, namely the beginning character of a certain keyword, and B-W is the starting character label of the keyword; i represents the successor of an entity, namely the successor word of a keyword, and I-W is the successor word tag of the keyword. Here, "divorce" means proper noun entity, denoted by W, and thus the word "divorce" is labeled "B-W" and the word "marriage" is labeled "I-W".

(2) And performing entity recognition training on the keyword extraction model by using the training samples.

(3) And extracting entities, namely keywords, from the text to be classified by using the trained keyword extraction model.

Referring to fig. 2, a schematic diagram of a keyword extraction model is shown. The keyword extraction model comprises an ALBERT layer, a BILSTM layer and a CRF layer. Text input keyword extraction model, single word W in text₁、W₂、W₃、…W_nThe maximum length of each piece of input data is 128 in this embodiment mode. The ALBERT layer initializes words into word vectors. In the model training process, the ALBERT layer repeatedly iterates and updates the word vector, and finally outputs the vector representation after the word context semantic information is blended. Word vector E output by ALBERT layer₁、E₂、E₃… En fuses context semantic information, word vector E₁、E₂、E₃… En as input to the BILSTM layer. The probability that the BILSTM layer output word belongs to each tag, and tags are labeled. The output data format of the BILSTM layer is [ batch _ size, num _ steps, num _ tags]The batch _ size represents the number of training sentences per time, num _ steps represents the length of each training sentence, and num _ tags is the number of tagged tags. Finally, the maximum value of the probability that each word belongs to each TAG is output by the CRF layer and is recorded as TAG₁、TAG₂、TAG₃、TAGn。

For example, tag numbers labeled in training samples are O, B-W, I-W, the number of training sentences per time is 16, and the maximum length of each sentence is 128, so that batch _ size is 16, num _ steps is 128, num _ tags is the labeled tag number, that is, num _ tags is 5.

Without the CRF layer, a logically non-compliant sequence of [ B-W, O, I-W, O, I-W ] would appear if the maximum probability value of 5 tags for each word output directly in the BILSTM layer is taken as output. For the "weekly planning and answering debate" sentence, if the sentence is restored according to the sequence of [ B-W, O, I-W, O, I-W ], the sentence becomes "answering debate. The CRF layer is used to add constraints such as: after the 'B-W', only the 'I-W' can appear, and other tags cannot appear, so that the final prediction result is ensured to accord with language logic.

The training process of the keyword extraction model in the invention is as follows: in each iteration, the CRF layer outputs the label of each character in the training sample, compares each output character label with the label of the character, and calculates the accuracy rate; and when the accuracy rate does not reach the preset accuracy rate threshold value, repeating iterative training until the accuracy rate reaches the preset accuracy rate threshold value, and finishing the training. The trained keyword extraction model can endow each character in the text to be classified with a label, and keywords can be extracted based on the label.

In extracting keywords, the words are trained into word vectors using the ALBERT model. When the text similarity calculation is carried out, the word vectors of the keywords are trained by using the ALBERT model, and the extracted keywords are converted into corresponding word vectors.

The similarity detection model adopts an ALBERT model, and the detailed flow of text similarity is as follows:

(4) and respectively converting the extracted keywords and the self-defined keywords of each text category into word vectors by using an ALBERT model.

The keywords extracted from the text to be classified are assumed to be: [ "contract", "first party", "second party", "other. ], the ALBERT model converts the keywords into corresponding word vectors: { "contract": [9.79439989e-02, -3.78220007e-02,3.56911987e-01, ], "first square": [3.87089998e-02, -7.86560029e-02, -6.02289997e-02, ], "B-side": [2.89862990e-01, -9.67399962e-03, -1.52429000e-01,. ] }. Where each keyword word vector dimension is 300.

(5) And splicing the keywords extracted from the texts to be classified into keyword vectors, and splicing the self-defined keywords of each text category into the self-defined keyword vectors of each text category.

Based on the above example, the word vectors of the keywords [ "contract", "first party", "second party", "so. ] extracted from the text to be classified are spliced to [9.79439989e-02, -3.78220007e-02,3.56911987e-01,. so.,. 3.87089998e-02, -7.86560029e-02, -6.02289997e-02,. so.,. 2.89862990e-01, -9.67399962e-03, -1.52429000 e-01. ].

(6) And calculating the cosine similarity between the keyword vector of the text to be classified and the self-defined keyword vector of each text category, and calculating the formula (1).

In the formula (1), V1 and V2 are respectively a keyword vector and a self-defined keyword vector of the text to be classified; cos theta is the cosine similarity of V1 and V2.

The text classification method based on entity extraction is realized based on the keyword extraction and text similarity detection method, and comprises the following specific steps:

firstly, defining keywords for different text categories respectively.

The texts in different categories have representative words which can reflect the text categories, for example, the resume text, such as "name", "graduation school", "work experience", and the like, is a representative word of the resume text; for legal texts, "original announcement", "announced" and the like are representative words of the legal texts. And customizing the representative words of the texts in each category into the keywords of each text category. Therefore, the names, graduation schools, work experiences, and the like can be customized as keywords of the resume text, and the original reports, the announcements, and the like can be customized as keywords of the legal text.

And secondly, training a keyword extraction model by using the training sample, and extracting the keywords of the text to be classified by using the trained keyword extraction model. The principle and the specific implementation process of this step are detailed in the steps (1) - (3), and are not described herein again.

And thirdly, respectively detecting the similarity between the keywords extracted from the texts to be classified and the self-defined keywords of each text category by using a similarity detection model. The principle and the specific implementation process of this step are detailed in the steps (4) - (6), and are not described herein again.

Fourthly, the maximum similarity is obtained, the maximum similarity is compared with a preset similarity threshold, and when the maximum similarity is not smaller than the similarity threshold, the text to be classified is classified into the text category corresponding to the maximum similarity; otherwise, the text to be classified does not belong to any one of the current existing text categories. The similarity threshold is an empirical value, and can be adjusted for multiple times according to actual application scenarios to obtain an optimal value.

It will be understood by those skilled in the art that all or part of the steps of the above method may be implemented by hardware related to instructions of a program, the program may be stored in a computer readable storage medium, and when executed, the program comprises the following steps: (steps of the method), said storage medium, such as: ROM/RAM, magnetic disk, optical disk, etc.

The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims

1. The text classification method based on entity extraction is characterized by comprising the following steps:

2. The method for text classification based on entity extraction as claimed in claim 1, wherein:

step S400 further includes:

3. The method of text classification based on entity extraction as claimed in claim 2, characterized by:

in step S430, the similarity is cosine similarity.

4. The text classification system based on entity extraction is characterized by comprising the following steps:

5. The entity extraction based text classification system of claim 4, wherein:

the similarity calculation module further comprises a sub-module: