CN116108160A

CN116108160A - Policy matching system based on NLP text multi-label classification

Info

Publication number: CN116108160A
Application number: CN202310088645.4A
Authority: CN
Inventors: 徐立群; 李正; 郭海涛
Original assignee: Anhui Zhiyuxin Information Technology Co ltd
Current assignee: Anhui Zhiyuxin Information Technology Co ltd
Priority date: 2023-02-09
Filing date: 2023-02-09
Publication date: 2023-05-12

Abstract

The invention provides a policy matching system based on NLP text multi-label classification, which comprises a data acquisition module, a label mining module, a text classification module, an enterprise portrait module and a policy recommendation module. The invention has the advantages that: the method has the advantages that the policy text labels can be mined more timely and rapidly, the policy text can reach the user more rapidly, and the text classifier is good in effect.

Description

Policy matching system based on NLP text multi-label classification

Technical Field

The invention relates to the technical field of information acquisition systems, in particular to a policy matching system based on NLP text multi-label classification.

Background

Originally, the corporation's home was required to obtain up-to-date policy information by means of government official websites, news media, etc. However, it is difficult to ensure that the information is available to the entrepreneurs at the first time. Later, policy information retrieval systems were developed based on algorithms such as TF-IDF and BM 25. The algorithm is based on similarity obtained by statistics of word frequency, recall rate is low, all relevant information is difficult to find, and effect is natural and bad. Subsequently, many condition and tag based policy matching schemes have emerged. The condition-based matching scheme has the problems of enterprise data deficiency, excessive dimensions of policy conditions and the like; tag-based matching schemes suffer from problems such as untimely updating of tags, mismatch between policy tags and enterprise tags, and the like.

The current label-based policy enterprise matching mainstream scheme flow is as follows:

1. manually reading policies, and formulating a related policy tag system according to the requirements of enterprises;

2. labeling the policy text in a manual labeling mode;

3. training a text classifier by adopting a text classification algorithm, and labeling a label by using the classifier for the newly acquired policy;

4. all policies and labels are stored in a database, and a label retrieval system is established to enable a user to retrieve interesting labels.

The following problems exist in the technical scheme:

1. the policy tag system is completely formulated manually, so that timeliness of updating is difficult to ensure;

2. the labels of the policy texts are completely manual labels, and the cost of the labels is very high;

3. if the tag content is too much, it is difficult for the user to perform policy retrieval based on the tag.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides the policy matching system based on NLP text multi-label classification, which can more timely and quickly mine the policy text labels, and the policy text can more quickly arrive at the hand of a user, so that the text classifier has good effect.

In order to solve the technical problems, the technical scheme provided by the invention is as follows: a policy matching system based on NLP text multi-label classification comprises a data acquisition module, a label mining module, a text classification module, an enterprise portrayal module and a policy recommendation module;

the data acquisition module acquires massive policy information texts from government information websites in various places and performs basic data processing on the texts;

the label mining module extracts key words in the text from the processed policy text data by using a TextRank algorithm, then manually cleans the extracted key words to sort out key words which are concerned by a user and can be used for recommendation, groups and sorts the key words to sort out a set of label system of the policy, establishes a mapping relation between the key words and labels, facilitates subsequent use, extracts the key words in the text for the newly acquired policy, and facilitates mining of the new labels;

the text classification module classifies the text in two modes, namely a classifier C1 designed according to the mapping relation between the keywords and the labels; the other is a text classifier C2 based on BERT neural network, the C1 classifier is used for pre-labeling the policy text, then the label is manually corrected, a large amount of high-precision labeling data can be quickly obtained in the mode, the text classifier C2 adopts a BERT+softmax neural network architecture, the evaluation index of macroF1 is adopted, the newly mined label is classified by the C1 classifier, and after a certain amount is reached, the manual labeling is carried out, the C2 classifier is retrained and updated, so that two modules are organically combined;

the enterprise portrait module is used for constructing enterprise portrait labels for the subsequent policy recommendation module;

and the policy recommending module is used for combining the policy text label and the enterprise portrait label to recommend.

Further, the step of the data acquisition module performing basic data processing on the text comprises,

1) Information unrelated to the policy text, including HTML tags, javaScript code, navigation bar information, etc., is removed using regular expressions, the text is then deduplicated using a simhash algorithm,

2) Analyzing the compressed package file contained in the website by using an intelligent document processing technology, extracting text information in word documents, excel documents and PDF files,

3) The pictures are deduplicated using phash (perceptualhashalgorithm), and then text in the website pictures is extracted using OCR technology, preserving useful text information.

Further, the policy recommending module combines the policy text label and the enterprise portrait label to make recommendation, which comprises,

1) Firstly, converting a policy text label into a vector by using a single-hot code, and then constructing indexes for all policy vectors by using faiss (FacebookAISimilarity Search), and marking the indexes as I;

2) Judging a relevance score between each policy text label and each enterprise portrait label by adopting an expert scoring method, and constructing a weight matrix W for enterprise-policy label conversion;

3) For each enterprise, converting the label into a vector by using a single thermal code, and then linearly transforming the vector by using a W matrix to obtain a final enterprise vector;

4) And searching in the index I by using the enterprise vector, acquiring a policy text with a front similarity, and returning the policy text as a recommended text to the user.

Compared with the prior art, the invention has the advantages that:

1. text labels can be more timely and rapidly mined;

2. the policy text can be more quickly analyzed;

3. the effect of the text classifier is improved.

Drawings

FIG. 1 is a system architecture diagram of the present invention.

FIG. 2 is a flow chart of the operation of the data acquisition module of the present invention.

FIG. 3 is a flowchart of the operation of the tag mining module of the present invention.

Fig. 4 is a flowchart of the operation of the text classification module of the present invention.

FIG. 5 is a flow chart of the operation of the enterprise portrayal module of the present invention.

FIG. 6 is a flowchart illustrating the operation of the policy recommendation module of the present invention.

Detailed Description

The present invention will be described in further detail with reference to examples.

Examples

The system comprises a data acquisition module, a label mining module, a text classification module, an enterprise portrayal module and a policy recommendation module.

The data acquisition module is mainly used for acquiring massive policy information texts from government information websites in various places and performing basic data processing on the texts, and comprises the following steps:

1. removing information irrelevant to the policy text by using a regular expression, including HTML labels, javaScript codes, navigation bar information and the like, and then de-duplicating the text by using a simhash algorithm;

2. analyzing the compressed package file contained in the website by using an intelligent document processing technology, and extracting text information in word documents, excel documents and PDF files;

3. the pictures are deduplicated using phash (perceptualhashalgorithm), and then text in the website pictures is extracted using OCR technology, preserving useful text information.

After cleaning, there are about 200 tens of thousands of policy texts that remain.

And the label mining module extracts key words in the text by using a TextRank algorithm for the policy text data after the cleaning is completed. And then manually cleaning the extracted keywords, and finishing the keywords which are concerned by the user and can be used for recommendation. And then, grouping and sorting the keywords to sort out a set of policy tag system, and establishing a mapping relation between the keywords and the tags, so that the follow-up use is convenient. For the newly acquired policies, keywords in the text can still be extracted, so that the mining of new labels is facilitated.

The text classification module is mainly used for classifying the text according to two types, namely a classifier C1 designed according to the mapping relation between the keywords and the labels; the other is a text classifier C2 based on the currently prevailing BERT neural network. First, the policy text is pre-labeled using a C1 classifier, and then this label is manually revised. By the method, a large amount of high-precision annotation data can be obtained rapidly. Meanwhile, data are screened according to the pre-labeled labels, so that the labels of each category are ensured to have enough training data. Finally, approximately 25 tens of thousands of data were manually calibrated. After the training data, training of the C2 text classifier is started. The text classifier adopts a neural network architecture of BERT+softmax and adopts the evaluation index of macroF 1. The final F1 score value fluctuates between 88% and 90%. The BERT model achieves good results compared to conventional RNN neural networks because of its superior pre-training task. To further enhance the classifier effect, the BERT pre-training model was fine-tuned with unlabeled 175-thousand policy text data to better fit the policy text. Finally, the F1 value of the classification task is increased to between 93% and 94%. Finally, the remaining policy text data is labeled with a label using a C2 classifier. And for the newly mined labels, firstly classifying by using a C1 classifier, manually marking after a certain amount of labels are reached, and retraining and updating the C2 classifier. In this way, the two modules can be organically combined together.

Enterprise portrait module the main function of the module is to construct enterprise portrait labels for the following policy recommendation module. And constructing portrait labels for companies by using rules according to business requirements recommended by policies, such as high and new technology enterprise identifications, enterprise scale identifications, regular enterprise identifications, industries and the like. Through the module, all enterprises can be primarily identified, and the module can be used for judging the rationality of a recommendation system. But depending on a large number of enterprise portrait labels, the portrait labels of the enterprises can be inaccurate and perfect, and users can modify and perfect the portrait labels of the enterprises by themselves.

And the policy recommending module is used for recommending how to combine the two label systems after the policy text labels and the enterprise portrait labels are provided. The specific steps are as follows:

firstly, converting a policy text label into a vector by using a single-hot code, and then constructing indexes for all policy vectors by using faiss (FacebookAISimilarity Search), and marking the indexes as I;

judging a relevance score between each policy text label and each enterprise portrait label by adopting an expert scoring method, and constructing a weight matrix W for enterprise-policy label conversion;

for each enterprise, we still use the single thermal code to convert the label into a vector, and then use the W matrix to linearly transform the vector to obtain the final enterprise vector;

and searching in the index I by using the enterprise vector, acquiring a policy text with a front similarity, and returning the policy text as a recommended text to the user.

The five modules are combined together to form a complete multi-tag policy matching scheme.

The invention and its embodiments have been described in a non-limiting manner, and the actual construction is not limited to the embodiments of the invention as shown in the drawings. In summary, if one of ordinary skill in the art is informed by this disclosure, a structural manner and an embodiment similar to the technical solution should not be creatively devised without departing from the gist of the present invention.

Claims

1. The policy matching system based on NLP text multi-label classification is characterized in that: the system comprises a data acquisition module, a label mining module, a text classification module, an enterprise portrayal module and a policy recommendation module;

the label mining module extracts key words in the Text from the processed policy Text data by using a Text Rank algorithm, then manually cleans the extracted key words to sort out key words which are concerned by a user and can be used for recommendation, groups and sorts the key words to sort out a set of label system of the policy, establishes a mapping relation between the key words and labels, facilitates subsequent use, extracts the key words in the Text for the newly acquired policy, and facilitates mining of the new labels;

the text classification module classifies the text in two modes, namely a classifier C1 designed according to the mapping relation between the keywords and the labels; the other is a text classifier C2 based on the BERT neural network, the C1 classifier is used for pre-marking the policy text, then the label is manually corrected, a large amount of high-precision marking data can be quickly obtained in the mode, the text classifier C2 adopts a BERT+softmax neural network architecture, the evaluation index of macro F1 is adopted, the newly mined label is classified by the C1 classifier, and after a certain amount is reached, the manual marking is carried out, the C2 classifier is retrained and updated, so that two modules are organically combined;

2. The NLP text multi-label classification based policy matching system of claim 1, wherein: the step of the data acquisition module performing basic data processing on the text includes,

3) The pictures are deduplicated using phash (perceptual hash algorithm), and then text in the website pictures is extracted using OCR technology, preserving useful text information.

3. The NLP text multi-label classification based policy matching system of claim 1, wherein: the policy recommending module combines the policy text label and the enterprise portrait label to make recommendation, which comprises,

1) Firstly, converting a policy text label into a vector by using a single-hot code, and then constructing indexes for all policy vectors by using faiss (Facebook AI Similarity Search), and marking the indexes as I;