CN116108160A - Policy matching system based on NLP text multi-label classification - Google Patents

Policy matching system based on NLP text multi-label classification Download PDF

Info

Publication number
CN116108160A
CN116108160A CN202310088645.4A CN202310088645A CN116108160A CN 116108160 A CN116108160 A CN 116108160A CN 202310088645 A CN202310088645 A CN 202310088645A CN 116108160 A CN116108160 A CN 116108160A
Authority
CN
China
Prior art keywords
text
policy
label
module
enterprise
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310088645.4A
Other languages
Chinese (zh)
Inventor
徐立群
李正
郭海涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui Zhiyuxin Information Technology Co ltd
Original Assignee
Anhui Zhiyuxin Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anhui Zhiyuxin Information Technology Co ltd filed Critical Anhui Zhiyuxin Information Technology Co ltd
Priority to CN202310088645.4A priority Critical patent/CN116108160A/en
Publication of CN116108160A publication Critical patent/CN116108160A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention provides a policy matching system based on NLP text multi-label classification, which comprises a data acquisition module, a label mining module, a text classification module, an enterprise portrait module and a policy recommendation module. The invention has the advantages that: the method has the advantages that the policy text labels can be mined more timely and rapidly, the policy text can reach the user more rapidly, and the text classifier is good in effect.

Description

Policy matching system based on NLP text multi-label classification
Technical Field
The invention relates to the technical field of information acquisition systems, in particular to a policy matching system based on NLP text multi-label classification.
Background
Originally, the corporation's home was required to obtain up-to-date policy information by means of government official websites, news media, etc. However, it is difficult to ensure that the information is available to the entrepreneurs at the first time. Later, policy information retrieval systems were developed based on algorithms such as TF-IDF and BM 25. The algorithm is based on similarity obtained by statistics of word frequency, recall rate is low, all relevant information is difficult to find, and effect is natural and bad. Subsequently, many condition and tag based policy matching schemes have emerged. The condition-based matching scheme has the problems of enterprise data deficiency, excessive dimensions of policy conditions and the like; tag-based matching schemes suffer from problems such as untimely updating of tags, mismatch between policy tags and enterprise tags, and the like.
The current label-based policy enterprise matching mainstream scheme flow is as follows:
1. manually reading policies, and formulating a related policy tag system according to the requirements of enterprises;
2. labeling the policy text in a manual labeling mode;
3. training a text classifier by adopting a text classification algorithm, and labeling a label by using the classifier for the newly acquired policy;
4. all policies and labels are stored in a database, and a label retrieval system is established to enable a user to retrieve interesting labels.
The following problems exist in the technical scheme:
1. the policy tag system is completely formulated manually, so that timeliness of updating is difficult to ensure;
2. the labels of the policy texts are completely manual labels, and the cost of the labels is very high;
3. if the tag content is too much, it is difficult for the user to perform policy retrieval based on the tag.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides the policy matching system based on NLP text multi-label classification, which can more timely and quickly mine the policy text labels, and the policy text can more quickly arrive at the hand of a user, so that the text classifier has good effect.
In order to solve the technical problems, the technical scheme provided by the invention is as follows: a policy matching system based on NLP text multi-label classification comprises a data acquisition module, a label mining module, a text classification module, an enterprise portrayal module and a policy recommendation module;
the data acquisition module acquires massive policy information texts from government information websites in various places and performs basic data processing on the texts;
the label mining module extracts key words in the text from the processed policy text data by using a TextRank algorithm, then manually cleans the extracted key words to sort out key words which are concerned by a user and can be used for recommendation, groups and sorts the key words to sort out a set of label system of the policy, establishes a mapping relation between the key words and labels, facilitates subsequent use, extracts the key words in the text for the newly acquired policy, and facilitates mining of the new labels;
the text classification module classifies the text in two modes, namely a classifier C1 designed according to the mapping relation between the keywords and the labels; the other is a text classifier C2 based on BERT neural network, the C1 classifier is used for pre-labeling the policy text, then the label is manually corrected, a large amount of high-precision labeling data can be quickly obtained in the mode, the text classifier C2 adopts a BERT+softmax neural network architecture, the evaluation index of macroF1 is adopted, the newly mined label is classified by the C1 classifier, and after a certain amount is reached, the manual labeling is carried out, the C2 classifier is retrained and updated, so that two modules are organically combined;
the enterprise portrait module is used for constructing enterprise portrait labels for the subsequent policy recommendation module;
and the policy recommending module is used for combining the policy text label and the enterprise portrait label to recommend.
Further, the step of the data acquisition module performing basic data processing on the text comprises,
1) Information unrelated to the policy text, including HTML tags, javaScript code, navigation bar information, etc., is removed using regular expressions, the text is then deduplicated using a simhash algorithm,
2) Analyzing the compressed package file contained in the website by using an intelligent document processing technology, extracting text information in word documents, excel documents and PDF files,
3) The pictures are deduplicated using phash (perceptualhashalgorithm), and then text in the website pictures is extracted using OCR technology, preserving useful text information.
Further, the policy recommending module combines the policy text label and the enterprise portrait label to make recommendation, which comprises,
1) Firstly, converting a policy text label into a vector by using a single-hot code, and then constructing indexes for all policy vectors by using faiss (FacebookAISimilarity Search), and marking the indexes as I;
2) Judging a relevance score between each policy text label and each enterprise portrait label by adopting an expert scoring method, and constructing a weight matrix W for enterprise-policy label conversion;
3) For each enterprise, converting the label into a vector by using a single thermal code, and then linearly transforming the vector by using a W matrix to obtain a final enterprise vector;
4) And searching in the index I by using the enterprise vector, acquiring a policy text with a front similarity, and returning the policy text as a recommended text to the user.
Compared with the prior art, the invention has the advantages that:
1. text labels can be more timely and rapidly mined;
2. the policy text can be more quickly analyzed;
3. the effect of the text classifier is improved.
Drawings
FIG. 1 is a system architecture diagram of the present invention.
FIG. 2 is a flow chart of the operation of the data acquisition module of the present invention.
FIG. 3 is a flowchart of the operation of the tag mining module of the present invention.
Fig. 4 is a flowchart of the operation of the text classification module of the present invention.
FIG. 5 is a flow chart of the operation of the enterprise portrayal module of the present invention.
FIG. 6 is a flowchart illustrating the operation of the policy recommendation module of the present invention.
Detailed Description
The present invention will be described in further detail with reference to examples.
Examples
The system comprises a data acquisition module, a label mining module, a text classification module, an enterprise portrayal module and a policy recommendation module.
The data acquisition module is mainly used for acquiring massive policy information texts from government information websites in various places and performing basic data processing on the texts, and comprises the following steps:
1. removing information irrelevant to the policy text by using a regular expression, including HTML labels, javaScript codes, navigation bar information and the like, and then de-duplicating the text by using a simhash algorithm;
2. analyzing the compressed package file contained in the website by using an intelligent document processing technology, and extracting text information in word documents, excel documents and PDF files;
3. the pictures are deduplicated using phash (perceptualhashalgorithm), and then text in the website pictures is extracted using OCR technology, preserving useful text information.
After cleaning, there are about 200 tens of thousands of policy texts that remain.
And the label mining module extracts key words in the text by using a TextRank algorithm for the policy text data after the cleaning is completed. And then manually cleaning the extracted keywords, and finishing the keywords which are concerned by the user and can be used for recommendation. And then, grouping and sorting the keywords to sort out a set of policy tag system, and establishing a mapping relation between the keywords and the tags, so that the follow-up use is convenient. For the newly acquired policies, keywords in the text can still be extracted, so that the mining of new labels is facilitated.
The text classification module is mainly used for classifying the text according to two types, namely a classifier C1 designed according to the mapping relation between the keywords and the labels; the other is a text classifier C2 based on the currently prevailing BERT neural network. First, the policy text is pre-labeled using a C1 classifier, and then this label is manually revised. By the method, a large amount of high-precision annotation data can be obtained rapidly. Meanwhile, data are screened according to the pre-labeled labels, so that the labels of each category are ensured to have enough training data. Finally, approximately 25 tens of thousands of data were manually calibrated. After the training data, training of the C2 text classifier is started. The text classifier adopts a neural network architecture of BERT+softmax and adopts the evaluation index of macroF 1. The final F1 score value fluctuates between 88% and 90%. The BERT model achieves good results compared to conventional RNN neural networks because of its superior pre-training task. To further enhance the classifier effect, the BERT pre-training model was fine-tuned with unlabeled 175-thousand policy text data to better fit the policy text. Finally, the F1 value of the classification task is increased to between 93% and 94%. Finally, the remaining policy text data is labeled with a label using a C2 classifier. And for the newly mined labels, firstly classifying by using a C1 classifier, manually marking after a certain amount of labels are reached, and retraining and updating the C2 classifier. In this way, the two modules can be organically combined together.
Enterprise portrait module the main function of the module is to construct enterprise portrait labels for the following policy recommendation module. And constructing portrait labels for companies by using rules according to business requirements recommended by policies, such as high and new technology enterprise identifications, enterprise scale identifications, regular enterprise identifications, industries and the like. Through the module, all enterprises can be primarily identified, and the module can be used for judging the rationality of a recommendation system. But depending on a large number of enterprise portrait labels, the portrait labels of the enterprises can be inaccurate and perfect, and users can modify and perfect the portrait labels of the enterprises by themselves.
And the policy recommending module is used for recommending how to combine the two label systems after the policy text labels and the enterprise portrait labels are provided. The specific steps are as follows:
firstly, converting a policy text label into a vector by using a single-hot code, and then constructing indexes for all policy vectors by using faiss (FacebookAISimilarity Search), and marking the indexes as I;
judging a relevance score between each policy text label and each enterprise portrait label by adopting an expert scoring method, and constructing a weight matrix W for enterprise-policy label conversion;
for each enterprise, we still use the single thermal code to convert the label into a vector, and then use the W matrix to linearly transform the vector to obtain the final enterprise vector;
and searching in the index I by using the enterprise vector, acquiring a policy text with a front similarity, and returning the policy text as a recommended text to the user.
The five modules are combined together to form a complete multi-tag policy matching scheme.
The invention and its embodiments have been described in a non-limiting manner, and the actual construction is not limited to the embodiments of the invention as shown in the drawings. In summary, if one of ordinary skill in the art is informed by this disclosure, a structural manner and an embodiment similar to the technical solution should not be creatively devised without departing from the gist of the present invention.

Claims (3)

1. The policy matching system based on NLP text multi-label classification is characterized in that: the system comprises a data acquisition module, a label mining module, a text classification module, an enterprise portrayal module and a policy recommendation module;
the data acquisition module acquires massive policy information texts from government information websites in various places and performs basic data processing on the texts;
the label mining module extracts key words in the Text from the processed policy Text data by using a Text Rank algorithm, then manually cleans the extracted key words to sort out key words which are concerned by a user and can be used for recommendation, groups and sorts the key words to sort out a set of label system of the policy, establishes a mapping relation between the key words and labels, facilitates subsequent use, extracts the key words in the Text for the newly acquired policy, and facilitates mining of the new labels;
the text classification module classifies the text in two modes, namely a classifier C1 designed according to the mapping relation between the keywords and the labels; the other is a text classifier C2 based on the BERT neural network, the C1 classifier is used for pre-marking the policy text, then the label is manually corrected, a large amount of high-precision marking data can be quickly obtained in the mode, the text classifier C2 adopts a BERT+softmax neural network architecture, the evaluation index of macro F1 is adopted, the newly mined label is classified by the C1 classifier, and after a certain amount is reached, the manual marking is carried out, the C2 classifier is retrained and updated, so that two modules are organically combined;
the enterprise portrait module is used for constructing enterprise portrait labels for the subsequent policy recommendation module;
and the policy recommending module is used for combining the policy text label and the enterprise portrait label to recommend.
2. The NLP text multi-label classification based policy matching system of claim 1, wherein: the step of the data acquisition module performing basic data processing on the text includes,
1) Information unrelated to the policy text, including HTML tags, javaScript code, navigation bar information, etc., is removed using regular expressions, the text is then deduplicated using a simhash algorithm,
2) Analyzing the compressed package file contained in the website by using an intelligent document processing technology, extracting text information in word documents, excel documents and PDF files,
3) The pictures are deduplicated using phash (perceptual hash algorithm), and then text in the website pictures is extracted using OCR technology, preserving useful text information.
3. The NLP text multi-label classification based policy matching system of claim 1, wherein: the policy recommending module combines the policy text label and the enterprise portrait label to make recommendation, which comprises,
1) Firstly, converting a policy text label into a vector by using a single-hot code, and then constructing indexes for all policy vectors by using faiss (Facebook AI Similarity Search), and marking the indexes as I;
2) Judging a relevance score between each policy text label and each enterprise portrait label by adopting an expert scoring method, and constructing a weight matrix W for enterprise-policy label conversion;
3) For each enterprise, converting the label into a vector by using a single thermal code, and then linearly transforming the vector by using a W matrix to obtain a final enterprise vector;
4) And searching in the index I by using the enterprise vector, acquiring a policy text with a front similarity, and returning the policy text as a recommended text to the user.
CN202310088645.4A 2023-02-09 2023-02-09 Policy matching system based on NLP text multi-label classification Pending CN116108160A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310088645.4A CN116108160A (en) 2023-02-09 2023-02-09 Policy matching system based on NLP text multi-label classification

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310088645.4A CN116108160A (en) 2023-02-09 2023-02-09 Policy matching system based on NLP text multi-label classification

Publications (1)

Publication Number Publication Date
CN116108160A true CN116108160A (en) 2023-05-12

Family

ID=86259470

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310088645.4A Pending CN116108160A (en) 2023-02-09 2023-02-09 Policy matching system based on NLP text multi-label classification

Country Status (1)

Country Link
CN (1) CN116108160A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116777400A (en) * 2023-08-21 2023-09-19 江苏海外集团国际工程咨询有限公司 Engineering consultation information whole-flow management system and method based on deep learning

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116777400A (en) * 2023-08-21 2023-09-19 江苏海外集团国际工程咨询有限公司 Engineering consultation information whole-flow management system and method based on deep learning
CN116777400B (en) * 2023-08-21 2023-10-31 江苏海外集团国际工程咨询有限公司 Engineering consultation information whole-flow management system and method based on deep learning

Similar Documents

Publication Publication Date Title
CN107220365B (en) Accurate recommendation system and method based on collaborative filtering and association rule parallel processing
CN110688474B (en) Embedded representation obtaining and citation recommending method based on deep learning and link prediction
CN112732934A (en) Power grid equipment word segmentation dictionary and fault case library construction method
CN112199508A (en) Parameter adaptive agricultural knowledge graph recommendation method based on remote supervision
CN112507691A (en) Interpretable financial subject matter generating method and device fusing emotion, industrial chain and case logic
CN116108160A (en) Policy matching system based on NLP text multi-label classification
CN111814486A (en) Enterprise client tag generation method, system and device based on semantic analysis
CN113282729A (en) Question-answering method and device based on knowledge graph
CN113268615A (en) Resource label generation method and device, electronic equipment and storage medium
CN112579729A (en) Training method and device for document quality evaluation model, electronic equipment and medium
CN111754208A (en) Automatic screening method for recruitment resumes
CN113505273B (en) Data sorting method, device, equipment and medium based on repeated data screening
CN111259223B (en) News recommendation and text classification method based on emotion analysis model
CN112487263A (en) Information processing method, system, equipment and computer readable storage medium
TW201243627A (en) Multi-label text categorization based on fuzzy similarity and k nearest neighbors
CN112445862A (en) Internet of things equipment data set construction method and device, electronic equipment and storage medium
CN112784040B (en) Vertical industry text classification method based on corpus
CN111339303B (en) Text intention induction method and device based on clustering and automatic abstracting
CN115062615A (en) Financial field event extraction method and device
CN113495964A (en) Method, device and equipment for screening triples and readable storage medium
Ma et al. Clip-based cycle alignment hashing for unsupervised vision-text retrieval
CN114118273B (en) Limit multi-label classified data enhancement method based on label and text block attention mechanism
CN114091463B (en) Regional work order random point analysis method and device, electronic equipment and readable storage medium
Wang et al. Using graph embedding to improve requirements traceability recovery
KR20230057841A (en) Nuclear-related industry information collection, analysis and classification system and method thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination