CN110609983A

CN110609983A - Structured decomposition method for policy file

Info

Publication number: CN110609983A
Application number: CN201910766729.2A
Authority: CN
Inventors: 金耀初; 何卫灵; 刘华; 张宏辉
Original assignee: Guangzhou Liko Technology Co Ltd
Current assignee: Guangzhou Liko Technology Co Ltd
Priority date: 2019-08-19
Filing date: 2019-08-19
Publication date: 2019-12-24
Anticipated expiration: 2039-08-19
Also published as: CN110609983B

Abstract

The invention relates to the technical field of natural language processing, in particular to a structured decomposition method of a policy file, which comprises the following steps: step S1: obtaining a corpus set; step S2: preprocessing a corpus; step S3: constructing a discourse structure tree; step S4: constructing a policy condition tree; step S5: and constructing and visualizing a new construction tree according to the discourse structure tree and the policy condition tree. The scheme enables the policy document to be accurately understood through corpus preprocessing, part of speech analysis and syntactic analysis.

Description

Structured decomposition method for policy file

Technical Field

The invention relates to the technical field of natural language processing, in particular to a structured decomposition method for policy files.

Background

The natural language refers to the language used by people in daily life, such as Chinese, English, French and the like, is a natural language evolved along with the development of human society, is not an artificial language, and is an important tool for human study and life. In general, natural language refers to a convention of human society that is distinguished from artificial languages, such as programming languages.

Natural Language Processing (NLP) refers to an operation and processing for processing information such as shapes, sounds, and meanings of natural language, i.e., inputting, outputting, recognizing, analyzing, understanding, and generating characters, words, sentences, and chapters, by a computer. The concrete expression forms of natural language processing include machine translation, text summarization, text classification, text proofreading, information extraction, speech synthesis, speech recognition and the like. It can be said that natural language processing is to solve natural language by a computer, and the natural language processing mechanism involves two processes including natural language understanding and natural language generation.

In the current society, with the development of information technology and the popularization of the internet, big data, cloud computing and artificial intelligence become hot topics of the current academic community. Natural language processing is one of the most difficult problems in artificial intelligence, and how to realize information exchange between human and machines and intelligently screen and process massive data is a key technical breakthrough in the artificial intelligence, computer science and linguistic industries. Because of the specificity and complexity of human languages, understanding human languages by machines is a difficult task. Especially in the field of natural language processing, machine understanding of Chinese is far more complex than understanding of English. Therefore, how to make the machine better analyze Chinese becomes a difficult problem that cannot be circumvented in the field of artificial intelligence.

Currently, various forms of data fall into three categories: unstructured data, semi-structured data, and structured data. Structured data is easy to reason because its entities are isolated; the semi-structured data has certain structurality, and the operability of extracting entities is high; unstructured data has difficulty extracting entities because of uncertainty in its structure. An entity generally refers to a noun phrase or verb phrase of a particular meaning or strong reference in text, typically including a person's name, place name, organization name, time, proper noun, and the like. The policy file is one of unstructured data, and the content relationship of the policy file is more and more complicated due to the unstructured data form, so that the policy file is difficult to understand by a machine, and the enterprise or an individual is easy to ignore or misunderstand in the understanding process. In the process of policy implementation, the importance of the policy document is self-evident, and it is only necessary to accurately convey the national policy to effectively implement the policy that people can clearly know and fully understand the intention of the policy, the method and the steps for implementing the policy, and the specific measures for implementing the policy, and thus people can actively and actively implement the policy. The manual interpretation and labeling of the policy documents are high in cost, efficiency and quality are difficult to improve, and the manual intelligent application of backward intelligent question answering, emotion analysis, knowledge map construction and the like is not facilitated. Therefore, a method for accurately understanding the policy document is needed.

Disclosure of Invention

In order to solve the above problems, the present invention provides a structured decomposition method for policy documents, which can accurately understand the policy documents.

The technical scheme adopted by the invention is as follows:

a structured decomposition method of a policy file, comprising:

step S1: obtaining a corpus set;

step S2: preprocessing a corpus;

step S3: constructing a discourse structure tree;

step S4: constructing a policy condition tree;

step S5: and constructing and visualizing a new construction tree according to the discourse structure tree and the policy condition tree.

The present solution is a specific implementation of a branch function of a natural language processing part, specifically, step S1: obtaining a corpus set; the language material is obtained, the corpus is a basic unit forming a corpus and is in a text form, and the collection of the corpus is a corpus collection. Step S2: preprocessing a corpus; namely, the noise in the corpus is removed, the required text content is obtained, the text content is analyzed preliminarily, and the text content is labeled, so that the machine reading and understanding are easy, and conditions are provided for subsequent natural language processing application. Step S3: constructing a discourse structure tree; the foregoing steps have preliminarily interpreted the analysis corpus, established tree nodes of a header level according to the result of the analysis interpretation, and realized the association between the tree nodes. Step S4: and constructing a policy condition tree. The analysis is further deepened, the text content in the tree nodes is understood, the nodes of the content level are established according to the policy conditions, and the association among the nodes is realized. Finally, step S5: and constructing and visualizing a new construction tree according to the discourse structure tree and the policy condition tree. A new association is established making the association between the policy terms more concrete and understandable. The scheme enables the policy document to be accurately understood through corpus preprocessing, part of speech analysis and syntactic analysis.

Further, the step S1 includes:

step S1.1: selecting a webpage from a political policy website;

step S1.2: defining the webpage as document, and traversing the document to acquire text data;

step S1.3: establishing an element group set according to the acquired text data;

the element group set is: element (tuple)₁，tuple₂……tuple_n)，tuple_i＝{(tag_i，data_i1, 2, … … n), where n is the number of element groups, i denotes the element group number, tag_iIndicating html tag, data, in the ith element group_iIndicating the html content in the ith element group.

Because the policy file is to be processed by the scheme, the concrete obtaining mode of the corpus is as follows: and capturing corpus information from the administrative website. The web page not only contains text information, but also other information such as picture links and the like. Therefore, the webpage is defined as a document format, all data are converted into texts, then the webpage converted into the texts is traversed, all text data in the webpage are obtained, and finally an element group set element is established to store all the obtained text data. The webpage converted into text not only contains text content, but also html tags, comments and the like, and the tags contain information such as text styles. The element group stores the html content and the html tag separately for the convenience of subsequent reading and parsing work.

Further, the step S2 includes:

step 52.1: cleaning the corpus;

step S2.2: performing word segmentation on the cleaned corpus;

step S2.3: and performing part-of-speech tagging on the corpus after word segmentation.

After the corpus is obtained, because the corpus necessarily contains unnecessary information, the corpus must be filtered to obtain useless contents, such as: and deleting useless contents such as advertisements, useless links, html comments and the like, extracting useful content texts, segmenting words according to the meanings of the words and the words, and then marking corresponding part-of-speech labels on each word or each word.

Further, the labeling set of step 52.3 is a daily labeling corpus of people.

Specifically, the part-of-speech tags are obtained from the daily newspaper tagging corpus, and the processed corpus is a policy file, so that the daily newspaper tagging corpus is more accurate than other tagging sets.

Further, the step S3 includes:

step S3.1: writing a regular expression for describing each level of title style;

step S3.2: and establishing a title template set according to the regular expression.

Step S3.3: matching the title template set with the element group set, if the content of the text in the element group conforms to the regular expression, executing the step S3.4, otherwise executing the step 52.5;

step S3.4: building a new node on the corresponding layer, wherein the node is named as the text content of which the element group conforms to the regular expression, and the element group corresponding to the text content is stored in the node;

step S3.5: merging the element groups into the nearest node element group;

step S3.6: associating all nodes to form a structure tree;

the node hierarchy of the structure tree is the corresponding title hierarchy, and the association between the nodes is the association between the element groups.

And S2, cleaning the corpus in a noise deleting mode, and S3 cleaning the corpus in a regular expression matching mode, and simultaneously extracting titles and contents from the corpus. First, because of the tag within the element group_iThe html tags are stored and comprise title tags which comprise style information, so that corresponding regular expressions are written according to the styles of all levels of titles, the regular expressions are matched with the content modified by the title tags, and the corresponding titles can be extracted. And then the title label also contains title hierarchical information, the matched title is known to be a title of several levels according to the information, the element group corresponding to the title is stored into the node, and if the title template set does not have a regular expression consistent with the matched content, the element group corresponding to the matched content is merged into the nearest node element group. And finally, constructing a structure tree according to the extracted information, wherein the title is used as a node name, the title hierarchy is a node hierarchy, and the association among the element groups is the association among the nodes, so that the chapter structure tree is constructed.

Further, the step S4 includes:

step S4.1: extracting a text area related to policy terms in the tree nodes;

step S4.2: filtering the text in the text area by using the combined template of the parts of speech;

step S4.3: performing part-of-speech analysis on the filtered text;

step S4.4: extracting policy terms and conditions according to the analysis result, and constructing a policy condition tree according to the policy terms and conditions;

the tree nodes of the policy condition tree correspond to policy terms and policy conditions, and the associations between tree nodes correspond to associations between policy terms or associations between policy terms and policy conditions.

The previous steps are used for cleaning the corpus twice, performing word segmentation and part-of-speech tagging, and performing primary analysis on the corpus. Step S4 is to perform a third cleaning on the corpus to perform a deeper understanding. Firstly, text regions required by us are extracted according to policy keywords. Then, the user filters the text in the text area by using the self-defined part-of-speech template, and the part-of-speech template can play a role in screening because the word corpus is already participled. Finally, based on word segmentation and part-of-speech tagging, part-of-speech analysis is carried out on the filtered text, and association between words and sentences can be obtained. And constructing a policy condition tree according to the information obtained by analyzing the contents of the nodes, wherein the nodes are named as policy terms, the nodes are policy conditions, and the node relationships are the associations between the policy terms and the policy conditions and between the policy conditions and the policy conditions.

Further, the step S4.1 includes:

step S4.11: selecting keywords related to policy terms;

step 54.12: writing a regular expression for describing the policy keywords;

step 54.13: matching texts in the tree nodes by using a regular expression;

step S4.14: a text region associated with the keyword is selected from the text.

Specifically, the method for selecting the text area comprises the following steps: establishing keywords to be selected, wherein the keywords are words related to policies in which people are interested; writing a regular expression for describing the key words, matching texts in the tree nodes by using the regular expression, finding the positions of the key words from the texts, and finally selecting a text area near the key words.

Further, the step S4.3 further includes: the text is parsed.

Specifically, the syntactic analysis is performed on the text, on one hand, the context can be associated, the text can be understood more deeply, the ambiguity between words is eliminated, the correctness and the integrity of a corresponding tree library construction system are verified, and the information error rate of the visual structure tree is lower. On the other hand, besides building the structure tree, the method can also be directly served for other upper-layer applications, such as search engine user log analysis and other tasks related to natural language processing, such as keyword recognition, information extraction, automatic question answering, machine translation and the like. Furthermore, the syntactic analysis does not adopt a deep learning method based on the labeled data set, but trains on the basis of the traditional unsupervised learning method, does not need a large amount of manual data labeling, avoids errors caused by poor quality of the labeled data set, and saves a large amount of manpower and financial resources.

Further, the syntax analysis is a dependency syntax analysis.

Specifically, the dependency grammar reveals its syntactic structure by analyzing the dependency relationships between components within a language unit. Namely, the grammatical components of ' principal object and ' definite form complement ' in the sentence are analyzed and recognized, and the relation among the components is analyzed. Dependency parsing can help better understand the meaning of text by analyzing the syntactic structure of a segment of speech and accurately extracting its backbone information. Through analyzing the syntactic structure, the word-by-word translation can be carried out, and then the translation result is sorted and modified according to the syntactic structure.

Compared with the prior art, the invention has the beneficial effects that:

(1) the policy document is accurately understood through corpus preprocessing, part of speech analysis and syntactic analysis.

(2) The syntactic analysis based on unsupervised learning does not need a large amount of manual data labeling, avoids errors caused by poor quality of a labeled data set, and saves a large amount of manpower and financial resources.

(3) The trained corpus may be used for other upper-level applications of natural language processing.

(4) The visual structure tree also visually shows each policy term and the association thereof, which can be easily understood by people.

Drawings

FIG. 1 is a schematic view of a construction node of the present invention;

FIG. 2 is a diagram of a policy condition tree according to the present invention;

FIG. 3 is a schematic diagram of a structure tree according to the present invention.

Detailed Description

The drawings are only for purposes of illustration and are not to be construed as limiting the invention. For a better understanding of the following embodiments, certain features of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product; it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

Examples

The embodiment provides a structured decomposition method of a policy file, which comprises the following steps:

step S1: obtaining a corpus set;

step S2: preprocessing a corpus;

step S3: constructing a discourse structure tree;

step S4: and constructing a policy condition tree.

This embodiment is a specific implementation of the branch function of the natural language processing part, specifically, step S1: obtaining a corpus set; the language material is obtained, the corpus is a basic unit forming a corpus and is in a text form, and the collection of the corpus is a corpus collection. Step S2: preprocessing a corpus; namely, the noise in the corpus is removed, the required text content is obtained, the text content is analyzed preliminarily, and the text content is labeled, so that the machine reading and understanding are easy, and conditions are provided for subsequent natural language processing application. Step S3: constructing a discourse structure tree; the foregoing steps have preliminarily interpreted the analysis corpus, established tree nodes of a header level according to the result of the analysis interpretation, and realized the association between the tree nodes. Step S4: and constructing a policy condition tree. The analysis is further deepened, the text content in the tree nodes is understood, the nodes of the content level are established according to the policy conditions, and the association among the nodes is realized. Finally, step S5: and constructing and visualizing a new construction tree according to the discourse structure tree and the policy condition tree. A new association is established making the association between the policy terms more concrete and understandable. The scheme enables the policy document to be accurately understood through corpus preprocessing, part of speech analysis and syntactic analysis.

Further, the step S1 includes:

step S1.1: selecting a webpage from a political policy website;

Because the policy file is to be processed in this embodiment, the concrete obtaining manner of the corpus is as follows: and capturing corpus information from the administrative website. The web page not only contains text information, but also other information such as picture links and the like. Therefore, the webpage is defined as a document format, all data are converted into texts, then the webpage converted into the texts is traversed, all text data in the webpage are obtained, and finally an element group set element is established to store all the obtained text data. The webpage converted into text not only contains text content, but also html tags, comments and the like, and the tags contain information such as text styles. The element group stores the html content and the html tag separately for the convenience of subsequent reading and parsing work.

Further, the step S2 includes:

step S2.1: cleaning the corpus;

step S2.2: performing word segmentation on the cleaned corpus;

Further, the labeling set in step S2.3 is a daily labeling corpus of people.

Further, the step S3 includes:

Step S3.3: matching the title template set with the element group set, if the content of the text in the element group conforms to the regular expression, executing the step S3.4, otherwise executing the step S2.5;

step S3.5: merging the element groups into the nearest node element group;

step S3.6: associating all nodes to form a structure tree;

And S2, cleaning the corpus in a noise deleting mode, and S3 cleaning the corpus in a regular expression matching mode, and simultaneously extracting titles and contents from the corpus. First, because of the tag within the element group_iThe html tags are stored and comprise title tags which comprise style information, so that corresponding regular expressions are written according to the styles of all levels of titles, the regular expressions are matched with the content modified by the title tags, and the corresponding titles can be extracted. And then the title label also contains title hierarchical information, the matched title is known to be a title of several levels according to the information, the element group corresponding to the title is stored into the node, and if the title template set does not have a regular expression consistent with the matched content, the element group corresponding to the matched content is merged into the nearest node element group. Finally, the structure tree and the title are constructed according to the extracted informationAnd as the node names, the title hierarchies are the node hierarchies, and the associations among the element groups are the associations among the nodes, so that the chapter structure tree is constructed.

Fig. 1 is a schematic diagram of a node constructed according to the present invention, and as shown in fig. 1, the node construction process in this embodiment includes: selecting a proper policy webpage, analyzing and obtaining a text from HTML, writing a regular expression for describing a title, wherein the regular expression forms a title template set S { S }₁，S₂……S_i，S_j，S_nSuppose that the current layer is h, and the regular expression corresponding to the title of the h layer is S_i. If the sentences in the text have the coincidence S_iBuilding tree nodes in the h layer; if the sentence does not conform to S_iBut with a content according to S_iThen detect S_jWhether the style is h-layer style or not is judged, if yes, a tree node is constructed on the layer, if not, a new layer is split, the number of the new layer is h +1, and whether the corresponding title style of the h +1 layer is S or not is detected_jIf yes, tree nodes are constructed, and if not, the steps of splitting and detecting are repeated. If the content in the last element group does not meet the title style, the content is merged into the element group of the nearest node.

Further, the step S4 includes:

step S4.1: extracting a text area related to policy terms in the tree nodes;

step 54.3: performing part-of-speech analysis on the filtered text;

step 54.4: extracting policy terms and conditions according to the analysis result, and constructing a policy condition tree according to the policy terms and conditions;

In this embodiment, the specific examples of the association between the policy terms in the tree nodes and the policy conditions required by the policy terms are as follows:

the following can be applied according to all the following application conditions:

firstly, conditions of an enterprise:

1. the system has complete business and industry registration places, tax collection and management relations and statistical relations;

2. units with independent legal qualification, sound financial system and independent accounting;

FIG. 2 is a schematic diagram of a policy condition tree according to the present invention, which is shown as the result of the association transformation.

Further, the step S4.1 includes:

step S4.11: selecting keywords related to policy terms;

step 54.12: writing a regular expression for describing the policy keywords;

step S4.13: matching texts in the tree nodes by using a regular expression;

Further, the step S4.3 further includes: the text is parsed.

Further, the syntax analysis is a dependency syntax analysis.

Fig. 3 is a schematic diagram of the structure tree of the present invention, as shown in the figure, the discourse structure tree is visualized after being combined with the policy condition tree, so that various associations are more specific and clear and easier to understand.

It should be understood that the above-mentioned embodiments of the present invention are only examples for clearly illustrating the technical solutions of the present invention, and are not intended to limit the specific embodiments of the present invention. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention claims should be included in the protection scope of the present invention claims.

Claims

1. A method for structured decomposition of policy documents, the method comprising:

step S1: obtaining a corpus set;

step S2: preprocessing a corpus;

step S3: constructing a discourse structure tree;

step S4: constructing a policy condition tree;

2. The method for structured decomposition of policy document according to claim 1, wherein said step S1 includes:

step S1.1: selecting a webpage from a political policy website;

3. The method for structured decomposition of policy document according to claim 1, wherein said step S2 includes:

step S2.1: cleaning the corpus;

step S2.2: performing word segmentation on the cleaned corpus;

4. The method of claim 3, wherein the annotation set of step S2.3 is a corpus of people daily annotations.

5. The method for structured decomposition of policy document according to claim 2, wherein said step S3 includes:

step S3.5: merging the element groups into the nearest node element group;

step S3.6: associating all nodes to form a structure tree;

6. The method for structured decomposition of policy document according to claim 5, wherein said step S4 includes:

step S4.1: extracting a text area related to policy terms in the tree nodes;

step S4.3: performing part-of-speech analysis on the filtered text;

7. The method according to claim 6, wherein the step S4.1 includes:

step S4.11: selecting keywords related to policy terms;

step S4.12: writing a regular expression for describing the policy keywords;

step S4.13: matching texts in the tree nodes by using a regular expression;

8. The method according to claim 6, wherein the step S4.3 further comprises: the text is parsed.

9. The method of claim 8, wherein the parsing is dependency parsing.