CN110609983A - Structured decomposition method for policy file - Google Patents

Structured decomposition method for policy file Download PDF

Info

Publication number
CN110609983A
CN110609983A CN201910766729.2A CN201910766729A CN110609983A CN 110609983 A CN110609983 A CN 110609983A CN 201910766729 A CN201910766729 A CN 201910766729A CN 110609983 A CN110609983 A CN 110609983A
Authority
CN
China
Prior art keywords
policy
text
tree
corpus
element group
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910766729.2A
Other languages
Chinese (zh)
Other versions
CN110609983B (en
Inventor
金耀初
何卫灵
刘华
张宏辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Liko Technology Co Ltd
Original Assignee
Guangzhou Liko Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Liko Technology Co Ltd filed Critical Guangzhou Liko Technology Co Ltd
Priority to CN201910766729.2A priority Critical patent/CN110609983B/en
Publication of CN110609983A publication Critical patent/CN110609983A/en
Application granted granted Critical
Publication of CN110609983B publication Critical patent/CN110609983B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention relates to the technical field of natural language processing, in particular to a structured decomposition method of a policy file, which comprises the following steps: step S1: obtaining a corpus set; step S2: preprocessing a corpus; step S3: constructing a discourse structure tree; step S4: constructing a policy condition tree; step S5: and constructing and visualizing a new construction tree according to the discourse structure tree and the policy condition tree. The scheme enables the policy document to be accurately understood through corpus preprocessing, part of speech analysis and syntactic analysis.

Description

Structured decomposition method for policy file
Technical Field
The invention relates to the technical field of natural language processing, in particular to a structured decomposition method for policy files.
Background
The natural language refers to the language used by people in daily life, such as Chinese, English, French and the like, is a natural language evolved along with the development of human society, is not an artificial language, and is an important tool for human study and life. In general, natural language refers to a convention of human society that is distinguished from artificial languages, such as programming languages.
Natural Language Processing (NLP) refers to an operation and processing for processing information such as shapes, sounds, and meanings of natural language, i.e., inputting, outputting, recognizing, analyzing, understanding, and generating characters, words, sentences, and chapters, by a computer. The concrete expression forms of natural language processing include machine translation, text summarization, text classification, text proofreading, information extraction, speech synthesis, speech recognition and the like. It can be said that natural language processing is to solve natural language by a computer, and the natural language processing mechanism involves two processes including natural language understanding and natural language generation.
In the current society, with the development of information technology and the popularization of the internet, big data, cloud computing and artificial intelligence become hot topics of the current academic community. Natural language processing is one of the most difficult problems in artificial intelligence, and how to realize information exchange between human and machines and intelligently screen and process massive data is a key technical breakthrough in the artificial intelligence, computer science and linguistic industries. Because of the specificity and complexity of human languages, understanding human languages by machines is a difficult task. Especially in the field of natural language processing, machine understanding of Chinese is far more complex than understanding of English. Therefore, how to make the machine better analyze Chinese becomes a difficult problem that cannot be circumvented in the field of artificial intelligence.
Currently, various forms of data fall into three categories: unstructured data, semi-structured data, and structured data. Structured data is easy to reason because its entities are isolated; the semi-structured data has certain structurality, and the operability of extracting entities is high; unstructured data has difficulty extracting entities because of uncertainty in its structure. An entity generally refers to a noun phrase or verb phrase of a particular meaning or strong reference in text, typically including a person's name, place name, organization name, time, proper noun, and the like. The policy file is one of unstructured data, and the content relationship of the policy file is more and more complicated due to the unstructured data form, so that the policy file is difficult to understand by a machine, and the enterprise or an individual is easy to ignore or misunderstand in the understanding process. In the process of policy implementation, the importance of the policy document is self-evident, and it is only necessary to accurately convey the national policy to effectively implement the policy that people can clearly know and fully understand the intention of the policy, the method and the steps for implementing the policy, and the specific measures for implementing the policy, and thus people can actively and actively implement the policy. The manual interpretation and labeling of the policy documents are high in cost, efficiency and quality are difficult to improve, and the manual intelligent application of backward intelligent question answering, emotion analysis, knowledge map construction and the like is not facilitated. Therefore, a method for accurately understanding the policy document is needed.
Disclosure of Invention
In order to solve the above problems, the present invention provides a structured decomposition method for policy documents, which can accurately understand the policy documents.
The technical scheme adopted by the invention is as follows:
a structured decomposition method of a policy file, comprising:
step S1: obtaining a corpus set;
step S2: preprocessing a corpus;
step S3: constructing a discourse structure tree;
step S4: constructing a policy condition tree;
step S5: and constructing and visualizing a new construction tree according to the discourse structure tree and the policy condition tree.
The present solution is a specific implementation of a branch function of a natural language processing part, specifically, step S1: obtaining a corpus set; the language material is obtained, the corpus is a basic unit forming a corpus and is in a text form, and the collection of the corpus is a corpus collection. Step S2: preprocessing a corpus; namely, the noise in the corpus is removed, the required text content is obtained, the text content is analyzed preliminarily, and the text content is labeled, so that the machine reading and understanding are easy, and conditions are provided for subsequent natural language processing application. Step S3: constructing a discourse structure tree; the foregoing steps have preliminarily interpreted the analysis corpus, established tree nodes of a header level according to the result of the analysis interpretation, and realized the association between the tree nodes. Step S4: and constructing a policy condition tree. The analysis is further deepened, the text content in the tree nodes is understood, the nodes of the content level are established according to the policy conditions, and the association among the nodes is realized. Finally, step S5: and constructing and visualizing a new construction tree according to the discourse structure tree and the policy condition tree. A new association is established making the association between the policy terms more concrete and understandable. The scheme enables the policy document to be accurately understood through corpus preprocessing, part of speech analysis and syntactic analysis.
Further, the step S1 includes:
step S1.1: selecting a webpage from a political policy website;
step S1.2: defining the webpage as document, and traversing the document to acquire text data;
step S1.3: establishing an element group set according to the acquired text data;
the element group set is: element (tuple)1,tuple2……tuplen),tuplei={(tagi,datai1, 2, … … n), where n is the number of element groups, i denotes the element group number, tagiIndicating html tag, data, in the ith element groupiIndicating the html content in the ith element group.
Because the policy file is to be processed by the scheme, the concrete obtaining mode of the corpus is as follows: and capturing corpus information from the administrative website. The web page not only contains text information, but also other information such as picture links and the like. Therefore, the webpage is defined as a document format, all data are converted into texts, then the webpage converted into the texts is traversed, all text data in the webpage are obtained, and finally an element group set element is established to store all the obtained text data. The webpage converted into text not only contains text content, but also html tags, comments and the like, and the tags contain information such as text styles. The element group stores the html content and the html tag separately for the convenience of subsequent reading and parsing work.
Further, the step S2 includes:
step 52.1: cleaning the corpus;
step S2.2: performing word segmentation on the cleaned corpus;
step S2.3: and performing part-of-speech tagging on the corpus after word segmentation.
After the corpus is obtained, because the corpus necessarily contains unnecessary information, the corpus must be filtered to obtain useless contents, such as: and deleting useless contents such as advertisements, useless links, html comments and the like, extracting useful content texts, segmenting words according to the meanings of the words and the words, and then marking corresponding part-of-speech labels on each word or each word.
Further, the labeling set of step 52.3 is a daily labeling corpus of people.
Specifically, the part-of-speech tags are obtained from the daily newspaper tagging corpus, and the processed corpus is a policy file, so that the daily newspaper tagging corpus is more accurate than other tagging sets.
Further, the step S3 includes:
step S3.1: writing a regular expression for describing each level of title style;
step S3.2: and establishing a title template set according to the regular expression.
Step S3.3: matching the title template set with the element group set, if the content of the text in the element group conforms to the regular expression, executing the step S3.4, otherwise executing the step 52.5;
step S3.4: building a new node on the corresponding layer, wherein the node is named as the text content of which the element group conforms to the regular expression, and the element group corresponding to the text content is stored in the node;
step S3.5: merging the element groups into the nearest node element group;
step S3.6: associating all nodes to form a structure tree;
the node hierarchy of the structure tree is the corresponding title hierarchy, and the association between the nodes is the association between the element groups.
And S2, cleaning the corpus in a noise deleting mode, and S3 cleaning the corpus in a regular expression matching mode, and simultaneously extracting titles and contents from the corpus. First, because of the tag within the element groupiThe html tags are stored and comprise title tags which comprise style information, so that corresponding regular expressions are written according to the styles of all levels of titles, the regular expressions are matched with the content modified by the title tags, and the corresponding titles can be extracted. And then the title label also contains title hierarchical information, the matched title is known to be a title of several levels according to the information, the element group corresponding to the title is stored into the node, and if the title template set does not have a regular expression consistent with the matched content, the element group corresponding to the matched content is merged into the nearest node element group. And finally, constructing a structure tree according to the extracted information, wherein the title is used as a node name, the title hierarchy is a node hierarchy, and the association among the element groups is the association among the nodes, so that the chapter structure tree is constructed.
Further, the step S4 includes:
step S4.1: extracting a text area related to policy terms in the tree nodes;
step S4.2: filtering the text in the text area by using the combined template of the parts of speech;
step S4.3: performing part-of-speech analysis on the filtered text;
step S4.4: extracting policy terms and conditions according to the analysis result, and constructing a policy condition tree according to the policy terms and conditions;
the tree nodes of the policy condition tree correspond to policy terms and policy conditions, and the associations between tree nodes correspond to associations between policy terms or associations between policy terms and policy conditions.
The previous steps are used for cleaning the corpus twice, performing word segmentation and part-of-speech tagging, and performing primary analysis on the corpus. Step S4 is to perform a third cleaning on the corpus to perform a deeper understanding. Firstly, text regions required by us are extracted according to policy keywords. Then, the user filters the text in the text area by using the self-defined part-of-speech template, and the part-of-speech template can play a role in screening because the word corpus is already participled. Finally, based on word segmentation and part-of-speech tagging, part-of-speech analysis is carried out on the filtered text, and association between words and sentences can be obtained. And constructing a policy condition tree according to the information obtained by analyzing the contents of the nodes, wherein the nodes are named as policy terms, the nodes are policy conditions, and the node relationships are the associations between the policy terms and the policy conditions and between the policy conditions and the policy conditions.
Further, the step S4.1 includes:
step S4.11: selecting keywords related to policy terms;
step 54.12: writing a regular expression for describing the policy keywords;
step 54.13: matching texts in the tree nodes by using a regular expression;
step S4.14: a text region associated with the keyword is selected from the text.
Specifically, the method for selecting the text area comprises the following steps: establishing keywords to be selected, wherein the keywords are words related to policies in which people are interested; writing a regular expression for describing the key words, matching texts in the tree nodes by using the regular expression, finding the positions of the key words from the texts, and finally selecting a text area near the key words.
Further, the step S4.3 further includes: the text is parsed.
Specifically, the syntactic analysis is performed on the text, on one hand, the context can be associated, the text can be understood more deeply, the ambiguity between words is eliminated, the correctness and the integrity of a corresponding tree library construction system are verified, and the information error rate of the visual structure tree is lower. On the other hand, besides building the structure tree, the method can also be directly served for other upper-layer applications, such as search engine user log analysis and other tasks related to natural language processing, such as keyword recognition, information extraction, automatic question answering, machine translation and the like. Furthermore, the syntactic analysis does not adopt a deep learning method based on the labeled data set, but trains on the basis of the traditional unsupervised learning method, does not need a large amount of manual data labeling, avoids errors caused by poor quality of the labeled data set, and saves a large amount of manpower and financial resources.
Further, the syntax analysis is a dependency syntax analysis.
Specifically, the dependency grammar reveals its syntactic structure by analyzing the dependency relationships between components within a language unit. Namely, the grammatical components of ' principal object and ' definite form complement ' in the sentence are analyzed and recognized, and the relation among the components is analyzed. Dependency parsing can help better understand the meaning of text by analyzing the syntactic structure of a segment of speech and accurately extracting its backbone information. Through analyzing the syntactic structure, the word-by-word translation can be carried out, and then the translation result is sorted and modified according to the syntactic structure.
Compared with the prior art, the invention has the beneficial effects that:
(1) the policy document is accurately understood through corpus preprocessing, part of speech analysis and syntactic analysis.
(2) The syntactic analysis based on unsupervised learning does not need a large amount of manual data labeling, avoids errors caused by poor quality of a labeled data set, and saves a large amount of manpower and financial resources.
(3) The trained corpus may be used for other upper-level applications of natural language processing.
(4) The visual structure tree also visually shows each policy term and the association thereof, which can be easily understood by people.
Drawings
FIG. 1 is a schematic view of a construction node of the present invention;
FIG. 2 is a diagram of a policy condition tree according to the present invention;
FIG. 3 is a schematic diagram of a structure tree according to the present invention.
Detailed Description
The drawings are only for purposes of illustration and are not to be construed as limiting the invention. For a better understanding of the following embodiments, certain features of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product; it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.
Examples
The embodiment provides a structured decomposition method of a policy file, which comprises the following steps:
step S1: obtaining a corpus set;
step S2: preprocessing a corpus;
step S3: constructing a discourse structure tree;
step S4: and constructing a policy condition tree.
Step S5: and constructing and visualizing a new construction tree according to the discourse structure tree and the policy condition tree.
This embodiment is a specific implementation of the branch function of the natural language processing part, specifically, step S1: obtaining a corpus set; the language material is obtained, the corpus is a basic unit forming a corpus and is in a text form, and the collection of the corpus is a corpus collection. Step S2: preprocessing a corpus; namely, the noise in the corpus is removed, the required text content is obtained, the text content is analyzed preliminarily, and the text content is labeled, so that the machine reading and understanding are easy, and conditions are provided for subsequent natural language processing application. Step S3: constructing a discourse structure tree; the foregoing steps have preliminarily interpreted the analysis corpus, established tree nodes of a header level according to the result of the analysis interpretation, and realized the association between the tree nodes. Step S4: and constructing a policy condition tree. The analysis is further deepened, the text content in the tree nodes is understood, the nodes of the content level are established according to the policy conditions, and the association among the nodes is realized. Finally, step S5: and constructing and visualizing a new construction tree according to the discourse structure tree and the policy condition tree. A new association is established making the association between the policy terms more concrete and understandable. The scheme enables the policy document to be accurately understood through corpus preprocessing, part of speech analysis and syntactic analysis.
Further, the step S1 includes:
step S1.1: selecting a webpage from a political policy website;
step S1.2: defining the webpage as document, and traversing the document to acquire text data;
step S1.3: establishing an element group set according to the acquired text data;
the element group set is: element (tuple)1,tuple2……tuplen),tuplei={(tagi,datai1, 2, … … n), where n is the number of element groups, i denotes the element group number, tagiIndicating html tag, data, in the ith element groupiIndicating the html content in the ith element group.
Because the policy file is to be processed in this embodiment, the concrete obtaining manner of the corpus is as follows: and capturing corpus information from the administrative website. The web page not only contains text information, but also other information such as picture links and the like. Therefore, the webpage is defined as a document format, all data are converted into texts, then the webpage converted into the texts is traversed, all text data in the webpage are obtained, and finally an element group set element is established to store all the obtained text data. The webpage converted into text not only contains text content, but also html tags, comments and the like, and the tags contain information such as text styles. The element group stores the html content and the html tag separately for the convenience of subsequent reading and parsing work.
Further, the step S2 includes:
step S2.1: cleaning the corpus;
step S2.2: performing word segmentation on the cleaned corpus;
step S2.3: and performing part-of-speech tagging on the corpus after word segmentation.
After the corpus is obtained, because the corpus necessarily contains unnecessary information, the corpus must be filtered to obtain useless contents, such as: and deleting useless contents such as advertisements, useless links, html comments and the like, extracting useful content texts, segmenting words according to the meanings of the words and the words, and then marking corresponding part-of-speech labels on each word or each word.
Further, the labeling set in step S2.3 is a daily labeling corpus of people.
Specifically, the part-of-speech tags are obtained from the daily newspaper tagging corpus, and the processed corpus is a policy file, so that the daily newspaper tagging corpus is more accurate than other tagging sets.
Further, the step S3 includes:
step S3.1: writing a regular expression for describing each level of title style;
step S3.2: and establishing a title template set according to the regular expression.
Step S3.3: matching the title template set with the element group set, if the content of the text in the element group conforms to the regular expression, executing the step S3.4, otherwise executing the step S2.5;
step S3.4: building a new node on the corresponding layer, wherein the node is named as the text content of which the element group conforms to the regular expression, and the element group corresponding to the text content is stored in the node;
step S3.5: merging the element groups into the nearest node element group;
step S3.6: associating all nodes to form a structure tree;
the node hierarchy of the structure tree is the corresponding title hierarchy, and the association between the nodes is the association between the element groups.
And S2, cleaning the corpus in a noise deleting mode, and S3 cleaning the corpus in a regular expression matching mode, and simultaneously extracting titles and contents from the corpus. First, because of the tag within the element groupiThe html tags are stored and comprise title tags which comprise style information, so that corresponding regular expressions are written according to the styles of all levels of titles, the regular expressions are matched with the content modified by the title tags, and the corresponding titles can be extracted. And then the title label also contains title hierarchical information, the matched title is known to be a title of several levels according to the information, the element group corresponding to the title is stored into the node, and if the title template set does not have a regular expression consistent with the matched content, the element group corresponding to the matched content is merged into the nearest node element group. Finally, the structure tree and the title are constructed according to the extracted informationAnd as the node names, the title hierarchies are the node hierarchies, and the associations among the element groups are the associations among the nodes, so that the chapter structure tree is constructed.
Fig. 1 is a schematic diagram of a node constructed according to the present invention, and as shown in fig. 1, the node construction process in this embodiment includes: selecting a proper policy webpage, analyzing and obtaining a text from HTML, writing a regular expression for describing a title, wherein the regular expression forms a title template set S { S }1,S2……Si,Sj,SnSuppose that the current layer is h, and the regular expression corresponding to the title of the h layer is Si. If the sentences in the text have the coincidence SiBuilding tree nodes in the h layer; if the sentence does not conform to SiBut with a content according to SiThen detect SjWhether the style is h-layer style or not is judged, if yes, a tree node is constructed on the layer, if not, a new layer is split, the number of the new layer is h +1, and whether the corresponding title style of the h +1 layer is S or not is detectedjIf yes, tree nodes are constructed, and if not, the steps of splitting and detecting are repeated. If the content in the last element group does not meet the title style, the content is merged into the element group of the nearest node.
Further, the step S4 includes:
step S4.1: extracting a text area related to policy terms in the tree nodes;
step S4.2: filtering the text in the text area by using the combined template of the parts of speech;
step 54.3: performing part-of-speech analysis on the filtered text;
step 54.4: extracting policy terms and conditions according to the analysis result, and constructing a policy condition tree according to the policy terms and conditions;
the tree nodes of the policy condition tree correspond to policy terms and policy conditions, and the associations between tree nodes correspond to associations between policy terms or associations between policy terms and policy conditions.
The previous steps are used for cleaning the corpus twice, performing word segmentation and part-of-speech tagging, and performing primary analysis on the corpus. Step S4 is to perform a third cleaning on the corpus to perform a deeper understanding. Firstly, text regions required by us are extracted according to policy keywords. Then, the user filters the text in the text area by using the self-defined part-of-speech template, and the part-of-speech template can play a role in screening because the word corpus is already participled. Finally, based on word segmentation and part-of-speech tagging, part-of-speech analysis is carried out on the filtered text, and association between words and sentences can be obtained. And constructing a policy condition tree according to the information obtained by analyzing the contents of the nodes, wherein the nodes are named as policy terms, the nodes are policy conditions, and the node relationships are the associations between the policy terms and the policy conditions and between the policy conditions and the policy conditions.
In this embodiment, the specific examples of the association between the policy terms in the tree nodes and the policy conditions required by the policy terms are as follows:
the following can be applied according to all the following application conditions:
firstly, conditions of an enterprise:
1. the system has complete business and industry registration places, tax collection and management relations and statistical relations;
2. units with independent legal qualification, sound financial system and independent accounting;
FIG. 2 is a schematic diagram of a policy condition tree according to the present invention, which is shown as the result of the association transformation.
Further, the step S4.1 includes:
step S4.11: selecting keywords related to policy terms;
step 54.12: writing a regular expression for describing the policy keywords;
step S4.13: matching texts in the tree nodes by using a regular expression;
step S4.14: a text region associated with the keyword is selected from the text.
Specifically, the method for selecting the text area comprises the following steps: establishing keywords to be selected, wherein the keywords are words related to policies in which people are interested; writing a regular expression for describing the key words, matching texts in the tree nodes by using the regular expression, finding the positions of the key words from the texts, and finally selecting a text area near the key words.
Further, the step S4.3 further includes: the text is parsed.
Specifically, the syntactic analysis is performed on the text, on one hand, the context can be associated, the text can be understood more deeply, the ambiguity between words is eliminated, the correctness and the integrity of a corresponding tree library construction system are verified, and the information error rate of the visual structure tree is lower. On the other hand, besides building the structure tree, the method can also be directly served for other upper-layer applications, such as search engine user log analysis and other tasks related to natural language processing, such as keyword recognition, information extraction, automatic question answering, machine translation and the like. Furthermore, the syntactic analysis does not adopt a deep learning method based on the labeled data set, but trains on the basis of the traditional unsupervised learning method, does not need a large amount of manual data labeling, avoids errors caused by poor quality of the labeled data set, and saves a large amount of manpower and financial resources.
Further, the syntax analysis is a dependency syntax analysis.
Specifically, the dependency grammar reveals its syntactic structure by analyzing the dependency relationships between components within a language unit. Namely, the grammatical components of ' principal object and ' definite form complement ' in the sentence are analyzed and recognized, and the relation among the components is analyzed. Dependency parsing can help better understand the meaning of text by analyzing the syntactic structure of a segment of speech and accurately extracting its backbone information. Through analyzing the syntactic structure, the word-by-word translation can be carried out, and then the translation result is sorted and modified according to the syntactic structure.
Fig. 3 is a schematic diagram of the structure tree of the present invention, as shown in the figure, the discourse structure tree is visualized after being combined with the policy condition tree, so that various associations are more specific and clear and easier to understand.
It should be understood that the above-mentioned embodiments of the present invention are only examples for clearly illustrating the technical solutions of the present invention, and are not intended to limit the specific embodiments of the present invention. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention claims should be included in the protection scope of the present invention claims.

Claims (9)

1. A method for structured decomposition of policy documents, the method comprising:
step S1: obtaining a corpus set;
step S2: preprocessing a corpus;
step S3: constructing a discourse structure tree;
step S4: constructing a policy condition tree;
step S5: and constructing and visualizing a new construction tree according to the discourse structure tree and the policy condition tree.
2. The method for structured decomposition of policy document according to claim 1, wherein said step S1 includes:
step S1.1: selecting a webpage from a political policy website;
step S1.2: defining the webpage as document, and traversing the document to acquire text data;
step S1.3: establishing an element group set according to the acquired text data;
the element group set is: element (tuple)1,tuple2……tuplen),tuplei={(tagi,datai1, 2, … … n), where n is the number of element groups, i denotes the element group number, tagiIndicating html tag, data, in the ith element groupiIndicating the html content in the ith element group.
3. The method for structured decomposition of policy document according to claim 1, wherein said step S2 includes:
step S2.1: cleaning the corpus;
step S2.2: performing word segmentation on the cleaned corpus;
step S2.3: and performing part-of-speech tagging on the corpus after word segmentation.
4. The method of claim 3, wherein the annotation set of step S2.3 is a corpus of people daily annotations.
5. The method for structured decomposition of policy document according to claim 2, wherein said step S3 includes:
step S3.1: writing a regular expression for describing each level of title style;
step S3.2: and establishing a title template set according to the regular expression.
Step S3.3: matching the title template set with the element group set, if the content of the text in the element group conforms to the regular expression, executing the step S3.4, otherwise executing the step 52.5;
step S3.4: building a new node on the corresponding layer, wherein the node is named as the text content of which the element group conforms to the regular expression, and the element group corresponding to the text content is stored in the node;
step S3.5: merging the element groups into the nearest node element group;
step S3.6: associating all nodes to form a structure tree;
the node hierarchy of the structure tree is the corresponding title hierarchy, and the association between the nodes is the association between the element groups.
6. The method for structured decomposition of policy document according to claim 5, wherein said step S4 includes:
step S4.1: extracting a text area related to policy terms in the tree nodes;
step S4.2: filtering the text in the text area by using the combined template of the parts of speech;
step S4.3: performing part-of-speech analysis on the filtered text;
step S4.4: extracting policy terms and conditions according to the analysis result, and constructing a policy condition tree according to the policy terms and conditions;
the tree nodes of the policy condition tree correspond to policy terms and policy conditions, and the associations between tree nodes correspond to associations between policy terms or associations between policy terms and policy conditions.
7. The method according to claim 6, wherein the step S4.1 includes:
step S4.11: selecting keywords related to policy terms;
step S4.12: writing a regular expression for describing the policy keywords;
step S4.13: matching texts in the tree nodes by using a regular expression;
step S4.14: a text region associated with the keyword is selected from the text.
8. The method according to claim 6, wherein the step S4.3 further comprises: the text is parsed.
9. The method of claim 8, wherein the parsing is dependency parsing.
CN201910766729.2A 2019-08-19 2019-08-19 Structured decomposition method for policy file Active CN110609983B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910766729.2A CN110609983B (en) 2019-08-19 2019-08-19 Structured decomposition method for policy file

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910766729.2A CN110609983B (en) 2019-08-19 2019-08-19 Structured decomposition method for policy file

Publications (2)

Publication Number Publication Date
CN110609983A true CN110609983A (en) 2019-12-24
CN110609983B CN110609983B (en) 2023-06-09

Family

ID=68890232

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910766729.2A Active CN110609983B (en) 2019-08-19 2019-08-19 Structured decomposition method for policy file

Country Status (1)

Country Link
CN (1) CN110609983B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112036150A (en) * 2020-07-07 2020-12-04 远光软件股份有限公司 Electricity price policy term analysis method, storage medium and computer
CN112131385A (en) * 2020-09-15 2020-12-25 天津大学 Structure analysis method of privacy policy
CN112580331A (en) * 2020-12-15 2021-03-30 国家工业信息安全发展研究中心 Method and system for establishing knowledge graph of policy text
CN112632964A (en) * 2020-12-24 2021-04-09 平安科技(深圳)有限公司 NLP-based industry policy information processing method, device, equipment and medium
CN114021574A (en) * 2022-01-05 2022-02-08 杭州实在智能科技有限公司 Intelligent analysis and structuring method and system for policy file
CN115080924A (en) * 2022-07-25 2022-09-20 南开大学 Software license clause extraction method based on natural language understanding
CN115859968A (en) * 2023-02-27 2023-03-28 四川省计算机研究院 Policy granular analysis system based on natural language analysis and machine learning

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106919606A (en) * 2015-12-28 2017-07-04 航天信息股份有限公司 A kind of method and system that SQL query condition is realized based on tree construction
CN107145479A (en) * 2017-05-04 2017-09-08 北京文因互联科技有限公司 Structure of an article analysis method based on text semantic
CN109493265A (en) * 2018-11-05 2019-03-19 北京奥法科技有限公司 A kind of Policy Interpretation method and Policy Interpretation system based on deep learning
CN109918672A (en) * 2019-03-13 2019-06-21 东华大学 A kind of structuring processing method of the Thyroid ultrasound report based on tree construction

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106919606A (en) * 2015-12-28 2017-07-04 航天信息股份有限公司 A kind of method and system that SQL query condition is realized based on tree construction
CN107145479A (en) * 2017-05-04 2017-09-08 北京文因互联科技有限公司 Structure of an article analysis method based on text semantic
CN109493265A (en) * 2018-11-05 2019-03-19 北京奥法科技有限公司 A kind of Policy Interpretation method and Policy Interpretation system based on deep learning
CN109918672A (en) * 2019-03-13 2019-06-21 东华大学 A kind of structuring processing method of the Thyroid ultrasound report based on tree construction

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112036150A (en) * 2020-07-07 2020-12-04 远光软件股份有限公司 Electricity price policy term analysis method, storage medium and computer
CN112131385A (en) * 2020-09-15 2020-12-25 天津大学 Structure analysis method of privacy policy
CN112580331A (en) * 2020-12-15 2021-03-30 国家工业信息安全发展研究中心 Method and system for establishing knowledge graph of policy text
CN112632964A (en) * 2020-12-24 2021-04-09 平安科技(深圳)有限公司 NLP-based industry policy information processing method, device, equipment and medium
CN114021574A (en) * 2022-01-05 2022-02-08 杭州实在智能科技有限公司 Intelligent analysis and structuring method and system for policy file
CN115080924A (en) * 2022-07-25 2022-09-20 南开大学 Software license clause extraction method based on natural language understanding
CN115859968A (en) * 2023-02-27 2023-03-28 四川省计算机研究院 Policy granular analysis system based on natural language analysis and machine learning
CN115859968B (en) * 2023-02-27 2023-11-21 四川省计算机研究院 Policy granulation analysis system based on natural language analysis and machine learning

Also Published As

Publication number Publication date
CN110609983B (en) 2023-06-09

Similar Documents

Publication Publication Date Title
CN110609983B (en) Structured decomposition method for policy file
Velardi et al. A taxonomy learning method and its application to characterize a scientific web community
CN111078875B (en) Method for extracting question-answer pairs from semi-structured document based on machine learning
CN109947921B (en) Intelligent question-answering system based on natural language processing
CN111259631B (en) Referee document structuring method and referee document structuring device
CN106570171A (en) Semantics-based sci-tech information processing method and system
CN111061882A (en) Knowledge graph construction method
Navigli et al. From Glossaries to Ontologies: Extracting Semantic Structure from Textual Definitions.
CN110609998A (en) Data extraction method of electronic document information, electronic equipment and storage medium
Mitkov et al. Coreference and anaphora: developing annotating tools, annotated resources and annotation strategies
Kmail et al. An automatic online recruitment system based on exploiting multiple semantic resources and concept-relatedness measures
CN112541337B (en) Document template automatic generation method and system based on recurrent neural network language model
Alex et al. Digitised historical text: Does it have to be mediOCRe?.
Itani et al. Corpora for sentiment analysis of Arabic text in social media
Kutter Corpus analysis
CN113377916B (en) Extraction method of main relations in multiple relations facing legal text
CN113312922A (en) Improved chapter-level triple information extraction method
CN113159969A (en) Financial long text rechecking system
CN112711666A (en) Futures label extraction method and device
McEnery et al. Corpus annotation and reference resolution
CN115759037A (en) Intelligent auditing frame and auditing method for building construction scheme
CN112488593B (en) Auxiliary bid evaluation system and method for bidding
Suriyachay et al. Thai named entity tagged corpus annotation scheme and self verification
CN111859887A (en) Scientific and technological news automatic writing system based on deep learning
KR101126186B1 (en) Apparatus and Method for disambiguation of morphologically ambiguous Korean verbs, and Recording medium thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant