CN113239018A - Policy data screening model and policy data processing method - Google Patents
Policy data screening model and policy data processing method Download PDFInfo
- Publication number
- CN113239018A CN113239018A CN202110613433.4A CN202110613433A CN113239018A CN 113239018 A CN113239018 A CN 113239018A CN 202110613433 A CN202110613433 A CN 202110613433A CN 113239018 A CN113239018 A CN 113239018A
- Authority
- CN
- China
- Prior art keywords
- policy
- data
- policy data
- keywords
- paragraph
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
Abstract
The invention belongs to the technical field of data screening, and particularly relates to a policy data screening model and a policy data processing method, wherein the policy data processing method comprises the following steps: constructing a policy data collection database; collecting basic data according to a policy data collection database; screening core data in the basic data according to the policy data screening model; the policy data is constructed according to the core data, so that the collection and the arrangement of the policy data on the network are realized, a user can conveniently know various policies at the same time by one-time access, and the time cost is saved.
Description
Technical Field
The invention belongs to the technical field of data screening, and particularly relates to a policy data screening model and a policy data processing method.
Background
With the development of network technology, the traditional way of issuing policies such as paper documents and periodicals is changed into the way that the policies are issued on the network at the first time, but websites issuing the policies are various, and when a plurality of policies need to be known, a plurality of websites need to be accessed, which is time-consuming and labor-consuming.
Therefore, it is necessary to design a new policy data screening model and a new policy data processing method based on the above technical problems.
Disclosure of Invention
The invention aims to provide a policy data screening model and a policy data processing method.
In order to solve the above technical problem, the present invention provides a policy data screening model, which includes:
wherein, H (C)i) Identifying keywords C for each paragraph in a set of paragraphs for keywords according to policy categoryiA context co-occurrence entropy value of;as other words CjAnd word CiThe number of co-occurrences.
On the other hand, the invention also provides a policy data processing method, which comprises the following steps:
constructing a policy data collection database;
collecting basic data according to a policy data collection database;
screening core data in the basic data according to the policy data screening model;
policy data is constructed from the core data.
Further, the method for constructing the policy data collection database comprises the following steps:
and collecting the websites of the websites distributed with the policy information, and storing each website in a database to form a policy data collection database.
Further, the method for collecting basic data according to the policy data collection database comprises the following steps:
and acquiring all original data of each website from all websites in a policy data collection database by adopting a crawling method of a web crawler, and screening the original data to acquire basic data.
Further, the method for screening core data in basic data according to the policy data screening model comprises the following steps:
dividing the basic data into paragraph sets, identifying keywords of each paragraph in the paragraph sets according to the keywords of the policy category, and identifying core words in the keywords, i.e.
Dividing a paragraph into n words to form a word set C, and identifying keywords in the word set C;
for the keywords C in the word set CiCalculate the keyword CiThe number of co-occurrences with any of the remaining words in the word set C;
obtaining a keyword CiContext co-occurrence entropy value of (a):
wherein, H (C)i) Identifying keywords C for each paragraph in a set of paragraphs for keywords according to policy categoryiA context co-occurrence entropy value of;as other words CjAnd word CiThe number of co-occurrences of (c);
after context co-occurrence entropy values of all keywords are obtained, comparing the context co-occurrence entropy values of all the keywords, wherein the keyword with the largest context co-occurrence entropy value is a core word;
and judging the policy type corresponding to the core word according to the keywords of the policy type, wherein the content of the paragraph to which the core word belongs corresponds to the policy type, and further judging the policy type to which each paragraph belongs.
Further, the method for constructing policy data according to the core data comprises the following steps:
and dividing each paragraph into corresponding policy categories according to the policy category to which the paragraph belongs to construct policy data.
The method has the advantages that the policy data collection database is constructed; collecting basic data according to a policy data collection database; screening core data in the basic data according to the policy data screening model; the policy data is constructed according to the core data, so that the collection and the arrangement of the policy data on the network are realized, a user can conveniently know various policies at the same time by one-time access, and the time cost is saved.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and drawings.
In order to make the aforementioned and other objects, features and advantages of the present invention comprehensible, preferred embodiments accompanied with figures are described in detail below.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
FIG. 1 is a flow chart of screening models according to policy data in accordance with the present invention.
Detailed Description
To make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Example 1
This embodiment 1 provides a policy data screening model, including:
wherein, H (C)i) Identifying keywords for each paragraph in a set of paragraphs for keywords according to policy categoryCiA context co-occurrence entropy value of;as other words CjAnd word CiThe number of co-occurrences.
Example 2
On the basis of embodiment 1, this embodiment 2 further provides a policy data processing method, including: constructing a policy data collection database; collecting basic data according to a policy data collection database; screening core data in the basic data according to the policy data screening model; the policy data is constructed according to the core data, so that the collection and the arrangement of the policy data on the network are realized, a user can conveniently know various policies at the same time by one-time access, and the time cost is saved.
In the present embodiment, the policy data screening model is adapted to employ the policy data screening model in embodiment 1.
In this embodiment, the method for constructing the policy data collection database includes: and collecting the websites of the websites distributed with the policy information, and storing each website in a database to form a policy data collection database.
In this embodiment, the method for collecting basic data according to the policy data collection database includes: acquiring all original data of each website from all websites in a policy data collection database by adopting a crawling method of a web crawler, and screening the original data to acquire basic data; different crawling techniques can be employed to cope with the anti-crawler policies of different websites, such as Requests, Selenium, etc.; selecting which of the original data on the website is selected by adopting techniques such as Beautifulsoup, Selenium and the like, removing HTML (hypertext markup language) tags, CSS (cascading style sheets) styles and the like in the original data, and obtaining basic data, wherein the basic data are data which are issued on each website and contain policies.
In this embodiment, the method for screening core data in basic data according to the policy data screening model includes: dividing the basic data into paragraph sets, identifying keywords of each paragraph in the paragraph sets according to the keywords of the policy category, and identifying core words in the keywords, i.e.Dividing a paragraph into n words to form a word set C, and identifying keywords in the word set C; for the keywords C in the word set CiCalculate the keyword CiThe number of co-occurrences with any of the remaining words in the word set C; obtaining a keyword CiContext co-occurrence entropy value of (a):
wherein, H (C)i) Identifying keywords C for each paragraph in a set of paragraphs for keywords according to policy categoryiA context co-occurrence entropy value of;as other words CjAnd word CiThe number of co-occurrences of (c); after context co-occurrence entropy values of all keywords are obtained, comparing the context co-occurrence entropy values of all the keywords, wherein the keyword with the largest context co-occurrence entropy value is a core word; if only one keyword appears in the paragraph, the keyword is the core word; if the keywords with the maximum context co-occurrence entropy values in the paragraphs are multiple, the paragraphs have multiple core words, and when the policy types corresponding to the paragraphs are judged, the paragraphs are simultaneously divided into multiple policy types; and judging the policy type corresponding to the core word according to the keywords of the policy type, wherein the content of the paragraph to which the core word belongs corresponds to the policy type, and further judging the policy type to which each paragraph belongs.
In this embodiment, the method for constructing policy data according to core data includes: dividing each paragraph into corresponding policy categories according to the policy category to which the paragraph belongs to construct policy data; under the catalogue of each policy category, paragraph contents with keywords corresponding to the policy category are collected from other websites, so that a user can conveniently know various policies at the same time by one-time access, and the time cost is saved.
In this embodiment, the policy category division and the policy category keyword extraction may be set according to the policy direction to be collected; for example, when data on the endowment policy needs to be collected, the policy category and related keywords related to endowment can be divided to accurately collect the policy on the endowment.
In summary, the policy data collection database is constructed; collecting basic data according to a policy data collection database; screening core data in the basic data according to the policy data screening model; the policy data is constructed according to the core data, so that the collection and the arrangement of the policy data on the network are realized, a user can conveniently know various policies at the same time by one-time access, and the time cost is saved.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method can be implemented in other ways. The apparatus embodiments described above are merely illustrative, and for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In addition, the functional modules in the embodiments of the present invention may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.
The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
In light of the foregoing description of the preferred embodiment of the present invention, many modifications and variations will be apparent to those skilled in the art without departing from the spirit and scope of the invention. The technical scope of the present invention is not limited to the content of the specification, and must be determined according to the scope of the claims.
Claims (6)
2. A method of processing policy data, comprising:
constructing a policy data collection database;
collecting basic data according to a policy data collection database;
screening core data in the basic data according to the policy data screening model;
policy data is constructed from the core data.
3. The policy data processing method of claim 2 wherein,
the method for constructing the policy data collection database comprises the following steps:
and collecting the websites of the websites distributed with the policy information, and storing each website in a database to form a policy data collection database.
4. The policy data processing method according to claim 3,
the method for collecting basic data according to the policy data collection database comprises the following steps:
and acquiring all original data of each website from all websites in a policy data collection database by adopting a crawling method of a web crawler, and screening the original data to acquire basic data.
5. The policy data processing method according to claim 4,
the method for screening core data in basic data according to the policy data screening model comprises the following steps:
dividing the basic data into paragraph sets, identifying keywords of each paragraph in the paragraph sets according to the keywords of the policy category, and identifying core words in the keywords, i.e.
Dividing a paragraph into n words to form a word set C, and identifying keywords in the word set C;
for the keywords C in the word set CiCalculate the keyword CiThe number of co-occurrences with any of the remaining words in the word set C;
obtaining a keyword CiContext co-occurrence entropy value of (a):
wherein, H (C)i) Identifying keywords C for each paragraph in a set of paragraphs for keywords according to policy categoryiA context co-occurrence entropy value of;as other words CjAnd word CiThe number of co-occurrences of (c);
after context co-occurrence entropy values of all keywords are obtained, comparing the context co-occurrence entropy values of all the keywords, wherein the keyword with the largest context co-occurrence entropy value is a core word;
and judging the policy type corresponding to the core word according to the keywords of the policy type, wherein the content of the paragraph to which the core word belongs corresponds to the policy type, and further judging the policy type to which each paragraph belongs.
6. The policy data processing method according to claim 5,
the method for constructing policy data according to core data comprises the following steps:
and dividing each paragraph into corresponding policy categories according to the policy category to which the paragraph belongs to construct policy data.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110613433.4A CN113239018A (en) | 2021-06-02 | 2021-06-02 | Policy data screening model and policy data processing method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110613433.4A CN113239018A (en) | 2021-06-02 | 2021-06-02 | Policy data screening model and policy data processing method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113239018A true CN113239018A (en) | 2021-08-10 |
Family
ID=77136352
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110613433.4A Pending CN113239018A (en) | 2021-06-02 | 2021-06-02 | Policy data screening model and policy data processing method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113239018A (en) |
-
2021
- 2021-06-02 CN CN202110613433.4A patent/CN113239018A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Hofmann et al. | Text mining and visualization: Case studies using open-source tools | |
US7941420B2 (en) | Method for organizing structurally similar web pages from a web site | |
Urvoy et al. | Tracking web spam with html style similarities | |
US9514216B2 (en) | Automatic classification of segmented portions of web pages | |
US8630972B2 (en) | Providing context for web articles | |
Das et al. | Text mining and topic modeling of compendiums of papers from transportation research board annual meetings | |
US9268749B2 (en) | Incremental computation of repeats | |
US20110173197A1 (en) | Methods and apparatuses for clustering electronic documents based on structural features and static content features | |
CN107688616B (en) | Make the unique facts of the entity appear | |
CN110929145B (en) | Public opinion analysis method, public opinion analysis device, computer device and storage medium | |
US20110246462A1 (en) | Method and System for Prompting Changes of Electronic Document Content | |
CN103838798A (en) | Page classification system and method | |
CN108520007B (en) | Web page information extracting method, storage medium and computer equipment | |
Alassi et al. | Effectiveness of template detection on noise reduction and websites summarization | |
Story et al. | Which apps have privacy policies? an analysis of over one million google play store apps | |
CN108763961B (en) | Big data based privacy data grading method and device | |
Sivakumar | Effectual web content mining using noise removal from web pages | |
CN112149387A (en) | Visualization method and device for financial data, computer equipment and storage medium | |
CN101714147B (en) | Method for filtering same or similar files | |
CN112818200A (en) | Data crawling and event analyzing method and system based on static website | |
JP7290391B2 (en) | Information processing device and program | |
CN106874368B (en) | RTB bidding advertisement position value analysis method and system | |
CN110968757B (en) | Policy file processing method and device | |
CN106649748B (en) | Information recommendation method and device | |
US8131546B1 (en) | System and method for adaptive sentence boundary disambiguation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |