CN113239018A - Policy data screening model and policy data processing method - Google Patents

Policy data screening model and policy data processing method Download PDF

Info

Publication number
CN113239018A
CN113239018A CN202110613433.4A CN202110613433A CN113239018A CN 113239018 A CN113239018 A CN 113239018A CN 202110613433 A CN202110613433 A CN 202110613433A CN 113239018 A CN113239018 A CN 113239018A
Authority
CN
China
Prior art keywords
policy
data
policy data
keywords
paragraph
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110613433.4A
Other languages
Chinese (zh)
Inventor
卢剑伟
于世著
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Changzhou Ciyanglin Information Technology Co ltd
Original Assignee
Changzhou Ciyanglin Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Changzhou Ciyanglin Information Technology Co ltd filed Critical Changzhou Ciyanglin Information Technology Co ltd
Priority to CN202110613433.4A priority Critical patent/CN113239018A/en
Publication of CN113239018A publication Critical patent/CN113239018A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Abstract

The invention belongs to the technical field of data screening, and particularly relates to a policy data screening model and a policy data processing method, wherein the policy data processing method comprises the following steps: constructing a policy data collection database; collecting basic data according to a policy data collection database; screening core data in the basic data according to the policy data screening model; the policy data is constructed according to the core data, so that the collection and the arrangement of the policy data on the network are realized, a user can conveniently know various policies at the same time by one-time access, and the time cost is saved.

Description

Policy data screening model and policy data processing method
Technical Field
The invention belongs to the technical field of data screening, and particularly relates to a policy data screening model and a policy data processing method.
Background
With the development of network technology, the traditional way of issuing policies such as paper documents and periodicals is changed into the way that the policies are issued on the network at the first time, but websites issuing the policies are various, and when a plurality of policies need to be known, a plurality of websites need to be accessed, which is time-consuming and labor-consuming.
Therefore, it is necessary to design a new policy data screening model and a new policy data processing method based on the above technical problems.
Disclosure of Invention
The invention aims to provide a policy data screening model and a policy data processing method.
In order to solve the above technical problem, the present invention provides a policy data screening model, which includes:
Figure BDA0003096964930000011
wherein, H (C)i) Identifying keywords C for each paragraph in a set of paragraphs for keywords according to policy categoryiA context co-occurrence entropy value of;
Figure BDA0003096964930000012
as other words CjAnd word CiThe number of co-occurrences.
On the other hand, the invention also provides a policy data processing method, which comprises the following steps:
constructing a policy data collection database;
collecting basic data according to a policy data collection database;
screening core data in the basic data according to the policy data screening model;
policy data is constructed from the core data.
Further, the method for constructing the policy data collection database comprises the following steps:
and collecting the websites of the websites distributed with the policy information, and storing each website in a database to form a policy data collection database.
Further, the method for collecting basic data according to the policy data collection database comprises the following steps:
and acquiring all original data of each website from all websites in a policy data collection database by adopting a crawling method of a web crawler, and screening the original data to acquire basic data.
Further, the method for screening core data in basic data according to the policy data screening model comprises the following steps:
dividing the basic data into paragraph sets, identifying keywords of each paragraph in the paragraph sets according to the keywords of the policy category, and identifying core words in the keywords, i.e.
Dividing a paragraph into n words to form a word set C, and identifying keywords in the word set C;
for the keywords C in the word set CiCalculate the keyword CiThe number of co-occurrences with any of the remaining words in the word set C;
obtaining a keyword CiContext co-occurrence entropy value of (a):
Figure BDA0003096964930000021
wherein, H (C)i) Identifying keywords C for each paragraph in a set of paragraphs for keywords according to policy categoryiA context co-occurrence entropy value of;
Figure BDA0003096964930000022
as other words CjAnd word CiThe number of co-occurrences of (c);
after context co-occurrence entropy values of all keywords are obtained, comparing the context co-occurrence entropy values of all the keywords, wherein the keyword with the largest context co-occurrence entropy value is a core word;
and judging the policy type corresponding to the core word according to the keywords of the policy type, wherein the content of the paragraph to which the core word belongs corresponds to the policy type, and further judging the policy type to which each paragraph belongs.
Further, the method for constructing policy data according to the core data comprises the following steps:
and dividing each paragraph into corresponding policy categories according to the policy category to which the paragraph belongs to construct policy data.
The method has the advantages that the policy data collection database is constructed; collecting basic data according to a policy data collection database; screening core data in the basic data according to the policy data screening model; the policy data is constructed according to the core data, so that the collection and the arrangement of the policy data on the network are realized, a user can conveniently know various policies at the same time by one-time access, and the time cost is saved.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and drawings.
In order to make the aforementioned and other objects, features and advantages of the present invention comprehensible, preferred embodiments accompanied with figures are described in detail below.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
FIG. 1 is a flow chart of screening models according to policy data in accordance with the present invention.
Detailed Description
To make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Example 1
This embodiment 1 provides a policy data screening model, including:
Figure BDA0003096964930000041
wherein, H (C)i) Identifying keywords for each paragraph in a set of paragraphs for keywords according to policy categoryCiA context co-occurrence entropy value of;
Figure BDA0003096964930000042
as other words CjAnd word CiThe number of co-occurrences.
Example 2
On the basis of embodiment 1, this embodiment 2 further provides a policy data processing method, including: constructing a policy data collection database; collecting basic data according to a policy data collection database; screening core data in the basic data according to the policy data screening model; the policy data is constructed according to the core data, so that the collection and the arrangement of the policy data on the network are realized, a user can conveniently know various policies at the same time by one-time access, and the time cost is saved.
In the present embodiment, the policy data screening model is adapted to employ the policy data screening model in embodiment 1.
In this embodiment, the method for constructing the policy data collection database includes: and collecting the websites of the websites distributed with the policy information, and storing each website in a database to form a policy data collection database.
In this embodiment, the method for collecting basic data according to the policy data collection database includes: acquiring all original data of each website from all websites in a policy data collection database by adopting a crawling method of a web crawler, and screening the original data to acquire basic data; different crawling techniques can be employed to cope with the anti-crawler policies of different websites, such as Requests, Selenium, etc.; selecting which of the original data on the website is selected by adopting techniques such as Beautifulsoup, Selenium and the like, removing HTML (hypertext markup language) tags, CSS (cascading style sheets) styles and the like in the original data, and obtaining basic data, wherein the basic data are data which are issued on each website and contain policies.
In this embodiment, the method for screening core data in basic data according to the policy data screening model includes: dividing the basic data into paragraph sets, identifying keywords of each paragraph in the paragraph sets according to the keywords of the policy category, and identifying core words in the keywords, i.e.Dividing a paragraph into n words to form a word set C, and identifying keywords in the word set C; for the keywords C in the word set CiCalculate the keyword CiThe number of co-occurrences with any of the remaining words in the word set C; obtaining a keyword CiContext co-occurrence entropy value of (a):
Figure BDA0003096964930000051
wherein, H (C)i) Identifying keywords C for each paragraph in a set of paragraphs for keywords according to policy categoryiA context co-occurrence entropy value of;
Figure BDA0003096964930000052
as other words CjAnd word CiThe number of co-occurrences of (c); after context co-occurrence entropy values of all keywords are obtained, comparing the context co-occurrence entropy values of all the keywords, wherein the keyword with the largest context co-occurrence entropy value is a core word; if only one keyword appears in the paragraph, the keyword is the core word; if the keywords with the maximum context co-occurrence entropy values in the paragraphs are multiple, the paragraphs have multiple core words, and when the policy types corresponding to the paragraphs are judged, the paragraphs are simultaneously divided into multiple policy types; and judging the policy type corresponding to the core word according to the keywords of the policy type, wherein the content of the paragraph to which the core word belongs corresponds to the policy type, and further judging the policy type to which each paragraph belongs.
In this embodiment, the method for constructing policy data according to core data includes: dividing each paragraph into corresponding policy categories according to the policy category to which the paragraph belongs to construct policy data; under the catalogue of each policy category, paragraph contents with keywords corresponding to the policy category are collected from other websites, so that a user can conveniently know various policies at the same time by one-time access, and the time cost is saved.
In this embodiment, the policy category division and the policy category keyword extraction may be set according to the policy direction to be collected; for example, when data on the endowment policy needs to be collected, the policy category and related keywords related to endowment can be divided to accurately collect the policy on the endowment.
In summary, the policy data collection database is constructed; collecting basic data according to a policy data collection database; screening core data in the basic data according to the policy data screening model; the policy data is constructed according to the core data, so that the collection and the arrangement of the policy data on the network are realized, a user can conveniently know various policies at the same time by one-time access, and the time cost is saved.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method can be implemented in other ways. The apparatus embodiments described above are merely illustrative, and for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In addition, the functional modules in the embodiments of the present invention may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.
The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
In light of the foregoing description of the preferred embodiment of the present invention, many modifications and variations will be apparent to those skilled in the art without departing from the spirit and scope of the invention. The technical scope of the present invention is not limited to the content of the specification, and must be determined according to the scope of the claims.

Claims (6)

1. A policy data screening model, comprising:
Figure FDA0003096964920000011
wherein, H (C)i) Identifying keywords C for each paragraph in a set of paragraphs for keywords according to policy categoryiA context co-occurrence entropy value of;
Figure FDA0003096964920000012
as other words CjAnd word CiThe number of co-occurrences.
2. A method of processing policy data, comprising:
constructing a policy data collection database;
collecting basic data according to a policy data collection database;
screening core data in the basic data according to the policy data screening model;
policy data is constructed from the core data.
3. The policy data processing method of claim 2 wherein,
the method for constructing the policy data collection database comprises the following steps:
and collecting the websites of the websites distributed with the policy information, and storing each website in a database to form a policy data collection database.
4. The policy data processing method according to claim 3,
the method for collecting basic data according to the policy data collection database comprises the following steps:
and acquiring all original data of each website from all websites in a policy data collection database by adopting a crawling method of a web crawler, and screening the original data to acquire basic data.
5. The policy data processing method according to claim 4,
the method for screening core data in basic data according to the policy data screening model comprises the following steps:
dividing the basic data into paragraph sets, identifying keywords of each paragraph in the paragraph sets according to the keywords of the policy category, and identifying core words in the keywords, i.e.
Dividing a paragraph into n words to form a word set C, and identifying keywords in the word set C;
for the keywords C in the word set CiCalculate the keyword CiThe number of co-occurrences with any of the remaining words in the word set C;
obtaining a keyword CiContext co-occurrence entropy value of (a):
Figure FDA0003096964920000021
wherein, H (C)i) Identifying keywords C for each paragraph in a set of paragraphs for keywords according to policy categoryiA context co-occurrence entropy value of;
Figure FDA0003096964920000022
as other words CjAnd word CiThe number of co-occurrences of (c);
after context co-occurrence entropy values of all keywords are obtained, comparing the context co-occurrence entropy values of all the keywords, wherein the keyword with the largest context co-occurrence entropy value is a core word;
and judging the policy type corresponding to the core word according to the keywords of the policy type, wherein the content of the paragraph to which the core word belongs corresponds to the policy type, and further judging the policy type to which each paragraph belongs.
6. The policy data processing method according to claim 5,
the method for constructing policy data according to core data comprises the following steps:
and dividing each paragraph into corresponding policy categories according to the policy category to which the paragraph belongs to construct policy data.
CN202110613433.4A 2021-06-02 2021-06-02 Policy data screening model and policy data processing method Pending CN113239018A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110613433.4A CN113239018A (en) 2021-06-02 2021-06-02 Policy data screening model and policy data processing method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110613433.4A CN113239018A (en) 2021-06-02 2021-06-02 Policy data screening model and policy data processing method

Publications (1)

Publication Number Publication Date
CN113239018A true CN113239018A (en) 2021-08-10

Family

ID=77136352

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110613433.4A Pending CN113239018A (en) 2021-06-02 2021-06-02 Policy data screening model and policy data processing method

Country Status (1)

Country Link
CN (1) CN113239018A (en)

Similar Documents

Publication Publication Date Title
Hofmann et al. Text mining and visualization: Case studies using open-source tools
US7941420B2 (en) Method for organizing structurally similar web pages from a web site
Urvoy et al. Tracking web spam with html style similarities
US9514216B2 (en) Automatic classification of segmented portions of web pages
US8630972B2 (en) Providing context for web articles
Das et al. Text mining and topic modeling of compendiums of papers from transportation research board annual meetings
US9268749B2 (en) Incremental computation of repeats
US20110173197A1 (en) Methods and apparatuses for clustering electronic documents based on structural features and static content features
CN107688616B (en) Make the unique facts of the entity appear
CN110929145B (en) Public opinion analysis method, public opinion analysis device, computer device and storage medium
US20110246462A1 (en) Method and System for Prompting Changes of Electronic Document Content
CN103838798A (en) Page classification system and method
CN108520007B (en) Web page information extracting method, storage medium and computer equipment
Alassi et al. Effectiveness of template detection on noise reduction and websites summarization
Story et al. Which apps have privacy policies? an analysis of over one million google play store apps
CN108763961B (en) Big data based privacy data grading method and device
Sivakumar Effectual web content mining using noise removal from web pages
CN112149387A (en) Visualization method and device for financial data, computer equipment and storage medium
CN101714147B (en) Method for filtering same or similar files
CN112818200A (en) Data crawling and event analyzing method and system based on static website
JP7290391B2 (en) Information processing device and program
CN106874368B (en) RTB bidding advertisement position value analysis method and system
CN110968757B (en) Policy file processing method and device
CN106649748B (en) Information recommendation method and device
US8131546B1 (en) System and method for adaptive sentence boundary disambiguation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination