CN116431815B - Intelligent management system for public village data - Google Patents

Intelligent management system for public village data Download PDF

Info

Publication number
CN116431815B
CN116431815B CN202310685694.6A CN202310685694A CN116431815B CN 116431815 B CN116431815 B CN 116431815B CN 202310685694 A CN202310685694 A CN 202310685694A CN 116431815 B CN116431815 B CN 116431815B
Authority
CN
China
Prior art keywords
information
village
words
data
decision
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310685694.6A
Other languages
Chinese (zh)
Other versions
CN116431815A (en
Inventor
赵斌
张敏
张问银
高一龙
王雷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Linyi University
Original Assignee
Linyi University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Linyi University filed Critical Linyi University
Priority to CN202310685694.6A priority Critical patent/CN116431815B/en
Publication of CN116431815A publication Critical patent/CN116431815A/en
Application granted granted Critical
Publication of CN116431815B publication Critical patent/CN116431815B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0631Resource planning, allocation, distributing or scheduling for enterprises or organisations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/26Government or public services
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Human Resources & Organizations (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Strategic Management (AREA)
  • Economics (AREA)
  • Tourism & Hospitality (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Evolutionary Computation (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Educational Administration (AREA)
  • Marketing (AREA)
  • Development Economics (AREA)
  • Evolutionary Biology (AREA)
  • Quality & Reliability (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Business, Economics & Management (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Game Theory and Decision Science (AREA)
  • Operations Research (AREA)
  • Primary Health Care (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the technical field of digital data processing, in particular to an intelligent management system for public village data. The system comprises a village data disclosure platform database, a term grabbing module, a data processing module and an information management module, wherein the data processing module further comprises a keyword labeling module, a keyword decision index module, a disclosure information vector processing module, a phrase induction module, a phrase weight module and a decision preference ratio module. The invention can avoid words with the same frequency in the published village information from reducing the accuracy rate of classifying the whole village information, and is beneficial to the subsequent measurement of the similarity degree between the village information. And secondly, based on the disclosure vector, a decision preference ratio is constructed, so that the influence of words in a small amount of special village disclosure information on the classification accuracy of all village data can be avoided. Based on the decision preference ratio, the attribute classification sequence is obtained, the attribute selection sequence is provided for the subsequent decision tree classification, and the classification precision and management efficiency of the subsequent village data are improved.

Description

Intelligent management system for public village data
Technical Field
The invention relates to the technical field of digital data processing, in particular to an intelligent management system for public village data.
Background
The village data is helpful for collecting and managing village information to improve the quality and efficiency of village management work and to build digital villages. The village data is mainly divided into government affair public edition, service public edition, financial public edition and the like, and the contents of all the edition areas in different areas are different.
The village public data has the characteristics of various types and complex content, and can easily cause a great deal of problems such as incomplete disclosure, old form, incapability of understanding by outside residents, uncoordinated management and the like. Therefore, the village public data needs to be reasonably classified and managed, so that villagers can conveniently and quickly find information which the villagers want to know. The data classification method at the current stage comprises content-based classification, data form-based classification, SVM classification and the like, and the content-based classification method is easy to face the problems of incomplete classification and unclear classification; the classification method based on the data is easily affected by various factors such as data updating frequency, storage mode, data quantity and the like, and the classification precision is difficult to control; the support vector machine classification is difficult to handle the problem of multiple classifications, and the computer resources and time consumed when the data volume is large are excessive.
Disclosure of Invention
In order to solve the problems, the invention provides a village public data intelligent management system which comprises a village data public platform database, an entry grabbing module, a data processing module and an information management module.
The entry grabbing module acquires all historical public information from the village data public platform database by utilizing a crawler technology, counts the quantity of all village information, and forms a historical corpus.
The data processing module also comprises a keyword labeling module, a keyword decision index module, a public information vector processing module, a phrase induction module, a phrase weight module and a decision preference ratio module.
The information management module classifies the obtained village public data by utilizing a decision tree classification algorithm according to the attribute classification sequence obtained by the data processing module as an attribute selection sequence when the data processing module is classified, sets sections according to village information content in a classification result of the village data, obtains a public information vector of each village information for each section, divides each section into a certain number of sub sections based on the similarity between the public information vectors, puts each classification result into different sub sections, obtains the public information vector of new information after uploading the new information to a village data disclosure platform, respectively calculates the similarity of the public information vector of the existing village information in each sub section, and divides the new information into the sub sections with the maximum similarity.
The decision tree classification algorithm comprises the following steps:
step 1, collecting data: collecting and sorting village public data, and ensuring that each data sample contains a classification label and an attribute value;
step 2, preparing data: preprocessing data, including missing value processing, data standardization and discretization;
step 3, feature selection: selecting the attribute most suitable for classification as an attribute selection sequence of the decision tree according to the attribute classification sequence obtained by the data processing module;
step 4, constructing a decision tree: constructing a decision tree by using the selected attribute sequence, and calculating entropy of the data set as an initial uncertainty measure; for each attribute, calculating the information gain or the information gain ratio and the base index reduction of the data set, and selecting the attribute with the maximum information gain or the maximum gain ratio and the maximum base index reduction as the dividing attribute of the current node; creating a branch node by the dividing attribute, and creating a child node according to the value of the attribute; for each child node, the above steps are repeated recursively until the samples in the node belong to the same class, or there are no more attributes for partitioning.
Further, the step of crawling adopted in the entry grabbing module comprises the following steps:
step 1, determining a target website: determining a target website to be information-captured, and capturing webpage data according to the authority and the permission protocol;
step 2, selecting a crawler tool;
step 3, analyzing the webpage structure: carefully analyzing the webpage structure of the target website, including HTML (hypertext markup language) tags, CSS (client S) selectors and XPath, so as to determine the position and the acquisition mode of the information;
step 4, writing crawler codes: initiating an HTTP request: using a crawler tool to send an HTTP request and acquiring the HTML content of a target webpage; parsing HTML: analyzing the HTML content by using an HTML analysis library or an XPath analyzer, and extracting target information; positioning target information: according to the analyzed webpage structure, positioning HTML elements where the target information is located by using a CSS selector and XPath; extracting information: extracting required information from the positioned HTML elements; storing information: storing the extracted information in an appropriate data structure;
step 5, setting crawler parameters: setting request heads, agents and request frequencies of crawlers according to requirements so as to ensure compliance and efficiency of the crawling process;
step 6, processing a reverse climbing mechanism: the proxy IP, request header disguise and delay request are used for avoiding the anti-climbing measures;
step 7, data cleaning and processing: the acquired original data contains noise and useless information, and the regular expression and the character string processing function are used for cleaning and processing the data so as to extract accurate and useful information.
Further, the keyword labeling module processes words in each piece of village information by utilizing a TF-IDF algorithm to obtain TF-IDF values corresponding to each word in each piece of village information.
Further, the keyword decision index module takes the historical corpus as input of a Word2Vec model, and outputs an embedded vector with a fixed length corresponding to each Word.
Further, the keyword decision index module obtains the importance degree of different keywords in each piece of village information by utilizing the context information of the keywords in each piece of village information,constructing a data vector by using the obtained TF-IDF value and the embedded vector, and marking as:wherein->、/>The TF-IDF value of word c, and the embedded vector, respectively.
Constructing an information decision index F for characterizing the ability of each word to determine the content of the village information in which the word is positioned, and calculating the information decision index of the word c in the village information iThe method comprises the following steps:
in the method, in the process of the invention,is the information importance of word c, +.>、/>The number of words of the first and second type in the village information i, a is the a word of the first type, b is the b word of the second type,/respectively>、/>Is the normalized Google distance of word c to word a and word c to word b, < ->Is the information decision index of word c, m is the number of words in the village information i, ++>、/>Data vectors of word c, word j, +.>Is the data vector +.>、/>Cosine similarity between->、/>TF-IDF values for word c, word j, < >>Is a parameter adjusting factor.
Further, the keyword decision index module acquires a segmentation threshold value by using an Otsu algorithm, classifies words with TF-IDF values larger than the segmentation threshold value into a first class, and classifies words with TF-IDF values smaller than the segmentation threshold value into a second class.
Further, the public information vector processing module respectively acquires information decision indexes of each keyword, divides words in each piece of village information into four types according to the information decision indexes, clusters the information decision indexes F of m words by using a k-means algorithm for village information i, and marks the information decision indexes of words in the cluster as mark words, core words, common words and sparse words from large to small according to the information decision index mean value of the words in the cluster.
Further, the public information vector processing module takes a clustering center in each cluster as a first element in each public information vector, and subsequent elements are clustered according to the clusteringSequencing from small to large in center measurement distance, if the two elements are the same as the clustering center measurement distance, taking the word with larger information decision index as the element with the front order, and respectively marking public information vectors corresponding to the marker word, the core word, the common word and the sparse word in the village information i as、/>、/>The length of each vector is the number of words in the corresponding cluster.
Further, the phrase induction module calculates the similarity degree between the corresponding public information vectors for any two pieces of village information, judges whether the two pieces of village information should be divided into the same type according to the similarity degree between the public information vectors, and adds the information decision indexes of the village information belonging to the same type to obtain the information decision index value of the phrase formed by the words.
Further, the phrase weight module extracts at least two phrase group words from the public information vectors of the marker words, the core words, the common words and the sparse words corresponding to each village information, marks the phrase group words extracted from the public information vectors of the marker words, the core words, the common words and the sparse words as the marker phrases, the core phrases, the common phrases and the sparse phrases, sets different decision weights for different phrases, and specifically calculates the phrase decision weightsThe method comprises the following steps:
in the method, in the process of the invention,is the decision ratio of words in village information i, < ->Is the number of elements in the public information vector corresponding to the words in the village information i, m is the number of words in the village information i, +.>Is the information decision index of word c, X is one of a flag word, a core word, a common word and a sparse word, < >>Is the length of the phrase and the size is equal to the +.>Number of Chinese words>Is that N village information inner words contain phrase +.>Is a village information amount of (a).
Further, the decision preference ratio module constructs a decision preference ratioFor characterizing influence degree of different kinds of words on village information classification, decision preference ratio +.>The specific calculation formula of (2) is as follows:
in the method, in the process of the invention,is a phrase->Division index of>Is a phrase->Decision weights of->Is the phrase corresponding to village information i>Phrase decision value of->Is that the phrase +.>Distribution variance of village data phrase decision value, < ->Is that the length of the x-th word corresponding to the village information i is +.>Phrase decision value of->Is the distribution variance of the decision value of the village data phrase containing the X-th phrase in all X-word disclosure vectors, +.>Is the x-th phrase->Decision weights of->Is the number of categories of words, +.>The magnitude of (2) takes the empirical value of 4,/and (2)>Is the length of the disclosure vector of the word,meaning of->The number of Chinese words is at least two and at most +.>And each.
The invention has the beneficial effects that the intelligent village public data management system is provided, the information decision index of the words is constructed through the embedded vector of the words in the village information and the TF-TDF value, the information decision index considers the key degree of the words with various frequencies in the village information and the similarity degree between the context information, and the intelligent village public data management system has the beneficial effects that the words with the same frequency in the public village information can be prevented from reducing the accuracy rate of classifying the whole village information, and the similarity degree between the village information is favorable for subsequent measurement. And secondly, constructing a decision preference ratio based on the public vector, wherein the decision preference ratio considers the village data segmentation capability of phrases with different lengths in different word classification results, and has the beneficial effect of avoiding the influence of a small number of words in special village public information on the classification precision of all village data. Based on the decision preference ratio, the attribute classification sequence is obtained, the attribute selection sequence is provided for the subsequent decision tree classification, and the classification precision and management efficiency of the subsequent village data are improved.
Drawings
Fig. 1 is a schematic block diagram of a smart management system for public village data according to the present invention.
Detailed Description
The present invention will be described in detail with reference to examples.
Examples:
a village public data intelligent management system comprises a village data public platform database, an entry grabbing module, a data processing module and an information management module, and is particularly shown in figure 1.
The entry grabbing module acquires all historical public information from the village data public platform database by utilizing a crawler technology, counts the quantity of all village information, and forms a historical corpus.
The step of the crawler adopted in the entry grabbing module comprises the following steps:
step 1, determining a target website: determining a target website to be information-captured, and capturing webpage data according to the authority and the permission protocol;
step 2, selecting a crawler tool;
step 3, analyzing the webpage structure: carefully analyzing the webpage structure of the target website, including HTML (hypertext markup language) tags, CSS (client S) selectors and XPath, so as to determine the position and the acquisition mode of the information;
step 4, writing crawler codes: initiating an HTTP request: using a crawler tool to send an HTTP request and acquiring the HTML content of a target webpage; parsing HTML: analyzing the HTML content by using an HTML analysis library or an XPath analyzer, and extracting target information; positioning target information: according to the analyzed webpage structure, positioning HTML elements where the target information is located by using a CSS selector and XPath; extracting information: extracting required information from the positioned HTML elements; storing information: storing the extracted information in an appropriate data structure;
step 5, setting crawler parameters: setting request heads, agents and request frequencies of crawlers according to requirements so as to ensure compliance and efficiency of the crawling process;
step 6, processing a reverse climbing mechanism: the proxy IP, request header disguise and delay request are used for avoiding the anti-climbing measures;
step 7, data cleaning and processing: the acquired original data contains noise and useless information, and the regular expression and the character string processing function are used for cleaning and processing the data so as to extract accurate and useful information.
And acquiring all historical public information from a village data public platform database by utilizing a crawler technology, and counting the number of all village information. The crawler is a well-known technology, and the specific process is not repeated. And secondly, counting the number of words in each piece of village information according to the content of each piece of village information, and acquiring the data length according to the number of words, wherein for example, the data length of the land compensation information is 1000 if 1000 words are included in the information of the land compensation. And marking a corpus consisting of words in the N pieces of village information as a historical corpus.
So far, all public village information and a historical corpus are obtained.
The data processing module also comprises a keyword labeling module, a keyword decision index module, a public information vector processing module, a phrase induction module, a phrase weight module and a decision preference ratio module.
The keyword labeling module processes words in each piece of village information by using a TF-IDF algorithm to obtain TF-IDF values corresponding to each word in each piece of village information.
In different types of public village information, the frequency of occurrence of keywords is different, for example, in the information of land-sign compensation, words such as land-sign area, compensation amount of land per mu and the like are generated with higher frequency, words such as consultation phones and the like are generated only once or a few times at the end of the information, and words such as poverty, compensation and the like are generated with higher frequency in the information of village assistance. Therefore, the invention utilizes TF-IDF algorithm to process words in each village information, and obtains TF-IDF value corresponding to each word in each village information, and the larger the TF-IDF value of a word is, the more the key word corresponding to village information is.
The keyword decision index module takes the historical corpus as input of a Word2Vec model and outputs embedded vectors with fixed lengths corresponding to each Word.
For the disclosed village information, words with the same frequency exist in the village information in different sections, so that if a relatively large amount of village information is accurately classified, the invention considers the importance degree of acquiring different keywords in each piece of village information by utilizing the context information of the keywords in each piece of village information. For example, in the agricultural assistance type village information and the village assistance type village information, related words such as assistance, compensation amount, compensation mode and the like are generated at a high frequency, namely TF-IDF values of different words are probably equal, but the words belong to different types of village information, so that in the invention, the context information of each Word is considered when classifying the information, a historical corpus is used as an input of a Word2Vec model, and the Word2Vec model outputs an embedded vector with a fixed length corresponding to each Word.
The keyword decision index module acquires the importance degree of different keywords in each piece of village information by utilizing the context information of the keywords in each piece of village information, and constructs a data vector by utilizing the acquired TF-IDF value and the embedded vector, and marks as:
wherein->、/>The TF-IDF value of word c, and the embedded vector, respectively. Further, the Otsu algorithm is utilized to obtain a segmentation threshold, words with TF-IDF values larger than the segmentation threshold are classified into a first class, and words with TF-IDF values smaller than the segmentation threshold are classified into a second class.
Based on the analysis, an information decision index F is constructed, which is used for representing the capacity of each word to determine the content of the village information, and the information decision index of the word c in the village information i is calculated
In the method, in the process of the invention,is the information importance of word c, +.>、/>The number of the first class words and the second class words in the village information i are respectively, a is the a word in the first class, b is the b word in the second class, and the word classification flow is as follows: the TF-IDF values of all words in the village information i are obtained, a segmentation threshold is obtained by using an Otsu algorithm, words with TF-IDF values larger than the segmentation threshold are classified into a first class, and words with TF-IDF values smaller than the segmentation threshold are classified into a second class. />、/>The normalized Google distance between the word c and the word a and between the word c and the word b is a known technique, and the specific process is not repeated.
Is the information decision index of word c, m is the number of words in the village information i, ++>、/>Data vectors of word c, word j, +.>Is the data vector +.>、/>Cosine similarity between->、/>TF-IDF values for word c, word j, < >>Is a parameter regulating factor, and is a herb of Jatropha curcas>The function of (2) is to prevent the denominator from being 0, < >>The size of (2) is 0.001.
The information decision index reflects the ability of each word in the village information to decide the content of the village information in which it is located. The greater the frequency of the co-occurrence of the word c with the word a in the first category, the less the frequency of the co-occurrence of the word c with the word b in the second category,the smaller the value of +.>The greater the value of (2), i.e +.>The larger the value of the term c, the more times the term c appears together with the larger term of TF-IDF value, the more the term c is the keyword in the village information i; the larger the difference between the TF-IDF values of the word c and the word j, the larger the difference between the embedded vectors, the smaller the similarity of the data vectors, and +.>The smaller the value of +.>、/>The greater the difference in value between word c and word j in village letterThe greater the difference in criticality in I, the +.>The larger the value of (c), the more similar the word c is to the semantic information of the keyword, i.e. +.>The larger the value of (c), the more similar the semantic information in word c is to the semantic information of the village information. The information decision index considers the key degree of words with various frequencies in the village information and the similarity degree between context information, and has the advantages of avoiding the words with the same frequency in the public village information from reducing the accuracy rate of classifying the whole village information and being beneficial to measuring the similarity degree between the village information.
The public information vector processing module respectively acquires information decision indexes of each keyword, divides words in each village information into four types according to the information decision indexes, clusters the information decision indexes F of m words by using a k-means algorithm for village information i, and marks the information decision indexes F of the m words as mark words, core words, common words and sparse words respectively from large to small according to the information decision index mean value of the words in the cluster.
The public information vector processing module takes a clustering center in each type of clustering cluster as a first element in each public information vector, the subsequent elements are ordered in the order from small to large with the clustering center measuring distance, if the two elements are the same with the clustering center measuring distance, words with larger information decision indexes are taken as elements with the front order, and the public information vectors corresponding to the marker words, the core words, the common words and the sparse words in the village information i are respectively marked as、/>、/>、/>The length of each vector is the number of words in the corresponding cluster.
And obtaining a public information vector corresponding to the village information by using the information decision index of each word. The method comprises the steps of respectively obtaining information decision indexes of each keyword, dividing words in each village information into four types according to the information decision indexes, and respectively marking the information decision indexes of the words in the cluster as mark words, core words, common words and sparse words from large to small according to the information decision index mean value of the words in the cluster. And for the village information i, clustering the information decision indexes F of m words by using a k-means algorithm, and setting the k value to be 4 to obtain the classification result of the words in the village information i. Taking the clustering center in each cluster as the first element in each public information vector, sorting the subsequent elements in the order from small to large with the measuring distance of the clustering center, and taking the word with larger information decision index as the element with the earlier order, for example, the measuring distance of the word c, the word a and the clustering center is equal, but the information decision index of the word c if the measuring distances of the two elements and the clustering center are the sameInformation decision index greater than word a +.>Word c is arranged before word a. Respectively marking public information vectors corresponding to marker words, core words, common words and sparse words in village information i as +.>、/>、/>、/>The length of each vector is the number of words in the corresponding cluster.
Thus, the public information vector corresponding to each piece of village information is obtained.
After the public information vector corresponding to each piece of village information is obtained, for any two pieces of village information, the similarity degree between the corresponding public information vectors is calculated, and whether the two pieces of village information should be divided into the same type is judged through the similarity degree between the public information vectors. For the village information belonging to the same category, a certain number of words with high similarity of semantics are necessarily existed between the village information and the village information, the information decision index values of the word groups can be obtained by adding the information decision indexes of the words, the information decision index values are recorded as word group decision values D, and the village information with larger word group decision values has larger correlation degree with each word group.
And randomly extracting two or more phrase group words from each public information vector, namely the sign word public information vector and the core word public information vector corresponding to each village information, marking the phrase group words extracted from the public information vector of the sign word as a sign phrase, and marking the phrase group words extracted from the public information vector of the core word as a core phrase. Because the semantic similarity of the marker words and the core words is different from that of the village information, the invention considers that different decision weights are set for the marker words and the core words. Calculating mark phraseDecision weight ∈10->
In the method, in the process of the invention,is the decision ratio of the marker words in the village information i, < ->Is village (village)Public information vector corresponding to mark word in business information i>The number of elements in m is the number of words in the village information i, +.>Is the information decision index of word c, +.>Is a mark phrase->Is equal to the length of the mark phrase +.>Number of Chinese words. />N village information mark words contain mark word group +.>Village information quantity,/, of->The larger the value of (2), the larger the decision weight corresponding to the mark phrase.
Based on the analysis described above, a decision preference ratio is constructed hereFor representing influence degree of different words on village information classification, calculating flag word ++>Decision preference ratio of->
In the method, in the process of the invention,is a mark phrase->Division index of>Is a mark phrase->Decision weights of->Is a mark phrase corresponding to village information i>Phrase decision value of->Is that all sign word public vectors contain sign word groupsDistribution variance of village data phrase decision value, < ->Is that the length of the k-class word corresponding to the village information i is +.>Phrase decision value of->Is the distribution variance of the decision value of the village data phrase containing the kth phrase in the public vector of all the marker words, and is +.>Is the k-th phrase->Decision weights of->Is the number of categories of words, +.>The magnitude of (2) takes the empirical value of 4./>The larger the value of (2), the flag phrase +.>The more powerful the data classification capability.
Is the public vector length of the tag word, +.>Meaning of->The number of Chinese words is at least two and at most +.>And each.
The decision preference ratio reflects the degree of influence of different classes of words on the classification of the village information. Sign phraseThe frequency in different village information is different, the larger the phrase decision value difference of the mark phrase in the village information is, the larger the distribution variance of the phrase decision value is, and the +.>The larger the value of (2), the k-th phrase->In different village informationThe more the frequency of the words is the same, the phrase +.>The smaller the phrase decision value difference, the smaller the distribution variance of phrase decision values, the +.>The smaller the value of (2), i.e +.>The larger the value of (2), the flag phrase +.>The stronger the data classification capacity of (2), the more should be based on the tag phrase +.>Classifying the village data; standard phrases of various lengths have a strong classification ability on village data, i.e. +.>The larger the value of (2), the stronger the segmentation ability of the tag word to the village data. The decision optimization ratio considers the village data segmentation capability of the word groups with different lengths in the word classification results with different types, and has the beneficial effects that the influence of the words in a small amount of special village public information on the classification precision of all village data can be avoided.
And respectively calculating the decision optimization ratio of the four types of words, namely the marker word, the core word, the common word and the sparse word, sequencing the words according to the sequence from big to small, and marking the sequencing result as an attribute classification sequence.
And acquiring an attribute classification sequence when the village data is classified according to the steps, taking the attribute classification sequence as an attribute selection sequence when the village data is classified by a decision tree, and classifying the acquired village public data by using a decision tree classification algorithm.
Setting sections according to the village information content in the classification result of the village data, for example, obtaining 7 classification results, setting 7 sections in the management system, and correspondingly displaying the sections on the front page of the village data disclosure platform. Further, for each section, a public information vector of each village information is obtained, each section is divided into a certain number of sub sections based on the similarity between the public information vectors, further, each type of classification result is put into different sub sections, and after each village enters the village data disclosure platform, each village can enter the corresponding section and the sub sections according to own needs and the content to be known. After new information is uploaded to the village data disclosure platform, a disclosure information vector of the new information is obtained, similarity with the disclosure information vector of the existing village information in each sub-layout is calculated respectively, and the new information is divided into sub-layout with the maximum similarity.
The decision tree classification algorithm comprises the following steps:
step 1, collecting data: collecting and sorting village public data, and ensuring that each data sample contains a classification label and an attribute value;
step 2, preparing data: preprocessing data, including missing value processing, data standardization and discretization;
step 3, feature selection: selecting the attribute most suitable for classification as an attribute selection sequence of the decision tree according to the attribute classification sequence obtained by the data processing module;
step 4, constructing a decision tree: constructing a decision tree by using the selected attribute sequence, and calculating entropy of the data set as an initial uncertainty measure; for each attribute, calculating the information gain or the information gain ratio and the base index reduction of the data set, and selecting the attribute with the maximum information gain or the maximum gain ratio and the maximum base index reduction as the dividing attribute of the current node; creating a branch node by the dividing attribute, and creating a child node according to the value of the attribute; for each child node, the above steps are repeated recursively until the samples in the node belong to the same class, or there are no more attributes for partitioning.
The process implemented by the whole set of system can be summarized as follows:
step one: historical public information is obtained from a village data public platform database, and a historical corpus is obtained according to the historical village data.
Step two: and constructing an information decision index of the word based on the embedded vector of the word in the village information and the TF-TDF value, acquiring a classification result of each village information based on the information decision index, constructing a decision preference ratio based on the public vector, and acquiring an attribute classification sequence based on the decision preference ratio.
Step three: and taking the attribute classification sequence as the attribute selection sequence of the decision tree, acquiring a village data classification result according to a decision tree classification algorithm, and realizing effective management of village public data according to the village data classification result.
The foregoing examples are intended to provide a better understanding of the present invention to those skilled in the art and are not intended to limit the present invention.

Claims (10)

1. The utility model provides a village public data wisdom management system, includes village data public platform database, entry snatch module, data processing module, information management module, its characterized in that:
the entry grabbing module acquires all historical public information from a village data public platform database by utilizing a crawler technology, counts the quantity of all village information, and forms a historical corpus;
the data processing module further comprises a keyword labeling module, a keyword decision index module, a public information vector processing module, a phrase induction module, a phrase weight module and a decision preference ratio module;
the information management module classifies the obtained village public data by utilizing a decision tree classification algorithm according to the attribute classification sequence obtained by the data processing module as an attribute selection sequence when the data processing module is classified, sets a layout according to the village information content in the classification result of the village data, obtains a public information vector of each village information for each layout, divides each layout into a certain number of sub-layouts based on the similarity between the public information vectors, puts each classification result into different sub-layouts, obtains the public information vector of new information after uploading the new information to a village data disclosure platform, respectively calculates the similarity of the public information vector of the existing village information in each sub-layout, and divides the new information into the sub-layouts with the maximum similarity;
the decision tree classification algorithm comprises the following steps:
step 1, collecting data: collecting and sorting village public data, and ensuring that each data sample contains a classification label and an attribute value;
step 2, preparing data: preprocessing data, including missing value processing, data standardization and discretization;
step 3, feature selection: selecting the attribute most suitable for classification as an attribute selection sequence of the decision tree according to the attribute classification sequence obtained by the data processing module;
step 4, constructing a decision tree: constructing a decision tree by using the selected attribute sequence, and calculating entropy of the data set as an initial uncertainty measure; for each attribute, calculating the information gain or the information gain ratio and the base index reduction of the data set, and selecting the attribute with the maximum information gain or the maximum gain ratio and the maximum base index reduction as the dividing attribute of the current node; creating a branch node by the dividing attribute, and creating a child node according to the value of the attribute; for each child node, the above steps are repeated recursively until the samples in the node belong to the same class, or there are no more attributes for partitioning.
2. The intelligent village public data management system according to claim 1, wherein: the step of the crawler adopted in the entry grabbing module comprises the following steps:
step 1, determining a target website: determining a target website to be information-captured, and capturing webpage data according to the authority and the permission protocol;
step 2, selecting a crawler tool;
step 3, analyzing the webpage structure: carefully analyzing the webpage structure of the target website, including HTML (hypertext markup language) tags, CSS (client S) selectors and XPath, so as to determine the position and the acquisition mode of the information;
step 4, writing crawler codes: initiating an HTTP request: using a crawler tool to send an HTTP request and acquiring the HTML content of a target webpage; parsing HTML: analyzing the HTML content by using an HTML analysis library or an XPath analyzer, and extracting target information; positioning target information: according to the analyzed webpage structure, positioning HTML elements where the target information is located by using a CSS selector and XPath; extracting information: extracting required information from the positioned HTML elements; storing information: storing the extracted information in an appropriate data structure;
step 5, setting crawler parameters: setting request heads, agents and request frequencies of crawlers according to requirements so as to ensure compliance and efficiency of the crawling process;
step 6, processing a reverse climbing mechanism: the proxy IP, request header disguise and delay request are used for avoiding the anti-climbing measures;
step 7, data cleaning and processing: the acquired original data contains noise and useless information, and the regular expression and the character string processing function are used for cleaning and processing the data so as to extract accurate and useful information.
3. The intelligent village public data management system according to claim 1, wherein:
and the keyword labeling module processes words in each piece of village information by utilizing a TF-IDF algorithm to acquire a TF-IDF value corresponding to each word in each piece of village information.
4. The intelligent village public data management system according to claim 1, wherein:
the keyword decision index module takes the historical corpus as input of a Word2Vec model and outputs an embedded vector with fixed length corresponding to each Word.
5. The intelligent village public data management system according to claim 4, wherein:
the keyword decision index module acquires the importance degree of different keywords in each piece of village information by utilizing the context information of the keywords in each piece of village information, and constructs a data vector by utilizing the acquired TF-IDF value and the embedded vector, and records as follows:
wherein the method comprises the steps of、/>The TF-IDF value of word c, respectively, the embedded vector;
constructing an information decision index F for characterizing the ability of each word to determine the content of the village information in which the word is positioned, and calculating the information decision index of the word c in the village information iThe method comprises the following steps:
in the method, in the process of the invention,is the information importance of word c, +.>、/>The number of words of the first and second type in the village information i, a is the a word of the first type, b is the b word of the second type,/respectively>、/>Is the normalized Google distance of word c to word a and word c to word b, < ->Is the information decision index of word c, m is the number of words in the village information i, ++>、/>Data vectors of word c, word j, +.>Is the data vector +.>、/>Cosine similarity between->、/>TF-IDF values for word c, word j, < >>Is a parameter adjusting factor.
6. The intelligent village public data management system according to claim 5, wherein: the keyword decision index module acquires a segmentation threshold value by using an Otsu algorithm, classifies words with TF-IDF values larger than the segmentation threshold value into a first class, and classifies words with TF-IDF values smaller than the segmentation threshold value into a second class.
7. The intelligent village public data management system according to claim 1, wherein: the public information vector processing module respectively acquires information decision indexes of each keyword, divides words in each piece of village information into four types according to the information decision indexes, clusters the information decision indexes F of m words by using a k-means algorithm for village information i, and marks the information decision indexes of words in the cluster as mark words, core words, common words and sparse words from large to small according to the information decision index mean value of the words in the cluster.
8. The intelligent village disclosure data management system according to claim 7, wherein:
the public information vector processing module takes a clustering center in each type of clustering cluster as a first element in each public information vector, the subsequent elements are ordered in the order from small to large with the clustering center measuring distance, if the two elements are the same with the clustering center measuring distance, words with larger information decision indexes are taken as elements with the front order, and the public information vectors corresponding to the marker words, the core words, the common words and the sparse words in the village information i are respectively marked as、/>、/>、/>,/>Indicating mark words->Representing core words->Representing common words->The sparse words are represented, and the length of each vector is the number of words in the corresponding cluster.
9. The intelligent village public data management system according to claim 1, wherein:
the phrase weight module extracts at least two phrase group words from the public information vectors of the mark words, the core words, the common words and the sparse words corresponding to each village information, marks the phrase group words extracted from the public information vectors of the mark words, the core words, the common words and the sparse words as mark phrases, core phrases, common phrases and sparse phrases respectively, sets different decision weights for different phrases, and specifically calculates the phrase decision weightsThe method comprises the following steps:
in the method, in the process of the invention,is the decision ratio of words in village information i, < ->Is the number of elements in the public information vector corresponding to the words in the village information i, m is the number of words in the village information i, +.>Is the information decision index of word c, X is one of a flag word, a core word, a common word and a sparse word, < >>Is the length of the phrase and the size is equal to the +.>Number of Chinese words>Is that N village information inner words contain phrase +.>Is a village information amount of (a).
10. The intelligent village public data management system according to claim 1, wherein:
the decision preference ratio module constructs a decision preference ratioFor characterizing influence degree of different kinds of words on village information classification, decision preference ratio +.>The specific calculation formula of (2) is as follows:
in the method, in the process of the invention,is a phrase->Division index of>Is a phrase->Decision weights of->Is the phrase corresponding to village information i>Phrase decision value of->Is that the phrase +.>Distribution variance of village data phrase decision value, < ->Is that the length of the x-th word corresponding to the village information i is +.>Phrase decision value of->Is the distribution variance of the decision value of the village data phrase containing the X-th phrase in all X-word disclosure vectors, +.>Is the x-th phraseDecision weights of->Is the number of categories of words, +.>The magnitude of (2) takes the empirical value of 4,/and (2)>Is the public vector length of the word,/>Meaning of->The number of Chinese words is at least two and at most +.>And each.
CN202310685694.6A 2023-06-12 2023-06-12 Intelligent management system for public village data Active CN116431815B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310685694.6A CN116431815B (en) 2023-06-12 2023-06-12 Intelligent management system for public village data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310685694.6A CN116431815B (en) 2023-06-12 2023-06-12 Intelligent management system for public village data

Publications (2)

Publication Number Publication Date
CN116431815A CN116431815A (en) 2023-07-14
CN116431815B true CN116431815B (en) 2023-08-22

Family

ID=87092938

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310685694.6A Active CN116431815B (en) 2023-06-12 2023-06-12 Intelligent management system for public village data

Country Status (1)

Country Link
CN (1) CN116431815B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103744954A (en) * 2014-01-06 2014-04-23 同济大学 Word relevancy network model establishing method and establishing device thereof
CN106096748A (en) * 2016-04-28 2016-11-09 武汉宝钢华中贸易有限公司 Entrucking forecast model in man-hour based on cluster analysis and decision Tree algorithms
CN106909626A (en) * 2017-01-22 2017-06-30 四川用联信息技术有限公司 Improved Decision Tree Algorithm realizes search engine optimization technology
CN110796331A (en) * 2019-09-11 2020-02-14 国网浙江省电力有限公司杭州供电公司 Power business collaborative classification method and system based on C4.5 decision tree algorithm
CN114764463A (en) * 2021-01-13 2022-07-19 上海交通大学 Internet public opinion event automatic early warning system based on event propagation characteristics
KR20220126468A (en) * 2021-03-09 2022-09-16 한국원자력 통제기술원 System for collecting and managing data of denial list and method thereof
CN115732078A (en) * 2022-11-17 2023-03-03 吾征智能技术(北京)有限公司 Pain disease distinguishing and classifying method and device based on multivariate decision tree model

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9317594B2 (en) * 2012-12-27 2016-04-19 Sas Institute Inc. Social community identification for automatic document classification
CN105022754B (en) * 2014-04-29 2020-05-12 腾讯科技(深圳)有限公司 Object classification method and device based on social network

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103744954A (en) * 2014-01-06 2014-04-23 同济大学 Word relevancy network model establishing method and establishing device thereof
CN106096748A (en) * 2016-04-28 2016-11-09 武汉宝钢华中贸易有限公司 Entrucking forecast model in man-hour based on cluster analysis and decision Tree algorithms
CN106909626A (en) * 2017-01-22 2017-06-30 四川用联信息技术有限公司 Improved Decision Tree Algorithm realizes search engine optimization technology
CN110796331A (en) * 2019-09-11 2020-02-14 国网浙江省电力有限公司杭州供电公司 Power business collaborative classification method and system based on C4.5 decision tree algorithm
CN114764463A (en) * 2021-01-13 2022-07-19 上海交通大学 Internet public opinion event automatic early warning system based on event propagation characteristics
KR20220126468A (en) * 2021-03-09 2022-09-16 한국원자력 통제기술원 System for collecting and managing data of denial list and method thereof
CN115732078A (en) * 2022-11-17 2023-03-03 吾征智能技术(北京)有限公司 Pain disease distinguishing and classifying method and device based on multivariate decision tree model

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Xinjuan Zhu ; .SEO Keyword Analysis and Its Application in Website Editing System.2012 8th International Conference on Wireless Communications, Networking and Mobile Computing.2013,全文. *

Also Published As

Publication number Publication date
CN116431815A (en) 2023-07-14

Similar Documents

Publication Publication Date Title
CN107577688B (en) Original article influence analysis system based on media information acquisition
US8630972B2 (en) Providing context for web articles
CN111026671B (en) Test case set construction method and test method based on test case set
CN103049575B (en) A kind of academic conference search system of topic adaptation
CN105045875B (en) Personalized search and device
CN110298032A (en) Text classification corpus labeling training system
WO2017097231A1 (en) Topic processing method and device
US20130073514A1 (en) Flexible and scalable structured web data extraction
CN114238573B (en) Text countercheck sample-based information pushing method and device
CN115098650B (en) Comment information analysis method based on historical data model and related device
CN109446288A (en) One kind being based on the internet Spark concerning security matters map detection algorithm
CN113449111B (en) Social governance hot topic automatic identification method based on time-space semantic knowledge migration
CN115935412A (en) Automatic classification and classification method and system for unstructured data
CN116775972A (en) Remote resource arrangement service method and system based on information technology
CN111680506A (en) External key mapping method and device of database table, electronic equipment and storage medium
CN111222028A (en) Intelligent data crawling method
CN116431815B (en) Intelligent management system for public village data
CN112612867A (en) News manuscript propagation analysis method, computer readable storage medium and electronic device
CN114238735B (en) Intelligent internet data acquisition method
CN115952770A (en) Data standardization processing method and device, electronic equipment and storage medium
CN109948015B (en) Meta search list result extraction method and system
CN111666479A (en) Method for searching web page and computer readable storage medium
KR20040098889A (en) A method of providing website searching service and a system thereof
US20190303464A1 (en) Directed Data Indexing Based on Conceptual Relevance
Yan et al. A multimodal retrieval and ranking method for scientific documents based on HFS and XLNet

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant