CN109710936A - A kind of cross-layer grade government document bulletin subject analysis method - Google Patents
A kind of cross-layer grade government document bulletin subject analysis method Download PDFInfo
- Publication number
- CN109710936A CN109710936A CN201811613793.9A CN201811613793A CN109710936A CN 109710936 A CN109710936 A CN 109710936A CN 201811613793 A CN201811613793 A CN 201811613793A CN 109710936 A CN109710936 A CN 109710936A
- Authority
- CN
- China
- Prior art keywords
- theme
- document
- subject
- government
- level
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
Landscapes
- Machine Translation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention provides a kind of cross-layer grade government documents to announce subject analysis method, includes the following steps: that (1) carries out text data pretreatment to disclosed government document advertisement data;(2) the government document advertisement data based on cross-layer grade constructs the dynamic hierarchies probability topic model based on space line;(3) the government document advertisement data based on single level constructs the dynamic time sequence probability topic model based on timeline;(4) in model hyper parameter and hidden variable sample, sampling parameter includes the distribution of theme, the distribution of word, the corresponding theme of word;(5) subject evolution analysis is carried out according to the distribution of model estimation.The present invention analyzes the theme of government document bulletin with the evolutionary process of level from space line, the evolution condition of the theme of same level government document bulletin at any time is analyzed from timeline, and combine common discovery and the revealed government document of analysis subject evolution to announce matters two cablings, it is efficiently supervised for government and decision provides auxiliary and support.
Description
Technical field
The present invention relates to a kind of cross-layer grade government documents to announce subject analysis method.
Background technique
With the arrival of big data era, a large amount of government document advertisement data is produced in government department.Government department
Increasingly pay attention to carrying out the association mining and analysis between government data using big data technology and text information processing means.
There is deep-rooted complex relationship and potential behind in the government document advertisement data of this cross-layer grade of national, provincial level, city-level
Association.
Currently, still lacking the side for being analyzed the official document advertisement data of government and government unit being assisted to carry out decision
Method.Probability topic model method, can be from space as one of big data technical method and the effective ways of text information processing
Potential theme and its evolution-information in line and timeline discovery and excavation governments at all levels' official document bulletin, to a certain extent can
Meet the regulatory requirement of governmental science, fining.
Summary of the invention
In order to solve the above technical problems, the present invention provides a kind of cross-layer grade government documents to announce subject analysis method, it should
Cross-layer grade government document announces subject analysis method and analyzes the theme of government document bulletin with the evolution of level from space line
Process analyzes the theme evolution condition at any time that same level government document is announced from timeline, and by two cablings
Matters are announced in conjunction with common discovery and the revealed government document of analysis subject evolution, is efficiently supervised for government and decision provides
Auxiliary and support.
The present invention is achieved by the following technical programs.
A kind of cross-layer grade government document provided by the invention announces subject analysis method, includes the following steps:
(1) text data pretreatment is carried out to disclosed government document advertisement data;
(2) the government document advertisement data based on cross-layer grade constructs the dynamic hierarchies probability topic model based on space line;
(3) the government document advertisement data based on single level constructs the dynamic time sequence probability topic mould based on timeline
Type;
(4) in model hyper parameter and hidden variable sample, sampling parameter include the distribution of theme, word point
The corresponding theme of cloth, word;
(5) subject evolution analysis is carried out according to the distribution of model estimation.
The government document advertisement data, including national official document bulletin, provincial official document bulletin, city-level official document bulletin, area
Official document bulletin at county level.
In the step (4) in model hyper parameter and hidden variable sample, be using the side Gibbs Blocked
Method carries out.
The step (4) specifically uses following steps:
(4.1) Markov Chain parameter, the variance parameter { δ including each normal distribution are initialized2, σ2, a2And every
The theme set z of all words in articled,i;
(4.2) the logic normal state Study first α of theme distribution is sampledi;
(4.3) sampling does not utilize the theme distribution of every document in the normalized single level of softmax or timeslice
ηd,i;
(4.4) sampling does not utilize the Topic word point of all documents in the normalized single level of softmax or timeslice
Cloth φk,i;
(4.5) η is givend,iAnd φk,i, sample each word w in documentd,nTheme value zd,n,i。
Subject evolution analysis is carried out according to the distribution of model estimation in the step (5), comprising:
(5.1) the subject evolution analysis based on space line;
(5.2) the subject evolution analysis based on timeline;
(5.3) it is analyzed based on the joint subject evolution of space line and timeline.
The subject evolution analysis based on space line refers to, first analyzes under same subject, country, province, city, district four
A level is from upper one layer to next layer to the practice condition of theme, theme dependence each other and independence, each theme point
The sub-topics information not stressed, then analyze government publication official document bulletin in each theme in each level proportion.
The subject evolution analysis based on timeline refers to, analyzes under same subject, in different time piece, a certain layer
Grade analyzes the theme ratio that government document is announced in different time piece in the level to the executive condition of a certain theme.
It is described to be referred to based on the analysis of the joint subject evolution of space line and timeline, it is dynamic using the timeline of different levels
State timing probability topic model is studied under cross events piece, to city to the executive condition of the theme in district from country to province, at any time
Between line space line evolution condition, analysis government document announce subject information.
The beneficial effects of the present invention are: the theme of government document bulletin is analyzed with the evolution of level from space line
Process analyzes the theme evolution condition at any time that same level government document is announced from timeline, and by two cablings
Matters are announced in conjunction with common discovery and the revealed government document of analysis subject evolution, is efficiently supervised for government and decision provides
Auxiliary and support.
Detailed description of the invention
Fig. 1 is the execution flow chart of one embodiment of the present invention;
Fig. 2 is the dynamic hierarchies probability topic illustraton of model in the present invention based on space line;
Fig. 3 is the dynamic time sequence probability topic illustraton of model in the present invention based on timeline.
Specific embodiment
Be described further below technical solution of the present invention, but claimed range be not limited to it is described.
In general, a kind of cross-layer grade government document provided by the invention announces subject analysis method, include the following steps:
(1) published national official document bulletin, provincial official document bulletin, city-level official document bulletin, district grade official document are announced
The pretreatment of data progress text data;
(2) country-province-city-district level Four government document advertisement data is utilized, the dynamic hierarchies based on space line are constructed
Probability topic model;
(3) it using each single level government document advertisement data that timeslice division is completed, constructs based on timeline
Dynamic time sequence probability topic model;
(4) according to the dynamic hierarchies probability topic model of space line, country, province, each government document public affairs in city and district are obtained
Accuse theme-level-word distribution and each level document-theme distribution of data;
(5) it according to the dynamic time sequence probability topic model of timeline, obtains in single level, document-master in each timeslice
Topic distribution and theme-when m- word distribution;
(6) the distribution discovery government document bulletin theme of basis is with each level word under the evolution condition of level, specific subject
Each timeslice word evolution condition under language evolution condition, the theme evolution condition at any time of single level, specific subject.
Text Pretreatment in step (1) include the participle of all level government document advertisement datas, go stop words,
Remove low-frequency word, go number, remove symbol etc., in addition to this text word information that only reservation meets specification is also needed to each layer
The government document advertisement data of grade is divided according to specific timeslice size.
As shown in Fig. 2, dynamic hierarchies probability topic model of the building based on space line in step (2) is for state
The theme modeling that this four levels of city, family-province-- district carry out.
As shown in figure 3, dynamic time sequence probability topic model of the building based on timeline in step (3) is for state
Each of family, province, city, district level carry out theme modeling.
Embodiment 1
Include the steps that as shown in Figure 1:
S1 step is first carried out, obtains country-each level government document advertisement data in province-city-district, the present embodiment is climbed
Official document advertisement data during having taken -2018 years 2013 from A general office, certain save the official document advertisement data made government affairs public, certain
The official document advertisement data of the openness of government affairs of official document advertisement data, certain area Xia Mou, provincial capital, province that provincial capital, province makes government affairs public.
Secondly S2 step is executed, Text Pretreatment is carried out to all government document bulletins, uses the accurate of stammerer participle
Then Pattern completion text participle removes stop words using the completion of python development language, goes number, removes English character, goes to accord with
Number, the operation such as remove low-frequency word.After the pretreatment for completing text, other redundance characters in addition to Chinese word are eliminated, are made
Obtaining government document advertisement data becomes succinct, carefully and neatly done, clear, saves computer resource, convenient for further calculating and divides
Analysis.
After government document bulletin pretreatment is completed, S3 (a) step is executed, existing text is divided into according to level
Four sub- corpus of level are completed level corpus and are divided;S3 (b) step is executed, by country, province, city, each level of district
Text data quarterly carry out the division of official document advertisement data for timeslice within (3 months), be divided into 20 timeslice altogether
Corpus.
Execute S4 (a) step.It is established for the sub- corpus of four levels ready-portioned in S3 (a) dynamic based on space line
State layer grade probability topic illustraton of model.The model thinks that hierarchical relationship is necessary for national (Nation), provincial first
(Province), city-level (Municipality), district grade (District).Document order between each level does not allow to hand over
It changes, but the document of inside at different levels is unordered commutative.Secondly, passing through latent Dirichletal location to the sub- corpus of each level
(Latent Dirichlet Allocation, LDA) topic model carry out document subject matter information excavating, obtain theme with level
Dynamic evolution.Third, the contextual level of any two in the model have potential level information association, are not
It is independent of each other.The word information that the latter level government document announces theme is distributed by level government previous under the theme
The influence of official document advertisement data word distribution.
S4 (b) step is executed, for each level for completing division in S3 (b) step according to season timeslice
Sub- corpus establishes the dynamic time sequence probability topic illustraton of model based on single level government document bulletin timeline.The model first
Think that the document order between each timeslice does not allow to exchange, but the document in timeslice is unordered.Secondly, to each timeslice
Sub- corpus by LDA carry out Topics Crawling obtain theme at any time dynamic evolution the case where.Third, any two in model
Contextual timeslice has the association of potential temporal information, is not independent of each other.The official document of the latter timeslice is announced
The word information distribution of theme is influenced by official document previous under theme bulletin subject information distribution.4th, for country
Grade, provincial, city-level, district grade divide according to season timeslice respectively, and execute the model.
First explain the present invention relates to symbol.Symbol description is as shown in table 1:
1 model symbol explanation of table
In space line and the sub- corpus of timeline, the document structure tree process of single level or single timeslice i such as 2 institute of table
Show:
The document structure tree process of 2 model of table
It is available according to illustraton of model and generating process, in each level or timeslice i, the theme of a document d
Distribution is expressed as follows:
Theme-word distribution β of one level piece or timeslice i are expressed as follows:
Give observable data W, the probability density function of Posterior distrbutionp are as follows:
Based on S4 (a) and S4 (b), S5 step of the invention is executed, carries out Blocked Gibbs sampling.In the step
In, we are using Blocked Gibbs Sampling method to the carry out theme sampling in model.Present invention ginseng to be sampled
Number includes: hyper parameter αi, hidden variable φk,i、ηd,iAnd the subject identification z of each wordd,n,i.Gibbs sampling process includes
Following steps:
(1) initialization model parameter.The model parameter for needing to initialize includes the prior variance parameter of logic normal distribution
{δ2, σ2, a2, the theme set z of each word in every documentd,i。
(2) from normal distributionMiddle sampling αi, wherein
(3) in αiAnd zd,iUnder the conditions of known, the theme distribution η of every document is sampled by following formulad,k:
Wherein, ∈iterIt is a constant, value ∈iter=a × (b+iter)-c, Nd,k,iIndicate in single level or
The total number for the word that theme is k in document d in the single timeslice i of person.Nd,iIt indicates in single level or single timeslice i
The total number of word, ξ in middle document diterFor from normal distribution N (ξiter| 0, ∈iter) sample obtained sampled value.
(4) Topic word distribution phi is sampled by following formulak,i:
Wherein, Nk,v,iIt is k for theme in single level or timeslice i, word is identified as the number of the word of v.Nk,v,iMark
Know the total number for the word that theme is k in single level or timeslice i.
(5) η is givend,iAnd φk,i, word w in document is sampled by following formulad,nTheme value zd,n,i:
After the Markov Chain of Blocked Gibbs sampling tends towards stability, S6 step of the invention is executed, is counted
Four distributions are analyzed, country, province, city, district government document announce the word distribution of each level under each theme, each level document
Theme distribution;Single level expects the theme distribution of the word distribution of each timeslice under each theme, each timeslice document.
S7 step of the invention is executed, subject evolution analysis is carried out.It is announced based on government document, utilizes space line dynamic
The subject evolution situation of-province-, hierarchical topics model analysis country four, city-district level.Including analyzing under each theme, country,
Province, city, four, district level from upper one layer to next layer to the practice condition of theme, level each other and independence,
The sub-topics information that each theme stresses, analysis government publication official document bulletin in each theme each level proportion,
The priority theme information of government document bulletin concern.Utilize each layer of the dynamic time sequence probability topic model analysis of timeline
Subject evolution situation of the grade in each period.Including being based under single analytic hierarchy process same subject, in different time piece, the layer
Grade is to the executive condition of theme, and the government of each theme sends the documents ratio in different time piece.In addition, according to the time of four levels
Line model, can be further to divide to city to the theme executive condition in district under comprehensive study cross events piece from country to province
It analyses government document and announces subject information, fully understand that government document announces publication situation.
Claims (8)
1. a kind of cross-layer grade government document announces subject analysis method, characterized by the following steps:
(1) text data pretreatment is carried out to disclosed government document advertisement data;
(2) the government document advertisement data based on cross-layer grade constructs the dynamic hierarchies probability topic model based on space line;
(3) the government document advertisement data based on single level constructs the dynamic time sequence probability topic model based on timeline;
(4) in model hyper parameter and hidden variable sample, sampling parameter includes the distribution of theme, the distribution of word, word
The corresponding theme of language;
(5) subject evolution analysis is carried out according to the distribution of model estimation.
2. cross-layer grade government document as described in claim 1 announces subject analysis method, it is characterised in that: the government document
Advertisement data, including national official document bulletin, provincial official document bulletin, city-level official document bulletin, district grade official document bulletin.
3. cross-layer grade government document as described in claim 1 announces subject analysis method, it is characterised in that: the step (4)
In in model hyper parameter and hidden variable sample, be using Blocked Gibbs method carry out.
4. cross-layer grade government document as described in claim 1 announces subject analysis method, it is characterised in that: the step (4)
Specifically use following steps:
(4.1) Markov Chain parameter, the variance parameter { δ including each normal distribution are initialized2, σ2, a2And every article in
The theme set z of all wordsd,i;
(4.2) the logic normal state Study first α of theme distribution is sampledi;
(4.3) sampling does not utilize the theme distribution η of every document in the normalized single level of softmax or timesliced,i;
(4.4) sampling is not distributed using the Topic word of all documents in the normalized single level of softmax or timeslice
φk,i;
(4.5) η is givend,iAnd φk,i, sample each word w in documentd,nTheme value zd,n,i。
5. cross-layer grade government document as described in claim 1 announces subject analysis method, it is characterised in that: the step (5)
The middle distribution according to model estimation carries out subject evolution analysis, comprising:
(5.1) the subject evolution analysis based on space line;
(5.2) the subject evolution analysis based on timeline;
(5.3) it is analyzed based on the joint subject evolution of space line and timeline.
6. cross-layer grade government document as claimed in claim 5 announces subject analysis method, it is characterised in that: described to be based on space
The subject evolution analysis of line refers to, first analyzes under same subject, and country, province, city, four, district level are from upper one layer to next layer
Practice condition, the sub-topics information that theme each other relies on and independence, each theme stress respectively to theme, then
Each theme is in each level proportion in the official document bulletin of analysis government publication.
7. cross-layer grade government document as claimed in claim 5 announces subject analysis method, it is characterised in that: described to be based on the time
The subject evolution analysis of line refers to, analyzes under same subject, in different time piece, execution feelings of a certain level to a certain theme
Condition analyzes the theme ratio that government document is announced in different time piece in the level.
8. cross-layer grade government document as claimed in claim 5 announces subject analysis method, it is characterised in that: described to be based on space
The analysis of the joint subject evolution of line and timeline refers to, is studied using the timeline dynamic time sequence probability topic model of different levels
Under cross events piece, to city to the executive condition of the theme in district from country to province, the space line evolution condition of line, divides at any time
It analyses government document and announces subject information.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811613793.9A CN109710936A (en) | 2018-12-27 | 2018-12-27 | A kind of cross-layer grade government document bulletin subject analysis method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811613793.9A CN109710936A (en) | 2018-12-27 | 2018-12-27 | A kind of cross-layer grade government document bulletin subject analysis method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109710936A true CN109710936A (en) | 2019-05-03 |
Family
ID=66257846
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811613793.9A Withdrawn CN109710936A (en) | 2018-12-27 | 2018-12-27 | A kind of cross-layer grade government document bulletin subject analysis method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109710936A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111782814A (en) * | 2020-07-17 | 2020-10-16 | 安徽大学 | Analysis method for patent technology subject content and heat evolution |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102662960A (en) * | 2012-03-08 | 2012-09-12 | 浙江大学 | On-line supervised theme-modeling and evolution-analyzing method |
CN103984681A (en) * | 2014-03-31 | 2014-08-13 | 同济大学 | News event evolution analysis method based on time sequence distribution information and topic model |
CN104199974A (en) * | 2013-09-22 | 2014-12-10 | 中科嘉速(北京)并行软件有限公司 | Microblog-oriented dynamic topic detection and evolution tracking method |
CN106021222A (en) * | 2016-05-09 | 2016-10-12 | 浙江农林大学 | Analysis method and device for scientific research literature theme evolution |
-
2018
- 2018-12-27 CN CN201811613793.9A patent/CN109710936A/en not_active Withdrawn
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102662960A (en) * | 2012-03-08 | 2012-09-12 | 浙江大学 | On-line supervised theme-modeling and evolution-analyzing method |
CN104199974A (en) * | 2013-09-22 | 2014-12-10 | 中科嘉速(北京)并行软件有限公司 | Microblog-oriented dynamic topic detection and evolution tracking method |
CN103984681A (en) * | 2014-03-31 | 2014-08-13 | 同济大学 | News event evolution analysis method based on time sequence distribution information and topic model |
CN106021222A (en) * | 2016-05-09 | 2016-10-12 | 浙江农林大学 | Analysis method and device for scientific research literature theme evolution |
Non-Patent Citations (4)
Title |
---|
唐晓波等: "基于潜在狄利克雷分配模型的微博主题演化分析", 《情报学报》 * |
崔凯等: "一种基于LDA的在线主题演化挖掘模型", 《计算机科学》 * |
张永安等: "基于R语言的区域技术创新政策量化分析", 《情报杂志》 * |
李永忠等: "基于LDA的国内电子政务研究主题演化及可视化分析", 《现代情报》 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111782814A (en) * | 2020-07-17 | 2020-10-16 | 安徽大学 | Analysis method for patent technology subject content and heat evolution |
CN111782814B (en) * | 2020-07-17 | 2023-11-10 | 安徽大学 | Analysis method for patent technical subject matter and heat evolution |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10229154B2 (en) | Subject-matter analysis of tabular data | |
CN106355627A (en) | Method and system used for generating knowledge graphs | |
Peng et al. | Radical-based hierarchical embeddings for Chinese sentiment analysis at sentence level | |
CN104636425A (en) | Method for predicting and visualizing emotion cognitive ability of network individual or group | |
CN103034693A (en) | Open-type entity and type identification method thereof | |
CN111782797A (en) | Automatic matching method for scientific and technological project review experts and storage medium | |
CN105975475A (en) | Chinese phrase string-based fine-grained thematic information extraction method | |
CN110598002A (en) | Knowledge graph library construction method and device, computer storage medium and electronic equipment | |
CN111159356A (en) | Knowledge graph construction method based on teaching content | |
CN104881399A (en) | Event identification method and system based on probability soft logic PSL | |
CN109635260A (en) | For generating the method, apparatus, equipment and storage medium of article template | |
Fadele et al. | A novel classification to categorise original hadith detection techniques | |
CN115309634A (en) | Micro-service extraction method, system, medium, equipment and information processing terminal | |
CN109710936A (en) | A kind of cross-layer grade government document bulletin subject analysis method | |
Badawi et al. | Kurdish news dataset headlines (KNDH) through multiclass classification | |
Bajestan et al. | DErivCELEX: Development and evaluation of a German derivational morphology lexicon based on CELEX | |
CN107368610B (en) | Full-text-based large text CRF and rule classification method and system | |
Preethi | Survey on text transformation using Bi-LSTM in natural language processing with text data | |
Bizzoni et al. | Some steps towards the generation of diachronic WordNets | |
Mahmood et al. | KEFST: A knowledge extraction framework using finite-state transducers | |
CN108427672B (en) | Method, terminal device and the computer readable storage medium of character translation | |
Eghbalzadeh et al. | Persica: A Persian corpus for multi-purpose text mining and Natural language processing | |
CN109871414A (en) | Biomedical entity relationship classification method based on the context vector kernel of graph | |
Laviosa et al. | A corpus analysis of translation of environmental news on BBC China | |
Na et al. | A method of collecting four character medicine effect phrases in TCM patents based on semi-supervised learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WW01 | Invention patent application withdrawn after publication | ||
WW01 | Invention patent application withdrawn after publication |
Application publication date: 20190503 |