CN109710936A

CN109710936A - A kind of cross-layer grade government document bulletin subject analysis method

Info

Publication number: CN109710936A
Application number: CN201811613793.9A
Authority: CN
Inventors: 闫盈盈; 王进; 阚丹会; 丁剑飞; 曹扬
Original assignee: Division Big Data Research Institute Co Ltd
Current assignee: Division Big Data Research Institute Co Ltd
Priority date: 2018-12-27
Filing date: 2018-12-27
Publication date: 2019-05-03

Abstract

The present invention provides a kind of cross-layer grade government documents to announce subject analysis method, includes the following steps: that (1) carries out text data pretreatment to disclosed government document advertisement data；(2) the government document advertisement data based on cross-layer grade constructs the dynamic hierarchies probability topic model based on space line；(3) the government document advertisement data based on single level constructs the dynamic time sequence probability topic model based on timeline；(4) in model hyper parameter and hidden variable sample, sampling parameter includes the distribution of theme, the distribution of word, the corresponding theme of word；(5) subject evolution analysis is carried out according to the distribution of model estimation.The present invention analyzes the theme of government document bulletin with the evolutionary process of level from space line, the evolution condition of the theme of same level government document bulletin at any time is analyzed from timeline, and combine common discovery and the revealed government document of analysis subject evolution to announce matters two cablings, it is efficiently supervised for government and decision provides auxiliary and support.

Description

A kind of cross-layer grade government document bulletin subject analysis method

Technical field

The present invention relates to a kind of cross-layer grade government documents to announce subject analysis method.

Background technique

With the arrival of big data era, a large amount of government document advertisement data is produced in government department.Government department Increasingly pay attention to carrying out the association mining and analysis between government data using big data technology and text information processing means. There is deep-rooted complex relationship and potential behind in the government document advertisement data of this cross-layer grade of national, provincial level, city-level Association.

Currently, still lacking the side for being analyzed the official document advertisement data of government and government unit being assisted to carry out decision Method.Probability topic model method, can be from space as one of big data technical method and the effective ways of text information processing Potential theme and its evolution-information in line and timeline discovery and excavation governments at all levels' official document bulletin, to a certain extent can Meet the regulatory requirement of governmental science, fining.

Summary of the invention

In order to solve the above technical problems, the present invention provides a kind of cross-layer grade government documents to announce subject analysis method, it should Cross-layer grade government document announces subject analysis method and analyzes the theme of government document bulletin with the evolution of level from space line Process analyzes the theme evolution condition at any time that same level government document is announced from timeline, and by two cablings Matters are announced in conjunction with common discovery and the revealed government document of analysis subject evolution, is efficiently supervised for government and decision provides Auxiliary and support.

The present invention is achieved by the following technical programs.

A kind of cross-layer grade government document provided by the invention announces subject analysis method, includes the following steps:

(1) text data pretreatment is carried out to disclosed government document advertisement data；

(2) the government document advertisement data based on cross-layer grade constructs the dynamic hierarchies probability topic model based on space line；

(3) the government document advertisement data based on single level constructs the dynamic time sequence probability topic mould based on timeline Type；

(4) in model hyper parameter and hidden variable sample, sampling parameter include the distribution of theme, word point The corresponding theme of cloth, word；

(5) subject evolution analysis is carried out according to the distribution of model estimation.

The government document advertisement data, including national official document bulletin, provincial official document bulletin, city-level official document bulletin, area Official document bulletin at county level.

In the step (4) in model hyper parameter and hidden variable sample, be using the side Gibbs Blocked Method carries out.

The step (4) specifically uses following steps:

(4.1) Markov Chain parameter, the variance parameter { δ including each normal distribution are initialized², σ², a²And every The theme set z of all words in article_d,i；

(4.2) the logic normal state Study first α of theme distribution is sampled_i；

(4.3) sampling does not utilize the theme distribution of every document in the normalized single level of softmax or timeslice η_d,i；

(4.4) sampling does not utilize the Topic word point of all documents in the normalized single level of softmax or timeslice Cloth φ_k,i；

(4.5) η is given_d,iAnd φ_k,i, sample each word w in document_d,nTheme value z_d,n,i。

Subject evolution analysis is carried out according to the distribution of model estimation in the step (5), comprising:

(5.1) the subject evolution analysis based on space line；

(5.2) the subject evolution analysis based on timeline；

(5.3) it is analyzed based on the joint subject evolution of space line and timeline.

The subject evolution analysis based on space line refers to, first analyzes under same subject, country, province, city, district four A level is from upper one layer to next layer to the practice condition of theme, theme dependence each other and independence, each theme point The sub-topics information not stressed, then analyze government publication official document bulletin in each theme in each level proportion.

The subject evolution analysis based on timeline refers to, analyzes under same subject, in different time piece, a certain layer Grade analyzes the theme ratio that government document is announced in different time piece in the level to the executive condition of a certain theme.

It is described to be referred to based on the analysis of the joint subject evolution of space line and timeline, it is dynamic using the timeline of different levels State timing probability topic model is studied under cross events piece, to city to the executive condition of the theme in district from country to province, at any time Between line space line evolution condition, analysis government document announce subject information.

The beneficial effects of the present invention are: the theme of government document bulletin is analyzed with the evolution of level from space line Process analyzes the theme evolution condition at any time that same level government document is announced from timeline, and by two cablings Matters are announced in conjunction with common discovery and the revealed government document of analysis subject evolution, is efficiently supervised for government and decision provides Auxiliary and support.

Detailed description of the invention

Fig. 1 is the execution flow chart of one embodiment of the present invention；

Fig. 2 is the dynamic hierarchies probability topic illustraton of model in the present invention based on space line；

Fig. 3 is the dynamic time sequence probability topic illustraton of model in the present invention based on timeline.

Specific embodiment

Be described further below technical solution of the present invention, but claimed range be not limited to it is described.

In general, a kind of cross-layer grade government document provided by the invention announces subject analysis method, include the following steps:

(1) published national official document bulletin, provincial official document bulletin, city-level official document bulletin, district grade official document are announced The pretreatment of data progress text data；

(2) country-province-city-district level Four government document advertisement data is utilized, the dynamic hierarchies based on space line are constructed Probability topic model；

(3) it using each single level government document advertisement data that timeslice division is completed, constructs based on timeline Dynamic time sequence probability topic model；

(4) according to the dynamic hierarchies probability topic model of space line, country, province, each government document public affairs in city and district are obtained Accuse theme-level-word distribution and each level document-theme distribution of data；

(5) it according to the dynamic time sequence probability topic model of timeline, obtains in single level, document-master in each timeslice Topic distribution and theme-when m- word distribution；

(6) the distribution discovery government document bulletin theme of basis is with each level word under the evolution condition of level, specific subject Each timeslice word evolution condition under language evolution condition, the theme evolution condition at any time of single level, specific subject.

Text Pretreatment in step (1) include the participle of all level government document advertisement datas, go stop words, Remove low-frequency word, go number, remove symbol etc., in addition to this text word information that only reservation meets specification is also needed to each layer The government document advertisement data of grade is divided according to specific timeslice size.

As shown in Fig. 2, dynamic hierarchies probability topic model of the building based on space line in step (2) is for state The theme modeling that this four levels of city, family-province-- district carry out.

As shown in figure 3, dynamic time sequence probability topic model of the building based on timeline in step (3) is for state Each of family, province, city, district level carry out theme modeling.

Embodiment 1

Include the steps that as shown in Figure 1:

S1 step is first carried out, obtains country-each level government document advertisement data in province-city-district, the present embodiment is climbed Official document advertisement data during having taken -2018 years 2013 from A general office, certain save the official document advertisement data made government affairs public, certain The official document advertisement data of the openness of government affairs of official document advertisement data, certain area Xia Mou, provincial capital, province that provincial capital, province makes government affairs public.

Secondly S2 step is executed, Text Pretreatment is carried out to all government document bulletins, uses the accurate of stammerer participle Then Pattern completion text participle removes stop words using the completion of python development language, goes number, removes English character, goes to accord with Number, the operation such as remove low-frequency word.After the pretreatment for completing text, other redundance characters in addition to Chinese word are eliminated, are made Obtaining government document advertisement data becomes succinct, carefully and neatly done, clear, saves computer resource, convenient for further calculating and divides Analysis.

After government document bulletin pretreatment is completed, S3 (a) step is executed, existing text is divided into according to level Four sub- corpus of level are completed level corpus and are divided；S3 (b) step is executed, by country, province, city, each level of district Text data quarterly carry out the division of official document advertisement data for timeslice within (3 months), be divided into 20 timeslice altogether Corpus.

Execute S4 (a) step.It is established for the sub- corpus of four levels ready-portioned in S3 (a) dynamic based on space line State layer grade probability topic illustraton of model.The model thinks that hierarchical relationship is necessary for national (Nation), provincial first (Province), city-level (Municipality), district grade (District).Document order between each level does not allow to hand over It changes, but the document of inside at different levels is unordered commutative.Secondly, passing through latent Dirichletal location to the sub- corpus of each level (Latent Dirichlet Allocation, LDA) topic model carry out document subject matter information excavating, obtain theme with level Dynamic evolution.Third, the contextual level of any two in the model have potential level information association, are not It is independent of each other.The word information that the latter level government document announces theme is distributed by level government previous under the theme The influence of official document advertisement data word distribution.

S4 (b) step is executed, for each level for completing division in S3 (b) step according to season timeslice Sub- corpus establishes the dynamic time sequence probability topic illustraton of model based on single level government document bulletin timeline.The model first Think that the document order between each timeslice does not allow to exchange, but the document in timeslice is unordered.Secondly, to each timeslice Sub- corpus by LDA carry out Topics Crawling obtain theme at any time dynamic evolution the case where.Third, any two in model Contextual timeslice has the association of potential temporal information, is not independent of each other.The official document of the latter timeslice is announced The word information distribution of theme is influenced by official document previous under theme bulletin subject information distribution.4th, for country Grade, provincial, city-level, district grade divide according to season timeslice respectively, and execute the model.

First explain the present invention relates to symbol.Symbol description is as shown in table 1:

1 model symbol explanation of table

In space line and the sub- corpus of timeline, the document structure tree process of single level or single timeslice i such as 2 institute of table Show:

The document structure tree process of 2 model of table

It is available according to illustraton of model and generating process, in each level or timeslice i, the theme of a document d Distribution is expressed as follows:

Theme-word distribution β of one level piece or timeslice i are expressed as follows:

Give observable data W, the probability density function of Posterior distrbutionp are as follows:

Based on S4 (a) and S4 (b), S5 step of the invention is executed, carries out Blocked Gibbs sampling.In the step In, we are using Blocked Gibbs Sampling method to the carry out theme sampling in model.Present invention ginseng to be sampled Number includes: hyper parameter α_i, hidden variable φ_k,i、η_d,iAnd the subject identification z of each word_d,n,i.Gibbs sampling process includes Following steps:

(1) initialization model parameter.The model parameter for needing to initialize includes the prior variance parameter of logic normal distribution {δ², σ², a², the theme set z of each word in every document_d,i。

(2) from normal distributionMiddle sampling α_i, wherein

(3) in α_iAnd z_d,iUnder the conditions of known, the theme distribution η of every document is sampled by following formula_d,k:

Wherein, ∈_iterIt is a constant, value ∈_iter=a × (b+iter)^-c, N_d,k,iIndicate in single level or The total number for the word that theme is k in document d in the single timeslice i of person.N_d,iIt indicates in single level or single timeslice i The total number of word, ξ in middle document d_iterFor from normal distribution N (ξ_iter| 0, ∈_iter) sample obtained sampled value.

(4) Topic word distribution phi is sampled by following formula_k,i:

Wherein, N_k,v,iIt is k for theme in single level or timeslice i, word is identified as the number of the word of v.N_k,v,iMark Know the total number for the word that theme is k in single level or timeslice i.

(5) η is given_d,iAnd φ_k,i, word w in document is sampled by following formula_d,nTheme value z_d,n,i:

After the Markov Chain of Blocked Gibbs sampling tends towards stability, S6 step of the invention is executed, is counted Four distributions are analyzed, country, province, city, district government document announce the word distribution of each level under each theme, each level document Theme distribution；Single level expects the theme distribution of the word distribution of each timeslice under each theme, each timeslice document.

S7 step of the invention is executed, subject evolution analysis is carried out.It is announced based on government document, utilizes space line dynamic The subject evolution situation of-province-, hierarchical topics model analysis country four, city-district level.Including analyzing under each theme, country, Province, city, four, district level from upper one layer to next layer to the practice condition of theme, level each other and independence, The sub-topics information that each theme stresses, analysis government publication official document bulletin in each theme each level proportion, The priority theme information of government document bulletin concern.Utilize each layer of the dynamic time sequence probability topic model analysis of timeline Subject evolution situation of the grade in each period.Including being based under single analytic hierarchy process same subject, in different time piece, the layer Grade is to the executive condition of theme, and the government of each theme sends the documents ratio in different time piece.In addition, according to the time of four levels Line model, can be further to divide to city to the theme executive condition in district under comprehensive study cross events piece from country to province It analyses government document and announces subject information, fully understand that government document announces publication situation.

Claims

1. a kind of cross-layer grade government document announces subject analysis method, characterized by the following steps:

(3) the government document advertisement data based on single level constructs the dynamic time sequence probability topic model based on timeline；

(4) in model hyper parameter and hidden variable sample, sampling parameter includes the distribution of theme, the distribution of word, word The corresponding theme of language；

2. cross-layer grade government document as described in claim 1 announces subject analysis method, it is characterised in that: the government document Advertisement data, including national official document bulletin, provincial official document bulletin, city-level official document bulletin, district grade official document bulletin.

3. cross-layer grade government document as described in claim 1 announces subject analysis method, it is characterised in that: the step (4) In in model hyper parameter and hidden variable sample, be using Blocked Gibbs method carry out.

4. cross-layer grade government document as described in claim 1 announces subject analysis method, it is characterised in that: the step (4) Specifically use following steps:

(4.1) Markov Chain parameter, the variance parameter { δ including each normal distribution are initialized², σ², a²And every article in The theme set z of all words_d,i；

(4.3) sampling does not utilize the theme distribution η of every document in the normalized single level of softmax or timeslice_d,i；

(4.4) sampling is not distributed using the Topic word of all documents in the normalized single level of softmax or timeslice φ_k,i；

5. cross-layer grade government document as described in claim 1 announces subject analysis method, it is characterised in that: the step (5) The middle distribution according to model estimation carries out subject evolution analysis, comprising:

(5.1) the subject evolution analysis based on space line；

(5.2) the subject evolution analysis based on timeline；

6. cross-layer grade government document as claimed in claim 5 announces subject analysis method, it is characterised in that: described to be based on space The subject evolution analysis of line refers to, first analyzes under same subject, and country, province, city, four, district level are from upper one layer to next layer Practice condition, the sub-topics information that theme each other relies on and independence, each theme stress respectively to theme, then Each theme is in each level proportion in the official document bulletin of analysis government publication.

7. cross-layer grade government document as claimed in claim 5 announces subject analysis method, it is characterised in that: described to be based on the time The subject evolution analysis of line refers to, analyzes under same subject, in different time piece, execution feelings of a certain level to a certain theme Condition analyzes the theme ratio that government document is announced in different time piece in the level.

8. cross-layer grade government document as claimed in claim 5 announces subject analysis method, it is characterised in that: described to be based on space The analysis of the joint subject evolution of line and timeline refers to, is studied using the timeline dynamic time sequence probability topic model of different levels Under cross events piece, to city to the executive condition of the theme in district from country to province, the space line evolution condition of line, divides at any time It analyses government document and announces subject information.