CN109710936A - A kind of cross-layer grade government document bulletin subject analysis method - Google Patents

A kind of cross-layer grade government document bulletin subject analysis method Download PDF

Info

Publication number
CN109710936A
CN109710936A CN201811613793.9A CN201811613793A CN109710936A CN 109710936 A CN109710936 A CN 109710936A CN 201811613793 A CN201811613793 A CN 201811613793A CN 109710936 A CN109710936 A CN 109710936A
Authority
CN
China
Prior art keywords
theme
document
subject
government
level
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN201811613793.9A
Other languages
Chinese (zh)
Inventor
闫盈盈
王进
阚丹会
丁剑飞
曹扬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Division Big Data Research Institute Co Ltd
Original Assignee
Division Big Data Research Institute Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Division Big Data Research Institute Co Ltd filed Critical Division Big Data Research Institute Co Ltd
Priority to CN201811613793.9A priority Critical patent/CN109710936A/en
Publication of CN109710936A publication Critical patent/CN109710936A/en
Withdrawn legal-status Critical Current

Links

Landscapes

  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention provides a kind of cross-layer grade government documents to announce subject analysis method, includes the following steps: that (1) carries out text data pretreatment to disclosed government document advertisement data;(2) the government document advertisement data based on cross-layer grade constructs the dynamic hierarchies probability topic model based on space line;(3) the government document advertisement data based on single level constructs the dynamic time sequence probability topic model based on timeline;(4) in model hyper parameter and hidden variable sample, sampling parameter includes the distribution of theme, the distribution of word, the corresponding theme of word;(5) subject evolution analysis is carried out according to the distribution of model estimation.The present invention analyzes the theme of government document bulletin with the evolutionary process of level from space line, the evolution condition of the theme of same level government document bulletin at any time is analyzed from timeline, and combine common discovery and the revealed government document of analysis subject evolution to announce matters two cablings, it is efficiently supervised for government and decision provides auxiliary and support.

Description

A kind of cross-layer grade government document bulletin subject analysis method
Technical field
The present invention relates to a kind of cross-layer grade government documents to announce subject analysis method.
Background technique
With the arrival of big data era, a large amount of government document advertisement data is produced in government department.Government department Increasingly pay attention to carrying out the association mining and analysis between government data using big data technology and text information processing means. There is deep-rooted complex relationship and potential behind in the government document advertisement data of this cross-layer grade of national, provincial level, city-level Association.
Currently, still lacking the side for being analyzed the official document advertisement data of government and government unit being assisted to carry out decision Method.Probability topic model method, can be from space as one of big data technical method and the effective ways of text information processing Potential theme and its evolution-information in line and timeline discovery and excavation governments at all levels' official document bulletin, to a certain extent can Meet the regulatory requirement of governmental science, fining.
Summary of the invention
In order to solve the above technical problems, the present invention provides a kind of cross-layer grade government documents to announce subject analysis method, it should Cross-layer grade government document announces subject analysis method and analyzes the theme of government document bulletin with the evolution of level from space line Process analyzes the theme evolution condition at any time that same level government document is announced from timeline, and by two cablings Matters are announced in conjunction with common discovery and the revealed government document of analysis subject evolution, is efficiently supervised for government and decision provides Auxiliary and support.
The present invention is achieved by the following technical programs.
A kind of cross-layer grade government document provided by the invention announces subject analysis method, includes the following steps:
(1) text data pretreatment is carried out to disclosed government document advertisement data;
(2) the government document advertisement data based on cross-layer grade constructs the dynamic hierarchies probability topic model based on space line;
(3) the government document advertisement data based on single level constructs the dynamic time sequence probability topic mould based on timeline Type;
(4) in model hyper parameter and hidden variable sample, sampling parameter include the distribution of theme, word point The corresponding theme of cloth, word;
(5) subject evolution analysis is carried out according to the distribution of model estimation.
The government document advertisement data, including national official document bulletin, provincial official document bulletin, city-level official document bulletin, area Official document bulletin at county level.
In the step (4) in model hyper parameter and hidden variable sample, be using the side Gibbs Blocked Method carries out.
The step (4) specifically uses following steps:
(4.1) Markov Chain parameter, the variance parameter { δ including each normal distribution are initialized2, σ2, a2And every The theme set z of all words in articled,i
(4.2) the logic normal state Study first α of theme distribution is sampledi
(4.3) sampling does not utilize the theme distribution of every document in the normalized single level of softmax or timeslice ηd,i
(4.4) sampling does not utilize the Topic word point of all documents in the normalized single level of softmax or timeslice Cloth φk,i
(4.5) η is givend,iAnd φk,i, sample each word w in documentd,nTheme value zd,n,i
Subject evolution analysis is carried out according to the distribution of model estimation in the step (5), comprising:
(5.1) the subject evolution analysis based on space line;
(5.2) the subject evolution analysis based on timeline;
(5.3) it is analyzed based on the joint subject evolution of space line and timeline.
The subject evolution analysis based on space line refers to, first analyzes under same subject, country, province, city, district four A level is from upper one layer to next layer to the practice condition of theme, theme dependence each other and independence, each theme point The sub-topics information not stressed, then analyze government publication official document bulletin in each theme in each level proportion.
The subject evolution analysis based on timeline refers to, analyzes under same subject, in different time piece, a certain layer Grade analyzes the theme ratio that government document is announced in different time piece in the level to the executive condition of a certain theme.
It is described to be referred to based on the analysis of the joint subject evolution of space line and timeline, it is dynamic using the timeline of different levels State timing probability topic model is studied under cross events piece, to city to the executive condition of the theme in district from country to province, at any time Between line space line evolution condition, analysis government document announce subject information.
The beneficial effects of the present invention are: the theme of government document bulletin is analyzed with the evolution of level from space line Process analyzes the theme evolution condition at any time that same level government document is announced from timeline, and by two cablings Matters are announced in conjunction with common discovery and the revealed government document of analysis subject evolution, is efficiently supervised for government and decision provides Auxiliary and support.
Detailed description of the invention
Fig. 1 is the execution flow chart of one embodiment of the present invention;
Fig. 2 is the dynamic hierarchies probability topic illustraton of model in the present invention based on space line;
Fig. 3 is the dynamic time sequence probability topic illustraton of model in the present invention based on timeline.
Specific embodiment
Be described further below technical solution of the present invention, but claimed range be not limited to it is described.
In general, a kind of cross-layer grade government document provided by the invention announces subject analysis method, include the following steps:
(1) published national official document bulletin, provincial official document bulletin, city-level official document bulletin, district grade official document are announced The pretreatment of data progress text data;
(2) country-province-city-district level Four government document advertisement data is utilized, the dynamic hierarchies based on space line are constructed Probability topic model;
(3) it using each single level government document advertisement data that timeslice division is completed, constructs based on timeline Dynamic time sequence probability topic model;
(4) according to the dynamic hierarchies probability topic model of space line, country, province, each government document public affairs in city and district are obtained Accuse theme-level-word distribution and each level document-theme distribution of data;
(5) it according to the dynamic time sequence probability topic model of timeline, obtains in single level, document-master in each timeslice Topic distribution and theme-when m- word distribution;
(6) the distribution discovery government document bulletin theme of basis is with each level word under the evolution condition of level, specific subject Each timeslice word evolution condition under language evolution condition, the theme evolution condition at any time of single level, specific subject.
Text Pretreatment in step (1) include the participle of all level government document advertisement datas, go stop words, Remove low-frequency word, go number, remove symbol etc., in addition to this text word information that only reservation meets specification is also needed to each layer The government document advertisement data of grade is divided according to specific timeslice size.
As shown in Fig. 2, dynamic hierarchies probability topic model of the building based on space line in step (2) is for state The theme modeling that this four levels of city, family-province-- district carry out.
As shown in figure 3, dynamic time sequence probability topic model of the building based on timeline in step (3) is for state Each of family, province, city, district level carry out theme modeling.
Embodiment 1
Include the steps that as shown in Figure 1:
S1 step is first carried out, obtains country-each level government document advertisement data in province-city-district, the present embodiment is climbed Official document advertisement data during having taken -2018 years 2013 from A general office, certain save the official document advertisement data made government affairs public, certain The official document advertisement data of the openness of government affairs of official document advertisement data, certain area Xia Mou, provincial capital, province that provincial capital, province makes government affairs public.
Secondly S2 step is executed, Text Pretreatment is carried out to all government document bulletins, uses the accurate of stammerer participle Then Pattern completion text participle removes stop words using the completion of python development language, goes number, removes English character, goes to accord with Number, the operation such as remove low-frequency word.After the pretreatment for completing text, other redundance characters in addition to Chinese word are eliminated, are made Obtaining government document advertisement data becomes succinct, carefully and neatly done, clear, saves computer resource, convenient for further calculating and divides Analysis.
After government document bulletin pretreatment is completed, S3 (a) step is executed, existing text is divided into according to level Four sub- corpus of level are completed level corpus and are divided;S3 (b) step is executed, by country, province, city, each level of district Text data quarterly carry out the division of official document advertisement data for timeslice within (3 months), be divided into 20 timeslice altogether Corpus.
Execute S4 (a) step.It is established for the sub- corpus of four levels ready-portioned in S3 (a) dynamic based on space line State layer grade probability topic illustraton of model.The model thinks that hierarchical relationship is necessary for national (Nation), provincial first (Province), city-level (Municipality), district grade (District).Document order between each level does not allow to hand over It changes, but the document of inside at different levels is unordered commutative.Secondly, passing through latent Dirichletal location to the sub- corpus of each level (Latent Dirichlet Allocation, LDA) topic model carry out document subject matter information excavating, obtain theme with level Dynamic evolution.Third, the contextual level of any two in the model have potential level information association, are not It is independent of each other.The word information that the latter level government document announces theme is distributed by level government previous under the theme The influence of official document advertisement data word distribution.
S4 (b) step is executed, for each level for completing division in S3 (b) step according to season timeslice Sub- corpus establishes the dynamic time sequence probability topic illustraton of model based on single level government document bulletin timeline.The model first Think that the document order between each timeslice does not allow to exchange, but the document in timeslice is unordered.Secondly, to each timeslice Sub- corpus by LDA carry out Topics Crawling obtain theme at any time dynamic evolution the case where.Third, any two in model Contextual timeslice has the association of potential temporal information, is not independent of each other.The official document of the latter timeslice is announced The word information distribution of theme is influenced by official document previous under theme bulletin subject information distribution.4th, for country Grade, provincial, city-level, district grade divide according to season timeslice respectively, and execute the model.
First explain the present invention relates to symbol.Symbol description is as shown in table 1:
1 model symbol explanation of table
In space line and the sub- corpus of timeline, the document structure tree process of single level or single timeslice i such as 2 institute of table Show:
The document structure tree process of 2 model of table
It is available according to illustraton of model and generating process, in each level or timeslice i, the theme of a document d Distribution is expressed as follows:
Theme-word distribution β of one level piece or timeslice i are expressed as follows:
Give observable data W, the probability density function of Posterior distrbutionp are as follows:
Based on S4 (a) and S4 (b), S5 step of the invention is executed, carries out Blocked Gibbs sampling.In the step In, we are using Blocked Gibbs Sampling method to the carry out theme sampling in model.Present invention ginseng to be sampled Number includes: hyper parameter αi, hidden variable φk,i、ηd,iAnd the subject identification z of each wordd,n,i.Gibbs sampling process includes Following steps:
(1) initialization model parameter.The model parameter for needing to initialize includes the prior variance parameter of logic normal distribution {δ2, σ2, a2, the theme set z of each word in every documentd,i
(2) from normal distributionMiddle sampling αi, wherein
(3) in αiAnd zd,iUnder the conditions of known, the theme distribution η of every document is sampled by following formulad,k:
Wherein, ∈iterIt is a constant, value ∈iter=a × (b+iter)-c, Nd,k,iIndicate in single level or The total number for the word that theme is k in document d in the single timeslice i of person.Nd,iIt indicates in single level or single timeslice i The total number of word, ξ in middle document diterFor from normal distribution N (ξiter| 0, ∈iter) sample obtained sampled value.
(4) Topic word distribution phi is sampled by following formulak,i:
Wherein, Nk,v,iIt is k for theme in single level or timeslice i, word is identified as the number of the word of v.Nk,v,iMark Know the total number for the word that theme is k in single level or timeslice i.
(5) η is givend,iAnd φk,i, word w in document is sampled by following formulad,nTheme value zd,n,i:
After the Markov Chain of Blocked Gibbs sampling tends towards stability, S6 step of the invention is executed, is counted Four distributions are analyzed, country, province, city, district government document announce the word distribution of each level under each theme, each level document Theme distribution;Single level expects the theme distribution of the word distribution of each timeslice under each theme, each timeslice document.
S7 step of the invention is executed, subject evolution analysis is carried out.It is announced based on government document, utilizes space line dynamic The subject evolution situation of-province-, hierarchical topics model analysis country four, city-district level.Including analyzing under each theme, country, Province, city, four, district level from upper one layer to next layer to the practice condition of theme, level each other and independence, The sub-topics information that each theme stresses, analysis government publication official document bulletin in each theme each level proportion, The priority theme information of government document bulletin concern.Utilize each layer of the dynamic time sequence probability topic model analysis of timeline Subject evolution situation of the grade in each period.Including being based under single analytic hierarchy process same subject, in different time piece, the layer Grade is to the executive condition of theme, and the government of each theme sends the documents ratio in different time piece.In addition, according to the time of four levels Line model, can be further to divide to city to the theme executive condition in district under comprehensive study cross events piece from country to province It analyses government document and announces subject information, fully understand that government document announces publication situation.

Claims (8)

1. a kind of cross-layer grade government document announces subject analysis method, characterized by the following steps:
(1) text data pretreatment is carried out to disclosed government document advertisement data;
(2) the government document advertisement data based on cross-layer grade constructs the dynamic hierarchies probability topic model based on space line;
(3) the government document advertisement data based on single level constructs the dynamic time sequence probability topic model based on timeline;
(4) in model hyper parameter and hidden variable sample, sampling parameter includes the distribution of theme, the distribution of word, word The corresponding theme of language;
(5) subject evolution analysis is carried out according to the distribution of model estimation.
2. cross-layer grade government document as described in claim 1 announces subject analysis method, it is characterised in that: the government document Advertisement data, including national official document bulletin, provincial official document bulletin, city-level official document bulletin, district grade official document bulletin.
3. cross-layer grade government document as described in claim 1 announces subject analysis method, it is characterised in that: the step (4) In in model hyper parameter and hidden variable sample, be using Blocked Gibbs method carry out.
4. cross-layer grade government document as described in claim 1 announces subject analysis method, it is characterised in that: the step (4) Specifically use following steps:
(4.1) Markov Chain parameter, the variance parameter { δ including each normal distribution are initialized2, σ2, a2And every article in The theme set z of all wordsd,i
(4.2) the logic normal state Study first α of theme distribution is sampledi
(4.3) sampling does not utilize the theme distribution η of every document in the normalized single level of softmax or timesliced,i
(4.4) sampling is not distributed using the Topic word of all documents in the normalized single level of softmax or timeslice φk,i
(4.5) η is givend,iAnd φk,i, sample each word w in documentd,nTheme value zd,n,i
5. cross-layer grade government document as described in claim 1 announces subject analysis method, it is characterised in that: the step (5) The middle distribution according to model estimation carries out subject evolution analysis, comprising:
(5.1) the subject evolution analysis based on space line;
(5.2) the subject evolution analysis based on timeline;
(5.3) it is analyzed based on the joint subject evolution of space line and timeline.
6. cross-layer grade government document as claimed in claim 5 announces subject analysis method, it is characterised in that: described to be based on space The subject evolution analysis of line refers to, first analyzes under same subject, and country, province, city, four, district level are from upper one layer to next layer Practice condition, the sub-topics information that theme each other relies on and independence, each theme stress respectively to theme, then Each theme is in each level proportion in the official document bulletin of analysis government publication.
7. cross-layer grade government document as claimed in claim 5 announces subject analysis method, it is characterised in that: described to be based on the time The subject evolution analysis of line refers to, analyzes under same subject, in different time piece, execution feelings of a certain level to a certain theme Condition analyzes the theme ratio that government document is announced in different time piece in the level.
8. cross-layer grade government document as claimed in claim 5 announces subject analysis method, it is characterised in that: described to be based on space The analysis of the joint subject evolution of line and timeline refers to, is studied using the timeline dynamic time sequence probability topic model of different levels Under cross events piece, to city to the executive condition of the theme in district from country to province, the space line evolution condition of line, divides at any time It analyses government document and announces subject information.
CN201811613793.9A 2018-12-27 2018-12-27 A kind of cross-layer grade government document bulletin subject analysis method Withdrawn CN109710936A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811613793.9A CN109710936A (en) 2018-12-27 2018-12-27 A kind of cross-layer grade government document bulletin subject analysis method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811613793.9A CN109710936A (en) 2018-12-27 2018-12-27 A kind of cross-layer grade government document bulletin subject analysis method

Publications (1)

Publication Number Publication Date
CN109710936A true CN109710936A (en) 2019-05-03

Family

ID=66257846

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811613793.9A Withdrawn CN109710936A (en) 2018-12-27 2018-12-27 A kind of cross-layer grade government document bulletin subject analysis method

Country Status (1)

Country Link
CN (1) CN109710936A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111782814A (en) * 2020-07-17 2020-10-16 安徽大学 Analysis method for patent technology subject content and heat evolution

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102662960A (en) * 2012-03-08 2012-09-12 浙江大学 On-line supervised theme-modeling and evolution-analyzing method
CN103984681A (en) * 2014-03-31 2014-08-13 同济大学 News event evolution analysis method based on time sequence distribution information and topic model
CN104199974A (en) * 2013-09-22 2014-12-10 中科嘉速(北京)并行软件有限公司 Microblog-oriented dynamic topic detection and evolution tracking method
CN106021222A (en) * 2016-05-09 2016-10-12 浙江农林大学 Analysis method and device for scientific research literature theme evolution

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102662960A (en) * 2012-03-08 2012-09-12 浙江大学 On-line supervised theme-modeling and evolution-analyzing method
CN104199974A (en) * 2013-09-22 2014-12-10 中科嘉速(北京)并行软件有限公司 Microblog-oriented dynamic topic detection and evolution tracking method
CN103984681A (en) * 2014-03-31 2014-08-13 同济大学 News event evolution analysis method based on time sequence distribution information and topic model
CN106021222A (en) * 2016-05-09 2016-10-12 浙江农林大学 Analysis method and device for scientific research literature theme evolution

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
唐晓波等: "基于潜在狄利克雷分配模型的微博主题演化分析", 《情报学报》 *
崔凯等: "一种基于LDA的在线主题演化挖掘模型", 《计算机科学》 *
张永安等: "基于R语言的区域技术创新政策量化分析", 《情报杂志》 *
李永忠等: "基于LDA的国内电子政务研究主题演化及可视化分析", 《现代情报》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111782814A (en) * 2020-07-17 2020-10-16 安徽大学 Analysis method for patent technology subject content and heat evolution
CN111782814B (en) * 2020-07-17 2023-11-10 安徽大学 Analysis method for patent technical subject matter and heat evolution

Similar Documents

Publication Publication Date Title
US10229154B2 (en) Subject-matter analysis of tabular data
CN106355627A (en) Method and system used for generating knowledge graphs
Peng et al. Radical-based hierarchical embeddings for Chinese sentiment analysis at sentence level
CN104636425A (en) Method for predicting and visualizing emotion cognitive ability of network individual or group
CN103034693A (en) Open-type entity and type identification method thereof
CN111782797A (en) Automatic matching method for scientific and technological project review experts and storage medium
CN105975475A (en) Chinese phrase string-based fine-grained thematic information extraction method
CN110598002A (en) Knowledge graph library construction method and device, computer storage medium and electronic equipment
CN111159356A (en) Knowledge graph construction method based on teaching content
CN104881399A (en) Event identification method and system based on probability soft logic PSL
CN109635260A (en) For generating the method, apparatus, equipment and storage medium of article template
Fadele et al. A novel classification to categorise original hadith detection techniques
CN115309634A (en) Micro-service extraction method, system, medium, equipment and information processing terminal
CN109710936A (en) A kind of cross-layer grade government document bulletin subject analysis method
Badawi et al. Kurdish news dataset headlines (KNDH) through multiclass classification
Bajestan et al. DErivCELEX: Development and evaluation of a German derivational morphology lexicon based on CELEX
CN107368610B (en) Full-text-based large text CRF and rule classification method and system
Preethi Survey on text transformation using Bi-LSTM in natural language processing with text data
Bizzoni et al. Some steps towards the generation of diachronic WordNets
Mahmood et al. KEFST: A knowledge extraction framework using finite-state transducers
CN108427672B (en) Method, terminal device and the computer readable storage medium of character translation
Eghbalzadeh et al. Persica: A Persian corpus for multi-purpose text mining and Natural language processing
CN109871414A (en) Biomedical entity relationship classification method based on the context vector kernel of graph
Laviosa et al. A corpus analysis of translation of environmental news on BBC China
Na et al. A method of collecting four character medicine effect phrases in TCM patents based on semi-supervised learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication
WW01 Invention patent application withdrawn after publication

Application publication date: 20190503