CN108038790B - Situation analysis system with internal and external data fusion - Google Patents

Situation analysis system with internal and external data fusion Download PDF

Info

Publication number
CN108038790B
CN108038790B CN201711200078.8A CN201711200078A CN108038790B CN 108038790 B CN108038790 B CN 108038790B CN 201711200078 A CN201711200078 A CN 201711200078A CN 108038790 B CN108038790 B CN 108038790B
Authority
CN
China
Prior art keywords
data
value
text data
specific field
news
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201711200078.8A
Other languages
Chinese (zh)
Other versions
CN108038790A (en
Inventor
章昭辉
蒋昌俊
王鹏伟
王海建
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Donghua University
Original Assignee
Donghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Donghua University filed Critical Donghua University
Priority to CN201711200078.8A priority Critical patent/CN108038790B/en
Publication of CN108038790A publication Critical patent/CN108038790A/en
Application granted granted Critical
Publication of CN108038790B publication Critical patent/CN108038790B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Systems or methods specially adapted for specific business sectors, e.g. utilities or tourism
    • G06Q50/01Social networking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0639Performance analysis of employees; Performance analysis of enterprise or organisation operations

Abstract

The invention relates to a situation analysis system for internal and external data fusion, which is characterized by comprising a data acquisition module; a text quantization calculation module; a causal hysteresis analysis module; and a situation prediction module. The invention can effectively predict the development situation of the relevant indexes, namely, the current text data such as internet news and the like can be given, and the current index change trend can be predicted. The text topic classification technology, the time difference correlation analysis and the regression prediction method are combined and used for situation analysis in a specific field, and the method is a new method innovation. Because the relevant statistical index data of the specific field usually lags behind text data such as internet news and the like, the future development situation of the specific field can be well predicted according to the text data of the internet and the historical statistical data indexes of the specific field, and the method is favorable for the supervision department of the specific field to make scientific decisions.

Description

Situation analysis system with internal and external data fusion
Technical Field
The invention relates to a system for fusing and analyzing internet big data and internal data in a specific field.
Background
The fusion of multi-source heterogeneous data first requires the quantization of unstructured text data. The main methods currently used for text quantization are: and (3) calculating the emotional tendency value of the news text by emotional analysis, and carrying out theme clustering on the theme model. Shroff G, Agrarwal P, Dey L.Enterprise Information fusion for real-time music interaction [ C ] (Proceedings of the, International Conference on Information fusion. IEEE, 2011: 1-8) and Li J, Xu Z, Xu H, et al.Foresting Oil principles Trends with a sententive of line News arrows [ J ] (Procedia Computer Science, 2016, 91: 1081-. The demonstration analysis based on data mining technology [ J ] (mathematical statistics and management 2016, 35 (2): 215 + 224) proposes a method for topic clustering of news documents by using a topic model, and text is quantified by the proportion of topics in an analysis stage. However, the method only uses the change of the trend of the concerned subject in a short time and obtains uncertain subjects. In the cause-and-effect lag analysis, currently, a time difference correlation analysis method is mainly utilized, namely Zhaoweing, Jiayongfei, Qianwei and real estate warning system research [ J ] (Tianjin university proceedings: social science edition, 1999 (4): 277 + 280) and Jiangxiao, Wangwangjian, a correlation analysis method with aligned time sequence data curves [ J ] (software proceedings, 2014 (9): 2002 + 2017) is utilized to screen out indexes and predict.
The method can not directly reflect the development condition of the specific field because the news data and index data (such as the health condition of business operation of self-trade areas) of the specific field have the problems of structural inconsistency, hysteresis of data-related influence and the like.
Disclosure of Invention
The purpose of the invention is: and combining the internet data with the internal data of the specific field, thereby carrying out situation assessment on the relevant indexes of the specific field.
In order to achieve the above object, a technical solution of the present invention is to provide an internal and external data fusion situation analysis system, which is characterized by comprising:
the data acquisition module is used for acquiring news text data related to a specific field from the Internet;
the text quantification calculation module is used for converting unstructured news text data into structured numerical information, extracting the topics of the news text by using an improved naive Bayes classification algorithm during conversion, and solving the number of the topic texts contained in each time period according to the topic time of each news, and comprises the following steps:
step A1, for a given topic set tc ═ { tc ] of k topics1,tc2...,tckAnd classifying the news text data by using a naive Bayes classification algorithm, and calculating the current new probability after obtaining the posterior probability P of each topic class during classificationIf dist is less than epsilon and epsilon is a preset threshold value, classifying the current news text data b into a theme type tci
Figure BDA0001481017600000021
I is more than or equal to 1, j is less than or equal to k, in the formula, P (tc)i| b) indicates that the current news text data b belongs to the topic category tciThe posterior probability of (a) is,
Figure BDA0001481017600000022
indicating that the current news text data b belongs to the topic category tciMaximum a posteriori probability of;
a2, obtaining the number of news texts containing various topics in each time period according to the topic type and news date of the news text data and the time span of the index data;
the causal delay analysis module is used for calculating and determining a time difference value of causal delay influence of the theme categories and the relevant indexes, and aims to find out how much delay is caused by relevance of different theme categories to the relevant indexes;
and the situation prediction module is used for training a prediction model by utilizing the theme category, the internet statistical data and the related specific field platform internal data which are obtained by the text quantitative calculation module, and calculating the index prediction value in a future period of time.
Preferably, the causal hysteresis analysis module is implemented by:
step B1, selecting the situation related index as a reference variable y, and news topics and internet statistical data x ═ x in the same time period1,x2...xmAs an alternative index;
step B2, determining an influence tendency value tw of the subject category and the reference variable according to the prior knowledge;
step B3, calculating to obtain
Figure BDA0001481017600000023
Figure BDA0001481017600000024
Wherein D is 0, ± 1, ± 2, ± 3.± D, D represents lead or lag phase, when D takes a negative value, it represents that x variable has a lag effect on the reference variable, D takes a positive value, it represents that x variable has a lead effect on the reference variable:
Figure BDA0001481017600000025
representing a time difference d as a subject tciA correlation coefficient between the quantized value and a reference variable y, D representing a selectable maximum time difference in the causal lag relationship;
step B4 according to
Figure BDA0001481017600000026
To obtain
Figure BDA0001481017600000027
In the formula
Figure BDA0001481017600000028
As a topic tciThe maximum correlation coefficient between the quantized value and the reference variable y is then found
Figure BDA0001481017600000031
Corresponding time difference
Figure BDA0001481017600000032
A situation prediction module which utilizes the topic category value obtained by the text quantitative calculation module, the internet statistical data and the related data in the platform in the specific field to obtain the time difference value according to the causal lag analysis
Figure BDA0001481017600000033
The relation of (2) is input, a prediction model is trained, and the index prediction value of a period of time in the future is calculated.
The invention can effectively predict the development situation of the relevant indexes, namely, the current text data such as internet news and the like can be given, and the current index change trend can be predicted. The text topic classification technology, the time difference correlation analysis and the regression prediction method are combined and used for situation analysis in a specific field, and the method is a new method innovation. Because the relevant statistical index data of the specific field usually lags behind text data such as internet news and the like, the future development situation of the specific field can be well predicted according to the text data of the internet and the historical statistical data indexes of the specific field, and the method is favorable for the supervision department of the specific field to make scientific decisions.
Drawings
FIG. 1 is a block diagram of the system of the present invention;
FIG. 2 is a flow chart of text quantization calculation;
FIG. 3 is a flow chart of causal hysteresis analysis.
Detailed Description
The invention will be further illustrated with reference to the following specific examples. It should be understood that these examples are for illustrative purposes only and are not intended to limit the scope of the present invention. Further, it should be understood that various changes or modifications of the present invention may be made by those skilled in the art after reading the teaching of the present invention, and such equivalents may fall within the scope of the present invention as defined in the appended claims.
As shown in fig. 1, the present invention provides a situation analysis system with internal and external data fusion, comprising:
the data acquisition module is used for acquiring related enterprises and policies in a specific field and news text data such as news, bulletins and the like related to the enterprises and policies from the Internet;
and the text quantification calculation module is used for converting the unstructured news text data into structured numerical information (the number of the theme texts contained in each time period). The invention is based on the domain knowledge obtained by communication with a specific domain, and a topic set tc consisting of k topics is given as { tc ═ tc1,tc2...,tckAnd extracting the topics of the news text data by using a naive Bayes model-based multi-label classification algorithm, and obtaining the number of the topic texts in each time period according to the topic time of each news, wherein the method comprises the following steps:
step A1, carrying out theme classification on news text data by using a naive Bayes multi-label classification algorithm, calculating a category discrimination value dist of the current news text data b after obtaining the posterior probability P of each theme category during classification, and classifying the current news text data b into a theme category tc if the dist is less than epsilon and epsilon is a preset threshold valuei
Figure BDA0001481017600000041
I is more than or equal to 1, j is less than or equal to k, and P (tc) in the formulai| b) indicates that the current news text data b belongs to the topic category tciThe posterior probability of (a) is,
Figure BDA0001481017600000042
indicating that the current news text data b belongs to the topic category tciMaximum a posteriori probability of;
step A2, according to the subject type and news date of the news text data, obtaining the quantity of news texts containing each subject in each time period according to the time span of the index data;
the causal delay analysis module is used for calculating and determining the time difference d of causal delay influence of the theme categories and the related indexes, and aims to find out how much the relevance delay of the different theme categories to the related indexes is the largest, and the causal delay analysis module comprises the following steps:
step B1, selecting the situation related index as a reference variable y, and news topics and internet statistical data x ═ x in the same time period1,x1...xmAs an alternative index;
step B2, determining an influence tendency value tw of the subject category and the reference variable according to the prior knowledge;
step B3, calculating to obtain
Figure BDA0001481017600000043
Figure BDA0001481017600000044
Wherein d is 0,d represents an advance or lag phase, when D takes a negative value, the x variable is shown to have lag influence on the reference variable, and D takes a distance value, and the x variable is shown to have lead influence on the reference variable;
Figure BDA0001481017600000045
representing a time difference d as a subject tciA correlation coefficient between the quantized value and a reference variable y, D representing a selectable maximum time difference in the causal lag relationship;
step B4 according to
Figure BDA0001481017600000046
To obtain
Figure BDA0001481017600000047
In the formula
Figure BDA0001481017600000048
As a topic tciThe maximum correlation coefficient between the quantized value and the reference variable y is then found
Figure BDA0001481017600000049
Corresponding time difference
Figure BDA00014810176000000410
A situation prediction module which utilizes the topic category value obtained by the text quantitative calculation module, the internet statistical data and the related data in the platform in the specific field to obtain the time difference value according to the causal lag analysis
Figure BDA00014810176000000411
The relation of (2) is input, a prediction model is trained, and the index prediction value of a period of time in the future is calculated.

Claims (2)

1. A situational analysis system for fusion of internal and external data, comprising:
the data acquisition module is used for acquiring news text data related to a specific field from the Internet;
the text quantification calculation module is used for converting unstructured news text data into structured numerical information, and during conversion, the text quantification calculation module extracts topics of news texts by using a multi-label classification algorithm based on a naive Bayes model, and comprises the following steps of:
step A1, for a given topic set tc ═ { tc ] of k topics1,tc2...,tckAnd classifying the news text data by using a naive Bayes multi-label classification algorithm, calculating a category discrimination value dist of the current news text data b after obtaining the posterior probability P of each subject category during classification, and classifying the current news text data b into a subject category tc if the dist is less than epsilon and epsilon is a preset threshold valuei
Figure FDA0003124122510000011
In the formula, P (tc)i| b) indicates that the current news text data b belongs to the topic category tciThe posterior probability of (a) is,
Figure FDA0003124122510000012
indicating that the current news text data b belongs to the topic category tciMaximum a posteriori probability of;
step A2, quantifying the quantity of the news text data according to the subject type and the document date of the news text data and the span of the index data;
the causal delay analysis module is used for calculating and determining a time difference value of causal delay influence of the theme categories and the relevant indexes, and aims to find out how much delay is caused by relevance of different theme categories to the relevant indexes;
and the situation prediction module is used for inputting data according to the relation of a time difference value d obtained by causal lag analysis by utilizing the theme category, the internet statistical data and the related specific field platform internal data obtained by the text quantitative calculation module, training a prediction model and calculating the index prediction value of a period of time in the future.
2. A system for situational analysis with fusion of internal and external data according to claim 1 wherein said causal lag analysis module is implemented by:
step B1, selecting the situation-related index as the reference variable y ═ y1,y2...ymAnd news topics and internet statistics x ═ x in the same time period1,x2...xmAs an alternative index;
step B2, determining an influence tendency value tw of the subject category and the reference variable according to the prior knowledge;
step B3, calculating to obtain
Figure FDA0003124122510000013
Figure FDA0003124122510000014
Wherein D is 0, ± 1, ± 2, ± 3.± D, D represents lead or lag phase, when D takes negative value, it represents x variable producing lag influence to the reference variable, D takes positive value, it represents x variable influencing the lead of the reference variable;
Figure FDA0003124122510000021
representing a time difference d as a subject tciThe maximum correlation coefficient between the quantized value and the reference variable y, D represents the selectable maximum time difference in the causal hysteresis relation;
step B4 according to
Figure FDA0003124122510000022
To obtain
Figure FDA0003124122510000023
In the formula
Figure FDA0003124122510000024
As a topic tciThe maximum correlation coefficient between the quantized value and the reference variable y,then find out
Figure FDA0003124122510000025
Corresponding time difference
Figure FDA0003124122510000026
A situation prediction module which utilizes the topic category value obtained by the text quantitative calculation module, the internet statistical data and the related data in the platform in the specific field to obtain the time difference value according to the causal lag analysis
Figure FDA0003124122510000027
The relation of (2) is input, a prediction model is trained, and the index prediction value of a period of time in the future is calculated.
CN201711200078.8A 2017-11-24 2017-11-24 Situation analysis system with internal and external data fusion Active CN108038790B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711200078.8A CN108038790B (en) 2017-11-24 2017-11-24 Situation analysis system with internal and external data fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711200078.8A CN108038790B (en) 2017-11-24 2017-11-24 Situation analysis system with internal and external data fusion

Publications (2)

Publication Number Publication Date
CN108038790A CN108038790A (en) 2018-05-15
CN108038790B true CN108038790B (en) 2021-10-15

Family

ID=62093910

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711200078.8A Active CN108038790B (en) 2017-11-24 2017-11-24 Situation analysis system with internal and external data fusion

Country Status (1)

Country Link
CN (1) CN108038790B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110990777B (en) * 2019-07-03 2022-03-18 北京市应急管理科学技术研究院 Data relevance analysis method and system and readable storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6873325B1 (en) * 1999-06-30 2005-03-29 Bayes Information Technology, Ltd. Visualization method and visualization system
CN101297337A (en) * 2005-09-29 2008-10-29 微软公司 Methods for predicting destinations from partial trajectories employing open-and closed-world modeling methods
CN105224608A (en) * 2015-09-06 2016-01-06 华南理工大学 The hot news Forecasting Methodology analyzed based on microblog data and system

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070130585A1 (en) * 2005-12-05 2007-06-07 Perret Pierre A Virtual Store Management Method and System for Operating an Interactive Audio/Video Entertainment System According to Viewers Tastes and Preferences

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6873325B1 (en) * 1999-06-30 2005-03-29 Bayes Information Technology, Ltd. Visualization method and visualization system
CN101297337A (en) * 2005-09-29 2008-10-29 微软公司 Methods for predicting destinations from partial trajectories employing open-and closed-world modeling methods
CN105224608A (en) * 2015-09-06 2016-01-06 华南理工大学 The hot news Forecasting Methodology analyzed based on microblog data and system

Also Published As

Publication number Publication date
CN108038790A (en) 2018-05-15

Similar Documents

Publication Publication Date Title
CN107633265B (en) Data processing method and device for optimizing credit evaluation model
Zhang et al. Enhancing stock market prediction with extended coupled hidden Markov model over multi-sourced data
WO2018040068A1 (en) Knowledge graph-based semantic analysis system and method
CN108108743B (en) Abnormal user identification method and device for identifying abnormal user
CN108647249B (en) Public opinion data prediction method, device, terminal and storage medium
Perdana et al. Combining likes-retweet analysis and naive bayes classifier within twitter for sentiment analysis
CN111435463A (en) Data processing method and related equipment and system
WO2021103401A1 (en) Data object classification method and apparatus, computer device and storage medium
Wu et al. Predicting the hate: A Gstm model based on Covid-19 hate speech datasets
CN111160959A (en) User click conversion estimation method and device
Katevas et al. Practical processing of mobile sensor data for continual deep learning predictions
Zhang et al. Emotional component analysis and forecast public opinion on micro-blog posts based on maximum entropy model
Baranowski et al. Social welfare in the light of topic modelling
Darena et al. Machine learning-based analysis of the association between online texts and stock price movements
CN108038790B (en) Situation analysis system with internal and external data fusion
Wlodarczyk et al. Current trends in predictive analytics of big data
CN107644042B (en) Software program click rate pre-estimation sorting method and server
CN110428102B (en) HC-TC-LDA-based major event trend prediction method
CN109871889B (en) Public psychological assessment method under emergency
Harvey et al. Machine Learning-Based Models for Assessing Impacts Before, During and After Hurricane Florence
CN112256884A (en) Knowledge graph-based data asset library access method and device
CN116596662A (en) Risk early warning method and device based on enterprise public opinion information, electronic equipment and medium
CN116756688A (en) Public opinion risk discovery method based on multi-mode fusion algorithm
Gutsche Automatic weak signal detection and forecasting
Гавриленко et al. Тhe task of analyzing publications to build a forecast for changes in cryptocurrency rates

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant