CN108038790B - Situation analysis system with internal and external data fusion - Google Patents
Situation analysis system with internal and external data fusion Download PDFInfo
- Publication number
- CN108038790B CN108038790B CN201711200078.8A CN201711200078A CN108038790B CN 108038790 B CN108038790 B CN 108038790B CN 201711200078 A CN201711200078 A CN 201711200078A CN 108038790 B CN108038790 B CN 108038790B
- Authority
- CN
- China
- Prior art keywords
- data
- value
- text data
- specific field
- news
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Systems or methods specially adapted for specific business sectors, e.g. utilities or tourism
- G06Q50/01—Social networking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/06—Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
- G06Q10/063—Operations research, analysis or management
- G06Q10/0639—Performance analysis of employees; Performance analysis of enterprise or organisation operations
Abstract
The invention relates to a situation analysis system for internal and external data fusion, which is characterized by comprising a data acquisition module; a text quantization calculation module; a causal hysteresis analysis module; and a situation prediction module. The invention can effectively predict the development situation of the relevant indexes, namely, the current text data such as internet news and the like can be given, and the current index change trend can be predicted. The text topic classification technology, the time difference correlation analysis and the regression prediction method are combined and used for situation analysis in a specific field, and the method is a new method innovation. Because the relevant statistical index data of the specific field usually lags behind text data such as internet news and the like, the future development situation of the specific field can be well predicted according to the text data of the internet and the historical statistical data indexes of the specific field, and the method is favorable for the supervision department of the specific field to make scientific decisions.
Description
Technical Field
The invention relates to a system for fusing and analyzing internet big data and internal data in a specific field.
Background
The fusion of multi-source heterogeneous data first requires the quantization of unstructured text data. The main methods currently used for text quantization are: and (3) calculating the emotional tendency value of the news text by emotional analysis, and carrying out theme clustering on the theme model. Shroff G, Agrarwal P, Dey L.Enterprise Information fusion for real-time music interaction [ C ] (Proceedings of the, International Conference on Information fusion. IEEE, 2011: 1-8) and Li J, Xu Z, Xu H, et al.Foresting Oil principles Trends with a sententive of line News arrows [ J ] (Procedia Computer Science, 2016, 91: 1081-. The demonstration analysis based on data mining technology [ J ] (mathematical statistics and management 2016, 35 (2): 215 + 224) proposes a method for topic clustering of news documents by using a topic model, and text is quantified by the proportion of topics in an analysis stage. However, the method only uses the change of the trend of the concerned subject in a short time and obtains uncertain subjects. In the cause-and-effect lag analysis, currently, a time difference correlation analysis method is mainly utilized, namely Zhaoweing, Jiayongfei, Qianwei and real estate warning system research [ J ] (Tianjin university proceedings: social science edition, 1999 (4): 277 + 280) and Jiangxiao, Wangwangjian, a correlation analysis method with aligned time sequence data curves [ J ] (software proceedings, 2014 (9): 2002 + 2017) is utilized to screen out indexes and predict.
The method can not directly reflect the development condition of the specific field because the news data and index data (such as the health condition of business operation of self-trade areas) of the specific field have the problems of structural inconsistency, hysteresis of data-related influence and the like.
Disclosure of Invention
The purpose of the invention is: and combining the internet data with the internal data of the specific field, thereby carrying out situation assessment on the relevant indexes of the specific field.
In order to achieve the above object, a technical solution of the present invention is to provide an internal and external data fusion situation analysis system, which is characterized by comprising:
the data acquisition module is used for acquiring news text data related to a specific field from the Internet;
the text quantification calculation module is used for converting unstructured news text data into structured numerical information, extracting the topics of the news text by using an improved naive Bayes classification algorithm during conversion, and solving the number of the topic texts contained in each time period according to the topic time of each news, and comprises the following steps:
step A1, for a given topic set tc ═ { tc ] of k topics1,tc2...,tckAnd classifying the news text data by using a naive Bayes classification algorithm, and calculating the current new probability after obtaining the posterior probability P of each topic class during classificationIf dist is less than epsilon and epsilon is a preset threshold value, classifying the current news text data b into a theme type tci:
I is more than or equal to 1, j is less than or equal to k, in the formula, P (tc)i| b) indicates that the current news text data b belongs to the topic category tciThe posterior probability of (a) is,indicating that the current news text data b belongs to the topic category tciMaximum a posteriori probability of;
a2, obtaining the number of news texts containing various topics in each time period according to the topic type and news date of the news text data and the time span of the index data;
the causal delay analysis module is used for calculating and determining a time difference value of causal delay influence of the theme categories and the relevant indexes, and aims to find out how much delay is caused by relevance of different theme categories to the relevant indexes;
and the situation prediction module is used for training a prediction model by utilizing the theme category, the internet statistical data and the related specific field platform internal data which are obtained by the text quantitative calculation module, and calculating the index prediction value in a future period of time.
Preferably, the causal hysteresis analysis module is implemented by:
step B1, selecting the situation related index as a reference variable y, and news topics and internet statistical data x ═ x in the same time period1,x2...xmAs an alternative index;
step B2, determining an influence tendency value tw of the subject category and the reference variable according to the prior knowledge;
Wherein D is 0, ± 1, ± 2, ± 3.± D, D represents lead or lag phase, when D takes a negative value, it represents that x variable has a lag effect on the reference variable, D takes a positive value, it represents that x variable has a lead effect on the reference variable:representing a time difference d as a subject tciA correlation coefficient between the quantized value and a reference variable y, D representing a selectable maximum time difference in the causal lag relationship;
step B4 according toTo obtainIn the formulaAs a topic tciThe maximum correlation coefficient between the quantized value and the reference variable y is then foundCorresponding time difference
A situation prediction module which utilizes the topic category value obtained by the text quantitative calculation module, the internet statistical data and the related data in the platform in the specific field to obtain the time difference value according to the causal lag analysisThe relation of (2) is input, a prediction model is trained, and the index prediction value of a period of time in the future is calculated.
The invention can effectively predict the development situation of the relevant indexes, namely, the current text data such as internet news and the like can be given, and the current index change trend can be predicted. The text topic classification technology, the time difference correlation analysis and the regression prediction method are combined and used for situation analysis in a specific field, and the method is a new method innovation. Because the relevant statistical index data of the specific field usually lags behind text data such as internet news and the like, the future development situation of the specific field can be well predicted according to the text data of the internet and the historical statistical data indexes of the specific field, and the method is favorable for the supervision department of the specific field to make scientific decisions.
Drawings
FIG. 1 is a block diagram of the system of the present invention;
FIG. 2 is a flow chart of text quantization calculation;
FIG. 3 is a flow chart of causal hysteresis analysis.
Detailed Description
The invention will be further illustrated with reference to the following specific examples. It should be understood that these examples are for illustrative purposes only and are not intended to limit the scope of the present invention. Further, it should be understood that various changes or modifications of the present invention may be made by those skilled in the art after reading the teaching of the present invention, and such equivalents may fall within the scope of the present invention as defined in the appended claims.
As shown in fig. 1, the present invention provides a situation analysis system with internal and external data fusion, comprising:
the data acquisition module is used for acquiring related enterprises and policies in a specific field and news text data such as news, bulletins and the like related to the enterprises and policies from the Internet;
and the text quantification calculation module is used for converting the unstructured news text data into structured numerical information (the number of the theme texts contained in each time period). The invention is based on the domain knowledge obtained by communication with a specific domain, and a topic set tc consisting of k topics is given as { tc ═ tc1,tc2...,tckAnd extracting the topics of the news text data by using a naive Bayes model-based multi-label classification algorithm, and obtaining the number of the topic texts in each time period according to the topic time of each news, wherein the method comprises the following steps:
step A1, carrying out theme classification on news text data by using a naive Bayes multi-label classification algorithm, calculating a category discrimination value dist of the current news text data b after obtaining the posterior probability P of each theme category during classification, and classifying the current news text data b into a theme category tc if the dist is less than epsilon and epsilon is a preset threshold valuei:
I is more than or equal to 1, j is less than or equal to k, and P (tc) in the formulai| b) indicates that the current news text data b belongs to the topic category tciThe posterior probability of (a) is,indicating that the current news text data b belongs to the topic category tciMaximum a posteriori probability of;
step A2, according to the subject type and news date of the news text data, obtaining the quantity of news texts containing each subject in each time period according to the time span of the index data;
the causal delay analysis module is used for calculating and determining the time difference d of causal delay influence of the theme categories and the related indexes, and aims to find out how much the relevance delay of the different theme categories to the related indexes is the largest, and the causal delay analysis module comprises the following steps:
step B1, selecting the situation related index as a reference variable y, and news topics and internet statistical data x ═ x in the same time period1,x1...xmAs an alternative index;
step B2, determining an influence tendency value tw of the subject category and the reference variable according to the prior knowledge;
Wherein d is 0,d represents an advance or lag phase, when D takes a negative value, the x variable is shown to have lag influence on the reference variable, and D takes a distance value, and the x variable is shown to have lead influence on the reference variable;representing a time difference d as a subject tciA correlation coefficient between the quantized value and a reference variable y, D representing a selectable maximum time difference in the causal lag relationship;
step B4 according toTo obtainIn the formulaAs a topic tciThe maximum correlation coefficient between the quantized value and the reference variable y is then foundCorresponding time difference
A situation prediction module which utilizes the topic category value obtained by the text quantitative calculation module, the internet statistical data and the related data in the platform in the specific field to obtain the time difference value according to the causal lag analysisThe relation of (2) is input, a prediction model is trained, and the index prediction value of a period of time in the future is calculated.
Claims (2)
1. A situational analysis system for fusion of internal and external data, comprising:
the data acquisition module is used for acquiring news text data related to a specific field from the Internet;
the text quantification calculation module is used for converting unstructured news text data into structured numerical information, and during conversion, the text quantification calculation module extracts topics of news texts by using a multi-label classification algorithm based on a naive Bayes model, and comprises the following steps of:
step A1, for a given topic set tc ═ { tc ] of k topics1,tc2...,tckAnd classifying the news text data by using a naive Bayes multi-label classification algorithm, calculating a category discrimination value dist of the current news text data b after obtaining the posterior probability P of each subject category during classification, and classifying the current news text data b into a subject category tc if the dist is less than epsilon and epsilon is a preset threshold valuei:
In the formula, P (tc)i| b) indicates that the current news text data b belongs to the topic category tciThe posterior probability of (a) is,indicating that the current news text data b belongs to the topic category tciMaximum a posteriori probability of;
step A2, quantifying the quantity of the news text data according to the subject type and the document date of the news text data and the span of the index data;
the causal delay analysis module is used for calculating and determining a time difference value of causal delay influence of the theme categories and the relevant indexes, and aims to find out how much delay is caused by relevance of different theme categories to the relevant indexes;
and the situation prediction module is used for inputting data according to the relation of a time difference value d obtained by causal lag analysis by utilizing the theme category, the internet statistical data and the related specific field platform internal data obtained by the text quantitative calculation module, training a prediction model and calculating the index prediction value of a period of time in the future.
2. A system for situational analysis with fusion of internal and external data according to claim 1 wherein said causal lag analysis module is implemented by:
step B1, selecting the situation-related index as the reference variable y ═ y1,y2...ymAnd news topics and internet statistics x ═ x in the same time period1,x2...xmAs an alternative index;
step B2, determining an influence tendency value tw of the subject category and the reference variable according to the prior knowledge;
Wherein D is 0, ± 1, ± 2, ± 3.± D, D represents lead or lag phase, when D takes negative value, it represents x variable producing lag influence to the reference variable, D takes positive value, it represents x variable influencing the lead of the reference variable;representing a time difference d as a subject tciThe maximum correlation coefficient between the quantized value and the reference variable y, D represents the selectable maximum time difference in the causal hysteresis relation;
step B4 according toTo obtainIn the formulaAs a topic tciThe maximum correlation coefficient between the quantized value and the reference variable y,then find outCorresponding time difference
A situation prediction module which utilizes the topic category value obtained by the text quantitative calculation module, the internet statistical data and the related data in the platform in the specific field to obtain the time difference value according to the causal lag analysisThe relation of (2) is input, a prediction model is trained, and the index prediction value of a period of time in the future is calculated.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711200078.8A CN108038790B (en) | 2017-11-24 | 2017-11-24 | Situation analysis system with internal and external data fusion |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711200078.8A CN108038790B (en) | 2017-11-24 | 2017-11-24 | Situation analysis system with internal and external data fusion |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108038790A CN108038790A (en) | 2018-05-15 |
CN108038790B true CN108038790B (en) | 2021-10-15 |
Family
ID=62093910
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711200078.8A Active CN108038790B (en) | 2017-11-24 | 2017-11-24 | Situation analysis system with internal and external data fusion |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108038790B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110990777B (en) * | 2019-07-03 | 2022-03-18 | 北京市应急管理科学技术研究院 | Data relevance analysis method and system and readable storage medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6873325B1 (en) * | 1999-06-30 | 2005-03-29 | Bayes Information Technology, Ltd. | Visualization method and visualization system |
CN101297337A (en) * | 2005-09-29 | 2008-10-29 | 微软公司 | Methods for predicting destinations from partial trajectories employing open-and closed-world modeling methods |
CN105224608A (en) * | 2015-09-06 | 2016-01-06 | 华南理工大学 | The hot news Forecasting Methodology analyzed based on microblog data and system |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20070130585A1 (en) * | 2005-12-05 | 2007-06-07 | Perret Pierre A | Virtual Store Management Method and System for Operating an Interactive Audio/Video Entertainment System According to Viewers Tastes and Preferences |
-
2017
- 2017-11-24 CN CN201711200078.8A patent/CN108038790B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6873325B1 (en) * | 1999-06-30 | 2005-03-29 | Bayes Information Technology, Ltd. | Visualization method and visualization system |
CN101297337A (en) * | 2005-09-29 | 2008-10-29 | 微软公司 | Methods for predicting destinations from partial trajectories employing open-and closed-world modeling methods |
CN105224608A (en) * | 2015-09-06 | 2016-01-06 | 华南理工大学 | The hot news Forecasting Methodology analyzed based on microblog data and system |
Also Published As
Publication number | Publication date |
---|---|
CN108038790A (en) | 2018-05-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107633265B (en) | Data processing method and device for optimizing credit evaluation model | |
Zhang et al. | Enhancing stock market prediction with extended coupled hidden Markov model over multi-sourced data | |
WO2018040068A1 (en) | Knowledge graph-based semantic analysis system and method | |
CN108108743B (en) | Abnormal user identification method and device for identifying abnormal user | |
CN108647249B (en) | Public opinion data prediction method, device, terminal and storage medium | |
Perdana et al. | Combining likes-retweet analysis and naive bayes classifier within twitter for sentiment analysis | |
CN111435463A (en) | Data processing method and related equipment and system | |
WO2021103401A1 (en) | Data object classification method and apparatus, computer device and storage medium | |
Wu et al. | Predicting the hate: A Gstm model based on Covid-19 hate speech datasets | |
CN111160959A (en) | User click conversion estimation method and device | |
Katevas et al. | Practical processing of mobile sensor data for continual deep learning predictions | |
Zhang et al. | Emotional component analysis and forecast public opinion on micro-blog posts based on maximum entropy model | |
Baranowski et al. | Social welfare in the light of topic modelling | |
Darena et al. | Machine learning-based analysis of the association between online texts and stock price movements | |
CN108038790B (en) | Situation analysis system with internal and external data fusion | |
Wlodarczyk et al. | Current trends in predictive analytics of big data | |
CN107644042B (en) | Software program click rate pre-estimation sorting method and server | |
CN110428102B (en) | HC-TC-LDA-based major event trend prediction method | |
CN109871889B (en) | Public psychological assessment method under emergency | |
Harvey et al. | Machine Learning-Based Models for Assessing Impacts Before, During and After Hurricane Florence | |
CN112256884A (en) | Knowledge graph-based data asset library access method and device | |
CN116596662A (en) | Risk early warning method and device based on enterprise public opinion information, electronic equipment and medium | |
CN116756688A (en) | Public opinion risk discovery method based on multi-mode fusion algorithm | |
Gutsche | Automatic weak signal detection and forecasting | |
Гавриленко et al. | Тhe task of analyzing publications to build a forecast for changes in cryptocurrency rates |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |