CN115115159A - TF-IDF and fuzzy Bayesian network-based risk prediction method - Google Patents
TF-IDF and fuzzy Bayesian network-based risk prediction method Download PDFInfo
- Publication number
- CN115115159A CN115115159A CN202111030602.8A CN202111030602A CN115115159A CN 115115159 A CN115115159 A CN 115115159A CN 202111030602 A CN202111030602 A CN 202111030602A CN 115115159 A CN115115159 A CN 115115159A
- Authority
- CN
- China
- Prior art keywords
- emergency
- fuzzy
- bayesian network
- idf
- word
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 45
- 238000011156 evaluation Methods 0.000 claims abstract description 33
- 238000004458 analytical method Methods 0.000 claims abstract description 12
- 238000005516 engineering process Methods 0.000 claims abstract description 10
- 238000004364 calculation method Methods 0.000 claims description 11
- 230000000694 effects Effects 0.000 claims description 8
- 230000006870 function Effects 0.000 claims description 7
- 238000010606 normalization Methods 0.000 claims description 7
- 238000013139 quantization Methods 0.000 claims description 7
- 239000000126 substance Substances 0.000 claims description 5
- 241000238413 Octopus Species 0.000 claims description 4
- 238000011161 development Methods 0.000 claims description 4
- 238000012216 screening Methods 0.000 claims description 4
- 238000012935 Averaging Methods 0.000 claims description 3
- 230000006399 behavior Effects 0.000 claims description 3
- 238000013278 delphi method Methods 0.000 claims description 3
- 238000000605 extraction Methods 0.000 claims description 3
- 238000012545 processing Methods 0.000 claims description 3
- 238000006243 chemical reaction Methods 0.000 claims description 2
- 230000008451 emotion Effects 0.000 claims 1
- 238000007418 data mining Methods 0.000 abstract description 2
- 238000011160 research Methods 0.000 description 5
- 238000002372 labelling Methods 0.000 description 4
- 238000013528 artificial neural network Methods 0.000 description 3
- 230000001364 causal effect Effects 0.000 description 3
- 238000011002 quantification Methods 0.000 description 3
- 206010039203 Road traffic accident Diseases 0.000 description 2
- 230000009193 crawling Effects 0.000 description 2
- 230000002996 emotional effect Effects 0.000 description 2
- 238000007781 pre-processing Methods 0.000 description 2
- 230000002457 bidirectional effect Effects 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 238000003745 diagnosis Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 238000004445 quantitative analysis Methods 0.000 description 1
- 238000011158 quantitative evaluation Methods 0.000 description 1
- 238000007670 refining Methods 0.000 description 1
- 238000012502 risk assessment Methods 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/06—Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
- G06Q10/063—Operations research, analysis or management
- G06Q10/0635—Risk analysis of enterprise or organisation activities
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/10—Services
- G06Q50/26—Government or public services
- G06Q50/265—Personal security, identity or safety
Landscapes
- Engineering & Computer Science (AREA)
- Business, Economics & Management (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Human Resources & Organizations (AREA)
- General Health & Medical Sciences (AREA)
- Tourism & Hospitality (AREA)
- General Engineering & Computer Science (AREA)
- Strategic Management (AREA)
- Health & Medical Sciences (AREA)
- Economics (AREA)
- Marketing (AREA)
- General Business, Economics & Management (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Development Economics (AREA)
- Artificial Intelligence (AREA)
- Educational Administration (AREA)
- Entrepreneurship & Innovation (AREA)
- Computational Linguistics (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Operations Research (AREA)
- Quality & Reliability (AREA)
- Probability & Statistics with Applications (AREA)
- Game Theory and Decision Science (AREA)
- Computer Security & Cryptography (AREA)
- Primary Health Care (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses a risk prediction method based on TF-IDF and a fuzzy Bayesian network, which relates to the technical field of information retrieval, data mining and emergency prediction and evaluation, and comprises the steps of acquiring emergency public opinion data by using a web crawler technology; acquiring the influence factors of the emergency by adopting a TF-IDF text analysis technology, and constructing an emergency index system by combining an emergency life cycle evolution model; determining a fuzzy Bayesian network topological structure according to an index system, and acquiring prior probability distribution of parent nodes of the fuzzy Bayesian network topological structure according to a fuzzy theory, a natural language variable and a deblurring method; and finally, predicting the risk probability of the emergency through a fuzzy Bayesian network reasoning technology, and providing scientific and reasonable early warning support for relevant departments to formulate emergency schemes.
Description
Technical Field
The invention relates to the fields of information retrieval, data mining and emergency prediction and evaluation, in particular to a risk prediction method based on TF-IDF and a fuzzy Bayesian network.
Background
The emergency event refers to the sudden occurrence of an event which causes casualties, property loss and the like. The risk of the situation change of the emergency is predicted by reasoning according to the model of the invention, and scientific and reasonable theoretical support is provided for relevant departments to formulate emergency plans, thereby achieving the early warning effect on the emergency.
At present, in the research of the emergency, the analysis is mainly based on a neural network, an analytic hierarchy process, a fuzzy comprehensive evaluation method and a Bayesian network method. The event risk assessment research method mainly comprises a subjective expert evaluation method, a semi-quantitative and semi-quantitative analytic hierarchy process, a fuzzy comprehensive evaluation method, a quantitative analysis neural network method and a Bayesian network method. The expert evaluation method can give quantitative evaluation according to the field expert experience under the condition of insufficient data, but the subjective consciousness of the expert evaluation method has large influence on the result. The analytic hierarchy process is easy to cause the situation of data inconsistency under the condition that a research model is complex. Compared with an expert evaluation method and an analytic hierarchy process, the fuzzy comprehensive evaluation method can better process fuzzy complex uncertain data, but has higher requirements on cross information in an index system. Although the neural network method has good learning ability and is suitable for the research of complex models, the result is unexplainable, the convergence speed is slow, and the problem of local minimization is easily caused. The Bayesian network method carries out causal reasoning on the node relation according to the conditional probability, has a clear network structure and good process interpretation performance, and can better handle the conditions of a complex index system and incomplete data.
The method has the characteristics of incomplete data, dynamic scene change, large cross quantity of event index information and the like, and the problems of strong subjectivity, inexplicability, poor self-adaption and the like exist when the risk of the emergency is predicted and evaluated by the conventional method, so that the requirement of actual scene prediction of the emergency cannot be met.
Disclosure of Invention
In order to solve the problem that the prediction of the upgrade risk of the emergency is insufficient in the prior art, the invention creatively provides a risk prediction method based on TF-IDF and a fuzzy Bayesian network on the basis of the existing research, and the method provides a new method for researching the evolution risk of the situation of the emergency. The method for achieving the purpose comprises the following steps: data acquisition, text analysis, establishment of an emergency index system, a life cycle evolution model, an inference prediction technology and the like.
The invention provides a risk prediction method based on TF-IDF and a fuzzy Bayesian network, which specifically comprises the following steps:
step 1: and data acquisition, wherein in the data acquisition step, a octopus data collector is utilized to acquire the original posts or forwarding posts of the relevant events of the emergency on mainstream media platforms such as Twitter (Twitter) and Facebook (Facebook) and the event relevant data such as the forwarding number, comment number and praise number of the corresponding posts by means of a web crawler technology, and daily public opinion data is collected from the emergency occurrence day.
Step 2: through carrying out text analysis on the daily public opinion data of the emergency collected in the step 1, extracting characteristic items and determining the characteristic factors of the emergency, the method mainly comprises the following steps:
s1) determining and extracting keyword indexes by domain experts researching the emergency, and preliminarily screening keywords in the posting and comment contents of the emergency. Wherein the keyword indicators include:
1) repeated words appearing in the text data.
2) The name of the person, the time, and the location in the text data.
3) And the emotional vocabulary of the attitude is embodied in the text data.
4) And the text data embodies related department decision behaviors and related department name vocabularies.
S2) carrying out feature extraction on the daily public sentiment data of the emergency collected in the step 1 by using a TF-IDF algorithm to obtain feature items, and comparing the keywords preliminarily screened in the step S1) with the feature items obtained by the TF-IDF algorithm to determine the feature factors of the emergency. The TF-IDF algorithm includes:
1) TF (term frequency) in TF-IDF represents the word frequency, represents the frequency of a certain word appearing in a document, and in order to reduce the error of the result caused by the difference of the word number of the document, the word frequency normalization is represented as:
wherein, tf i Representing the value of the word i after normalization processing; n is a radical of i,d Represents the total number of times word i appears in document d;represents the number of all words in document d and n represents the total number of words.
2) IDF (inverse Document frequency) in TF-IDF represents the inverse Document frequency. When the emergent event corpus contains a small number of documents of the word i, the word i has a good effect of distinguishing the document types. The emergency corpus refers to a Chinese emergency corpus in a semantic intelligent laboratory of Shanghai university, news reports of 5 types (earthquake, fire, traffic accident and the like) of emergency events are collected from the Internet to serve as raw corpora, the raw corpora are subjected to text preprocessing, text analysis, event labeling, consistency check and the like, and finally, labeling results are stored in the corpus.
The inverse document frequency of the computed word i is represented as:
where | D | represents the total number of documents in the corpus. I { j: i ∈ d j Denotes the number of documents containing the word i, i ∈ d j Indicating that the word i belongs to the jth document d j To avoid the word i not appearing in the corpus, | { j: i ∈ d j In the case where 0 is used, the denominator is increased by 1.
3) The TF-IDF value represents the effect of distinguishing categories, and the TF-IDF value represents:
tf_idfi i =tfi i ×idfi i 。
and step 3: and (3) determining an emergency life cycle evolution model by extracting the characteristic factors of the emergency and looking up the development characteristics of the previous events of the same type in the step (2), and dividing the life cycle of the emergency into 5 stages from stage 1 to stage 5.
And 4, step 4: determining influence factors driving the situation change of the emergency through the characteristics of each stage of the emergency life cycle evolution model constructed in the step 3 and the characteristic factors of the emergency extracted in the step 2, and establishing a three-stage index system from eight different angles through detailed analysis. The three-level index system sequentially represents from top to bottom: the first-level index points are used for predicting the risk of the emergency, the second-level index points are used for driving eight different angles of the situation change of the emergency, and the third-level index points represent influence factors of the emergency.
And 5: and (4) determining a fuzzy Bayesian network topological structure related to the emergency according to the three-level index system established in the step (4), wherein the fuzzy Bayesian network topological structure comprises a basic unit father node, intermediate nodes for linking the father node and the target nodes, and the target nodes of the final network reasoning result. The three-level index points in the three-level index system are used as father nodes of the fuzzy Bayesian network topological structure, the two-level index points are used as link nodes (namely intermediate nodes) of the fuzzy Bayesian network topological structure, and the one-level index points are used as target nodes of the fuzzy Bayesian network topological structure. The Bayesian network inference technology principle is an important result based on a Bayesian theory, and the Bayesian theory is to calculate a posterior probability, namely a target node probability based on prior probability distribution and conditional probability inference of a father node. The joint probability represents the probability of two events occurring together, and represents that p is (A, B), namely the joint probability of A and B, and is combined with any random variable set The joint probability distribution is expressed as:
wherein the content of the first and second substances,indicating that the random variable x is in the epoch t j′ Set of parent nodes of, t n′ The time length is 1 day, j 'is 1,2, …, n'.Is a parent node ofThe joint probability distribution at this time is represented as:
the conditional probability expresses the probability of event a occurring under the condition that event B has occurred, expressed as p ═ (a | B), referred to as the probability of a under B condition; the conditional probability is calculated asFor arbitrary set of random variables Random variableIs expressed as:
the prior probability represents the probability of occurrence of A, i.e. p (A), or the probability of occurrence of event B, i.e. p (B), and the prior probability represents the probability of occurrence of event A or event BAnd (4) rate. The posterior probability represents the probability of event A occurring after event B occurs, and is represented as the posterior probability of A
Step 6: and (4) introducing language evaluation level description variables, and evaluating the father node of the influence factor of the emergency expressed by the fuzzy Bayesian network topology structure of the emergency constructed in the step (5) by a field expert for researching the emergency according to the language evaluation level description variables. Determining natural language variables as follows according to fuzzy theory knowledge: seven natural language variables of "Very High (VH)", "high (H)", "high (FH)", "medium (M)", "low (FL)", "low (L)", and "Very Low (VL)" are used as indexes indicating the degree tendency of the field experts to evaluate the impact factors of the emergency.
And 7: and (5) according to the natural language variable determined in the step (6), carrying out M rounds of anonymous evaluation and screening on the influence factors of the emergency by the field expert researching the emergency according to a Delphi method, and determining the evaluation result of the field expert, wherein the M value is 5.
And 8: and (5) carrying out quantification operation on the field expert evaluation result in the step (7) by using a fuzzy resolving method. Fuzzy number (Fuzzy number) belongs to the Fuzzy set theory concept, represents a Fuzzy set in a definition domain U, and satisfies U (x) epsilon [0,1], U (x) is called the membership degree of U, and U (x) is a membership function of any random variable x and is also called the Fuzzy function of the random variable x. When U (x) satisfies the following expression, A is called Triangular fuzzy Number (Triangular F uzzy Number).
And (5) defuzzifying the field expert evaluation result by using an integral value method to obtain a quantization value of the fuzzy language. Fuzzy probability equalizationRepresenting the fuzzy probability of the occurrence of the ith' event, A i′k Represents the fuzzy value of the evaluation of the ith 'event by the kth expert, and n' represents the number of events. The integral value method defuzzification obtains an accurate probability calculation formula as follows:wherein p represents the probability of ambiguity, I represents the defuzzification value,the representation of the optimistic coefficients is,which represents the lower bound of the probability of ambiguity,represents an upper bound on the probability of ambiguity, where the selection isμ l (p),μ r (p) represents the integral values of the left and right membership functions, respectively. Mu.s l (p),μ r The λ -truncated expression of (p) is as follows:
wherein the content of the first and second substances,represents the lower bound of the lambda intercept;represents the upper bound of lambda intercept, with lambda values of 0,0.1,0.2, …,1, and delta lambda value of 0.1.
And step 9: and (4) obtaining prior probability distribution of father nodes in the fuzzy Bayesian network topology structure about the emergency according to the calculation result I (p) in the step 8, inputting the prior probability distribution into GeNIe software to perform inference calculation on the probability taking the emergency occurrence risk as a target node, and further obtaining the risk level of the emergency.
The invention establishes a risk prediction method based on TF-IDF and fuzzy Bayesian network, analyzes the network public opinion data of the emergency by using the natural language text analysis TF-IDF technology, extracts characteristic factors, determines the target information of network public opinion discussion, establishes an index system, and solves the problems faced in the actual scenes of incomplete data, dynamic scene change, large event index information cross quantity and the like in the difficult handling of the emergency by using the fuzzy theory and Bayesian network characteristics. The causal bidirectional reasoning characteristic of the Bayesian network can not only reason the posterior probability of the result, but also reverse the prior probability of each influence factor, and further obtain the key index influencing the situation transition of the emergency, and relevant departments adopt the targeted emergency measures according to the reasoning prediction result, thereby achieving the effect of preventing the situation evolution of the emergency.
Drawings
FIG. 1 is a flow chart of a technical route of the present invention;
FIG. 2 is a data crawling flow diagram of the present invention.
Detailed Description
To facilitate understanding of the technical content of the present invention and understanding of the use of the fuzzy Bayesian network model for predicting the risk of the emergency event, the present invention is further explained below with reference to the accompanying drawings.
The invention provides an emergency risk prediction method based on TF-IDF and a fuzzy Bayesian network. Analyzing the network public opinion situation development of the emergency from the perspective of text analysis, extracting event characteristic influence factors, and establishing an emergency index system; quantifying the field expert evaluation result by combining a fuzzy theory and a defuzzification technology; according to the Bayesian network inference technology, the probabilistic problem influenced by various factors can be processed by utilizing the directed acyclic graph, uncertainty caused by conditional correlation among variables is processed through probability, inference calculation is carried out on uncertainty knowledge and information, and therefore risk prediction of the emergency is achieved. The specific implementation steps of the technical route are shown in FIG. 1.
The method comprises the following specific implementation steps:
step 1: and in the data acquisition stage, acquiring the network public opinion data of the emergency by adopting a octopus data collector, wherein the octopus data collector can convert the webpage unstructured data into structured data and store the structured data in various forms such as a database or EXCEL. Accurate and efficient large-scale data acquisition is realized through cloud acquisition, the cost of acquiring information is reduced, the efficiency is improved, and efficient assistance is realized in emergency public opinion data acquisition. And acquiring data information such as related posting content, news reports, postings forwarded, commented and approved by netizens on the social media platform of the emergency. As shown in fig. 2, the data acquisition process firstly inputs the website information of the relevant emergency into the data collector, and determines the crawling rule, so as to accurately crawl the post content and the post forwarding, comment and praise data information of the relevant emergency. And finally storing the crawled data and ending the data acquisition process.
Step 2: performing text analysis on the network public opinion data of the emergency obtained in the step 1 to determine characteristic factors of the emergency, and mainly comprising the following steps of:
s1) determining and extracting keyword index points by domain experts researching the emergency, and preliminarily summarizing the emergency keywords in the poster content and the comments acquired in the step 1. Wherein the index points include:
1) repeated words appearing in the text data.
2) The name of the person, the time, and the location in the text data.
3) And the emotional vocabulary of the attitude is embodied in the text data.
4) And the text data embodies related department decision behaviors and related department name vocabularies.
S2) performing feature extraction on the network public sentiment data of the emergency by using the TF-IDF algorithm to obtain feature items, comparing and summarizing the emergency keywords summarized in the step S1) with the feature items obtained by the TF-IDF algorithm, and determining the feature factors of the emergency. The TF-IDF algorithm includes:
1) TF (term frequency) in TF-IDF represents the word frequency, represents the frequency of a certain word appearing in a document, and in order to reduce the error of the result caused by the difference of the word number of the document, the word frequency normalization is represented as:
wherein, tf i Representing the value of the word i after normalization processing; n is a radical of i,d Represents the total number of times word i appears in document d;represents the number of all words in document d and n represents the total number of words.
2) IDF (inverse Document frequency) in TF-IDF represents the inverse Document frequency. When the emergent event corpus contains a small number of documents of the word i, the word i has a good effect of distinguishing the document types. The emergency corpus refers to a Chinese emergency corpus in a semantic intelligent laboratory of Shanghai university, news reports of 5 types (earthquake, fire, traffic accident and the like) of emergency events are collected from the Internet to serve as raw corpora, the raw corpora are subjected to text preprocessing, text analysis, event labeling, consistency check and the like, and finally, labeling results are stored in the corpus.
The inverse document frequency of the computed word i is represented as:
where | D | represents the total number of documents in the corpus. I belongs to d j Denotes the number of documents containing the word i, i ∈ d j Indicating that the word i belongs to the jth document d in the corpus j To avoid the word i not appearing in the corpus, | { j: i ∈ d j In the case where the | is 0, the denominator is increased by 1.
3) The TF-IDF value represents the effect of distinguishing categories, and the TF-IDF value represents:
tf_idfi i =tfi i ×idfi i 。
and step 3: and (3) determining an emergency life cycle evolution model by extracting the characteristic factors of the emergency and looking up the development characteristics of the previous events of the same type in the step (2), and dividing the life cycle into 5 stages from stage 1 to stage 5.
And 4, step 4: determining influence factors driving the situation change of the emergency through the characteristics of each stage of the emergency life cycle evolution model constructed in the step 3 and the characteristic factors of the emergency extracted in the step 2, and refining and analyzing the influence factors from eight different angles to establish a three-stage index system. The three-level index system sequentially represents from top to bottom: the first-level index points are used for predicting the risk of the emergency, the second-level index points are used for representing eight different angles for driving the situation change of the emergency, and the third-level index points represent the influence factors of the emergency.
And 5: and determining a fuzzy Bayesian network topological structure related to the emergency according to the three-level index system established in the step 4, wherein the fuzzy Bayesian network topological structure comprises basic unit father nodes, intermediate nodes for linking the father nodes and the target nodes, and target nodes of a final network reasoning result, the three-level index points in the three-level index system are used as the father nodes of the fuzzy Bayesian network topological structure, the two-level index points are used as the link nodes (namely the intermediate nodes) of the fuzzy Bayesian network topological structure, and the one-level index points are used as the target nodes of the fuzzy Bayesian network topological structure. The Bayesian Network reasoning technical principle is an important result based on Bayesian theory, wherein a Bayesian Network (BN) is also called a belief Network or a directed acyclic graph, nodes in the graph represent random variables, connecting lines between the nodes represent existing dependency relationships, arrows represent causal relationships between the two random variables, and the BN Network can calculate reasoning from reasons to results and can also perform a reverse diagnosis process from the results to the reasons. The basic unit of the BN structure is called a parent node, also called an evidence node, and the intermediate nodes are called link nodes for connecting the parent node and the target node. Bayes theory calculates posterior probability, namely target node probability, based on prior probability distribution and conditional probability inference of father nodes. Joint probability means that two events are sent togetherThe probability of generation, where p is (A, B), is called the joint probability of A and B, e.g. for any set of random variables The joint probability distribution is expressed as:
wherein the content of the first and second substances,indicating that the random variable x is in the time period t j′ Parent node set of (c), t n′ Is the length of time.Is a parent node ofExpressed as:
conditional probability expresses the probability that event a occurs if event B has occurred, denoted as p ═ a | B, referred to as the probability of a under B conditions; the conditional probability is calculated asFor arbitrary set of random variables Random variableIs expressed as:
the prior probability represents the probability of occurrence of A, i.e., p (A), or the probability of occurrence of event B, i.e., p (B). The posterior probability represents the probability of occurrence of event A recalculated after occurrence of event B, called A's posterior probability, expressed as
Step 6: since the influence factors of the emergency are difficult to provide exact expression, language evaluation level description variables are introduced. And (4) the domain expert for researching the emergency evaluates the parent node which represents the influence factor of the emergency in the fuzzy Bayesian network topology structure about the emergency and is constructed in the step (5) according to the language evaluation level description variable. Determining natural language variables as follows according to fuzzy theory knowledge: seven natural language variables of "Very High (VH)", "high (H)", "high (FH)", "medium (M)", "low (FL)", "low (L)", and "Very Low (VL)" are used as indexes indicating the degree tendency of the field experts to evaluate the impact factors of the emergency.
And 7: and (4) obtaining a fuzzy language by the natural language variable determined in the step (6) and the domain expert for researching the emergency, and determining the domain expert evaluation result after anonymous M-round evaluation and screening are carried out on the influence factors of the emergency by the domain expert for researching the emergency according to the Delphi method, wherein the M value is 5.
And 8: and (4) carrying out quantification operation on the domain expert evaluation result determined in the step (7) by using a deblurring method. Fuzzy Number (Fuz zy Number) belongs to the theory concept of fuzzy set, represents fuzzy set in the definition domain U, and satisfies U (x) epsilon [0,1], U (x) is called membership degree of U, U (x) is the membership function of any random variable x, also called fuzzy function of random variable x. When U (x) satisfies the following expression, A is called triangular Fuzzy Number (Trian angular Fuzzy Number).
And defuzzifying the field expert evaluation result by using an integral value method to obtain a quantization value of the fuzzy language, carrying out averaging and normalization operations on the quantization value of the fuzzy language, and determining prior probability distribution of parent nodes of the Bayesian network for the Bayesian network to calculate the probability of the target node in an inference manner. The fuzzy theory middle intercept represents the conversion process from fuzzy to clear, and for quantifying the domain expert evaluation result, a quantitative relation table of the natural language variable, the fuzzy number and the intercept determined in the step 6 is established and is shown in table 1, and the intercept parameter is lambda. Fuzzy probability averagingRepresenting the fuzzy probability of the occurrence of the ith' event, A i′k And (3) representing the fuzzy value of the evaluation of the ith 'event by the kth expert, wherein n' represents the number of events. The integral value method defuzzification obtains an accurate probability calculation formula as follows:where p represents the probability of ambiguity, I represents the defuzzification value,the representation of the optimistic coefficients is,a lower bound of the probability of ambiguity is represented,represents an upper bound on the probability of ambiguity, where the choice isμ l (p),μ r (p) represents the integral values of the left and right membership functions, respectively. Mu.s l (p),μ r The λ -truncated expression of (p) is as follows:
wherein the content of the first and second substances,the lower bound of the lambda cutoff is indicated,representing the upper bound of the lambda intercept. Lambda is 0,0.1,0.2, …,1, delta lambda is 0.1.
TABLE 1 quantification of relationships
And step 9: and (3) evaluating, by a field expert for researching the emergency, parent nodes representing the influence factors of the emergency in the fuzzy Bayesian network of the emergency according to the calculation rule of the quantitative fuzzy language in the step 8, and obtaining the quantization results I (p) of all the parent nodes through calculation, wherein the quantization results I (p) are shown in the table 2, and each node value represents the probability level of the influence of each influence factor of the emergency on a target node and serves as an initial value of inference calculation. State1 and State0 states represent the level of probability that an event "occurred" or "did not occur" under the influence factor, and the present invention treats State1 as the risk probability of an emergency event occurring. And inputting each node value into GeNIe software to calculate the probability of the target node in an inference mode, and comparing the probability range of the risk level in the table 3 to obtain the risk level of the emergency. Example verification results show that the risk probability of the target node emergency is 76%, and the probability range of corresponding table 3 is 75% -100%, which indicates that the risk level of the emergency is the risk of 'I class'. When a certain emergency happens, the value of the target node State1 is set to be 100%, and key characteristic factors influencing the occurrence of the emergency are determined through reverse reasoning.
TABLE 2 parent Prior probability distribution
TABLE 3 Risk classes
The embodiments described above are only a part of the embodiments of the present invention, and not all of them. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Claims (5)
1. A risk prediction method based on TF-IDF and fuzzy Bayesian network is characterized by comprising the following steps:
step 1: data acquisition, namely acquiring related posting contents, news reports, post forwarding numbers, comment numbers and praise amount data information of the netizens on posts of the emergencies on a social media platform by using a data acquisition unit through a web crawler technology in the data acquisition stage, and comprehensively acquiring network public opinion data of the emergencies;
step 2: performing text analysis on the network public opinion data of the emergency obtained in the step 1 to determine characteristic factors of the emergency;
and step 3: determining an emergency life cycle evolution model by the characteristic factors of the emergency extracted in the step 2 and looking up the development characteristics of the previous events of the same type, and dividing the life cycle of the emergency into 5 stages from stage 1 to stage 5;
and 4, step 4: determining influence factors for driving the situation change of the emergency through the characteristics of each stage of the emergency life cycle evolution model constructed in the step 3 and the characteristic factors of the emergency obtained in the step 2, and carrying out detailed analysis on the influence factors to establish a three-stage index system from eight different angles, wherein the three-stage index system is sequentially expressed from top to bottom as follows: the first-level index points are used for predicting the risk of the emergency, the second-level index points are used for representing eight different angles for driving the situation change of the emergency, and the third-level index points are used for representing influence factors of the emergency;
and 5: determining a fuzzy Bayesian network topological structure related to the emergency according to the three-level index system established in the step 4, wherein the structure comprises a basic unit father node, an intermediate node for linking the father node and a target node of a final network reasoning result, a three-level index point in the three-level index system is used as the father node of the fuzzy Bayesian network topological structure, a two-level index point is used as the intermediate node of the fuzzy Bayesian network topological structure, and a one-level index point is used as the target node of the fuzzy Bayesian network topological structure;
step 6: introducing language evaluation level description variables, evaluating father nodes which are constructed in the step 5 and are related to the influence factors expressing the emergency in the fuzzy Bayesian network topology structure of the emergency by field experts researching the emergency according to the language evaluation level description variables, and determining natural language variables as follows according to fuzzy theory knowledge: seven natural language variables of ' Very High (VH) ", ' high (H)", ' high (FH) ", ' medium (M)", ' low (FL) ", ' low (L)", and ' Very Low (VL) "are used for representing the degree tendency index of the field expert for evaluating the influence factors of the emergency;
and 7: and (5) from the natural language variables determined in the step 6, field expert evaluation for researching the emergency is carried out to obtain fuzzy language: according to the Delphi method, performing M rounds of anonymous evaluation and screening on the influence factors of the emergency by field experts researching the emergency, and determining the evaluation result of the field experts;
and step 8: defuzzifying and pasting the field expert evaluation result determined in the step 7 by using an integral value method to obtain a quantization value of the fuzzy language, averaging and normalizing the quantization value of the fuzzy language, and determining prior probability distribution of a father node in the fuzzy Bayesian network topology structure about the emergency, wherein the prior probability distribution is used for reasoning and calculating the probability of a target node about the fuzzy Bayesian network topology structure about the emergency, and specifically:
the interception in the fuzzy theory represents the conversion process from fuzzy to clear, and for quantifying the expert evaluation result, a quantitative relation table of the natural language variable, the fuzzy number and the interception determined in the step 6 is established, the interception parameter is lambda, and the fuzzy probability is equalized i '═ 1,2, …, n', whereRepresenting the fuzzy probability of the occurrence of the ith' event, A i′k Representing the fuzzy value of the evaluation of the kth expert on the ith 'event, wherein n' represents the number of events; the integral value method defuzzification obtains an accurate probability calculation formula as follows:wherein p represents the probability of ambiguity, I (p) represents the defuzzification value,the representation of the optimistic coefficients is,μ l (p),μ r (p) integral values, mu, of left and right membership functions, respectively l (p),μ r The λ -truncated expression of (p) is as follows:
wherein the content of the first and second substances,the lower bound of the lambda cutoff is indicated,represents the upper bound of the lambda intercept; λ is 0,0.1,0.2, …, 1; the value of delta lambda is 0.1;
and step 9: and (4) obtaining prior probability distribution of father nodes in the fuzzy Bayesian network topology structure about the emergency according to the calculation result I (p) in the step 8, inputting the prior probability distribution into GeNIe software to perform inference calculation on the probability taking the emergency occurrence risk as a target node, and further obtaining the risk level of the emergency.
2. The method according to claim 1, wherein the step 2 specifically comprises:
step 2.1: determining and extracting keyword index points by field experts researching the emergency, preliminarily summarizing the emergency keywords in the post content and the comments obtained in the step 1, wherein the index points comprise: repeated words appearing in the text data; the name, time, and location of the person in the text data; expressing attitude emotion vocabularies in the text data; and embodying related department decision-making behaviors and related department name vocabularies in the text data;
step 2.2: and (3) performing feature extraction on the network public sentiment data of the emergency by using a TF-IDF algorithm to obtain feature items, comparing and summarizing the emergency keywords summarized in the step 2.1 and the feature items obtained by the TF-IDF algorithm, and determining feature factors of the emergency, wherein the TF-IDF algorithm comprises the following steps of:
1) TF (term frequency) in TF-IDF represents the word frequency, represents the frequency of a certain word appearing in a document, and in order to reduce the error of the result caused by the difference of the word number of the document, the word frequency normalization is represented as:
wherein, tf i Expressing the value of the word i after normalization processing; n is a radical of i,d Represents the total number of times word i appears in document d;representing the number of all words in the document d, and n representing the total number of words;
2) in the TF-IDF, IDF (inverse Document frequency) represents the inverse Document frequency, when the emergent event corpus contains a few documents of a word i, the word i has a good effect of distinguishing the Document types, and the inverse Document frequency of the word i is calculated and represented as follows:
wherein | D | represents the total number of documents in the emergency corpus, | { j: i ∈ D j Denotes the number of documents containing the word i, i ∈ d j Representing that the word i belongs to the jth document d in the emergency corpus j ;
3) The TF-IDF value represents the effect of distinguishing categories, and the TF-IDF value represents:
tf_idfi i =tfi i ×idfi i 。
3. the TF-IDF and fuzzy Bayesian network based risk prediction method of claim 2, wherein said data collector is a Octopus data collector.
4. The TF-IDF and fuzzy bayesian network based risk prediction method according to claim 3, wherein M-5 in said step 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111030602.8A CN115115159A (en) | 2021-09-03 | 2021-09-03 | TF-IDF and fuzzy Bayesian network-based risk prediction method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111030602.8A CN115115159A (en) | 2021-09-03 | 2021-09-03 | TF-IDF and fuzzy Bayesian network-based risk prediction method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115115159A true CN115115159A (en) | 2022-09-27 |
Family
ID=83325303
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111030602.8A Pending CN115115159A (en) | 2021-09-03 | 2021-09-03 | TF-IDF and fuzzy Bayesian network-based risk prediction method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115115159A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116488912A (en) * | 2023-04-27 | 2023-07-25 | 徐州医科大学 | Network traffic monitoring method and system based on mutation model finite state |
CN116962080A (en) * | 2023-09-19 | 2023-10-27 | 中孚信息股份有限公司 | Alarm filtering method, system and medium based on network node risk assessment |
CN117371876A (en) * | 2023-12-07 | 2024-01-09 | 深圳品阔信息技术有限公司 | Index data analysis method and system based on keywords |
-
2021
- 2021-09-03 CN CN202111030602.8A patent/CN115115159A/en active Pending
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116488912A (en) * | 2023-04-27 | 2023-07-25 | 徐州医科大学 | Network traffic monitoring method and system based on mutation model finite state |
CN116962080A (en) * | 2023-09-19 | 2023-10-27 | 中孚信息股份有限公司 | Alarm filtering method, system and medium based on network node risk assessment |
CN116962080B (en) * | 2023-09-19 | 2023-12-15 | 中孚信息股份有限公司 | Alarm filtering method, system and medium based on network node risk assessment |
CN117371876A (en) * | 2023-12-07 | 2024-01-09 | 深圳品阔信息技术有限公司 | Index data analysis method and system based on keywords |
CN117371876B (en) * | 2023-12-07 | 2024-04-02 | 深圳品阔信息技术有限公司 | Index data analysis method and system based on keywords |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Meng et al. | Rating the crisis of online public opinion using a multi-level index system | |
CN115115159A (en) | TF-IDF and fuzzy Bayesian network-based risk prediction method | |
CN112199608B (en) | Social media rumor detection method based on network information propagation graph modeling | |
CN107103100B (en) | A kind of fault-tolerant intelligent semantic searching method based on map framework | |
CN105740228A (en) | Internet public opinion analysis method | |
Cao et al. | A risky large group emergency decision-making method based on topic sentiment analysis | |
CN108319587B (en) | Multi-weight public opinion value calculation method and system and computer | |
CN113392986A (en) | Highway bridge information extraction method based on big data and management maintenance system | |
CN109241199B (en) | Financial knowledge graph discovery method | |
Peng et al. | Research on the early-warning model of network public opinion of major emergencies | |
CN112508600A (en) | Vehicle value evaluation method based on Internet public data | |
CN110909529A (en) | User emotion analysis and prejudgment system of company image promotion system | |
CN110472225A (en) | The railway accident analysis of causes method of word-based extension LDA | |
CN112765961A (en) | Fact verification method and system based on entity graph neural network inference | |
Liu et al. | Research and citation analysis of data mining technology based on Bayes algorithm | |
Zhang et al. | A regret theory-based multi-granularity three-way decision model with incomplete T-spherical fuzzy information and its application in forest fire management | |
Tong et al. | Multimedia network public opinion supervision prediction algorithm based on big data | |
CN110428102B (en) | HC-TC-LDA-based major event trend prediction method | |
CN111143573A (en) | Method for predicting target node of knowledge graph based on user feedback information | |
Li | [Retracted] Forecast and Simulation of the Public Opinion on the Public Policy Based on the Markov Model | |
CN111723127A (en) | Stock trend prediction method and system based on text abstract emotion mining | |
Fu et al. | Prediction of hot topics of agricultural public opinion based on attention mechanism LSTM model | |
Jiang | Research on factor space engineering and application of evidence factor mining in evidence-based reconstruction | |
CN116128275A (en) | Event deduction prediction system | |
Pan et al. | Automatic subject classification of public messages in e-government affairs |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |