CN115115159A - TF-IDF and fuzzy Bayesian network-based risk prediction method - Google Patents

TF-IDF and fuzzy Bayesian network-based risk prediction method Download PDF

Info

Publication number
CN115115159A
CN115115159A CN202111030602.8A CN202111030602A CN115115159A CN 115115159 A CN115115159 A CN 115115159A CN 202111030602 A CN202111030602 A CN 202111030602A CN 115115159 A CN115115159 A CN 115115159A
Authority
CN
China
Prior art keywords
emergency
fuzzy
bayesian network
idf
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111030602.8A
Other languages
Chinese (zh)
Inventor
康昭
赵晓翠
田玲
惠孛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN202111030602.8A priority Critical patent/CN115115159A/en
Publication of CN115115159A publication Critical patent/CN115115159A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0635Risk analysis of enterprise or organisation activities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/26Government or public services
    • G06Q50/265Personal security, identity or safety

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Human Resources & Organizations (AREA)
  • General Health & Medical Sciences (AREA)
  • Tourism & Hospitality (AREA)
  • General Engineering & Computer Science (AREA)
  • Strategic Management (AREA)
  • Health & Medical Sciences (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • General Business, Economics & Management (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Development Economics (AREA)
  • Artificial Intelligence (AREA)
  • Educational Administration (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Computational Linguistics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Probability & Statistics with Applications (AREA)
  • Game Theory and Decision Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Primary Health Care (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a risk prediction method based on TF-IDF and a fuzzy Bayesian network, which relates to the technical field of information retrieval, data mining and emergency prediction and evaluation, and comprises the steps of acquiring emergency public opinion data by using a web crawler technology; acquiring the influence factors of the emergency by adopting a TF-IDF text analysis technology, and constructing an emergency index system by combining an emergency life cycle evolution model; determining a fuzzy Bayesian network topological structure according to an index system, and acquiring prior probability distribution of parent nodes of the fuzzy Bayesian network topological structure according to a fuzzy theory, a natural language variable and a deblurring method; and finally, predicting the risk probability of the emergency through a fuzzy Bayesian network reasoning technology, and providing scientific and reasonable early warning support for relevant departments to formulate emergency schemes.

Description

TF-IDF and fuzzy Bayesian network-based risk prediction method
Technical Field
The invention relates to the fields of information retrieval, data mining and emergency prediction and evaluation, in particular to a risk prediction method based on TF-IDF and a fuzzy Bayesian network.
Background
The emergency event refers to the sudden occurrence of an event which causes casualties, property loss and the like. The risk of the situation change of the emergency is predicted by reasoning according to the model of the invention, and scientific and reasonable theoretical support is provided for relevant departments to formulate emergency plans, thereby achieving the early warning effect on the emergency.
At present, in the research of the emergency, the analysis is mainly based on a neural network, an analytic hierarchy process, a fuzzy comprehensive evaluation method and a Bayesian network method. The event risk assessment research method mainly comprises a subjective expert evaluation method, a semi-quantitative and semi-quantitative analytic hierarchy process, a fuzzy comprehensive evaluation method, a quantitative analysis neural network method and a Bayesian network method. The expert evaluation method can give quantitative evaluation according to the field expert experience under the condition of insufficient data, but the subjective consciousness of the expert evaluation method has large influence on the result. The analytic hierarchy process is easy to cause the situation of data inconsistency under the condition that a research model is complex. Compared with an expert evaluation method and an analytic hierarchy process, the fuzzy comprehensive evaluation method can better process fuzzy complex uncertain data, but has higher requirements on cross information in an index system. Although the neural network method has good learning ability and is suitable for the research of complex models, the result is unexplainable, the convergence speed is slow, and the problem of local minimization is easily caused. The Bayesian network method carries out causal reasoning on the node relation according to the conditional probability, has a clear network structure and good process interpretation performance, and can better handle the conditions of a complex index system and incomplete data.
The method has the characteristics of incomplete data, dynamic scene change, large cross quantity of event index information and the like, and the problems of strong subjectivity, inexplicability, poor self-adaption and the like exist when the risk of the emergency is predicted and evaluated by the conventional method, so that the requirement of actual scene prediction of the emergency cannot be met.
Disclosure of Invention
In order to solve the problem that the prediction of the upgrade risk of the emergency is insufficient in the prior art, the invention creatively provides a risk prediction method based on TF-IDF and a fuzzy Bayesian network on the basis of the existing research, and the method provides a new method for researching the evolution risk of the situation of the emergency. The method for achieving the purpose comprises the following steps: data acquisition, text analysis, establishment of an emergency index system, a life cycle evolution model, an inference prediction technology and the like.
The invention provides a risk prediction method based on TF-IDF and a fuzzy Bayesian network, which specifically comprises the following steps:
step 1: and data acquisition, wherein in the data acquisition step, a octopus data collector is utilized to acquire the original posts or forwarding posts of the relevant events of the emergency on mainstream media platforms such as Twitter (Twitter) and Facebook (Facebook) and the event relevant data such as the forwarding number, comment number and praise number of the corresponding posts by means of a web crawler technology, and daily public opinion data is collected from the emergency occurrence day.
Step 2: through carrying out text analysis on the daily public opinion data of the emergency collected in the step 1, extracting characteristic items and determining the characteristic factors of the emergency, the method mainly comprises the following steps:
s1) determining and extracting keyword indexes by domain experts researching the emergency, and preliminarily screening keywords in the posting and comment contents of the emergency. Wherein the keyword indicators include:
1) repeated words appearing in the text data.
2) The name of the person, the time, and the location in the text data.
3) And the emotional vocabulary of the attitude is embodied in the text data.
4) And the text data embodies related department decision behaviors and related department name vocabularies.
S2) carrying out feature extraction on the daily public sentiment data of the emergency collected in the step 1 by using a TF-IDF algorithm to obtain feature items, and comparing the keywords preliminarily screened in the step S1) with the feature items obtained by the TF-IDF algorithm to determine the feature factors of the emergency. The TF-IDF algorithm includes:
1) TF (term frequency) in TF-IDF represents the word frequency, represents the frequency of a certain word appearing in a document, and in order to reduce the error of the result caused by the difference of the word number of the document, the word frequency normalization is represented as:
Figure RE-GDA0003808112430000021
wherein, tf i Representing the value of the word i after normalization processing; n is a radical of i,d Represents the total number of times word i appears in document d;
Figure RE-GDA0003808112430000022
represents the number of all words in document d and n represents the total number of words.
2) IDF (inverse Document frequency) in TF-IDF represents the inverse Document frequency. When the emergent event corpus contains a small number of documents of the word i, the word i has a good effect of distinguishing the document types. The emergency corpus refers to a Chinese emergency corpus in a semantic intelligent laboratory of Shanghai university, news reports of 5 types (earthquake, fire, traffic accident and the like) of emergency events are collected from the Internet to serve as raw corpora, the raw corpora are subjected to text preprocessing, text analysis, event labeling, consistency check and the like, and finally, labeling results are stored in the corpus.
The inverse document frequency of the computed word i is represented as:
Figure RE-GDA0003808112430000023
where | D | represents the total number of documents in the corpus. I { j: i ∈ d j Denotes the number of documents containing the word i, i ∈ d j Indicating that the word i belongs to the jth document d j To avoid the word i not appearing in the corpus, | { j: i ∈ d j In the case where 0 is used, the denominator is increased by 1.
3) The TF-IDF value represents the effect of distinguishing categories, and the TF-IDF value represents:
tf_idfi i =tfi i ×idfi i
and step 3: and (3) determining an emergency life cycle evolution model by extracting the characteristic factors of the emergency and looking up the development characteristics of the previous events of the same type in the step (2), and dividing the life cycle of the emergency into 5 stages from stage 1 to stage 5.
And 4, step 4: determining influence factors driving the situation change of the emergency through the characteristics of each stage of the emergency life cycle evolution model constructed in the step 3 and the characteristic factors of the emergency extracted in the step 2, and establishing a three-stage index system from eight different angles through detailed analysis. The three-level index system sequentially represents from top to bottom: the first-level index points are used for predicting the risk of the emergency, the second-level index points are used for driving eight different angles of the situation change of the emergency, and the third-level index points represent influence factors of the emergency.
And 5: and (4) determining a fuzzy Bayesian network topological structure related to the emergency according to the three-level index system established in the step (4), wherein the fuzzy Bayesian network topological structure comprises a basic unit father node, intermediate nodes for linking the father node and the target nodes, and the target nodes of the final network reasoning result. The three-level index points in the three-level index system are used as father nodes of the fuzzy Bayesian network topological structure, the two-level index points are used as link nodes (namely intermediate nodes) of the fuzzy Bayesian network topological structure, and the one-level index points are used as target nodes of the fuzzy Bayesian network topological structure. The Bayesian network inference technology principle is an important result based on a Bayesian theory, and the Bayesian theory is to calculate a posterior probability, namely a target node probability based on prior probability distribution and conditional probability inference of a father node. The joint probability represents the probability of two events occurring together, and represents that p is (A, B), namely the joint probability of A and B, and is combined with any random variable set
Figure RE-GDA0003808112430000031
Figure RE-GDA0003808112430000032
The joint probability distribution is expressed as:
Figure RE-GDA0003808112430000033
wherein the content of the first and second substances,
Figure RE-GDA0003808112430000034
indicating that the random variable x is in the epoch t j′ Set of parent nodes of, t n′ The time length is 1 day, j 'is 1,2, …, n'.
Figure RE-GDA0003808112430000035
Is a parent node of
Figure RE-GDA0003808112430000036
The joint probability distribution at this time is represented as:
Figure RE-GDA0003808112430000037
the conditional probability expresses the probability of event a occurring under the condition that event B has occurred, expressed as p ═ (a | B), referred to as the probability of a under B condition; the conditional probability is calculated as
Figure RE-GDA0003808112430000038
For arbitrary set of random variables
Figure RE-GDA0003808112430000039
Figure RE-GDA00038081124300000310
Random variable
Figure RE-GDA00038081124300000311
Is expressed as:
Figure RE-GDA00038081124300000312
the prior probability represents the probability of occurrence of A, i.e. p (A), or the probability of occurrence of event B, i.e. p (B), and the prior probability represents the probability of occurrence of event A or event BAnd (4) rate. The posterior probability represents the probability of event A occurring after event B occurs, and is represented as the posterior probability of A
Figure RE-GDA0003808112430000041
Step 6: and (4) introducing language evaluation level description variables, and evaluating the father node of the influence factor of the emergency expressed by the fuzzy Bayesian network topology structure of the emergency constructed in the step (5) by a field expert for researching the emergency according to the language evaluation level description variables. Determining natural language variables as follows according to fuzzy theory knowledge: seven natural language variables of "Very High (VH)", "high (H)", "high (FH)", "medium (M)", "low (FL)", "low (L)", and "Very Low (VL)" are used as indexes indicating the degree tendency of the field experts to evaluate the impact factors of the emergency.
And 7: and (5) according to the natural language variable determined in the step (6), carrying out M rounds of anonymous evaluation and screening on the influence factors of the emergency by the field expert researching the emergency according to a Delphi method, and determining the evaluation result of the field expert, wherein the M value is 5.
And 8: and (5) carrying out quantification operation on the field expert evaluation result in the step (7) by using a fuzzy resolving method. Fuzzy number (Fuzzy number) belongs to the Fuzzy set theory concept, represents a Fuzzy set in a definition domain U, and satisfies U (x) epsilon [0,1], U (x) is called the membership degree of U, and U (x) is a membership function of any random variable x and is also called the Fuzzy function of the random variable x. When U (x) satisfies the following expression, A is called Triangular fuzzy Number (Triangular F uzzy Number).
Figure RE-GDA0003808112430000042
And (5) defuzzifying the field expert evaluation result by using an integral value method to obtain a quantization value of the fuzzy language. Fuzzy probability equalization
Figure RE-GDA0003808112430000043
Representing the fuzzy probability of the occurrence of the ith' event, A i′k Represents the fuzzy value of the evaluation of the ith 'event by the kth expert, and n' represents the number of events. The integral value method defuzzification obtains an accurate probability calculation formula as follows:
Figure RE-GDA0003808112430000044
wherein p represents the probability of ambiguity, I represents the defuzzification value,
Figure RE-GDA0003808112430000045
the representation of the optimistic coefficients is,
Figure RE-GDA0003808112430000046
which represents the lower bound of the probability of ambiguity,
Figure RE-GDA0003808112430000047
represents an upper bound on the probability of ambiguity, where the selection is
Figure RE-GDA0003808112430000048
μ l (p),μ r (p) represents the integral values of the left and right membership functions, respectively. Mu.s l (p),μ r The λ -truncated expression of (p) is as follows:
Figure RE-GDA0003808112430000049
Figure RE-GDA0003808112430000051
wherein the content of the first and second substances,
Figure RE-GDA0003808112430000052
represents the lower bound of the lambda intercept;
Figure RE-GDA0003808112430000053
represents the upper bound of lambda intercept, with lambda values of 0,0.1,0.2, …,1, and delta lambda value of 0.1.
And step 9: and (4) obtaining prior probability distribution of father nodes in the fuzzy Bayesian network topology structure about the emergency according to the calculation result I (p) in the step 8, inputting the prior probability distribution into GeNIe software to perform inference calculation on the probability taking the emergency occurrence risk as a target node, and further obtaining the risk level of the emergency.
The invention establishes a risk prediction method based on TF-IDF and fuzzy Bayesian network, analyzes the network public opinion data of the emergency by using the natural language text analysis TF-IDF technology, extracts characteristic factors, determines the target information of network public opinion discussion, establishes an index system, and solves the problems faced in the actual scenes of incomplete data, dynamic scene change, large event index information cross quantity and the like in the difficult handling of the emergency by using the fuzzy theory and Bayesian network characteristics. The causal bidirectional reasoning characteristic of the Bayesian network can not only reason the posterior probability of the result, but also reverse the prior probability of each influence factor, and further obtain the key index influencing the situation transition of the emergency, and relevant departments adopt the targeted emergency measures according to the reasoning prediction result, thereby achieving the effect of preventing the situation evolution of the emergency.
Drawings
FIG. 1 is a flow chart of a technical route of the present invention;
FIG. 2 is a data crawling flow diagram of the present invention.
Detailed Description
To facilitate understanding of the technical content of the present invention and understanding of the use of the fuzzy Bayesian network model for predicting the risk of the emergency event, the present invention is further explained below with reference to the accompanying drawings.
The invention provides an emergency risk prediction method based on TF-IDF and a fuzzy Bayesian network. Analyzing the network public opinion situation development of the emergency from the perspective of text analysis, extracting event characteristic influence factors, and establishing an emergency index system; quantifying the field expert evaluation result by combining a fuzzy theory and a defuzzification technology; according to the Bayesian network inference technology, the probabilistic problem influenced by various factors can be processed by utilizing the directed acyclic graph, uncertainty caused by conditional correlation among variables is processed through probability, inference calculation is carried out on uncertainty knowledge and information, and therefore risk prediction of the emergency is achieved. The specific implementation steps of the technical route are shown in FIG. 1.
The method comprises the following specific implementation steps:
step 1: and in the data acquisition stage, acquiring the network public opinion data of the emergency by adopting a octopus data collector, wherein the octopus data collector can convert the webpage unstructured data into structured data and store the structured data in various forms such as a database or EXCEL. Accurate and efficient large-scale data acquisition is realized through cloud acquisition, the cost of acquiring information is reduced, the efficiency is improved, and efficient assistance is realized in emergency public opinion data acquisition. And acquiring data information such as related posting content, news reports, postings forwarded, commented and approved by netizens on the social media platform of the emergency. As shown in fig. 2, the data acquisition process firstly inputs the website information of the relevant emergency into the data collector, and determines the crawling rule, so as to accurately crawl the post content and the post forwarding, comment and praise data information of the relevant emergency. And finally storing the crawled data and ending the data acquisition process.
Step 2: performing text analysis on the network public opinion data of the emergency obtained in the step 1 to determine characteristic factors of the emergency, and mainly comprising the following steps of:
s1) determining and extracting keyword index points by domain experts researching the emergency, and preliminarily summarizing the emergency keywords in the poster content and the comments acquired in the step 1. Wherein the index points include:
1) repeated words appearing in the text data.
2) The name of the person, the time, and the location in the text data.
3) And the emotional vocabulary of the attitude is embodied in the text data.
4) And the text data embodies related department decision behaviors and related department name vocabularies.
S2) performing feature extraction on the network public sentiment data of the emergency by using the TF-IDF algorithm to obtain feature items, comparing and summarizing the emergency keywords summarized in the step S1) with the feature items obtained by the TF-IDF algorithm, and determining the feature factors of the emergency. The TF-IDF algorithm includes:
1) TF (term frequency) in TF-IDF represents the word frequency, represents the frequency of a certain word appearing in a document, and in order to reduce the error of the result caused by the difference of the word number of the document, the word frequency normalization is represented as:
Figure RE-GDA0003808112430000061
wherein, tf i Representing the value of the word i after normalization processing; n is a radical of i,d Represents the total number of times word i appears in document d;
Figure RE-GDA0003808112430000062
represents the number of all words in document d and n represents the total number of words.
2) IDF (inverse Document frequency) in TF-IDF represents the inverse Document frequency. When the emergent event corpus contains a small number of documents of the word i, the word i has a good effect of distinguishing the document types. The emergency corpus refers to a Chinese emergency corpus in a semantic intelligent laboratory of Shanghai university, news reports of 5 types (earthquake, fire, traffic accident and the like) of emergency events are collected from the Internet to serve as raw corpora, the raw corpora are subjected to text preprocessing, text analysis, event labeling, consistency check and the like, and finally, labeling results are stored in the corpus.
The inverse document frequency of the computed word i is represented as:
Figure RE-GDA0003808112430000071
where | D | represents the total number of documents in the corpus. I belongs to d j Denotes the number of documents containing the word i, i ∈ d j Indicating that the word i belongs to the jth document d in the corpus j To avoid the word i not appearing in the corpus, | { j: i ∈ d j In the case where the | is 0, the denominator is increased by 1.
3) The TF-IDF value represents the effect of distinguishing categories, and the TF-IDF value represents:
tf_idfi i =tfi i ×idfi i
and step 3: and (3) determining an emergency life cycle evolution model by extracting the characteristic factors of the emergency and looking up the development characteristics of the previous events of the same type in the step (2), and dividing the life cycle into 5 stages from stage 1 to stage 5.
And 4, step 4: determining influence factors driving the situation change of the emergency through the characteristics of each stage of the emergency life cycle evolution model constructed in the step 3 and the characteristic factors of the emergency extracted in the step 2, and refining and analyzing the influence factors from eight different angles to establish a three-stage index system. The three-level index system sequentially represents from top to bottom: the first-level index points are used for predicting the risk of the emergency, the second-level index points are used for representing eight different angles for driving the situation change of the emergency, and the third-level index points represent the influence factors of the emergency.
And 5: and determining a fuzzy Bayesian network topological structure related to the emergency according to the three-level index system established in the step 4, wherein the fuzzy Bayesian network topological structure comprises basic unit father nodes, intermediate nodes for linking the father nodes and the target nodes, and target nodes of a final network reasoning result, the three-level index points in the three-level index system are used as the father nodes of the fuzzy Bayesian network topological structure, the two-level index points are used as the link nodes (namely the intermediate nodes) of the fuzzy Bayesian network topological structure, and the one-level index points are used as the target nodes of the fuzzy Bayesian network topological structure. The Bayesian Network reasoning technical principle is an important result based on Bayesian theory, wherein a Bayesian Network (BN) is also called a belief Network or a directed acyclic graph, nodes in the graph represent random variables, connecting lines between the nodes represent existing dependency relationships, arrows represent causal relationships between the two random variables, and the BN Network can calculate reasoning from reasons to results and can also perform a reverse diagnosis process from the results to the reasons. The basic unit of the BN structure is called a parent node, also called an evidence node, and the intermediate nodes are called link nodes for connecting the parent node and the target node. Bayes theory calculates posterior probability, namely target node probability, based on prior probability distribution and conditional probability inference of father nodes. Joint probability means that two events are sent togetherThe probability of generation, where p is (A, B), is called the joint probability of A and B, e.g. for any set of random variables
Figure RE-GDA0003808112430000072
Figure RE-GDA0003808112430000073
The joint probability distribution is expressed as:
Figure RE-GDA0003808112430000074
wherein the content of the first and second substances,
Figure RE-GDA0003808112430000081
indicating that the random variable x is in the time period t j′ Parent node set of (c), t n′ Is the length of time.
Figure RE-GDA0003808112430000082
Is a parent node of
Figure RE-GDA0003808112430000083
Expressed as:
Figure RE-GDA0003808112430000084
conditional probability expresses the probability that event a occurs if event B has occurred, denoted as p ═ a | B, referred to as the probability of a under B conditions; the conditional probability is calculated as
Figure RE-GDA0003808112430000085
For arbitrary set of random variables
Figure RE-GDA0003808112430000086
Figure RE-GDA0003808112430000087
Random variable
Figure RE-GDA0003808112430000088
Is expressed as:
Figure RE-GDA0003808112430000089
the prior probability represents the probability of occurrence of A, i.e., p (A), or the probability of occurrence of event B, i.e., p (B). The posterior probability represents the probability of occurrence of event A recalculated after occurrence of event B, called A's posterior probability, expressed as
Figure RE-GDA00038081124300000810
Step 6: since the influence factors of the emergency are difficult to provide exact expression, language evaluation level description variables are introduced. And (4) the domain expert for researching the emergency evaluates the parent node which represents the influence factor of the emergency in the fuzzy Bayesian network topology structure about the emergency and is constructed in the step (5) according to the language evaluation level description variable. Determining natural language variables as follows according to fuzzy theory knowledge: seven natural language variables of "Very High (VH)", "high (H)", "high (FH)", "medium (M)", "low (FL)", "low (L)", and "Very Low (VL)" are used as indexes indicating the degree tendency of the field experts to evaluate the impact factors of the emergency.
And 7: and (4) obtaining a fuzzy language by the natural language variable determined in the step (6) and the domain expert for researching the emergency, and determining the domain expert evaluation result after anonymous M-round evaluation and screening are carried out on the influence factors of the emergency by the domain expert for researching the emergency according to the Delphi method, wherein the M value is 5.
And 8: and (4) carrying out quantification operation on the domain expert evaluation result determined in the step (7) by using a deblurring method. Fuzzy Number (Fuz zy Number) belongs to the theory concept of fuzzy set, represents fuzzy set in the definition domain U, and satisfies U (x) epsilon [0,1], U (x) is called membership degree of U, U (x) is the membership function of any random variable x, also called fuzzy function of random variable x. When U (x) satisfies the following expression, A is called triangular Fuzzy Number (Trian angular Fuzzy Number).
Figure RE-GDA0003808112430000091
And defuzzifying the field expert evaluation result by using an integral value method to obtain a quantization value of the fuzzy language, carrying out averaging and normalization operations on the quantization value of the fuzzy language, and determining prior probability distribution of parent nodes of the Bayesian network for the Bayesian network to calculate the probability of the target node in an inference manner. The fuzzy theory middle intercept represents the conversion process from fuzzy to clear, and for quantifying the domain expert evaluation result, a quantitative relation table of the natural language variable, the fuzzy number and the intercept determined in the step 6 is established and is shown in table 1, and the intercept parameter is lambda. Fuzzy probability averaging
Figure RE-GDA0003808112430000092
Representing the fuzzy probability of the occurrence of the ith' event, A i′k And (3) representing the fuzzy value of the evaluation of the ith 'event by the kth expert, wherein n' represents the number of events. The integral value method defuzzification obtains an accurate probability calculation formula as follows:
Figure RE-GDA0003808112430000093
where p represents the probability of ambiguity, I represents the defuzzification value,
Figure RE-GDA0003808112430000094
the representation of the optimistic coefficients is,
Figure RE-GDA0003808112430000095
a lower bound of the probability of ambiguity is represented,
Figure RE-GDA0003808112430000096
represents an upper bound on the probability of ambiguity, where the choice is
Figure RE-GDA0003808112430000097
μ l (p),μ r (p) represents the integral values of the left and right membership functions, respectively. Mu.s l (p),μ r The λ -truncated expression of (p) is as follows:
Figure RE-GDA0003808112430000098
Figure RE-GDA0003808112430000099
wherein the content of the first and second substances,
Figure RE-GDA00038081124300000910
the lower bound of the lambda cutoff is indicated,
Figure RE-GDA00038081124300000911
representing the upper bound of the lambda intercept. Lambda is 0,0.1,0.2, …,1, delta lambda is 0.1.
TABLE 1 quantification of relationships
Figure RE-GDA00038081124300000912
And step 9: and (3) evaluating, by a field expert for researching the emergency, parent nodes representing the influence factors of the emergency in the fuzzy Bayesian network of the emergency according to the calculation rule of the quantitative fuzzy language in the step 8, and obtaining the quantization results I (p) of all the parent nodes through calculation, wherein the quantization results I (p) are shown in the table 2, and each node value represents the probability level of the influence of each influence factor of the emergency on a target node and serves as an initial value of inference calculation. State1 and State0 states represent the level of probability that an event "occurred" or "did not occur" under the influence factor, and the present invention treats State1 as the risk probability of an emergency event occurring. And inputting each node value into GeNIe software to calculate the probability of the target node in an inference mode, and comparing the probability range of the risk level in the table 3 to obtain the risk level of the emergency. Example verification results show that the risk probability of the target node emergency is 76%, and the probability range of corresponding table 3 is 75% -100%, which indicates that the risk level of the emergency is the risk of 'I class'. When a certain emergency happens, the value of the target node State1 is set to be 100%, and key characteristic factors influencing the occurrence of the emergency are determined through reverse reasoning.
TABLE 2 parent Prior probability distribution
Figure RE-GDA0003808112430000101
TABLE 3 Risk classes
Figure RE-GDA0003808112430000102
The embodiments described above are only a part of the embodiments of the present invention, and not all of them. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Claims (5)

1. A risk prediction method based on TF-IDF and fuzzy Bayesian network is characterized by comprising the following steps:
step 1: data acquisition, namely acquiring related posting contents, news reports, post forwarding numbers, comment numbers and praise amount data information of the netizens on posts of the emergencies on a social media platform by using a data acquisition unit through a web crawler technology in the data acquisition stage, and comprehensively acquiring network public opinion data of the emergencies;
step 2: performing text analysis on the network public opinion data of the emergency obtained in the step 1 to determine characteristic factors of the emergency;
and step 3: determining an emergency life cycle evolution model by the characteristic factors of the emergency extracted in the step 2 and looking up the development characteristics of the previous events of the same type, and dividing the life cycle of the emergency into 5 stages from stage 1 to stage 5;
and 4, step 4: determining influence factors for driving the situation change of the emergency through the characteristics of each stage of the emergency life cycle evolution model constructed in the step 3 and the characteristic factors of the emergency obtained in the step 2, and carrying out detailed analysis on the influence factors to establish a three-stage index system from eight different angles, wherein the three-stage index system is sequentially expressed from top to bottom as follows: the first-level index points are used for predicting the risk of the emergency, the second-level index points are used for representing eight different angles for driving the situation change of the emergency, and the third-level index points are used for representing influence factors of the emergency;
and 5: determining a fuzzy Bayesian network topological structure related to the emergency according to the three-level index system established in the step 4, wherein the structure comprises a basic unit father node, an intermediate node for linking the father node and a target node of a final network reasoning result, a three-level index point in the three-level index system is used as the father node of the fuzzy Bayesian network topological structure, a two-level index point is used as the intermediate node of the fuzzy Bayesian network topological structure, and a one-level index point is used as the target node of the fuzzy Bayesian network topological structure;
step 6: introducing language evaluation level description variables, evaluating father nodes which are constructed in the step 5 and are related to the influence factors expressing the emergency in the fuzzy Bayesian network topology structure of the emergency by field experts researching the emergency according to the language evaluation level description variables, and determining natural language variables as follows according to fuzzy theory knowledge: seven natural language variables of ' Very High (VH) ", ' high (H)", ' high (FH) ", ' medium (M)", ' low (FL) ", ' low (L)", and ' Very Low (VL) "are used for representing the degree tendency index of the field expert for evaluating the influence factors of the emergency;
and 7: and (5) from the natural language variables determined in the step 6, field expert evaluation for researching the emergency is carried out to obtain fuzzy language: according to the Delphi method, performing M rounds of anonymous evaluation and screening on the influence factors of the emergency by field experts researching the emergency, and determining the evaluation result of the field experts;
and step 8: defuzzifying and pasting the field expert evaluation result determined in the step 7 by using an integral value method to obtain a quantization value of the fuzzy language, averaging and normalizing the quantization value of the fuzzy language, and determining prior probability distribution of a father node in the fuzzy Bayesian network topology structure about the emergency, wherein the prior probability distribution is used for reasoning and calculating the probability of a target node about the fuzzy Bayesian network topology structure about the emergency, and specifically:
the interception in the fuzzy theory represents the conversion process from fuzzy to clear, and for quantifying the expert evaluation result, a quantitative relation table of the natural language variable, the fuzzy number and the interception determined in the step 6 is established, the interception parameter is lambda, and the fuzzy probability is equalized
Figure RE-FDA0003808112420000021
Figure RE-FDA0003808112420000022
i '═ 1,2, …, n', where
Figure RE-FDA0003808112420000023
Representing the fuzzy probability of the occurrence of the ith' event, A i′k Representing the fuzzy value of the evaluation of the kth expert on the ith 'event, wherein n' represents the number of events; the integral value method defuzzification obtains an accurate probability calculation formula as follows:
Figure RE-FDA0003808112420000024
wherein p represents the probability of ambiguity, I (p) represents the defuzzification value,
Figure RE-FDA0003808112420000025
the representation of the optimistic coefficients is,
Figure RE-FDA0003808112420000026
μ l (p),μ r (p) integral values, mu, of left and right membership functions, respectively l (p),μ r The λ -truncated expression of (p) is as follows:
Figure RE-FDA0003808112420000027
Figure RE-FDA0003808112420000028
wherein the content of the first and second substances,
Figure RE-FDA0003808112420000029
the lower bound of the lambda cutoff is indicated,
Figure RE-FDA00038081124200000210
represents the upper bound of the lambda intercept; λ is 0,0.1,0.2, …, 1; the value of delta lambda is 0.1;
and step 9: and (4) obtaining prior probability distribution of father nodes in the fuzzy Bayesian network topology structure about the emergency according to the calculation result I (p) in the step 8, inputting the prior probability distribution into GeNIe software to perform inference calculation on the probability taking the emergency occurrence risk as a target node, and further obtaining the risk level of the emergency.
2. The method according to claim 1, wherein the step 2 specifically comprises:
step 2.1: determining and extracting keyword index points by field experts researching the emergency, preliminarily summarizing the emergency keywords in the post content and the comments obtained in the step 1, wherein the index points comprise: repeated words appearing in the text data; the name, time, and location of the person in the text data; expressing attitude emotion vocabularies in the text data; and embodying related department decision-making behaviors and related department name vocabularies in the text data;
step 2.2: and (3) performing feature extraction on the network public sentiment data of the emergency by using a TF-IDF algorithm to obtain feature items, comparing and summarizing the emergency keywords summarized in the step 2.1 and the feature items obtained by the TF-IDF algorithm, and determining feature factors of the emergency, wherein the TF-IDF algorithm comprises the following steps of:
1) TF (term frequency) in TF-IDF represents the word frequency, represents the frequency of a certain word appearing in a document, and in order to reduce the error of the result caused by the difference of the word number of the document, the word frequency normalization is represented as:
Figure RE-FDA0003808112420000031
wherein, tf i Expressing the value of the word i after normalization processing; n is a radical of i,d Represents the total number of times word i appears in document d;
Figure RE-FDA0003808112420000032
representing the number of all words in the document d, and n representing the total number of words;
2) in the TF-IDF, IDF (inverse Document frequency) represents the inverse Document frequency, when the emergent event corpus contains a few documents of a word i, the word i has a good effect of distinguishing the Document types, and the inverse Document frequency of the word i is calculated and represented as follows:
Figure RE-FDA0003808112420000033
wherein | D | represents the total number of documents in the emergency corpus, | { j: i ∈ D j Denotes the number of documents containing the word i, i ∈ d j Representing that the word i belongs to the jth document d in the emergency corpus j
3) The TF-IDF value represents the effect of distinguishing categories, and the TF-IDF value represents:
tf_idfi i =tfi i ×idfi i
3. the TF-IDF and fuzzy Bayesian network based risk prediction method of claim 2, wherein said data collector is a Octopus data collector.
4. The TF-IDF and fuzzy bayesian network based risk prediction method according to claim 3, wherein M-5 in said step 7.
5. The TF-IDF and fuzzy Bayesian network based risk prediction method of claim 4, wherein said step of predicting risk based on TF-IDF and fuzzy Bayesian network is further characterized by the step of8 in
Figure RE-FDA0003808112420000034
CN202111030602.8A 2021-09-03 2021-09-03 TF-IDF and fuzzy Bayesian network-based risk prediction method Pending CN115115159A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111030602.8A CN115115159A (en) 2021-09-03 2021-09-03 TF-IDF and fuzzy Bayesian network-based risk prediction method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111030602.8A CN115115159A (en) 2021-09-03 2021-09-03 TF-IDF and fuzzy Bayesian network-based risk prediction method

Publications (1)

Publication Number Publication Date
CN115115159A true CN115115159A (en) 2022-09-27

Family

ID=83325303

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111030602.8A Pending CN115115159A (en) 2021-09-03 2021-09-03 TF-IDF and fuzzy Bayesian network-based risk prediction method

Country Status (1)

Country Link
CN (1) CN115115159A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116488912A (en) * 2023-04-27 2023-07-25 徐州医科大学 Network traffic monitoring method and system based on mutation model finite state
CN116962080A (en) * 2023-09-19 2023-10-27 中孚信息股份有限公司 Alarm filtering method, system and medium based on network node risk assessment
CN117371876A (en) * 2023-12-07 2024-01-09 深圳品阔信息技术有限公司 Index data analysis method and system based on keywords

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116488912A (en) * 2023-04-27 2023-07-25 徐州医科大学 Network traffic monitoring method and system based on mutation model finite state
CN116962080A (en) * 2023-09-19 2023-10-27 中孚信息股份有限公司 Alarm filtering method, system and medium based on network node risk assessment
CN116962080B (en) * 2023-09-19 2023-12-15 中孚信息股份有限公司 Alarm filtering method, system and medium based on network node risk assessment
CN117371876A (en) * 2023-12-07 2024-01-09 深圳品阔信息技术有限公司 Index data analysis method and system based on keywords
CN117371876B (en) * 2023-12-07 2024-04-02 深圳品阔信息技术有限公司 Index data analysis method and system based on keywords

Similar Documents

Publication Publication Date Title
Meng et al. Rating the crisis of online public opinion using a multi-level index system
CN115115159A (en) TF-IDF and fuzzy Bayesian network-based risk prediction method
CN112199608B (en) Social media rumor detection method based on network information propagation graph modeling
CN107103100B (en) A kind of fault-tolerant intelligent semantic searching method based on map framework
CN105740228A (en) Internet public opinion analysis method
Cao et al. A risky large group emergency decision-making method based on topic sentiment analysis
CN108319587B (en) Multi-weight public opinion value calculation method and system and computer
CN113392986A (en) Highway bridge information extraction method based on big data and management maintenance system
CN109241199B (en) Financial knowledge graph discovery method
Peng et al. Research on the early-warning model of network public opinion of major emergencies
CN112508600A (en) Vehicle value evaluation method based on Internet public data
CN110909529A (en) User emotion analysis and prejudgment system of company image promotion system
CN110472225A (en) The railway accident analysis of causes method of word-based extension LDA
CN112765961A (en) Fact verification method and system based on entity graph neural network inference
Liu et al. Research and citation analysis of data mining technology based on Bayes algorithm
Zhang et al. A regret theory-based multi-granularity three-way decision model with incomplete T-spherical fuzzy information and its application in forest fire management
Tong et al. Multimedia network public opinion supervision prediction algorithm based on big data
CN110428102B (en) HC-TC-LDA-based major event trend prediction method
CN111143573A (en) Method for predicting target node of knowledge graph based on user feedback information
Li [Retracted] Forecast and Simulation of the Public Opinion on the Public Policy Based on the Markov Model
CN111723127A (en) Stock trend prediction method and system based on text abstract emotion mining
Fu et al. Prediction of hot topics of agricultural public opinion based on attention mechanism LSTM model
Jiang Research on factor space engineering and application of evidence factor mining in evidence-based reconstruction
CN116128275A (en) Event deduction prediction system
Pan et al. Automatic subject classification of public messages in e-government affairs

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination