CN113010572B

CN113010572B - Public digital life scene rule model prediction early warning method based on deep Bayesian network

Info

Publication number: CN113010572B
Application number: CN202110292515.3A
Authority: CN
Inventors: 马汉杰; 董慧; 许永恩; 刘烈宏; 李柏睿
Original assignee: Hangzhou Maquan Information Technology Co ltd
Current assignee: Hangzhou Maquan Information Technology Co ltd
Priority date: 2021-03-18
Filing date: 2021-03-18
Publication date: 2023-04-18
Anticipated expiration: 2041-03-18
Also published as: CN113010572A

Abstract

The invention discloses a public digital living scene rule model prediction early warning method based on a deep Bayesian network, which is characterized in that data analysis and extraction are carried out on multi-source heterogeneous data in some key living scenes in public digital life, an information and behavior element feature library is generated, the information and behavior element feature library is combined with a user digital portrait, a personalized rule mechanism is constructed, prediction early warning is timely and accurately carried out on different key living scenes, powerful support is provided for pre-intervention, and the method can be applied to public safety and sanitation early warning, psychological health early warning, campus cheating event early warning and the like.

Description

Public digital life scene rule model prediction early warning method based on deep Bayesian network

Technical Field

The invention belongs to the technical field of big data analysis, and particularly relates to a public digital life scene rule model prediction early warning method based on a deep Bayesian network.

Background

With the updating iteration of internet technologies such as cloud computing and big data and the continuous improvement of living standard, the demand of people on public services such as basic education, public health, public transportation and old age care is continuously expanded, and all levels of government departments also think and pay attention to innovation government public service modes under the internet + era background, promote the digitization of public life and provide life convenience. In public digital life, once problems occur in some key life scenes such as economic dispute events, fire disasters and the like, the advantages of people and social stability are seriously influenced, and prediction and early warning are made in the key life scenes, so that great loss can be avoided by finding in advance. Other life scenes such as route planning, intelligent recommendation and the like are accurately analyzed and predicted, so that great convenience is provided for people, and the life happiness of people is improved; therefore, the problems that prediction and early warning can be timely and accurately carried out on different key life scenes and powerful support is provided for prior intervention are urgently needed to be solved.

The existing prediction early warning technology still analyzes and predicts the behaviors of people based on the characteristics of single dimensionality or few dimensionalities and has the defects of incomplete analysis characteristics, low prediction accuracy and the like. Chinese patent publication No. CN106709606A provides a personalized scene prediction method and apparatus, which first obtains geographic location information of a user based on location service, where the geographic location information includes POI information associated with time, then performs cluster analysis on all geographic location information of the user within a preset period to obtain a lifestyle trajectory vector sequence, then constructs a markov transition matrix based on the lifestyle trajectory vector sequence, and finally obtains a current scene of the user, and obtains a corresponding prediction scene from the markov transition matrix based on the current scene. Chinese patent publication No. CN107967578A provides a public safety big data early warning platform for a smart city, which comprises an early warning system, a communication module, a cloud data platform and an information receiving terminal, wherein the early warning system comprises a natural disaster early warning system, an accident disaster early warning system, a public health event early warning system and a social safety event early warning system, the information receiving terminal comprises a PC terminal or a mobile terminal, and the PC terminal or the mobile terminal respectively displays early warning information through an early warning application program interface. Chinese patent publication No. CN109711613A provides an early warning method and system based on a personnel relationship model and an event association model, the method extracts model information data from public safety big data, and filters the model information data; performing statistical analysis on the model information data according to the personnel identity data, and extracting personnel creating personnel relation models reflecting events for many times; extracting semantic elements from the model information data according to the event data, and extracting events reflected by personnel for many times to create an event relation model; setting a personnel early warning threshold according to the times that one person reflects an event; and setting an event early warning threshold according to the times that a plurality of people reflect an event, and early warning the people and the event exceeding the early warning threshold.

In conclusion, a high-quality early warning system can accurately and timely make prediction and early warning on different key life scenes, meanwhile, multi-dimensional attributes of users are fused, the limitation is broken, the various dimensional attributes are associated, and a corresponding processing method is used according to the various dimensional attribute characteristics, so that the early warning system is more timely and accurate.

Disclosure of Invention

In view of the above, the invention provides a public digital life scene rule model prediction and early warning method based on a deep bayesian network, which can accurately make prediction and early warning on different key life scenes in time and make strong support for prior intervention.

A public digital life scene rule model prediction early warning method based on a deep Bayesian network comprises the following steps:

(1) Obtaining mass multi-source heterogeneous data through three access ways of an Internet of things, an application terminal and a service system, and establishing a database;

(2) Layering the database, and constructing a subject database of five basic elements, namely people, enterprises, places, things and things;

(3) Processing multi-source heterogeneous data by adopting a batch-flow type big data real-time processing technology;

(4) Combining the five basic element subject libraries with a specific application scene to construct five dimensions of the user digital portrait under the specific application scene: demographic attributes, life attributes, social attributes, consumption characteristics, psychological attributes;

(5) According to the processed multi-source heterogeneous data, constructing a user digital portrait by data mining and analyzing a user label;

(6) Aiming at a specific application scene, training a deep Bayesian network by using user digital portrait information to obtain an event risk prediction model under the scene, and then predicting and early warning risks existing in a target event by using the model.

Further, the multi-source heterogeneous data in the step (1) includes structured data and unstructured data, the structured data includes basic data including basic information such as houses and addresses and extended data including vehicle entrance and exit information and internet of things perception information, and the unstructured data includes life event information acquired by personnel, video monitoring data acquired by devices such as cameras, audio data and image data.

Further, the batch-flow type big data real-time processing technology in the step (3) comprises five functional modules of data acquisition, data loading, a data bus, data analysis and business service, wherein the data acquisition module is responsible for accessing the flow data in real time in a mode of internet of things acquisition and application side acquisition; the data loading module is responsible for loading historical offline data and access stream data from the service system; the data bus module is responsible for putting various data into a specified channel for transmission according to a uniform format; the data analysis module is responsible for extracting and processing real-time data and pushing product data; when a real-time query request sent by a service system is received, the data analysis module can utilize an internal analysis processing model to calculate a corresponding index on a complete big data set in real time and judge the index, and the result is fed back to the service system through the service module.

Further, the population attributes in the step (4) are used for describing the basic characteristic information of the social level of the user, and helping each life-focused application scene to know the basic situation of the user (specifically including name, gender, grade specialty, school number, dormitory number, height, age, marriage and non-marriage, contact, occupation and the like); the living attributes are used for knowing living conditions of the users, including living activity ranges (including dining halls, teaching buildings, dormitory buildings, shopping malls, bus stations, railway stations and the like) and travel modes (including bicycles, shared bicycles, electric vehicles, buses, self-driving vehicles and the like) so as to provide accurate services for the users in the subsequent process; the social attributes are used for describing social graphs, family members, friend circles and interests (particularly comprising roommates, classmates, students, teachers, intimacy, liking to go to a library and the like) of the users, the information usually represents a social relationship network of the users, and the users can be known as completely as possible through social information so as to provide personalized services for the users; the consumption characteristics are used for describing main consumption habits and consumption preferences (including car families, shopping types, purchase periods, brand preferences and the like) of the users, mining potential users of related consumption services, recommending related products and services according to the consumption characteristics of the users and improving the recommendation conversion rate; the psychological attributes are used for paying attention to the psychological condition information (such as characters, abilities, temperaments, values, emotions, thinking and the like) of the user, acquiring the psychological condition of the user in an anonymous questionnaire survey or similar user clustering mode, and providing corresponding psychological services or paying important attention according to the psychological condition.

Further, in the step (5), for non-video data and video data in the multi-source heterogeneous data, a user tag construction mode based on original data mining and a user tag construction mode based on a video structuring technology are respectively adopted; for non-video data, five methods of natural language processing, user intention identification, association rules, cluster analysis and track similarity are fused in a user tag construction mode based on original data mining; for the condition that specific dimension data of a specific user is missing, the completeness of a user label is ensured by using a collaborative filtering algorithm through the analysis completion characteristics of other similar users; for video data, a user label construction mode based on a video structuring technology integrates three methods of target detection, openCV + CNN emotion recognition and GaitSet gait recognition.

Furthermore, the natural language processing process adopts TF-IDF algorithm to calculate the similarity between texts, then a fastText classifier is adopted to classify the texts according to the similarity, finally Word vectors in the texts are extracted by adopting Word2Vec, the Word vectors are fused into sentence vectors by using LSTM and are input into a pre-trained recurrent neural network or a recurrent neural network, and therefore the emotion shown by the similar texts is predicted and analyzed.

Furthermore, the user intention recognition is to judge the behavior intention of the user according to the search record of the user or the analyzed user label, a TF-IDF algorithm is adopted to carry out vectorization on data in the specific implementation process, a word frequency, chi-square and mutual information mode is utilized to carry out feature selection, and finally a pre-trained decision tree CART (Classification and Regression Trees), a random forest containing a plurality of decision Trees, a logistic Regression or a Bayesian model is adopted to judge the behavior intention of the user.

Furthermore, the association rule is used for discovering the association between the seemingly irregular data of the surface, so as to find out the regularity and the development trend between the data, and an Apriori algorithm or an FP-Growth algorithm is adopted in the specific implementation process; the cluster analysis is used for classifying similar data into one class, the similarity of each class of data is the maximum in principle, and the cluster is taken as an unsupervised algorithm and is suitable for analyzing high-dimensional data; the track similarity is to analyze the behavior tracks from the time domain and the space domain, mine the daily behavior rules and the preference of the user from the historical behavior tracks, and label the daily behavior rules and the preference.

Further, the OpenCV + CNN emotion recognition is used for detecting the expression state of the face in the video image, and the specific implementation process includes firstly face detection and positioning, then facial expression feature extraction, and finally the use of a pre-trained convolutional neural network CNN for classification and judgment of the face expression.

Further, the GaitSet gait recognition is used for detecting the walking posture of a person in a video image, and in the specific implementation process, the image is firstly input into a Convolutional Neural Network (CNN) to extract features, then the multi-feature Pooling mode is integrated to aggregate the features in the image into a feature vector, and meanwhile, a Horizontal Pyramid Pooling (HPP) is adopted to make the features more discriminative, and a double-flow method, that is, two channels are adopted in the prediction calculation: one is an RGB image channel used for modeling spatial information, the other is an optical flow channel used for RNN modeling time sequence information, the RGB image channel and the RNN modeling time sequence information are jointly trained and subjected to information fusion, and finally, the features are input into a trained model so as to realize gait recognition.

Further, the training and predicting process of the deep bayesian network in the step (6) is as follows: firstly, analyzing user digital portrait information in a specific application scene, acquiring various information elements and behavior elements related to an event, knowing the association relationship among the elements of the event, and establishing a feature sample library based on the information elements and the behavior elements of the event; then combining the characteristic sample with expert opinions (namely as a true value), and determining the prior probability of the network node, namely the initial evidence of the risk probability; inputting the characteristic sample and the initial evidence into a network structure, and inferring the conditional probability distribution of the non-root nodes in the network by using an EM (effective man-machine) algorithm; and finally, based on a Bayesian algorithm criterion, converting the prior probability and the conditional probability into a posterior probability, namely a probability prediction result of the occurrence risk of the target event.

According to the public digital living scene rule model prediction early warning method, data analysis and extraction are carried out on multi-source heterogeneous data in some key living scenes in public digital life, an information and behavior element feature library is generated and combined with a user digital portrait to construct an individualized rule mechanism, prediction early warning can be timely and accurately carried out on different key living scenes, and powerful support is provided for pre-intervention.

Drawings

Fig. 1 is a schematic flow diagram of a public digital life scene rule model prediction and early warning method of the invention.

FIG. 2 is a schematic diagram of the public digital life data basic element theme base.

Fig. 3 is a schematic diagram of a specific data processing flow of the batch streaming big data real-time processing module according to the present invention.

FIG. 4 is a diagram of a user representation construction framework according to the present invention.

FIG. 5 is a schematic diagram of a personalized feature model construction framework according to the present invention.

FIG. 6 is a schematic diagram of an event anomaly prediction early warning model route according to the present invention.

FIG. 7 is a schematic view of a risk assessment process of various events according to the present invention.

Fig. 8 is a schematic diagram of a route of the public safety early warning technology of the present invention.

FIG. 9 is a diagram of a Bayesian network structure according to the present invention.

FIG. 10 (a) is a diagram of a Bayesian network for class social interaction according to the present invention.

FIG. 10 (b) is a diagram of a Bayesian network for gender-specific social interaction in accordance with the present invention.

Detailed Description

In order to more specifically describe the present invention, the following detailed description is provided for the technical solution of the present invention with reference to the accompanying drawings and the specific embodiments.

The general process of the present invention is shown in fig. 1, and can be applied to the scenes of campus, district, garden, and countryside. The following introduces a public digital life scene rule model and a prediction early warning method based on a deep Bayesian network by taking a campus scene as a specific example, and the specific process is as follows:

(1) And accessing multi-source heterogeneous data. The multi-source heterogeneous data mainly comprises two characteristics: firstly, the data source has multiple sources, such as image acquisition of a camera, a man brake, a car brake and the like, and system data access of each government department; secondly, the data types and forms have complexity, namely isomerism. The data source mainly comprises two types of data, namely structured data and unstructured data, wherein the structured data take basic information such as houses, addresses and the like as basic data, and the expanded data comprise face data, vehicle access data and Internet of things perception data; unstructured data includes: the life event information collected by personnel, and the video monitoring data, audio data and image data collected by equipment such as a camera. In a campus scene, the embodiment accesses massive multi-source heterogeneous data from internet of things equipment such as a camera, a man gate and a vehicle gate, mobile terminals such as WeChat, microblog and GPS, and business system data such as campus one-card data, student registration data, access records, consumption records, campus wifi access logs and one-card.

(2) And constructing a basic element subject library. And (3) carrying out dimension decomposition on the data, and constructing a subject database of five basic elements, namely people, enterprises, places, things and things, as shown in figure 2. In a campus scene, people in the element subject library can be refined into students, teaching workers, parents, visitors and the like; the enterprise can be divided into a supermarket, a canteen, a print shop, a glasses shop and the like; the 'affairs' can be refined into student entrance and exit records, stranger access records, infectious disease conditions and the like; the 'ground' can be refined into a library, a dining hall, a teaching building and the like.

(3) And (6) data processing. In the embodiment, a batch-type big data real-time processing module is built by combining a batch-type big data computing framework and a stream-type big data computing framework, so that massive data files can be processed in parallel in real time.

The specific data processing flow of the batch-flow type big data real-time processing module is shown in fig. 3, and the module is internally divided into small modules such as data acquisition, data loading, data bus, data analysis, business service and the like. The data acquisition module is responsible for accessing stream data in real time in the modes of internet of things acquisition, application end acquisition and the like; the data loading module is responsible for loading historical offline data and accessing stream data from a specific service system; the data bus module is responsible for putting various data into a specified channel for transmission according to a uniform format; the data analysis module is responsible for extracting and processing real-time data and pushing the product data. When the batch-flow type big data real-time processing module receives a real-time query request sent by the service system, the batch-flow type big data real-time processing module can calculate a corresponding index on the complete big data set in real time according to an analysis processing model in the data analysis small module, judges the index and feeds the result back to the service system through the service module.

(4) Dimensions of the user representation are constructed. Combining the data in the base element topic library with the campus scene depth, as shown in fig. 4, proposes to construct five dimensions surrounding the user portraits in the campus scene: demographic attributes, life attributes, social attributes, consumption characteristics, psychological attributes, in particular:

the population attributes are used for describing the basic characteristic information of the user social level and helping each key life application scene to know the basic situation of the user, and the method specifically comprises the following steps: name, gender, grade specialty, school number, dormitory number, height, age, marriage, contact, occupation, and the like.

The life attribute is used for knowing the life condition of the user, such as the life activity range, the travel mode and the like, so as to provide accurate service for the user in the following process, and the method specifically comprises the following steps: living activity range, travel pattern, etc.; wherein the life activity range includes: dining room, teaching building, dormitory building, market, bus station, railway station etc. the trip mode includes: bicycles, shared bicycles, electric vehicles, buses, self-driving, and the like.

The social attributes are used for describing a social graph, family members, a friend circle, interests and hobbies and the like of the user, the information usually represents a social relationship network of the user, and the user can be known as completely as possible through the social information so as to provide personalized services for the user, and the method specifically comprises the following steps: roommates, classmates, students, teachers, being more intimate, liking to go to a library, etc.

The consumption characteristics are used for describing main consumption habits and consumption preferences of users, potential users for consuming related services recommend related products and services according to the consumption characteristics of the users, the conversion rate is very high, and the consumption characteristics comprise: there are car families, shopping types, purchase cycles, brand preferences, etc.

The psychological attributes are used for paying attention to the psychological condition information of the users, such as characters, abilities, temperaments, values, emotions, thinking and the like, the psychological conditions of the users are obtained through anonymous questionnaires or similar user clustering, and corresponding psychological services are provided or important attention is paid according to the psychological conditions of the users.

(5) A user digital representation is constructed. According to whether the data belongs to non-video data or video data, two user portrait label construction modes, namely user label construction based on original data mining and user label construction based on a video structuring technology, are proposed, as shown in fig. 4.

For non-video data, comprehensive analysis and calculation are carried out on data of the five element topic libraries by using Natural Language Processing (NLP), clustering, classifying and association rule algorithms in a data mining algorithm, differences of behavior rules of different user groups are mined, and tags are marked for users.

Through the non-video data, detailed information of the user trip, such as behavior mode and dressing information, cannot be directly acquired. Therefore, to address this issue, the present example employs a video structuring technique that combines both traditional algorithms and deep learning algorithms.

The video structuring technology is that the video is extracted to obtain key information of different levels through algorithms in the fields of video image processing technology, text analysis technology and the like, corresponding semantic description is carried out on the key information of different levels, and finally the key video image information and the corresponding semantic information are structurally stored through video standardized description, so that the key information of the video can be conveniently recorded and retrieved. The method mainly relates to the technologies of target detection, behavior recognition, emotion recognition and the like, so that the information in the video image can be effectively expressed, and a corresponding descriptive sentence, namely a text label, can be generated for each image; for the attributes which are insufficient in data and difficult to determine, the embodiment performs complementation according to the corresponding attributes of similar users through a collaborative filtering algorithm.

This example will construct student representations that are rich and diverse, such as "super school", "weak school", "sports man", "diligent" and "extrasexual", among others, primarily from the perspective of the student.

(6) The method for constructing the deep Bayesian network rule model based on the event characteristics comprises the following steps: firstly, user digital portrait information in a campus scene is analyzed, various information elements and behavior elements related to an event are obtained, and an event characteristic model is constructed in a supporting mode, as shown in fig. 5. The information elements specifically comprise time information, place information, track information, character information, time information, learning achievement and the like; behavioral elements include purchase, travel, communication, stay, and the like. Each type of key life scene can extract information elements and behavior elements of virtual and real spaces and even thought spaces which are specific to the type of events as much as possible by carrying out ontology analysis on the events, generalize the common characteristics and the common behaviors of the type of events on the basis of analyzing a plurality of similar events, construct and form a characteristic library of the information elements and the behavior elements which are specific to the type of events, and support risk prediction and early warning analysis of campus life scenes.

(7) And (5) predicting and early warning analysis. The behaviors of various event objects generated in different stages have abnormal characteristics, on one hand, the behaviors of the event objects are abnormal compared with most behaviors of ordinary people, and on the other hand, the behaviors of the event objects are abnormal compared with the daily behaviors of the event objects. And analyzing data information of the virtual and real space of the target object, wherein the data information comprises basic information, communication behaviors, network behaviors, economic behaviors, consumption traces, accommodation traces and the like. As shown in fig. 6, in the present embodiment, by analyzing the behavior habits of the target object, and developing, comparing and mining the actual situation and the daily behavior of the target object or the behaviors of other ordinary people, a deep bayesian network is used to perform comprehensive research and judgment, identify abnormal behaviors, and support abnormal perception of events.

In the construction of the deep Bayesian network rule model, several events with high occurrence probability and poor influence are focused, such as public safety and health exception, campus deception event, mental health exception and the like. The prediction early warning analysis is carried out by adopting a deep Bayesian network, and the basic principle is that on the premise of knowing prior probability and a conditional probability density expression, a conditional probability density function is deduced through statistical learning of samples aiming at the uncertainty problem of various event risks, and Bayesian algorithm criterion is used for converting the conditional probability density function into the posterior probability.

The Deep Bayesian network (Deep Bayesian network) is a description of the Probability relation of uncertainty knowledge, and combines the classical Probability Theory (Probability Theory) and the Graph Theory (Graph Theory), thereby not only having the Probability Theory as a solid mathematical basis, but also having the visual expression of the Graph Theory. In the deep Bayesian network, if the state of any node in the network is determined, the network can carry out forward or reverse reasoning in the network by using Bayesian rules, so that the posterior probability of any node in the network is obtained, which is a key mechanism for establishing a prediction early warning system in the deep Bayesian network.

The construction of the prediction early warning model based on the deep Bayesian network comprises four steps: (1) and (3) based on the information element and behavior element feature library of the event, understanding the incidence relation among the event elements and constructing a deep Bayesian network structure model. (2) Combining the historical sample data and the expert opinions to determine the prior probability of the network nodes, namely the initial evidence of the risk probability. (3) Inputting sample data and initial evidence into a network structure model, and inferring the conditional probability distribution of the non-root nodes of the network by using a parameter learning algorithm; because of the dynamic property and uncertainty of event occurrence, part of invisible variables which cannot be observed often exist in sample data, the example adopts an iterative convergence algorithm (EM algorithm) with missing values of the sample to carry out parameter learning, and model parameters continuously tend to maximum likelihood estimation through multiple iterations to finally obtain conditional probability distribution. (4) And based on a Bayesian algorithm criterion, converting the prior probability and the conditional probability into a posterior probability, namely the risk probability of the target event in the model. As shown in fig. 7.

According to the prediction early warning model based on the deep Bayesian network, an abnormity early warning function module in a campus scene displays students with possible abnormity according to a result of big data judgment of a background model, and key factors causing abnormity are given through a graph model, so that the prediction early warning model plays a vital role in timely and effectively managing the students for an education supervisor. The system is mainly divided into public safety and health abnormity, psychological health abnormity and event abnormity, and public safety and health early warning, psychological health early warning and campus deception event early warning are correspondingly carried out.

Example 1 public health safety Pre-Warning

1.1 technical route

The traditional infectious disease outbreak risk prediction mainly comprises the following four aspects: (1) selecting infection types and regions of interest; (2) Selecting pathological, environmental and climatic factors related to the onset of infectious diseases; (3) Selecting a proper model to establish an infectious disease outbreak risk evaluation model; (4) And predicting the probability of the epidemic situation of the infectious disease under various conditions and verifying the accuracy of the established model. The embodiment is modified appropriately, and the specific technical route is shown in fig. 8.

The method mainly adopts a mobile percentile method, and the selected risk factors mainly comprise meteorological factors, economy, population density factors and the like. The establishment of the Bayesian model mainly comprises four steps, namely discretization of data, bayesian structure learning, parameter learning and network verification, wherein when a verification result is not ideal, the structural learning needs to be returned again, and a Bayesian network structure is reconstructed; finally, uncertainty analysis is carried out on the adopted method, and the uncertainty analysis mainly comprises uncertainty of data processing, uncertainty of panel data clustering analysis, uncertainty of a mobile percentile method in classification of infection outbreak grades and uncertainty in an early warning model building process based on a Bayesian network.

1.2 clustering algorithm based on spatio-temporal panel model

Panel Data (Panel Data) is also called time series-cross section mixed Data, and mainly refers to sample Data with time series, and Data obtained by taking a plurality of sections on the time series for experiment; the panel data typically includes time series features and cross-sectional features, as well as features in both spatial and temporal dimensions.

A general linear panel data regression model is:

y _ij ＝X _it β+μ _i +ε _it

wherein: i is an element of [1,2, \ 8230;, N]N different space individuals, T is from [1,2 ] \ 8230;, T]Refers to the change in time, y _it Dependent variable observed value, X _it Is a row vector of a K-dimensional interpretation variable, beta is a column vector of a K-dimensional coefficient, mu _i Represents the spatial unit individual effect, epsilon _it Is a random error term.

If a certain phenomenon or a certain attribute of one spatial unit is similar to the phenomenon or the attribute of another spatial unit to a high degree, the two spatial units have certain spatial correlation, and the spatial panel data is divided into single-index spatial panel data and multi-index spatial panel data according to the indexes of the spatial panel data. The data of the single-index panel is represented by a two-dimensional table or matrix, and the data is as follows:

assuming that the total is N samples, X represents a characteristic index of each sample, and T is a time length, X _i (t) represents an index value of the i-th sample at time t.

Because the actual situation is too complex, the object to be studied in the actual research is often multi-index panel data, the structure of which is more complex than that of the traditional panel data structure, the time and space characteristics of which are usually represented by a three-dimensional table and sometimes can be represented by a matrix form.

Assuming an overall sample X comprising N samples each having a characteristic value, T being the time length, a matrix of a multi-indicator panel sample X is represented as:

general ofThe sample X actually contains data of three dimensions of space (total number of samples), time and a plurality of indexes, and can be subjected to dimensionality reduction on the spatial dimension, namely, can be represented as a group of space samples, namely, a three-dimensional table is expanded in a two-dimensional table form on the space, namely, X ^S ＝[X ₁ ,...,X _i ,...,X _N ] ^T One spatial sample X of the sample X _i The matrix of yes is represented as:

wherein: i is more than or equal to 1 and less than or equal to N, j is more than or equal to 1 and less than or equal to P, T is more than or equal to 1 and less than or equal to T,

and (3) representing the index value of the jth index of the ith sample at the time t.

The sample X can be expressed as a group of indexes in the index dimension, namely a three-dimensional table is expanded into a two-dimensional table according to the index sequence, namely X ^V ＝[X ¹ ,...,X ^j ,...,X ^P ]An index X of the sample X ^j The matrix of (d) is represented as:

sample X can be represented in the time dimension as a set of "ordered samples", that is, a three-dimensional table is spread out chronologically as a two-dimensional table, i.e.:

X ^O ＝[X(1),...,X(t),...,X(T)]

the matrix for an ordered sample X (t) of sample X is represented as:

wherein, its digital characteristic mainly includes:

(1) mean of jth index at time t:

(2) mean of jth index:

(3) variance of jth index at time t:

(4) variance of jth index:

compared with the traditional time series and cross section data, the spatio-temporal panel data can predict the situation of a future period more accurately and more quickly, and the accuracy of prediction and early warning can be improved more quickly in the uncertain field when the spatio-temporal panel data is combined with a Bayesian network.

1.3 Bayesian network-based space-time early warning algorithm

The method comprises the steps of establishing an infectious disease early warning model based on the Bayesian network by utilizing the existing knowledge, wherein the infectious disease early warning model mainly comprises data preprocessing, establishment of the Bayesian network for infectious disease outbreak risk, calculation of infectious disease outbreak risk probability, network verification and the like. The establishment of the Bayesian network is a crucial step, and is the key to the success of the early warning model establishment; when a network structure which is most suitable for the actual morbidity is found, the joint probability distribution of each node is calculated, and therefore the outbreak risk of the infectious disease is predicted.

Since an infectious disease is caused by not only one factor, but also many related epidemiological factors, economic factors, meteorological factors or environmental factors are combined together under the common condition, when the factors cannot be completely acquired, the factors are considered to be related to a part of data, and therefore, the factors which are most related to the outbreak and the epidemic of the infectious disease are found and analyzed. Because the Bayesian model can only process level and discrete data, for most influencing elements of continuous variables, only data discretization can be carried out, and an equidistant method is adopted for discretization, the number of a plurality of regions is required to be specified, and then a value domain is divided into a plurality of sub-regions according to a calculation method with equal width, so that a discretization result is obtained.

Then, a network structure learning algorithm based on independent test is adopted for carrying out the method, and the method mainly comprises the following steps:

(1) firstly, initializing a graph structure G < V, E >; where V is node = { dataset for all attribute fields }, E = { }, S = p, R = p;

(2) for each node pair (v) _i ，v _j ) Wherein v is _i ,v _j e.V, I ≠ j, and the interaction information I (V) of the e.V, I ≠ j is calculated _i ,v _j ) When the value I is larger than a certain fixed threshold value, adding the values I into the data set S in sequence according to the size sequence;

(3) marking and removing a first node pair in the data set S, and putting two corresponding edges into an edge set A;

(4) selecting a first node pair from the rest data sets S, if the node of the first node pair does not have a communication path, adding the node (4) pair into the edge set A, otherwise, putting the node into the edge set R;

(5) repeating (4) until S is empty;

(6) the first node pair in R is marked;

(7) taking out the node pair, carrying out conditional independence test on the node pair, and adding the node pair into the edge set A if the two nodes are still interdependent;

(8) repeating until R is empty;

(9) for any edge in E, if an edge other than one edge exists between nodes, deleting the edge from E temporarily; a conditional dependency test is then used to detect if two points are conditional, and if so, the edge is permanently deleted, otherwise E is added again.

Friedman theoretically proves that the learning algorithm based on the independent test has the semantic characteristics of the network and achieves effective results in practical application. As shown in fig. 9, a bayesian network is a graphical structure, and each variable is a node therein and contains information represented by one or more probability distributions. A variable does not have any dependency on other variables if it does not have any arcs attached to it, and if it does, it has a probability distribution associated with it if it has an associated child or parent node.

1.4 infectious disease outbreak risk probability estimation

And when the structure based on the Bayesian network early warning model is constructed, the next step of work is to calculate a conditional probability distribution table of the relative nodes in the network structure. In this example, a bayesian formula method is mainly used to learn parameters of a bayesian network, and the method is performed under the assumption that variables in a data set are all discrete and have no missing value, and nodes in the network are independent of each other, and the method mainly includes the following steps:

(1) first, data sets N and D are defined, where N has N variables and X has r possible sample segment values, i.e.

The data set D has m records, is a data set for recording the epidemic outbreak risk level, and each record in the data set D has the information of all variables in the Z; a Bayesian network structure B is defined, which contains all the variables in N.

(2) In structure B _G In each node X _i Will have a set of parent nodes pi _i (ii) a Definition of w _ij Denotes pi _i J (j =1, 2.., q.) in red _i ) Fractional value of each sample, N _ijk Represents variable X _i Is v is _ik Its father node pi _i Is w _ij The number of data records in time D, then

(3) Defining a network conditional probability θ _ijk Is a conditional probability P (X) _i ＝v _ik |π _i ＝w _ij ) It represents when node X _i Parent node pi of _i Has a value of w _ij ，X _i Has a value of v _ik ，k∈[1，r _i ]Probability of time.

(4) Given a dataset D and a Bayesian network structure B _G When theta is greater than theta _ijk The expected value of (a) is calculated as:

θ _ijk the variance of (a) is calculated as:

in parameter learning, it is usually necessary to calculate P (N) ₁ |N ₂ ) To infer the probability of an event occurring, where N ₁ And N ₂ Representing two different sets of variables, N ₁ Expressed as the infectious disease outbreak risk rating, N ₂ Representing environmental, climate and economic factor variables associated with the outbreak of the infectious disease, i.e., calculating probability values corresponding to various risk levels of the outbreak of the infectious disease in the presence of various associated factor variables. If N is present ₂ As is known, the expected value E [ P (N) of this probability value is calculated ₁ |N ₂ )]It depends only on N ₁ The likelihood value of (d); then, given a data set D and a Bayesian network structure B _G When, E [ P (N) ₁ |N ₂ )]The calculation formula of (c) is as follows:

E[P(N ₁ |N ₂ )|D,B _G ]＝P(N ₁ |N ₂ ,D,B _G )

wherein P (N) ₁ |N ₂ ,D,B _G ) The calculation of (b) can be calculated by a bayesian calculation formula and an iterative product-sum summation formula in a bayesian network,meanwhile, the probability estimation value of each node, namely variable in the network can be obtained through calculation by the method, and the estimation structure is the expected value of the estimation structure.

1.5 introduction of related data

(1) Etiology index: generally, data such as virus detection rate and severe death incidence need to be provided by professional organizations.

(2) Demographic indexes: the population density (total number of susceptible people/area) of the susceptible population can be adjusted by regions according to the population flow of a specific region.

(3) Meteorological indexes are as follows: the weather indexes such as sunshine days, air temperature difference, average air temperature, average wind speed and the like are researched, the data mainly comes from a China weather data sharing service network and is obtained by an inverse distance weighting interpolation method on the basis of 756 station data in the whole country.

(4) Economic condition indexes are as follows: economics represents a regional development and also affects the prevalence and spread of disease to some extent. The urbanization level (town population/general population) is mainly considered in the example and is taken as the economic index, and the data is derived from the Chinese economic statistical database.

1.6 spatial aggregative predictor indices

The incidence conditions of the hand-foot-and-mouth disease are different in different months according to the regional distribution, so that the spatial clustering detection is required. The two indexes of the disease incidence S and the severe rate Q are comprehensively considered, the clustering method of the multi-index spatial panel is utilized in the embodiment, the clustering is carried out under SPSS analysis software, and the following three aspects of information are comprehensively considered:

(1) incidence and severity data itself, i.e. the actual condition of hand-foot-and-mouth disease.

(2) The time-dependent changes in incidence and severity, i.e., the incremental indicators, represent the time-dependent changes in incidence and severity.

(3) The change rate or the change speed of increment of the morbidity and the severe rate, namely the increment change condition of the morbidity and the severe rate, comprehensively considers the level index, the increment index and the time sequence of the increment change rate index of the morbidity and the severe rate, and has the following main formula:

single level indicators, i.e. the data itself S and Q, i.e.:

incremental indicators, namely:

the incremental rate of change indicator, i.e.:

and calculating the Euclidean distance of the disease to perform system clustering, so as to obtain areas with similar risk levels, and calculating the risk levels of the diseases according to the meteorological indexes and population flow conditions.

Example 2 mental health Pre-Warning

The form of an online questionnaire can be used for effectively screening students for depression, and self-assessment data of students can be collected online by using an online health questionnaire-depression scale (PHQ-9), but the online health questionnaire is time-consuming and labor-consuming, lacks real-time and reliability, and is not high in quality and quantity of collected data. The research of psychologists shows that the real-time screening of the depression by using the data of social media such as WeChat, microblog and the like is feasible and accurate.

Therefore, the example combines the characteristics of students, utilizes the data of social media to construct student word clouds, combines data such as one-card data, internet data, mobile terminal data, access records, consumption records, video monitoring, GPS (global positioning system), campus wifi access logs and the like to obtain spatio-temporal information on the basis, analyzes the behavior tracks of the students, and constructs student figures and information behavior elements on the basis of the student word clouds and the behavior tracks.

And finally, early warning is carried out by using a deep Bayesian model according to data such as the social network, word cloud, information behavior elements and the like of the students, and the information of the students with the early warning value exceeding a threshold value is displayed and used as an attention object of a school to find out the abnormality of the psychology or behavior of the students in advance and make a break-away and precaution work.

2.1 building word clouds

1) Emotion dictionary construction

On the basis of the existing more complete general emotion dictionary, an emotion dictionary related to depression is constructed, and the emotion dictionary is divided into an active dictionary and a passive dictionary.

Crawling depression overword and contents in depression overword as an alternative passive dictionary, then crawling microblog contents at random as an alternative positive dictionary, and then performing data cleaning on the alternative passive dictionary and the alternative positive dictionary and reserving expression characters so as to improve the analysis capability on microblog expressions and network hotwords; and comparing the cleaned data with data in the emotion dictionary by using a TF-IDF algorithm, and bringing words with high similarity into the corresponding dictionary.

For the text part, firstly calling the registered basic information of the student, and crawling the microblog content and the WeChat friend circle content of the student; then data pre-processing operation is carried out: removing information such as microblog topics and friend circle advertisements and links, and putting pictures into a picture library; and finally, segmenting words of microblog and friend characters by using a word segmentation technology in natural language processing, and performing text comparison with the emotion dictionary by using the TF-IDF algorithm to optimize a passive dictionary and an active dictionary.

2) Text sentiment analysis based on LSTM

In the embodiment, an open source semantic frame Word2Vec is used, high-dimensional vectors are used for Word representation, words with similar meanings are placed at similar positions, and then two words with similar meanings are found out by Euclidean distance or cosine similarity, so that the problem of 'one-meaning multiple-Word' is solved.

Combining the divided word vectors and sentences into a matrix, and encoding the input in the form of the matrix into one-dimensional vectors with lower dimensionality by using a Recurrent Neural Network (RNNs) or a Recurrent Neural Network (RNNs), while retaining most useful information, and combining an emotion dictionary to realize text emotion analysis.

3) Image emotion analysis

And manually marking the data in the picture library, wherein the labels are negative and positive, and then performing model training on the data by using an image classification model VGGNet in a computer vision technology to obtain a picture emotion classification model.

In the embodiment, an emotion dictionary and a picture library are divided into a training set and a testing set according to the proportion of 7.

Based on the method, sentiment analysis is carried out on the student friend circle and the microblog content by combining the sentiment dictionary and the picture library, and word cloud is constructed.

4) Emotion value calculation method

For the word cloud of the student, the example calculates the emotion values of a friend circle and a microblog of the student by using a weighted average method:

wherein: n is a radical of _p 、N _n Number of words, wp, representing positive and negative respectively _i 、wp _j Weights representing positive and negative words, M _p 、M _n Number of words, wp, representing positive and negative respectively _a 、wp _b Representing the weight of the active and passive vocabulary, respectively.

2.2 student trajectories

According to the in-out record, the consumption record and the video monitoring of students or teaching workers, the action tracks of the students or the teaching workers are analyzed through data such as a mobile terminal GPS, a campus wifi access log and an all-purpose card, the track similarity is calculated according to the Hausdorff distance, and generally the higher the similarity is, the more intimate the relationship is. The moving track sequence of each user is calculated pairwise to obtain an intimacy value between the users, then density clustering is carried out according to an intimacy threshold value of 0.4, a plurality of user groups with social relations are classified, labels are applied to the user groups, a student digital portrait is constructed, and behavior patterns of students, such as behavior habits, life styles, consumption levels, network behaviors, learning states and the like of the students are represented.

Wherein the similarity measure between the tracks is the basis of track data mining and querying, for any two tracks T _a And T _b Is provided with T _a And T _b The distance between is Dist (T) _a ,T _b ) A distance of 0 means that the two tracks are identical, and a larger distance means that the two tracks have a lower similarity or a higher dissimilarity. CPD (Closest-Pfoir Distance) is a method for measuring the Distance between two tracks by taking the minimum Distance between position points in the two tracks, T _a And T _b The CPD values in between are calculated as follows:

wherein: dist (loc, loc ') represents the euclidean distance between two location points loc and loc'.

2.3 social networking

The students are used as nodes of the neural network, the threshold condition of connection establishment between the nodes is that the track similarity between the two students exceeds 0.5, and the weight of the connection between the nodes is the track similarity between the two students. The obtained social network formed by all students is shown in fig. 10 (a), wherein nodes in the social network represent each student, different shades and colors of the nodes represent classes of the students, and the size of the nodes reflects the degree of the nodes, namely the number of the nodes connected with the nodes; it should be noted that, the network topology relationship of the student social network is shown in the figure, not the mapping of the student vector in the two-dimensional plane, it can be obviously found that most students are distributed in a cluster-like network by taking class as a unit, but there are also more isolated students, and from the size distribution of the nodes, there is a great difference in the individual sociability of the students, namely, there is a large node in the center of the cluster and a small node isolated to be hardly found. And fig. 10 (b) shows a social network diagram distinguished by the gender of the student, and it can be seen that circles of social contact of boys and girls are basically separated, and boys and girls are basically clustered respectively except for the campus lovers relationship.

The accuracy of the student vector calculation can be laterally verified by combining common sense and the graphs in fig. 10 (a) and 10 (b), the student social network can show the isolation of students, and the calculation of the student isolation is converted into the mental health early warning based on the deep bayesian network in the example.

2.4 mental health early warning method

Establishing a deep Bayesian network by referring to a 1.4 risk probability estimation method, setting different weights for word cloud emotion, social networks and user portraits established by students, and training a model by using the weights as input features of the deep Bayesian network; the mental health early warning value is between 0 and 1, and early warning is carried out when the mental health early warning value exceeds 0.6.

Example 3 campus fraud Warning

Behavior elements and information elements of the past campus deception event are obtained and analyzed according to the methods 2.1-2.3, and the personality, consumption condition, behavior habit, learning state, psychological condition and the like of the campus deception event are analyzed by combining student information, so that a deception student user portrait is constructed.

Constructing a deep Bayesian network by referring to a 1.4 risk probability estimation method, constructing a feature vector according to student user figures, behavior elements and information element features thereof, and training to obtain a campus cheating early warning model; and alarming when the risk value exceeds 0.5, paying corresponding attention to related students, and performing psychological dispersion, family visit or punishment if necessary.

The foregoing description of the embodiments is provided to enable one of ordinary skill in the art to make and use the invention, and it is to be understood that other modifications of the embodiments, and the generic principles defined herein may be applied to other embodiments without the use of inventive faculty, as will be readily apparent to those skilled in the art. Therefore, the present invention is not limited to the above embodiments, and those skilled in the art should make improvements and modifications to the present invention based on the disclosure of the present invention within the protection scope of the present invention.

Claims

1. A public digital life scene rule model prediction early warning method based on a deep Bayesian network comprises the following steps:

(2) Layering the database, and constructing a subject database of five basic elements, namely people, enterprises, places, matters and things;

the batch-flow type big data real-time processing technology comprises five functional modules of data acquisition, data loading, a data bus, data analysis and business service, wherein the data acquisition module is responsible for accessing streaming data in real time in a mode of internet of things acquisition and application end acquisition; the data loading module is responsible for loading historical offline data and access stream data from the service system; the data bus module is responsible for putting various data into an appointed channel for transmission according to a uniform format; the data analysis module is responsible for extracting and processing real-time data and pushing product data; when a real-time query request sent by a service system is received, the data analysis module can utilize an internal analysis processing model to calculate a corresponding index on a complete big data set in real time and judge the index, and the result is fed back to the service system through the service module;

the population attributes are used for describing the basic characteristic information of the user social level and helping each key life application scene to know the basic situation of the user; the life attributes are used for knowing the life conditions of the user, including the life activity range and the travel mode, so that accurate services can be provided for the user in the following process; the social attributes are used for describing a social graph, family members, a friend circle and interests of the user, the information usually represents a social relationship network of the user, and the user can be known as completely as possible through the social information so as to provide personalized services for the user; the consumption characteristics are used for describing main consumption habits and consumption preferences of the users, mining potential users of related consumption services, recommending related products and services according to the consumption characteristics of the users and improving the recommendation conversion rate; the psychological attributes are used for paying attention to the psychological condition information of the user, acquiring the psychological condition of the user through anonymous questionnaire survey or a similar user clustering mode, and providing corresponding psychological service or paying attention to the psychological condition;

aiming at non-video data and video data in multi-source heterogeneous data, a user tag construction mode based on original data mining and a user tag construction mode based on a video structuring technology are respectively adopted; for non-video data, five methods of natural language processing, user intention identification, association rules, cluster analysis and track similarity are fused in a user tag construction mode based on original data mining; for the condition that specific dimension data of a specific user is missing, the completeness of a user label is ensured by using a collaborative filtering algorithm through the analysis completion characteristics of other similar users; for video data, a user label construction mode based on a video structuring technology integrates three methods of target detection, openCV + CNN emotion recognition and GaitSet gait recognition;

the natural language processing process adopts TF-IDF algorithm to calculate the similarity between texts, further adopts a fastText classifier to classify the texts according to the similarity, finally adopts Word2Vec to extract Word vectors in the texts, and utilizes LSTM to fuse the Word vectors into sentence vectors and input the sentence vectors into a pre-trained recurrent neural network or a recurrent neural network, thereby predicting and analyzing the emotion shown by the similar texts;

the user intention identification is to judge the behavior intention of the user according to the search record of the user or the analyzed user label, particularly, a TF-IDF algorithm is adopted to carry out vectorization on data in the implementation process, the characteristic selection is carried out by utilizing a word frequency, chi-square and mutual information mode, and finally, a pre-trained decision tree CART, a random forest containing a plurality of decision trees, a logistic regression or Bayesian model is adopted to judge the behavior intention of the user;

the association rule is used for discovering the association between the data with seemingly irregular surfaces so as to find the regularity and the development trend between the data, and an Apriori algorithm or an FP-Growth algorithm is adopted in the specific realization process; the cluster analysis is used for classifying similar data into one class, the similarity of each class of data is the maximum in principle, and the cluster is taken as an unsupervised algorithm and is suitable for analyzing high-dimensional data; analyzing the behavior tracks from the time domain and the space domain according to the track similarity, mining the daily behavior rules and the preference of the user from the historical behavior tracks, and labeling the daily behavior rules and the preference;

the OpenCV + CNN emotion recognition is used for detecting the expression state of the face in a video image, and the specific implementation process comprises the steps of firstly detecting and positioning the face, then extracting facial expression characteristics, and finally using a pre-trained convolutional neural network CNN for classifying and judging the facial expression; the GaitSet gait recognition is used for detecting the walking posture of a person in a video image, in the specific implementation process, firstly, the image is input into a Convolutional Neural Network (CNN) to extract features, then, a multi-feature Pooling mode is integrated to aggregate the features in the image into a feature vector, meanwhile, a Horizontal Pyramid Pooling method is adopted to enable the features to be more discriminative, and a double-flow method is adopted in prediction calculation, namely, the method comprises two channels: one is an RGB image channel used for modeling spatial information, the other is an optical flow channel used for RNN modeling time sequence information, the RGB image channel and the RNN modeling time sequence information are jointly trained and subjected to information fusion, and finally, the features are input into a trained model so as to realize gait recognition;

(6) Aiming at a specific application scene, training a deep Bayesian network by using user digital portrait information to obtain an event risk prediction model under the scene, and then predicting and early warning risks existing in a target event by using the model, specifically:

firstly, analyzing user digital portrait information in a specific application scene, acquiring various information elements and behavior elements related to an event, knowing the association relationship among the elements of the event, and establishing a feature sample library based on the information elements and the behavior elements of the event; then combining the characteristic sample with the expert opinion to determine the prior probability of the network node, namely the initial evidence of the risk probability; inputting the characteristic sample and the initial evidence into a network structure, and inferring the conditional probability distribution of the non-root nodes in the network by using an EM (effective man-machine) algorithm; and finally, based on a Bayesian algorithm criterion, converting the prior probability and the conditional probability into a posterior probability, namely a probability prediction result of the occurrence risk of the target event.

2. The public digital life scene rule model prediction early warning method as claimed in claim 1, wherein: the multi-source heterogeneous data in the step (1) comprises structured data and unstructured data, the structured data comprises basic data including basic information such as houses and addresses and extended data including vehicle access information and internet of things perception information, and the unstructured data comprises life event information acquired by personnel and video monitoring data, audio data and image data acquired by equipment such as a camera.