CN112582074B

CN112582074B - Bi-LSTM and TF-IDF based new crown epidemic situation prediction and analysis method

Info

Publication number: CN112582074B
Application number: CN202011236359.0A
Authority: CN
Inventors: 刘晓夏; 吕颖达
Original assignee: Jilin University
Current assignee: Jilin University
Priority date: 2020-11-02
Filing date: 2020-11-02
Publication date: 2022-10-18
Anticipated expiration: 2040-11-02
Also published as: CN112582074A

Abstract

The invention provides a new crown epidemic situation prediction and analysis method based on Bi-LSTM and TF-IDF, which specifically comprises the following steps: collecting patient information according to the category of the target area; p2, identifying and extracting the key information of the patient by using a Bi-LSTM model; p3, calculating by using a TF-IDF model to obtain a weight coefficient of the extracted information; p4, classifying the information by using a multi-classification SVM; and P5, screening information to form a patient path map/epidemic situation propagation relation tree and predicting epidemic situation origin/zero patient. The invention comprehensively utilizes artificial intelligence and natural language processing technology, adopts corresponding processing strategies according to different conditions of the region, dynamically establishes a prediction model, and adjusts the weight coefficient corresponding to the information for many times by applying a classification model and actual conditions in machine learning, so that the prediction process is more consistent with objective scientific rules and epidemic actual conditions, thereby analyzing the epidemic situation source information of the target region, and avoiding the outbreak and even recurrence of the epidemic situation from the source while restraining the spread of the epidemic situation.

Description

Bi-LSTM and TF-IDF based new crown epidemic situation prediction and analysis method

Technical Field

The invention relates to the field of Natural Language Processing (NLP) and deep learning, in particular to a new crown epidemic situation prediction and analysis method based on Bi-LSTM and TF-IDF.

Technical Field

Because the patient base is large and the specific journey of each patient in a period of time needs to be counted, a large amount of text content needs to be extracted and classified by applying computer technologies such as natural language processing and deep learning.

Natural Language Processing (NLP) is an important direction for Artificial Intelligence (AI). Natural language processing is a field related to linguistics, computer science, and artificial intelligence for enabling communication between humans and computers by natural language. Thus, natural language processing can create a computer system that understands natural language and processes and analyzes the natural language. Modern natural language processing is often faced with huge amounts of text information, so corresponding technical functions are usually realized by means of artificial neural networks in machine learning and deep learning. The present invention mainly uses a natural language processing technique as Information Extraction (Information Extraction).

The information extraction is to extract specific event or fact information from a natural language text, so as to automatically classify, extract and reconstruct mass contents. The text data is composed of specific units, such as sentences, paragraphs and chapters, and the text information is composed of small specific units, such as words, phrases, sentences and paragraphs or combinations of these specific units. The noun phrase, person name, place name, etc. in the extracted text data are all text information extraction, and of course, the information extracted by the text information extraction technology can be various types of information, such as extracting time, place, key character from news, or extracting product name, development time, performance index, etc. from technical documents.

The first step in information extraction is to detect entities in the text, namely Named Entity Recognition (NER). A named entity generally refers to an entity in text that has a particular meaning or strong reference, and typically includes a person's name, place name, organization name, time of day, proper noun, and the like. The NER system extracts the entities from the unstructured input text.

Long Short-Term Memory (LSTM) is a special type of Recurrent Neural Network (RNN), and the Long-Term dependence problem is avoided by special structural design. Bidirectional long-and-short term memory networks (Bi-LSTM) based on neural networks are very popular and widespread in named entity recognition tasks.

The Bi-LSTM model is a Bi-directional LSTM, i.e., the Bi-LSTM model is formed by combining a forward LSTM and a backward LSTM. The Bi-LSTM model has a three-layer structure: the first layer is a presentation layer, which functions to present each sentence as a word vector and a word vector; the second layer is a Bi-LSTM layer, and respective scores of all labels of each word of the sentence are output by inputting the word vector and the word vector to the Bi-LSTM layer, wherein the respective scores of the labels are equivalent to the transmission probability value of each word mapped to the labels; the third layer is a conditional random field model (CRF) layer. CRF is used to predict the conditional distribution of another set of output random variables given a set of input random variables. An advantage of CRF is that it makes use of information that has been previously marked in the process of marking a location, which is better compatible with the task of named entity identification. In the invention, a BIO labeling set is selected, namely B-PER and I-PER represent first characters of a person and non-first characters of the person, B-LOC and I-LOC represent first characters of a place and non-first characters of the place, B-ORG and I-ORG represent first characters of an organization and non-first characters of the organization, and O represents that the character does not belong to one part of a named entity. The CRF layer uses the output of the Bi-LSTM layer, namely the emission probability matrix and the transition probability matrix, as the parameters of the original CRF model to finally obtain the probability of the label sequence, thereby obtaining the category of the word or the character and realizing the identification and the extraction of the named entity. The specific process comprises the following steps: firstly, segmenting words of text contents to be extracted by utilizing a Bi-LSTM model; then, acquiring a field label to be identified, and labeling the segmentation result; then, extracting the participles marked by the labels; and finally, the extracted participles are combined into a named entity of a required field.

A Support Vector Machine (SVM) is a Generalized Linear Classifier (Generalized Linear Classifier) for classifying data according to a Supervised Learning (supervisory Learning) manner in machine Learning, and a decision boundary of the SVM is a Maximum-margin Hyperplane (Maximum-margin Hyperplane) for solving Learning samples. The basic method of SVM learning is to solve a separating hyperplane which can correctly divide a training data set and has the largest geometric interval. w x + b =0 may represent a separating hyperplane, and for a linearly separable data set, there are an infinite number of such hyperplanes, but the separating hyperplane with the largest geometric spacing is unique, so that data can be effectively classified. Meanwhile, the SVM calculates the Empirical Risk (Empirical Risk) by using a Hinge Loss function (Hinge Loss), and adds a regularization term to the solution system to optimize the Structural Risk (Structural Risk), so that the SVM is a classifier with sparsity and robustness, and therefore, the obtained sample classification processing result is often accurate and effective. The standard SVM is an algorithm designed based on a binary classification problem, the multi-classification problem cannot be directly realized, but the multi-classification of samples can be realized by orderly constructing a plurality of decision boundaries by using the SVM, the adopted realization method comprises two methods of One-to-many (One-to-many-all) and One-to-One (One-to-One), and the multi-classification SVM can be realized by applying a One-to-One method. The one-to-one SVM is a Voting method (Voting), and the calculation process is to establish decision boundaries for any 2 of m categories, that is, m × m-1/2 decision boundaries are total, and the category of the sample is the category with the highest score among the discrimination results of all decision boundaries.

The TF-IDF (Term Frequency-Inverse Document Frequency) is also a Term Frequency-Inverse Document Frequency model, where TF is Term Frequency (Term Frequency) and IDF is Inverse text Frequency index (Inverse Document Frequency). The TF-IDF is a weighted statistical method for information retrieval, and is also an information retrieval model widely used in practical application scenarios such as search engines. The TF-IDF model can assess the importance of a word to a document in a corpus because the importance of a word increases in direct proportion to the number of times it appears in the document, but decreases in inverse proportion to the frequency with which it appears in the corpus. TF-IDF = TF x IDF. Therefore, some words or phrases with lower weight coefficients can be obtained by reversely using the algorithm. The obtained words or phrases frequently appear in the corpus, and can represent the components of the relevancy of each file in the corpus. Therefore, the TF-IDF model is used for carrying out primary secondary screening and integration on a large amount of primarily extracted information.

For a country or a larger area, the origin of the epidemic situation, namely a relatively specific place, and an epidemic situation key point map are expected to be obtained, so that the epidemic situation source and the key area of the country or the area can be better searched, and the control and the prevention of the epidemic situation are facilitated. By utilizing a Bi-LSTM + CRF model in the NLP field, regions which are reached by each patient before the disease occurs are counted in a document according to the sequence to form a personal journey information document, and then the personal journey information documents of the patients with the disease occurring in the same region (country, region and the like) are integrated to form a regional journey information corpus. And applying a TF-IDF model to obtain a TF-IDF weight coefficient of a relevant place in each personal journey information document, wherein the obtained TF-IDF weight coefficient is the initial weight value of each place. Next, since the TF-IDF model is used reversely, the initial weight coefficients are scaled reversely by the following specific rules: the original weight coefficient w is replaced by the value 1-w of the weight coefficient w under the mapping f (x) = 1-x. Therefore, the more possible the site becomes the origin of the epidemic situation, the larger the corresponding weight coefficient is; conversely, the smaller the weight coefficient. In this case, it is necessary to perform the first round of adjustment on the weighting factors of the points, and to perform the weighting operation on the points closer to the date of onset, that is, the points further back in the order.

For a relatively small area, such as a city, a district, a county and the like, not only the origin of an epidemic situation, i.e. a relatively specific place, but also a 'zero patient' of the epidemic situation and an 'epidemic spread tree', i.e. a dendrogram showing the man-to-man relationship of a new coronavirus, are expected to be obtained. The process of finding the origin of the epidemic situation is the same as the above process of finding the origin of the epidemic situation in a country or a larger area, and therefore, the detailed description thereof is omitted here. The first person is not necessarily the zero patient of the current transmission. Compared with the method for obtaining the origin of the epidemic situation, the method for finding the 'patient zero' of the epidemic situation in a smaller area is a more effective strategy for rapidly controlling the epidemic situation and preventing the spread of the epidemic situation. By utilizing a Bi-LSTM + CRF model in the NLP field, people who each patient contacts before the disease occurs, and close family or living people (such as wife/husband, parents, children or roommates and the like) with which the people are in close contact are counted in a document according to the sequence to become personal relationship information documents, and then the personal relationship information documents of the patients with the disease occurring in the same region are integrated to form a regional relationship corpus. Applying a TF-IDF model to obtain TF-IDF weight coefficients of related persons in each personal relationship information document, and then carrying out reverse scaling on the weight coefficients, wherein the specific rule is as follows: the original weight coefficient w is replaced by the value 1-w of the weight coefficient w under the mapping f (x) = 1-x. So that the greater the probability of a patient being zero, the greater the corresponding weight coefficient; conversely, the smaller the weight coefficient. The obtained TF-IDF weight coefficient is the initial weight value corresponding to each name. In this case, it is necessary to perform a first round of adjustment on the weight coefficient corresponding to the name of the person, and to perform a corresponding weighting operation on the weight information corresponding to the person who comes into close contact with the patient on the date of the patient's onset. The improved multi-classification SVM classifier is applied, the close contact degree of the SVM classifier with a patient is taken as a classification standard, and related personnel in the personal relationship information document are classified into three levels, namely, severe close contact personnel, moderate close contact personnel and mild close contact personnel. Then, according to different close contact levels, the initial weighting coefficients are weighted correspondingly. Then, marking the names of n persons with higher weight coefficients in the whole regional relation corpus, storing the names of the selected persons and the corresponding weight coefficients in a secondary integration relation document, and if repeated components exist in the process, adding the weights of the persons and recording the weights as one. Meanwhile, the information of the personnel who are in a heavy close contact relationship with the selected personnel is stored and updated in the personal relationship information document.

For a country or a larger area, the improved multi-classification SVM classifier is applied to classify all the site information in the secondary travel information document set, and the three types of site information are low-risk sites, medium-risk sites and high-risk sites. The low risk area is usually an outdoor area where people are sparse and the outdoor area is open, the middle risk area is usually a densely-populated outdoor or indoor area where the mask can be worn, and the high risk area is usually a densely-populated indoor area where the mask cannot be worn, such as a restaurant or a movie theater. And correspondingly increasing or decreasing the weight coefficient of the places with the low, medium and high risks, so that the finally obtained epidemic situation is more in line with objective rules and actual conditions. And marking n places with higher weight coefficients in each personal travel information document. The selected places and the weight coefficients corresponding to the places are stored in second-level personal travel information documents, namely, each personal travel information document corresponds to one second-level personal travel information document. All secondary personal travel information documents are stored and concentrated with the secondary travel information documents. And for each secondary personal travel information document in the secondary travel information document set, performing point drawing annotation in a map by using a computer graphics technology according to the sequence of the place information recorded in the document, and drawing a path by connecting the point drawing annotation in sequence. After all the place information is marked in the map, the place information appearing in more different paths is extracted, the weight coefficients of the places are adjusted according to the number of the paths passing through the places, and the adjustment rule is as follows: the more the number of paths passing through the point is, the more the weight coefficient is increased; conversely, the smaller the number of paths passing through the point, the smaller the weight coefficient increases. At this time, for the location information, three rounds of adjustment have been performed on the weighting coefficients, the first round is to adjust the TF-IDF weighting coefficients according to the order of arrival locations, the second round is to adjust the weighting coefficients according to the risk degree rating, and the third round is to adjust the weighting coefficients for individual locations according to the number of passing paths. And then sorting the places in a descending order according to the weight coefficients, and selecting the place with the highest weight coefficient, wherein the place is the origin of the epidemic situation predicted by the invention, and the first n places with higher weight coefficients also need to be subjected to key investigation, so that the detection, prevention and control of the epidemic situation are more scientific and effective.

For a relatively small area, the weight coefficients stored in the secondary integration relation documents are subjected to one round of weight adjustment according to the close contact degree, so that the weight coefficients are arranged in a descending order according to the obtained weight coefficients, the names of persons with the same weight coefficient or the difference value smaller than w are placed on the same layer of the tree-shaped graph, the connection relation of the tree-shaped graph is supplemented by retrieving the close contact documents, and the epidemic situation propagation tree is obtained.

Disclosure of Invention

The invention aims to provide a new crown epidemic situation prediction and analysis method based on Bi-LSTM and TF-IDF, which is used for solving an epidemic situation origin and a zero number patient by analyzing the journey information and social relation of a patient, obtaining a patient path map and an epidemic situation propagation tree, tracing the epidemic situation origin and an initial virus carrier of a target country/region, and fundamentally inhibiting the abuse of the epidemic situation, so that the control and the prevention of the epidemic situation are more scientific and effective. Meanwhile, the epidemic situation can be killed in the cradle as early as possible through the comprehensive epidemic initiation information.

In order to solve the problems in the related art, the invention provides a new crown epidemic situation prediction and analysis method based on Bi-LSTM and TF-IDF, which comprises the following steps:

part _1: patient information (path of action/close contacts) is collected based on expected outcome (epidemic/zero patient) or category of target area (country or larger area/relatively smaller area);

part _2: identifying and extracting the acquired patient information by using a Bi-LSTM model;

part _3: calculating by using a TF-IDF model to obtain an initial weight coefficient of the extracted information, and performing corresponding weight adjustment according to the initial weight coefficient;

part _4: classifying the places or the names of people by utilizing an SVM multi-classification classifier, and carrying out corresponding weight adjustment according to the classification;

part _5: screening information and forming a patient path map/epidemic propagation relation tree and predicting epidemic inception place/zero number patient.

A new crown epidemic situation prediction and analysis method based on Bi-LSTM and TF-IDF uses the following data structure:

the data structure of the personal relationship information document Person is defined as follows:

data item 1, item _1: number of documents in corpus

Data item 2, item \u2: the name of the patient corresponding to the document

Data item 3, item \u3: intimate contact person information set scope processed by Bi-LSTM model

Data item 4, item _4: weight coefficient set weight corresponding to the close contact

Data item 5, item \u5: order of close contact person

Data item 6, item \u6: class of degree of intimate contact for intimate contact

The data structure of the personal itinerary information document Route is defined as follows:

data item 1, item \u1: number of documents in corpus

Data item 2, item _2: the name of the patient corresponding to the document

Data item 3, item \u3: passing point information sets sites processed by Bi-LSTM model

Data item 4, item \u4: weight coefficient set weight corresponding to passing point

Data item 5, item \u5: order information order of passing points

Data item 6, item \u6: risk class of pass sites

The data structure of the secondary personal travel information document integrated route is defined as follows:

data item 1, item \u1: number of documents in corpus

Data item 2, item _2: the name of the patient corresponding to the document

Data item 3, item \u3: route point information set sites after selection

Data item 4, item \u4: weight coefficient set weight corresponding to selected passing place

Data item 5, item \u5: order information order of selected passing places

The data structure of the second-level integration relationship document Integrated person is defined as follows:

data item 1, item \u1: the number of selected persons count

Data item 2, item \u2: number set number data item 3 of the personal relationship information document corresponding to the selected person in the corpus, item \u3: weight coefficient set weight corresponding to selected personnel

The data structure of the location feature data set RouteFeatureSet input to the multi-classification SVM classifier is defined as follows: data item 1, item _1: site name site

Data item 2, item \u2: the population density of the area of the site

Data item 3, item \u3: whether the site is open air TFl

Data item 4, item \u4: whether the place forcibly requires to wear the mask or not or does not need to take off the mask TF2

Data item 5, item \u5: the daily average flow of people in the place

Data item 6, item \u6: risk level label for the site

The data structure of the relational feature data set PersonFatureSet input to the multi-classification SVM classifier is defined as follows: data item 1, item \u1: name of patient to be analyzed

Data item 2, item \u2: whether or not the contact person is in the same-living relationship with the patient TF1

Data item 3, item \u3: number of times of concurrent meals with patient by contacter count1

Data item 4, item _4: whether the contact person has contact TF2 with the patient too closely (e.g., riding the same car or close behavior, etc.)

Data item 5, item \u5: number of contact times between contact person and patient count2

Data item 6, item _6: intimate contact level of the contact person label

A new crown epidemic situation prediction and analysis method based on Bi-LSTM and TF-IDF uses the following functions:

multi-classification SVM, defined as follows:

the specific process comprises the following steps: the SVM in machine learning is transformed into a multi-classification SVM by adopting a one-to-one method, namely, a plurality of two-classification SVM are combined to construct the multi-classification SVM. During training, samples of a certain class are classified into one class, and other remaining samples are classified into another class. When classifying, the unknown sample is classified into a class with the maximum classification function value.

Inputting and aiming: the named entity (name/place) feature data set which is marked manually is used for training the multi-classification SVM; the remaining set of named entity (person/place) feature data, which is not labeled manually, needs to be partitioned by the trained multi-class SVM.

Outputting and obtaining the result: all named entities are marked, i.e. there is a corresponding level/rank in label.

The Bi-LSTM + CRF model, defined as follows:

the specific process comprises the following steps: the Bi-LSTM + CRF model is an artificial neural network with a three-Layer structure, and represents a Look-up Layer (Look-up Layer), a Bi-LSTM Layer and a CRF Layer respectively. Input of the Bi-LSTM layer: and obtaining a vector consisting of Character Embedding through random initialization.

Output of the Bi-LSTM layer: the predicted score for each label. For example, the Bi-LSTM layer outputs 1.5 (B-Person), 0.9 (I-Person), 0.1 (B-Organization), 0.08 (I-Organization) 0.05 (O).

Inputs to the CRF layer: the predicted score for each label.

Output of CRF layer: a label for each cell.

TF-IDF model, defined as follows:

the specific process comprises the following steps: TF-IDF weight = TF × IDF, where TF represents the frequency of occurrence of a term in a document and IDF is the logarithm to the base 10 of the quotient of the total number of files divided by the number of files containing the term.

Inputting: documents where the entries are located and corpora where the entries are located.

And (3) outputting: TF-IDF weights for the entries.

A new crown epidemic situation prediction and analysis method based on Bi-LSTM and TF-IDF uses the following procedures.

The system process Task specifically comprises:

Task{

task _1: patient information (path of action/close contact) is collected based on the desired outcome (epidemic/zero patient) or the category of the target area (country or larger area/relatively smaller area),

task _2: a Bi-LSTM + CRF model is applied to pre-process massive original text information,

task _3: applying TF-IDF model to obtain initial weight coefficient of name/location in prediction model,

task _4: the multi-classification SVM is used for carrying out corresponding grade division on the names/places, the weight coefficient is adjusted according to the grade of division,

task _5: the name/location information is subjected to a round of screening according to the weight coefficient,

task _6: the screened place information is embodied in a map, a personal path is drawn according to the arrival sequence to form an epidemic situation map, the screened disease relation information is drawn into an epidemic situation propagation tree according to the size of the weight coefficient and the close contact relation,

task _7: the position information is adjusted according to the geometric characteristics (intersection condition) of the path,

task _8: the place with the highest weight coefficient is the predicted epidemic situation starting place; the person with the highest weight coefficient corresponding to the name is the predicted 'patient zero',

task _ i: the reserved users perform the instructions and processes and,

}

the Task _1 mainly realizes the collection of the journey information/interpersonal relationship information of the patient and provides data support for the subsequent processing and operation.

The Task _2 applies a Bi-LSTM + CRF model to mark and extract key information (names/places) of the acquired massive text information, and stores the key information in a corresponding document.

Task _3 represents that the TF-IDF model is applied to obtain the word frequency TF and the inverse document frequency IDF of different terms in the corpus where the terms are located, so as to obtain the TF-IDF coefficient, and the coefficient is used as the initial weight coefficient of the name/place in the prediction model.

Task _4 indicates that the person/place name is ranked. Firstly, according to the characteristic data set, a part of close contact persons/passing points corresponding to the close contact level label of the patient/risk level label of the point are marked manually, and the multi-classification SVM is trained by using the marked data. And then, inputting the rest of the names/place names which are not classified into the trained multi-classification SVM for grade classification. Wherein, there are three close contact grades label with the patient, which are respectively severe, moderate and mild; and the risk degree grades label of the sites are three types, namely low, medium and high. Meanwhile, according to the grades divided by the multi-classification SVM, the initial weight coefficient of the name/place name information is subjected to a first round of adjustment, namely, the weight coefficient is increased or decreased according to the corresponding proportion.

Task _5 indicates that n persons with large corresponding weight coefficients are selected for the name information; selecting m places with the largest weight coefficients in the places of the individual approaches according to the place information; where n and m are as the case may be, or may be initiated by the user.

Task _6 represents that the name/place information is integrated according to the current weight coefficient, and for the name information, an epidemic situation propagation tree is drawn according to the size relation of the weight coefficient; for the place information, the individual travel routes are drawn according to the sequence of reaching each place, and when all the individual travel routes are embodied in the map, an epidemic situation map is obtained.

The Task _7 indicates the situation of counting the path intersections in the "epidemic situation map" for the place name information, and the weighting coefficient is increased more for the places where the paths intersect more, and conversely, the weighting coefficient is increased less for the places where the paths intersect less.

The Task _8 is used for obtaining a final result, namely 'epidemic situation starting place' or 'zero patient', according to the weight coefficient at the moment, namely the final weight coefficient, so that a target area can carry out more scientific and effective epidemic situation prevention and control work, spreading of new crown epidemic situations is prevented in time, and even the newly-developed epidemic situations are killed in the cradle.

The Task _ i is an execution instruction and a process reserved by the system for the user so as to meet the requirement of the user on expanding functions.

A new crown epidemic situation prediction and analysis method based on Bi-LSTM and TF-IDF is characterized by comprising the following steps:

part _1, collecting patient information (path of action/close contact) based on expected results (epidemic situation origin/zero patient) or target area categories (country or larger area/relatively smaller area), specifically:

generally, for a country or a larger region, e.g., the world, continent or country, etc., it is desirable to have an origin of an epidemic for that region, while for a relatively smaller region, e.g., a city, district or county, etc., it is desirable to have a zero patient for that region. Thus, the patient information that needs to be collected is different for different categories of regions, the travel information of the patient is collected for larger regions, and the interpersonal relationship information of the patient is collected for smaller regions.

Assume that the information obtained is typically in text form as follows: liu, male, 61 years old, retired employee. Common addresses: anseris towns and peaceful streets in Hulunbel city, inner Mongolia. And 4, 21 days after the month of 4, the patient is diagnosed with the new coronary pneumonia. Day 5 of 4 months, 21 days, with wife, daughter, son, 4-person, from helar by K928 trains (11 cars, no. 55) to harlbine. 4 months, 6 days and 7 days, the partner wife is hospitalized from the Harbin railway station to the big Harbin hospital. At noon, family dish meal is recorded in a pandemic manner near a hospital, rest is carried out in the Yingmei hotel at 14 hours, and the pig and vegetable killing stone pot jar meat hall and Yuliangfang return to the Yingmei hotel for lodging after food is purchased at night. 4, 7 days, the patient has a couple in a hospital, has a meal in a small steamed bun with spicy noodles in the noon, and returns to a hotel in the morning and at 16 days. 4 months and 8 days, the patient attends to the wife in the hospital, and returns to the Yong Mei hotel for accommodation after purchasing food in the pot and meat center of the pig-killing dish and stone at 16 days.

By manually collecting the information on the official website of the target area, the python crawler can also be used for downloading the patient information, so that massive patient information including interpersonal relationship information and travel information is obtained. The information that needs to be input into the predictive model is stored in these redundant text messages, which need to be cleaned and extracted in order to facilitate processing and analysis with efficient algorithms and models.

The feature Part _1 is described.

Part _2: identifying and extracting the acquired patient information by using a Bi-LSTM + CRF model, which specifically comprises the following steps:

the information of different patients is stored in different documents, and for each document, a Bi-LSTM + CRF model is adopted for information identification and extraction once. In the Bi-LSTM + CRF model, a BIO label set is adopted, B-PER and I-PER represent first characters of names and non-first characters of names of people, B-LOC and I-LOC represent first characters of names of places and non-first characters of names of places, B-ORG and I-ORG represent first characters of names of organizations and non-first characters of names of organizations, and O represents that the character does not belong to one part of a named entity. For example:

the following describes a specific process of a document by taking a Bi-LSTM + CRF model-based information identification and extraction as an example:

the first layer of the Bi-LSTM + CRF model is the look-up layer. A text segment containing n words is denoted as X = (X _1, X _2.,. Times, X _ n), where X _ i represents the id of the ith word of the text in the dictionary, and thus one-hot vectors can be obtained for each word, and the dimension is the size of the dictionary. And mapping each word x _ i in the text into Character Embedding from one-hot vectors by using a randomly initialized Embedding matrix.

The second layer of the Bi-LSTM + CRF model is the Bi-LSTM layer. Taking the Character Embedding sequence X of a text as the input of each time step of the Bi-LSTM, and outputting the hidden state sequence sum of the forward LSTM in the Bi-LSTM

With inverse LSTM

Position-by-position splicing is carried out in hidden states output at various positions

This results in the complete hidden-state sequence h = (h _1, h _2., h _ n). The hidden state vector is then mapped from the n dimension to the k dimension by adding a linear layer, where k is the number of labels in the label set, resulting in the automatically extracted sentence features, denoted as matrix P = (P _1, P _2., P _ n). Each dimension p _ ij of the matrix p _ i can be considered as a scoring value that classifies the word x _ i to the jth label. However, such a method cannot be used to mark each positionThe information already labeled, therefore, accesses the CRF layer for labeling.

The third layer of the Bi-LSTM + CRF model is the CRF layer, which has the main function of performing sequence annotation. The parameter of the CRF layer is a matrix a of (k + 2) × (k + 2), where a _ ij represents the transition score from the ith label to the jth label, and the labels that have been labeled before can be used to label a location, and 2 is added because a start state is added to the head of the text and an end state is added to the tail of the text.

If a label sequence Y = (Y _1, Y _2.,. Gamma. N) with a length equal to the text length is recorded, then the label of the Bi-LSTM + CRF model for the text X is equal to

Thus, the CRF layer can output labels corresponding to each word. The desired patient information (name/location) can then be extracted by traversing the labels of all words. Storing the extracted patient information (name/location) into a corresponding document, namely storing the name information into Item _3 of Person relationship information document Person, namely a close contact Person information set scope processed by a Bi-LSTM model; and storing the place information into Item _3 of the personal itinerary information document Route, namely the passing place information set sites processed by the Bi-LSTM model. Therefore, the pretreatment operation of the patient information is completed, and the identification and the information extraction of the patient information are realized.

The feature Part _2 is described.

Part _3: calculating by using a TF-IDF model to obtain an initial weight coefficient of the extracted information, and adjusting the weight coefficient, specifically:

the above two parts have already realized the collection and preprocessing of the information, so it can start to set and adjust the weight coefficients, i.e. perform the weight calculation based on the TF-IDF model on all the documents in the target region corpus. The TF-IDF weight coefficients for any entry in a document are calculated as follows:

term Frequency (TF) refers to the Frequency of a given entry appearing in the document, and is calculated by the formula

Wherein n is _i，j Represents the number of times the entry appears in the document, ∑ _k n _k，j Representing the total number of terms in the document.

Inverse Document Frequency (IDF) is a measure of the general importance of a word. The IDF of a particular word is obtained by dividing the total number of documents in the corpus by the number of documents containing that word by a base-10 logarithm, and is calculated by

Where | D | represents the total number of files in the corpus, | { j: t is t _i ∈d _j Denotes the number of files containing the entry.

The TF-IDF coefficient is the product of the word frequency (TF) and the Inverse Document Frequency (IDF), i.e. TF-IDF = TF. Thus, the TF-IDF coefficient of a term in a document, that is, the initial weight coefficient of the information (name/location) is obtained. Next, the above operation is performed for each entry and the like of each document in the target region information corpus, and an initial weight coefficient of each information (name/location) is obtained and stored in the corresponding document.

Since the present invention uses the TF-IDF model in the reverse direction, the obtained initial weight coefficients are scaled in the reverse direction. The specific operation is as follows: the original weight coefficient w is replaced by the value 1-w of the weight coefficient w under the mapping f (x) = 1-x.

The invention needs to correspondingly adjust the weight coefficient corresponding to the information (name/place) according to the time sequence. With regard to the interpersonal relationship information, the weight coefficient corresponding to a person who comes into close contact with the patient at a later time is increased more, and the weight coefficient corresponding to a person who comes into close contact with the patient at an earlier time is increased less. The specific implementation method comprises the following steps: arranging all the person name information from late to early according to the time sequence, adding the weight coefficient corresponding to the first person name in the ranking to the initial value of delta omega, namely uniformly decreasing the delta omega until the weight coefficient is reduced to zero, and adding the weight coefficient corresponding to each person name to the current delta omega. When the weight coefficients corresponding to all the name information are adjusted and the value of Δ ω is reduced to 0, the first round of adjustment of the weight coefficients is realized.

Thus, the information preprocessing and basic initialization process is completed, and the document states at this time are as follows: for a Person relationship information document Person, in the information acquisition stage, the number of the document in the corpus and the name of the patient corresponding to the document can be obtained. After Bi-LSTM + CRF model processing, the close contact person information set scope and the order information of the close contact person of the patient can be obtained. After the TF-IDF coefficient is calculated by adopting a TF-IDF model, the weight coefficient set weight corresponding to each close contact person can be obtained, and the close contact degree class of the close contact person is set to be 0, namely an initial value. The format of the personal relationship information document Person is as follows:

Person＝{name，people[n]，order[n]，weight[n]，0}

where n is the number of close contacts to which the patient corresponds.

For the personal itinerary information document Route, in the information collection phase, the number of the document in the corpus and the name of the patient corresponding to the document can be obtained. After Bi-LSTM + CRF model processing, a passing point information set sites and a sequential information order of passing points can be obtained. After the TF-IDF coefficient is calculated by adopting a TF-IDF model, the weight coefficient set weight corresponding to each passing place can be obtained, and the risk degree class of the passing place is set to be 0, namely the initial value. The format of the personal itinerary information document Route is:

Route＝{name，sites[n]，order[n]，weight[n]，0}

where n is the total number of sites in the patient pathway.

The feature Part _3 is described.

Part _4: classifying names or places by using an SVM multi-classification classifier, and performing corresponding weight adjustment according to the names or the places, specifically comprising the following steps:

there are three levels of personal name information, i.e., the information of the intimate contact, which are respectively a heavy intimate contact, which corresponds to an intimate contact degree grade label =1, a moderate intimate contact corresponds to an intimate contact degree grade label =2, and a light intimate contact corresponds to an intimate contact degree grade label =3. The multi-classification SVM obtained by modification achieves the classification of the degree of intimate contact by analyzing four characteristics, namely whether the contacter is in a co-living relationship with the patient, the number of times the contacter has a co-meal with the patient, whether the contacter has had an excessive contact with the patient (e.g., riding the same car or close behavior, etc.), and the number of times the contacter has met with the patient. The site information, that is, the epidemic risk degree of the passing site, has three levels, namely, a high risk site, which corresponds to a risk degree level label =1, a middle risk site which corresponds to a risk degree level label =2, and a low risk site which corresponds to a risk degree level label =3. The multi-classification SVM obtained through transformation realizes classification of the close contact degree by analyzing four characteristics of population density of an area where the place is located, whether the place is open air, whether the place forcibly requires wearing a mask or does not need to take off the mask, and daily average human flow of the place.

First, all patient information (names/locations) collected and having undergone preliminary processing is analyzed. For the interpersonal relationship information, the characteristic values to be acquired are: TF1 indicates whether the contact person is in a relationship of living with the patient, TF1=1 indicates yes, TF1=0 indicates no, count1 is the number of times the contact person has a meal with the patient, TF2 is the contact person has an excessive contact with the patient, for example, the same compartment or close behavior of the same car, TF2=1 indicates yes, TF2=0 indicates no, and count2 is the number of times the contact person meets with the patient. For the information of the passing points, the characteristic values to be acquired are: the population density of the area where the place is located, whether the place is open to the air TF1, wherein TF1=1 indicates yes, TF1=0 indicates no, whether the place forcibly requires to wear the mask or does not need to take off the mask TF2, wherein TF2=1 indicates yes, TF2=0 indicates no, and the day-average people flow of the place is flow.

Then, according to the obtained characteristic value, manually marking a part of information, and the specific method is as follows: for the interpersonal relationship information, if the official has a clear formal statement, that is, whether the official is a heavy, medium or light close contact person, the close contact degree grade label is marked according to the official statement, but if the official has no clear formal statement, the close contact degree grade label is marked according to the current standard, and the format of the relationship characteristic data set PersonFatureSet input into the multi-classification SVM classifier is as follows:

PersonFeatureSet＝{name，TF1，count1，TF2，count2，0}

here, since the label value is 0, label is an initial value because the rating is not yet made. The processing of the route point information is the same as the above-described processing method, and therefore, description thereof is omitted. The format of the locality feature data set RouteFeatureSet input to the multi-classification SVM classifier is:

RouteFeatureSet＝{name，density，TF1，TF2，flow，0}

here, since the label value is 0, label is an initial value because the rating is not yet made.

And finally, inputting the part of information which is manually marked into the multi-class SVM, and training the multi-class SVM. Then inputting the information which is not marked by the human beings into the trained multi-classification SVM, and grading the information. At this time, the label of Person's relationship information document Person is updated, and its format is:

Person＝{name，people[n]，order[n]，weight[n]，label}

wherein n is the number of close contacts to which the patient corresponds, label =1 for high close contact, label =2 for moderate close contact grade, and label =3 for mild close contact grade; the personal itinerary information document Route is also updated in the format:

Route＝{name，sites[n]，order[n]，weight[n]，label}

where n is the total number of sites in the patient pathway, label =1 for high risk sites, label =2 for medium risk sites, and label =3 for low risk sites. By this, the hierarchical processing of the information (person name/place) is completed. And then, adjusting the weight coefficient according to the result obtained in the step.

For the name information, if the close contact person is determined to have high close contact with the patient, the weight coefficient corresponding to the name is increased by Δ ω 1_PersonWeight; if the intimate contact is determined to be moderate intimate contact with the patient, the weight factor associated with the name is increased by Δ ω 2 _PersonWeightand if the intimate contact is determined to be light intimate contact with the patient, the weight factor associated with the name is increased by Δ ω 3PersonWeight. Wherein the magnitude relation between the increment of the weight coefficient Δ ω 1, Δ ω 2, and Δ ω 3 is: Δ ω 1_PersonWeight > Δ ω 2_PersonWeight > Δ ω 3_PersonWeight.

For the location information, if the location is determined to be a high risk place, the weight coefficient corresponding to the location is increased by Δ ω 1 _routebight; if the location is determined to be a medium risk location, the weight coefficient corresponding to the location is increased by Δ ω 2 _routebight; if the location is determined to be a low risk location, the weight factor corresponding to the location is increased by Δ ω 3 u routebight. Wherein the magnitude relation between the increment Δ ω 1_RouteWeight, Δ ω 2_RouteWeight, and Δ ω 3 _RouteWeightof the weight coefficient is: Δ ω 1_RouteWeight > Δ ω 2_RouteWeight > Δ ω 3_RouteWeight.

Therefore, the second round of adjustment of the weight coefficient is completed, and the weight coefficient at the moment is more and more in accordance with objective scientific rules and epidemic situation actual conditions.

The feature Part _4 is described.

Part _5: screening information and forming a patient path map/epidemic situation propagation relation tree and predicting an epidemic situation origin/zero patient, which specifically comprises the following steps:

firstly, screening massive redundant and miscellaneous information once. For the name information, the names of the close contacts corresponding to the n personal names with higher weight coefficients in the whole regional relation corpus are marked and extracted, and the names of the selected close contacts and the weight coefficients corresponding to the names are stored in a secondary integration relation document. If the screened closely contacted persons have repetition in the process, the corresponding weight coefficients are added and recorded as one term. Meanwhile, the information of the Person who is in a heavy close contact relationship with the selected Person is stored and updated in the personal relationship information document, namely, the personal relationship information document Person is also simplified, and only the information corresponding to the close contact Person with label =1 is reserved. At this time, each item of content in the second-level integration relationship document can be obtained, which is the number set number of the selected personnel, the number set of the personal relationship information document corresponding to the selected personnel in the corpus, and the weight coefficient set weight corresponding to the selected personnel, respectively, and then the format of the second-level integration relationship document integrated person is as follows:

IntegratedPerson＝{count，number[count]，weight[count]}。

for the place information, m place marks with higher weight coefficients in each personal travel information document are extracted. And storing the screened places and the corresponding weight coefficients in secondary personal travel information documents, namely, each personal travel information document corresponds to one secondary personal travel information document. At this time, each item of content in the secondary personal travel information document can be obtained, which is the number of the document in the corpus, the name of the patient corresponding to the document, the route point information set sites after selection, the weight coefficient set weight corresponding to the route point after selection, and the order information order of the route point after selection, and then the format of the secondary personal travel information document integrated route is as follows:

IntegratedRoute＝{number，name，sites[m]，weight[m]，order[m]}。

next, prediction and analysis are started based on the extracted information. For the name information, the weight coefficients corresponding to the names in the second-level integration relation documents are arranged in a descending order, the names with the same weight coefficient or the difference value smaller than epsilon are placed at the same layer of the tree-shaped graph, and the connection relation of the nodes in the tree-shaped graph is drawn by retrieving the close contact documents, so that the epidemic situation propagation tree is obtained. And the name of the person corresponding to the highest weight coefficient in the secondary integration relation document is the predicted 'zero patient'.

For the location information, for each secondary personal travel information document in the secondary travel information document set, according to the sequence of the location information recorded in the document, point drawing and marking are carried out in a map by using a computer graphics technology, and a path is drawn by connecting the point drawing and marking according to the sequence. And marking all the location information in a map to obtain an epidemic situation map of the target area. Then, according to the number of paths to a certain place, the corresponding weighting processing is carried out on the place. The specific process is as follows: firstly, counting the Path number Path intersected with each Site, and storing the Path number Path in a Path information set Path (the Path is a two-dimensional tuple and records the mapping relation between the Site and the Path number intersected with the Site); then, corresponding weight adjustment is carried out on each location to obtain the sum total of all Path numbers in the Path,

increment of

Wherein, path [ i][j]The number of paths passing through the point i; finally, the weighting factor of each location i is added with the corresponding increment delta omega _i，j And (4) until the processing of the 'epidemic situation map' is completed, the places where more than or equal to 2 paths pass are available. And (4) counting the weight coefficients of all the current places, wherein the place corresponding to the highest weight coefficient is the predicted epidemic situation starting place.

The feature Part _5 is described.

The invention discloses a new crown epidemic situation prediction and analysis method based on Bi-LSTM and TF-IDF, which has the following beneficial effects:

(1) According to the size of the region or the actual requirement, different processing and analyzing means are applied to put forward different epidemic situation guiding opinions.

Common evaluation and prediction models usually adopt the same processing means for different target areas, and although the method is simpler and has stronger universality, the obtained results are often general and some special situations easily exist, so that the prediction and analysis results of the models are not good. The target area is divided into a large area (such as the whole world, asia, china, province and the like) and a small area (such as the city, district, county and the like), different patient information (interpersonal communication relation/place) is respectively extracted and analyzed, and different results (epidemic situation place/zero number patient) are obtained through model prediction. The obtained result is more targeted and more scientific and effective.

(2) The method applies a Bi-LSTM model of an artificial neural network in the field of natural language processing to extract key information from massive texts.

The commonly acquired information of the new coronary patients is descriptive massive texts, and because the epidemic situation infected people have a large base number and a high spreading speed, the manual extraction of the key information of each patient is obviously unrealistic. Therefore, the invention adopts the Bi-LSTM model which applies the artificial neural network in the natural language processing field to mark and extract the key information of the patient, namely the interpersonal relationship and the passing place, and quickly and accurately extracts a large amount of key information, so that the result obtained by the prediction model is more in line with the objective rule and the actual situation.

(3) Not only can predict the origin of epidemic situation and the patient with the zero number in the target area, but also can play a role in warning and guiding for further epidemic situation prevention and control.

By analyzing the collected patient information, the invention can predict the epidemic situation origin and the zero number patient in the target area, namely the source information of the epidemic situation. Once the source information is obtained, the spreading form and the spreading condition of the epidemic situation in the target area can be better investigated, so that the method has great significance for the prevention and control of the epidemic situation, and the prevention and control work of the epidemic situation can be better developed. And through the acquired epidemic situation starting place, the similar places can be greatly checked and managed, even the epidemic situation can be killed in the cradle, and the warning and guiding effects can be played for further epidemic situation prevention and control.

Drawings

FIG. 1 is a schematic diagram of the main process of a new prediction and analysis method of coronary epidemic situation based on Bi-LSTM and TF-IDF.

Detailed Description

The following detailed description of the embodiments of the present invention is provided in connection with the accompanying drawings and examples, which are provided for illustration of the present invention and are not intended to limit the scope of the present invention.

Example 1: the process of identifying and extracting patient information in the predictive model.

Taking the process of identifying and extracting the passing point information of one patient from the text segment as an example, the patient information is as follows: liu women. 1 month, 23 am, 22 am to check in Shenyang Sun lion Wanli Hotel. The family of relatives in the Lesongro district of the Haerbin Xiangfang district is admitted at 19 hours in 24 months 1 and then returned home. During 1 month, 25 days, 9 days, the user walks to a Jia le Fu (Lesong shop) for shopping, and 13 days, the user walks to a kitchen cabinet (Happy road shop) for dinner. In 1 month, 26 meshes and 15 meshes, the driver drives a private car to a Baoyudanyi Lang mountain district in the outer district of the street. And 14 days 1 month and 27 days, driving a private car to a Baoyudianyi Lang mountain district. 17 hours and 28 days in 1 month, driving a private car to a three-eight restaurant (Donglai street shop) for catering, 19 hours, 23 hours, driving to a Baoyudai Lankusan district, and 23 hours, returning home. And after 1 month, 30 days and 15 days, driving a private car to a Baoyudtianyi lankwan district. In the afternoon of 31 months in 1 month, the driver drives a private car to a Baoyudtianyi lankwan district and returns home at 23 hours. The person is permitted to take a private car to a Taiping airport at 17-20 hours, 2 months and 1 day, and takes the CZ6203 flight from Harbin to Beijing. The vehicle is driven to the home facing the sun. 2 months, 1 day to 5 days, shopping in the adjacent Hualian supermarket.

And labeling the text by adopting a BIO labeling set, wherein B-PER and I-PER represent initials of people and initials of people, B-LOC and I-LOC represent initials of places and initials of places, B-ORG and I-ORG represent initials of organizations and initials of organizations, and O represents that the word does not belong to one part of the named entity. The specific process of tagging named entities is presented below. Firstly, the label of each word in a text sentence is obtained by a Bi-LSTM layer of a Bi-LSTM + CRF model, and two named entities, namely a Jia Fu Lesong shop and a kitchen palm cabinet happiness road shop can be extracted by taking one sentence in the text as an example, as follows:

the probability scores of each word marked as three labels B, I and O by using the named entity 'Jia le Le Song shop' as Bi-LSTM layer are shown in the following table 1:

TABLE 1

	B	I	O
				Home-use	0.5	0.1	0.4
Musical instrument	0.2	0.5	0.3
				Good fortune	0.1	0.5	0.4
Musical instrument	0.1	0.6	0.3
				Pine needle	0.1	0.5	0.4
Shop	0.1	0.6	0.3

Then, the CRF layer of the Bi-LSTM + CRF model further restrains and judges the judged labels, so that the possibility of existence of illegal labels is reduced, and therefore, relatively accurate labels to which each word or phrase in the text section belongs are obtained. The required key information can be extracted by traversing the label set corresponding to the text segment, and in this example, the key information is the location information. By repeating the above operation for each sentence in the whole text segment, all the passing point information of the patient can be extracted. The information is stored in a passing point information set sites processed by a Bi-LSTM model in a personal itinerary information document Route, so that the process of identifying and extracting the passing point information of the patient by using the Bi-LSTM + CRF model in the prediction model is realized.

Example 2: and applying the TF-IDF model to acquire initial weight coefficients of the information in the prediction model.

Assuming that the individual itinerary information documents Route of all patients in the target area are obtained, the individual itinerary information documents are integrated together to form an area itinerary information corpus. The contents of the passing point information sets sites in the personal travel information document Route numbered 1 in the corpus are as follows: shenyang sun lion Wanli hotel, jia le Fu Lesong shop, chu shou cabinet happiness road shop, baoyu Tian lan shan district, sanba restaurant Dong Lejie shop, baoyu Tian lan shan district, and Baoyu Tian lan shan district. Taking the entry "bao yu yi lan shan district" as an example, the number of times the entry appears in the document is 3, and the total number of the entries in the personal trip information document is known to be 7, so that the entry "bao yu yi lan shan district" has a frequency TF =3/7=0.43. Given that the corpus contains 200 personal trip information documents, the term baoyyu tianru shan mountain region appears in 2 documents, so that the reverse file frequency IDF = lg (200/20) =1 of the term baoyyu tianru mountain region is obtained. Therefore, the final TF-IDF weight coefficient is TF-IDF = TF = IDF =0.43 =1.43, and this value is used as the initial weight coefficient corresponding to the point "baoyudtianhua lan district".

Example 3: and classifying the place information by using an SVM multi-classification classifier, and adjusting corresponding weight coefficients.

For the location information, namely the passing location, the epidemic risk degree has three levels, namely a high risk place, the risk degree level corresponding to the personal journey information document Route =1, the risk degree level corresponding to the personal journey information document Route =2 in the medium risk place, and the risk degree level corresponding to the personal journey information document Route =3 in the low risk place. Four characteristics that need to be input into the multi-classification SVM for analysis are population density, whether the location is open TF1 (where TF1=1 indicates yes, and TF1=0 indicates no), whether the location requires mask wearing or mask removal TF2 (where TF2=1 indicates yes, and TF2=0 indicates no), and the daily average traffic flow of the location.

Firstly, selecting a part of location information, manually marking, for example, for a location of 'Tianan seafood market', the population density =1S1, whether the location is open TF1=0, whether the location forcibly requires wearing a mask or does not need to take off the mask TF2=1, the daily average people flow rate of the location =900, and manually marking the risk level of the location as a medium risk place according to the characteristic values; for a place of Hospital Hotel, the population density =202, whether the place is open TF1=0, whether the place forcibly requires wearing a mask or does not need to take off the mask TF2=0, and the daily average traffic flow =1300, and the risk level of the place is artificially marked as a high-risk place according to the characteristic values; for the place of 'Jiamei supermarket', the population density =50, whether the place is open TF1=0, whether the place forcibly requires wearing a mask or does not need to take off the mask TF2=1, and the daily average people flow rate of the place =150, and the risk level of the place is artificially marked as a low-risk place according to the characteristic values.

According to the process, a part of places are marked manually, and then the characteristic matrix of the manually marked part of places is input into the multi-classification SVM to train the multi-classification SVM. And then, inputting the remaining feature matrixes of the places which are not marked manually into a multi-classification SVM for grade division, thereby realizing the classification of the place information by using an SVM multi-classification classifier.

And correspondingly adjusting the weight coefficient corresponding to the place according to the risk degree level of the place obtained by the multi-classification SVM. For high risk locations, the corresponding weight coefficient increments are Δ ω, for medium risk locations, the corresponding weight coefficient increments are 0.125 Δ ω, and for low risk locations, the corresponding weight coefficient increments are-0.125 Δ ω. Thus, the adjustment of the weight coefficient according to the result of the multi-classification SVM is completed.

Example 4: drawing an epidemic situation map and predicting the origin of the epidemic situation.

Selecting a python language, and realizing map visualization by means of a third-party library such as basemap, geopandas and boot.

For each secondary personal travel information document in the secondary travel information document set, according to the sequence of the place information recorded in the document, the places are visualized in the map by python, and the places are connected in sequence to draw a path, namely a connection line which is in accordance with the trend of the map is drawn between two nodes with adjacent sequence relation. After all the location information is marked in the map, an epidemic situation map of the target area is obtained, the locations with different weight coefficients and epidemic situation risk degrees can be represented by different colors in the map by means of a third-party library such as basemap, geopanandas and boye, and meanwhile, when a mouse slides to a certain location, the corresponding weight coefficient of the location in the prediction model is automatically displayed.

Then according to the cross-over inAnd the corresponding weighting processing is carried out on the position according to the path number of the position. For example, if the weight of the location "Tianan market" is adjusted accordingly, the number of paths intersecting the location is 50, and the number of total paths is 500, the increment of the weight coefficient corresponding to the location "Tianan market" is obtained

Finally, the weighting factor of each location i is added with the corresponding increment delta omega _i，j And (4) until the processing of the 'epidemic situation map' is completed, the places where more than or equal to 2 paths pass are available. And counting the weight coefficients of all the current places, and assuming that the weight coefficient corresponding to the current 'first farmer market' is 0.8, the weight coefficient corresponding to the 'Tianan market' is 0.5, and the weight coefficient corresponding to the 'same fortune restaurant' is 0.6, the place 'first farmer market' corresponding to the highest weight coefficient is the predicted 'epidemic situation starting place'.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A new crown epidemic situation prediction and analysis method based on Bi-LSTM and TF-IDF is characterized by comprising the following steps:

part _1: depending on the desired result: the categories of the origin of epidemic situation, the zero patient and the target area, the information of the patients is collected,

part _2: the Bi-LSTM + CRF model is used for identifying and extracting the acquired patient information,

part _3: calculating initial weight coefficient of the extracted information by using TF-IDF model, adjusting corresponding weight,

part _4: the multi-classification SVM classifier is utilized to classify the places and the names of people, the corresponding weight is optimized,

part _5: screening information, forming a patient path map and an epidemic situation propagation relation tree, and predicting an epidemic situation origin and a zero-number patient;

a new crown epidemic situation prediction and analysis method based on Bi-LSTM and TF-IDF uses the data structure, functions and processes as follows:

(1) The data structure of the Person relation information document Person is defined as follows

Data item 1, item _1: the number of documents in the corpus,

data item 2, item _2: the name of the patient to which the document corresponds,

data item 3, item \u3: the intimate contact information set scope processed by the Bi-LSTM model,

data item 4, item \u4: the weight coefficient set weight corresponding to the close contact,

data item 5, item \u5: the order information order of the close contact persons,

data item 6, item _6: the level of intimate contact class of the intimate contactor;

(2) The data structure of the personal travel information document Route is defined as follows

Data item 1, item _1: the number of documents in the corpus,

data item 2, item \u2: the name of the patient to which the document corresponds,

data item 3, item \u3: the passing point information sets sites processed by the Bi-LSTM model,

data item 4, item _4: a weight coefficient set weight corresponding to the passing position,

data item 5, item \u5: the order information order of the passing points,

data item 6, item \u6: risk level class of the passing points;

(3) The data structure of the secondary personal travel information document IntegratedRoute is defined as follows

Data item 1, item _1: the number of documents in the corpus,

data item 3, item \u3: the selected route point information sets sites,

data item 4, item \u4: the weight coefficient set weight corresponding to the selected passing place,

data item 5, item \u5: the order information order of the selected passing places;

(4) The data structure of the two-level integration relation document Integratedperson is defined as follows

Data item 1, item _1: the number of the selected persons is counted,

data item 2, item \u2: the number set number of the personal relationship information document corresponding to the selected person in the corpus,

data item 3, item \u3: the weight coefficient set weight corresponding to the selected person,

(5) The data structure of the location feature data set RouteFeatureSet input into the multi-classification SVM classifier is defined as follows

Data item 1, item _1: the location name site of the location is,

data item 2, item \u2: the population density of the area in which the location is located,

data item 3, item \u3: the site is open-air in the situation TF1,

data item 4, item \u4: the situation of wearing the mask TF2 is required in the place,

data item 5, item \u5: the daily average flow of people in the place is flow,

data item 6, item \u6: risk level label for the site;

(6) The data structure of the relational feature data set PersonFatureSet input into the multi-classification SVM classifier is defined as follows

Data item 1, item _1: the name of the patient name currently being analyzed,

data item 2, item \u2: the contacter and patient are in a co-occupational relationship situation TF1,

data item 3, item \u3: the number of times the contacter has a meal with the patient countl,

data item 4, item _4: the contact person and the patient have an excessive contact condition TF2,

data item 5, item \u5: the number of times of meeting between the contact person and the patient is count2,

data item 6, item \u6: the level of intimate contact of the contact, label;

(7) Multi-class SVM, as defined below

The specific process comprises the following steps: the SVM in machine learning is transformed into a multi-classification SVM by adopting a one-to-one method, namely, a plurality of two-classification SVM are combined to construct the multi-classification SVM; during training, classifying samples with the same characteristic category into one category, and classifying samples with other different characteristics into another category; the classification classifies the unknown sample into the class having the largest classification function value,

inputting and aiming: named entities that have been manually tagged: the training method comprises the steps of including a characteristic data set of names/places to train the multi-classification SVM; the remaining non-manually labeled named entity feature data sets need to be partitioned by a trained multi-classification SVM,

outputting and obtaining the following results: all named entities are marked, namely, the corresponding levels and grades are marked in label;

(8) The Bi-LSTM + CRF model is defined as follows

The specific process comprises the following steps: the Bi-LSTM + CRF model is an artificial neural network with a three-layer structure and is divided into a representation layer, a Bi-LSTM layer and a CRF layer,

input of the Bi-LSTM layer: a vector formed by Character Embedding and obtained by random initialization,

output of the Bi-LSTM layer: the predicted score for each of the tags is,

inputs to the CRF layer: the predicted score for each of the tags is,

output of CRF layer: a label for each cell;

(9) TF-IDF model, defined as follows

The specific process comprises the following steps: TF-IDF weight = TF × IDF, where TF denotes the frequency of occurrence of a term in a document, IDF is the logarithm to base 10 of the quotient of the total number of files divided by the number of files containing the term,

inputting: the document where the entry is located and the corpus where the entry is located,

and (3) outputting: TF-IDF weights for the entries;

(10) The system procedure Task is defined as follows

Task{

Task _1: according to the desired result: including the origin of the epidemic, the patient zero, and the category of the target area: patient information was collected from different levels of regions: including the action path and the person in close contact,

task _2: preprocessing massive original text information by applying a Bi-LSTM + CRF model,

task _3: calculating by using a TF-IDF model to obtain initial weight coefficients of the names and the places in the prediction model,

task _4: the multi-classification SVM is used for carrying out corresponding grade division on the names and the places of the people, the weight coefficient is adjusted and optimized according to the grade of the division,

task _5: the name and location information is screened in one round according to the weight coefficient,

task _6: marking the screened place information in a map, drawing a personal path according to the arrival sequence to form an epidemic situation map, drawing the screened disease relation information into an epidemic situation propagation tree according to the size and the close contact relation of the weight coefficient,

task _7: according to the geometric characteristics and intersection condition of the path, the location information is adjusted and optimized by the weight coefficient,

task _8: the place with the highest current weight coefficient is the predicted epidemic situation starting place, the person corresponding to the name with the highest current weight coefficient is the predicted zero patient,

task _ i: the reserved user executes the instructions and processes,

}

the method comprises the following steps that Task _1 acquires the travel information and interpersonal relationship information of a patient, a Bi-LSTM + CRF model is applied to Task _2 to mark and extract key information of acquired massive text information and store the key information in a corresponding document, a TF-IDF model is applied to Task _3 to calculate the Term Frequency (Term Frequency, TF) and the inverse file Frequency (IDF) of different terms in a corpus where the terms are located, so that a TF-IDF coefficient is obtained, the coefficient serves as an initial weight coefficient of names and places in a prediction model, and Task _4 grades names and place names; firstly, manually marking a part of close contacts, close contact grades label corresponding to passing places and patients and risk degree grades label of the places according to a characteristic data set, and training a multi-classification SVM by using the marked data; then, inputting the rest names of people and place which are not classified into classes into the trained multi-classification SVM for grade division; wherein, the intimate contact grades label with the patient are divided into three types, namely severe, moderate and mild; the risk level label of the site is divided into three types, namely low level, middle level and high level; meanwhile, according to the grades divided by the multi-classification SVM, the initial weight coefficients of the name information and the place information are subjected to first round of optimization adjustment, the weight coefficients are increased and reduced according to corresponding proportions, and Task _5 shows that n persons with larger corresponding weight coefficients are selected for the name information; selecting m places with the largest weight coefficients in the individual approach places according to the place information; wherein, n and m are initialized by a user, task _6 represents that the name and place information is integrated and optimized according to the current weight coefficient, and the name information is drawn into an epidemic situation propagation tree according to the size relation of the weight coefficient; for the place information, drawing the individual travel paths according to the sequence of reaching each place, when all the individual travel paths are marked in the map, completing drawing of the epidemic situation map, wherein Task _7 represents the place name information and counts the condition of the path intersection points in the epidemic situation map, the weight coefficient is increased more for the places with more paths intersected, otherwise, the weight coefficient is increased less for the places with less paths intersected, task _8 represents that the final result, namely the original place of the epidemic situation and the patient with zero number, is obtained according to the current weight coefficient, namely the final weight coefficient, so that the target area can develop more scientific and effective epidemic situation prevention and control work, and the spreading of new crown epidemic situation is prevented in time, and Task _ i is an execution instruction and a process reserved for a user by a system so as to meet the requirement of the expansion function of the user;

the data structure, the function and the process used by the system are described;

part _1, according to the desired result: the categories of epidemic situation origin, zero patients and target areas, and the information of the patients is collected: the specific action route and the close contact are:

the acquired information is in a text form, the information is collected on an official website of a target area, the patient information is downloaded by using a python tool, the regional patient information is acquired, the regional patient information comprises interpersonal relationship information and travel information, the information input into a prediction model is stored in the text information, and the information needs to be cleaned and extracted;

the description of the characteristic Part _1 is finished;

storing information of different patients in different documents, and performing information identification and extraction once on each document by adopting a Bi-LSTM + CRF model; in a Bi-LSTM + CRF model, adopting a BIO labeling set, wherein B-PER and I-PER represent a first name and a non-first name of a person, B-LOC and I-LOC represent a first name and a non-first name of a place, B-ORG and I-ORG represent a first name of an organization and a non-first name of an organization, and O represents that a word does not belong to one part of a named entity;

for one time of information identification and extraction based on a Bi-LSTM + CRF model of a document, a specific process is described:

the first layer of the Bi-LSTM + CRF model is the Look-up layer; recording a text segment containing n words as X = (X _1, X _2., X _ n), wherein X _ i represents the id of the ith word of the text in a dictionary, and then obtaining a one-hot vector of each word, and the dimension is the capacity of the dictionary; mapping each word x _ i in the text into Character Embedding from one-hot vectors by using a randomly initialized Embedding matrix;

the second layer of the Bi-LSTM + CRF model is a Bi-LSTM layer; taking the Character Embedding sequence X of a text as the input of each time step of the Bi-LSTM, and outputting the hidden state sequence of the forward LSTM in the Bi-LSTM

With reversed LSTM

Position-wise stitching at hidden states output at each position

Thus, a complete hidden state sequence h = (h _1, h _2., h _ n); then, a linear layer is added to map the hidden state vector from n dimension to k dimension, wherein k is the number of labels of the label set, so as to obtain the automatically extracted sentence characteristics, which are recorded as a matrix P = (P _1, P _2., P _ n); each dimension p _ ij of the matrix p _ i is regarded as a scoring value that classifies the word x _ i into the jth label; then, accessing a CRF layer to label;

the third layer of the Bi-LSTM + CRF model is a CRF layer, and the function of the CRF layer is to carry out sequence labeling; the parameter of the CRF layer is a matrix A of (k + 2) × (k + 2), wherein A _ ij represents the transition score from the ith label to the jth label, the labeled labels are used for labeling a position, and the purpose of adding 2 is to add a start state at the head of the text and add an end state at the tail of the text;

assuming a text-length tag sequence Y = (Y _1, Y _2.,. Y _ n), then the Bi-LSTM + CRF model will equal the tag for text X

Thus, the CRF layer outputs a label corresponding to each word; then, by traversing the labels of all words, patient information is extracted: name of person, location; storing the extracted patient information into a corresponding document, namely storing the name information into Item _3 of a Person relationship information document Person, and processing the information in a close contact Person information set scope by a Bi-LSTM model; storing the place information into Item _3 of the personal journey information document Route, and processing the passing place information set sites by a Bi-LSTM model; the pretreatment operation of the patient information is completed, and the identification and information extraction of the patient information are realized;

the feature Part _2 is described;

part _3: calculating by using a TF-IDF model to obtain an initial weight coefficient of the extracted information, and adjusting and optimizing the weight coefficient, specifically:

after the information is collected and preprocessed, the setting and adjustment of the weight coefficient are realized, namely, the weight calculation based on the TF-IDF model is carried out on all the documents in the target region corpus; the TF-IDF weight coefficients for any entry in a document are calculated as:

the term frequency is the frequency of occurrence of a given entry in the document, and the formula is

Wherein n is _i，j Represents the number of times the entry appears in the document, ∑ _k n _k，j Representing a total number of terms in the document;

the Inverse Document Frequency (IDF) is a measure of the general importance of a term, the IDF of a specific term is obtained by dividing the total Document number in the corpus by the number of documents containing the term, and taking the logarithm with the base of 10 as the quotient, and the calculation formula is

Where | D | represents the total number of files in the corpus, | { j: t is t _i ∈d _j Denotes the number of files containing the entry;

the TF-IDF coefficient is the product of the word frequency (TF) and the Inverse Document Frequency (IDF), i.e. TF-IDF = TF × IDF;

calculating TF-IDF coefficients of entries in the document to obtain initial weight coefficients of the information; then, performing the operation on each entry of each document in the target regional information corpus to obtain an initial weight coefficient of each information, and storing the initial weight coefficient in the corresponding document;

because the TF-IDF model is used reversely, the obtained initial weight coefficient needs to be subjected to reverse scaling processing; the specific operation is as follows: replacing the original weight coefficient w with the value 1-w of the weight coefficient w under the mapping f (x) = 1-x;

carrying out corresponding adjustment and optimization on the weight coefficients corresponding to the information according to the time sequence; for the interpersonal relationship information, the more the weight coefficient corresponding to the person who comes into close contact with the patient at a later time is increased, the less the weight coefficient corresponding to the person who comes into close contact with the patient at an earlier time is increased; the specific implementation method comprises the following steps: arranging all the name information from late to early according to the time sequence, adding the initial value of delta omega to the weight coefficient corresponding to the name with the first rank, uniformly decreasing the delta omega until the weight coefficient is reduced to zero, and adding the weight coefficient corresponding to each name with the current delta omega; when the weight coefficients corresponding to all the name information are adjusted and the value of delta omega is reduced to 0, the first round of adjustment and optimization of the weight coefficients is realized;

the information preprocessing and basic initialization process is completed, and the current document states are as follows: for a Person relationship information document Person, in an information acquisition stage, acquiring the number of the document in a corpus and the name of a patient corresponding to the document; after Bi-LSTM + CRF model processing, obtaining the close contact person information set scope of the patient and the order information order of the close contact person; after a TF-IDF model is adopted to calculate TF-IDF coefficients, a weight coefficient set weight corresponding to each close contact Person is obtained, the close contact degree class of the close contact Person is initialized to 0, and the personal relationship information document Person is in a format as follows:

person = { name, scope [ n ], order [ n ], weight [ n ],0}, where n is the number of corresponding intimate contacts for the patient;

for a personal travel information document Route, acquiring the number of the document in a corpus and the name of a patient corresponding to the document in an information acquisition stage; after Bi-LSTM + CRF model processing, obtaining a passing point information set sites and a sequential information order of passing points; after a TF-IDF model is adopted to calculate TF-IDF coefficients, a weight coefficient set weight corresponding to each passing place is obtained, the risk degree class of the passing place is initialized to be 0, and the format of a personal journey information document Route is as follows:

route = { name, sites [ n ], order [ n ], weight [ n ],0}, where n is the total number of sites of the patient pathway;

the description of the characteristic Part _3 is finished;

part _4: classifying names and places by using a multi-classification SVM classifier, and performing corresponding weight adjustment optimization according to the names and the places, specifically comprising the following steps of:

for the name information, namely, the three grades of the information of the close contact persons, the three grades are respectively the heavy close contact, the corresponding close contact degree grade label =1, the close contact degree grade label =2 corresponding to the medium close contact, and the close contact degree grade label =3 corresponding to the light close contact; the multi-classification SVM obtained by transformation realizes the grading of the close contact degree by analyzing the four characteristics of the condition that the contacter and the patient are in the same-living relationship, the frequency of the common dinning of the contacter and the patient, the over-close contact condition of the contacter and the patient and the frequency of the meeting of the contacter and the patient; for the location information, namely epidemic situation risk degrees of passing locations, three levels are provided, namely high risk locations, the corresponding risk degree level label =1, the risk degree level label =2 corresponding to middle risk locations, and the risk degree level label =3 corresponding to low risk locations; grading the close contact degree by analyzing four characteristics of population density of an area where the place is located, open-air condition of the place, mask wearing requirement of the place and daily average pedestrian volume of the place through the reconstructed multi-classification SVM;

firstly, analyzing all collected patient information which is subjected to primary processing; for the interpersonal relationship information, the characteristic values to be acquired are: the contact person and the patient are in a co-living relationship condition TF1, wherein TF1=1 indicates yes, TF1=0 indicates no, the number of times the contact person and the patient have a meal together is counted, the contact person and the patient have an excessive contact condition TF2, wherein TF2=1 indicates yes, TF2=0 indicates no, and the number of times the contact person and the patient meet is counted 2; for the information of the passing points, the characteristic values to be acquired are: the population density of the area of the place, the open air condition TF1 of the place, wherein TF1=1 indicates yes, TF1=0 indicates no, the place requires to wear a mask condition TF2, wherein TF2=1 indicates yes, TF2=0 indicates no, and the daily average people flow of the place is flow;

then, according to the obtained characteristic value, manually marking a part of information, and the specific method is as follows: for the interpersonal relationship information, if the official has an explicit formal statement, that is, the label is a person with heavy, medium and light close contact, the close contact degree grade label is marked according to the official statement, but if the official has no explicit formal statement, the close contact degree grade label is marked according to the classification standards of label =1, label =2 and label =3, and the format of the relationship characteristic data set personneatureset input into the multi-classification SVM classifier is as follows:

personetureset = { name, TF1, count, TF2, count2,0}, wherein a label value is initialized to 0, which indicates that the grade is not classified, and the processing of the passing point information is the same as the processing method; the format of the locality feature data set RouteFeatureSet input to the multi-classification SVM classifier is:

RouteFeatureSet = { name, diversity, TF1, TF2, flow,0}, wherein the label value is initialized to 0, representing no classification;

finally, inputting the part of information which is manually marked into the multi-classification SVM, and training the multi-classification SVM; inputting information which is not marked manually into the trained multi-classification SVM, and carrying out grade division on the information; and updating the label of the Person relationship information document Person, wherein the format is as follows:

person = { name, scope [ n ], order [ n ], weight [ n ], label }, where n is the number of intimate contacts to which the patient corresponds, label =1 for high intimate contact, label =2 for medium intimate contact grade, and label =3 for light intimate contact grade; updating the personal travel information document Route, wherein the format is as follows:

route = { name, sites [ n ], order [ n ], weight [ n ], label }, wherein n is the total number of places of the patient approach, label =1 corresponds to a high risk place, label =2 corresponds to a medium risk place, and label =3 corresponds to a low risk place; finishing grading processing of the information; then, adjusting and optimizing the weight coefficient according to the results obtained in the above-mentioned steps of Person and Route;

for the name information, if the close contact person is determined to have high close contact with the patient, the weight coefficient corresponding to the name is increased by Δ ω 1_PersonWeight; if the person in close contact is determined to have moderate close contact with the patient, the weighting factor corresponding to the name of the person is increased by delta omega 2 \ u PersonWeight; if the person in close contact is determined to have light close contact with the patient, the weight coefficient corresponding to the name of the person is increased by delta omega 3 \ u PersonWeight; wherein the relationship between the increments Δ ω 1 \ u personweight, Δ ω 2 \ u personweight, and Δ ω 3 \ u personweight of the weight coefficients is: Δ ω 1_PersonWeight > Δ ω 2_PersonWeight > Δ ω 3_PersonWeight;

for the location information, if the location is determined to be a high risk place, the weight coefficient corresponding to the location is increased by Δ ω 1 _routebight; if the location is determined to be a medium risk location, the weight factor corresponding to the location is increased by Δ ω 2 _routebightif the location is determined to be a low risk location, the weight factor corresponding to the location is increased by Δ ω 3 _routebight; wherein the relationship between the increments Δ ω 1_RouteWeight, Δ ω 2_RouteWeight, and Δ ω 3 _RouteWeightof the weight coefficients is: Δ ω 1_Routeweight > Δ ω 2_Routeweight > Δ ω 3_Routeweight;

the description of the characteristic Part _4 is finished;

part _5: screening information and forming a patient path map, an epidemic situation propagation relation tree, predicting an epidemic situation place and a zero-number patient, which specifically comprises the following steps:

firstly, screening acquired information for one time; for the name information, marking and extracting the names of the close contacts corresponding to n personal names with higher weight coefficients in the whole regional relation corpus, and storing the selected names of the close contacts and the corresponding weight coefficients in a secondary integration relation document; if the screened closely contacted persons have repetition in the process, adding the corresponding weight coefficients and recording as one item; meanwhile, the information of the Person who is in a heavy close contact relationship with the selected Person is stored, and the Person is updated in the personal relationship information document, namely, the Person relationship information document Person is simplified, and the information corresponding to the close contact Person with label =1 is reserved; obtaining each item of content in the second-level integration relation document, wherein the item is the number count of the selected personnel, the number set number of the personal relation information document corresponding to the selected personnel in the corpus, and the weight coefficient set weight corresponding to the selected personnel, and then the format of the second-level integration relation document Integrated person is as follows:

IntegratedPerson＝{count，number[count]，weight[count]}；

for the place information, marking and extracting m places with high weight coefficients in each individual travel information document; storing the screened places and the corresponding weight coefficients in secondary personal travel information documents, namely, each personal travel information document corresponds to one secondary personal travel information document; obtaining each item of content in the secondary personal travel information document, namely the number of the document in the corpus, the name of the patient corresponding to the document, the selected passing point information set site, the weight coefficient set weight corresponding to the selected passing point and the sequence information order of the selected passing point, wherein the format of the secondary personal travel information document integrated route is as follows:

IntegratedRoute＝{number，name，sites[m]，weight[m]，order[m]}；

starting to predict and analyze according to the extracted information; for the name information, the weight coefficients corresponding to the names in the second-level integration relation document are arranged in a descending order, the names with the same weight coefficient and the difference value smaller than a given value epsilon are placed in the same layer of the tree-shaped graph, and the connection relation of nodes in the tree-shaped graph is drawn by retrieving the close contact documents, so that an epidemic situation propagation tree is obtained; the name of the person corresponding to the highest weight coefficient in the secondary integration relation document is the predicted 'zero patient';

for the location information, for each secondary personal travel information document in the secondary travel information document set, according to the sequence of the location information recorded in the document, point drawing and marking are carried out in a map by using a computer graphics technology, and a path is drawn by connecting the point drawing and marking according to the sequence; marking all the place information in a map to obtain an epidemic situation map of a target area; then, calculating the number of paths intersected with the place, and carrying out corresponding weighting processing on the place; the specific process is as follows: firstly, counting the Path number Path intersected with each Site, and storing the Path number Path in a Path information set Path, wherein the Path is a two-dimensional tuple and records the mapping relation between the Site and the Path number intersected with the Site; then, the corresponding weight adjustment is carried out on each place, the sum total of all the paths in the Path is obtained, and the increment is added

Wherein, path [ i][j]The number of paths passing through the point i; finally, the weighting factor of each location i is added with the corresponding increment delta omega _i，j Until the 'epidemic situation map' is processed to have the places which are more than or equal to 2 paths to pass through; counting the weight coefficients of all current places, wherein the place corresponding to the highest weight coefficient is a predicted epidemic situation starting place;

the feature Part _5 is described.