CN112527961A

CN112527961A - Automatic extraction method for emergency response level of emergency plan and responsibility of administrative unit

Info

Publication number: CN112527961A
Application number: CN202011498662.8A
Authority: CN
Inventors: 朱安安; 邱彦林; 陈尚武
Original assignee: Hangzhou Xujian Science And Technology Co ltd
Current assignee: Hangzhou Xujian Science And Technology Co ltd
Priority date: 2020-12-18
Filing date: 2020-12-18
Publication date: 2021-03-19
Anticipated expiration: 2040-12-18
Also published as: CN112527961B

Abstract

The invention relates to an automatic extraction method for emergency response grade of an emergency plan and responsibility of administrative units, which comprises the following steps: s1: preprocessing the emergency plan, splitting the text content of the emergency plan according to the directory title, and storing the text content of the emergency plan to a database according to the directory title grade; s2: labeling classification categories of the catalog titles processed in the step S1 to form a labeled data set; training the labeled data set, and performing word segmentation, quantization and classification processing; s3: extracting the key information; s4: performing de-duplication splicing processing on the extracted name of the administrative unit, outputting the responsibility of the administrative unit, and performing standardization processing on the extracted entity related to the trigger condition; s5: and acquiring the response grade and the corresponding trigger condition, and outputting an analysis result. The method can extract the emergency response grade in the plan and the trigger conditions corresponding to different grades according to the emergency plan after the templating and standardize the emergency response grade and the trigger conditions.

Description

Automatic extraction method for emergency response level of emergency plan and responsibility of administrative unit

Technical Field

The invention belongs to the field of data processing, and particularly relates to an automatic extraction method for emergency response grade of an emergency plan and responsibility of an administrative unit.

Background

The emergency plan refers to emergency management, command, rescue plan and the like in the case of emergency such as natural disaster, serious accident, environmental pollution and artificial destruction. The emergency plan is often a comprehensive accident emergency plan, which describes in detail what people do before, during and after an accident, when and how to do, and an emergency response plan compiled for the accident situation that may occur in each facility and place on site. The emergency response protocol includes all possible hazardous conditions and specifies the responsibilities of the personnel involved in the emergency.

Most of the current emergency plans are stored in the form of paper files or electronic documents, and the quality of file writing is uneven, so that the content is various. In addition, the existing emergency plan digitizing system usually only performs template transformation on the plan, and does not extract and standardize the trigger conditions corresponding to the emergency response levels in the plan, the responsibility contents of the related functional departments, and the like. When an emergency occurs, it is difficult to judge what emergency plan should be adopted for the event and what emergency response level is met, so that the problem that the response is not timely or which department is responsible is not known easily occurs, and the emergency command disposal efficiency is seriously influenced.

Disclosure of Invention

In order to solve the problems, the method for automatically extracting the emergency response grade and the administrative unit responsibility of the emergency plan, which is provided by the invention, can extract the emergency response grade in the plan and the trigger conditions corresponding to different grades according to the templated emergency plan and standardize the extracted emergency response grade and trigger conditions; and the functional units mentioned in the plan complete text and the related responsibility ranges of the functional units can be extracted.

The technical scheme of the invention is as follows:

an automatic extraction method for emergency response grade and administrative unit responsibility of an emergency plan comprises the following steps:

s1: preprocessing the emergency plan, splitting the text content of the emergency plan according to the directory title, and storing the text content of the emergency plan to a database according to the directory title grade;

s2: labeling classification categories of the catalog titles processed in the step S1 to form a labeled data set; training the labeled data set, and performing word segmentation, quantization and classification processing;

s3: extraction of key information: extracting the name and the responsibility range of an administrative unit from the text content under all the directory titles; according to the classification result obtained in the step S2, extracting the response level and the corresponding trigger condition of the text of which the classification result is the content of describing the emergency response level, the early warning level and the event classification; the key information is extracted by combining entity identification and entity type classification;

s4: performing de-duplication splicing processing on the extracted name of the administrative unit, outputting the responsibility of the administrative unit, and performing standardization processing on the extracted entity related to the trigger condition;

s5: and acquiring the name and responsibility of the administrative unit under each level of directory title according to the directory title level, acquiring the response level and the corresponding trigger condition, and outputting an analysis result.

Preferably, the specific process of step S1 is as follows: splitting the content according to the directory title of the plan, storing the text content in each section of text, simultaneously storing the directory title and the father node of the directory title, specifying the father node of the first-level directory title as 'root', and storing the standardized emergency plan text into a database for further processing.

Preferably, in step S2, the classification labeling adopts a supervised two-classification model, and the labeling of the data set needs to label whether the content in each directory title is an emergency response "class content, if so, label as '1', otherwise label as '0';

the training process in step S2 is: firstly, performing word segmentation on a directory title by adopting a jieba, then calculating word frequency through TF-IDF, performing vectorization processing, and finally classifying by adopting a polynomial naive Bayes classifier.

Preferably, the step of entity identification and entity type classification described in step S3 is as follows:

s3.1: processing text data: in the training stage, when entity recognition is performed on each directory title and all texts under the directory title, the types of the entities to be recognized are as follows: quantity nouns, emergency response levels, condition trigger words, keywords of digit boundaries, quantity units and administrative unit names;

s3.2: entity identification and trigger word category classification model establishment: coding each directory title and all texts under the directory titles according to characters by adopting one-hot, wherein the coded vector is the input vector of the model; inputting the vector into a Bi-LSTM model, obtaining a final state vector of each input word through model coding, and temporarily storing the final state vector; and decoding the final state vector output CRF model to obtain a final sequence labeling result, if the sequence labeling result contains Trigger entities, finding the final state vector corresponding to each word in each Trigger entity, taking vector arithmetic mean as a word vector of the Trigger entity, and inputting Softmax classification.

Preferably, the Loss of the whole model is generated by adding the Loss of the entity recognition model Loss and Trigger classification Loss in the Loss of the training process, and a final entity recognition and Trigger word classification model is obtained through training.

Preferably, the deduplication processing and splicing method in step S4 is as follows: averaging each word vector of the word of which the recognition result is the ORG by using the final state vector of each word output in the step S3.2 to be used as a vector of an entity word, extracting word vectors of each word of which all text entities are recognized as the ORG under the directory, calculating cosine similarity in pairs, taking the word of which each word has the highest similarity with other words, judging that the description is the same administrative unit when the cosine similarity between the two words is greater than 0.9, dividing the two entities into one group, dividing the entities into different groups by comparing the similarity, and respectively forming one group if no similarity is greater than 0.9; and selecting the character with the longest length in each group as the name of the administrative unit, splicing sentences containing any entity in the group according to the sequence, and outputting the sentences as the responsibility of the administrative unit.

Preferably, the normalization is performed for the extracted trigger word entity, number word entity, quantifier entity and keyword entity, and the extraction of each trigger condition must include the trigger word, the number word entity and the quantifier entity at the same time.

Preferably, when a plurality of trigger word entities appear in a sentence, the sentence is again punctuated according to punctuations, so that only one set of trigger conditions appears in each clause finally.

Preferably, the quantifier corresponding to the trigger word needs to be limited, and the trigger condition is screened through secondary matching of the trigger word and the quantifier.

Preferably, when the trigger condition is normalized, when two number word entities are extracted from a set of trigger conditions, the two number word entities are determined to be the number boundary of the trigger condition.

The invention has the beneficial effects that:

according to the text data of the relevant plans for emergency management, all the plans are extracted and standardized through a series of text analysis, such as text classification, entity identification, entity standardization and the like, so that each plan generates trigger conditions corresponding to different response levels, and the names and duties of the associated administrative units; when an accident occurs, the emergency response level to be started by the accident, the name of the administrative unit related to the accident and the responsibility range of the administrative unit can be matched quickly by only knowing key information of casualty conditions, economic loss and the like of the accident and inquiring the standardized plan database, so that the emergency response level can be conveniently and quickly responded, the accident handling efficiency is improved, and the injury and the loss are reduced.

Drawings

FIG. 1 is a flow chart of the present invention

Detailed Description

The embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

As shown in fig. 1, an automatic extraction method for emergency response level and administrative unit responsibility of an emergency plan includes the following specific steps:

s3: extracting the name and the responsibility range of an administrative unit from the text content under all the directory titles; according to the classification result obtained in the step S2, extracting the response level and the corresponding trigger condition of the text of which the classification result is the content of describing the emergency response level, the early warning level and the event classification; extracting the key information by combining entity identification and entity type classification;

Aiming at the steps, the method is divided into an emergency plan text preprocessing module, a catalog title classification module, a plan entity extraction module, an administrative unit duty standardization module and a response trigger condition standardization module.

As an embodiment of the present invention, in the emergency plan text preprocessing module, the acquired emergency plan text is stored according to the directory level. The content is split according to the directory of the plan, and the directory title and the parent node of the directory are saved while the text content is stored in each section of text. In particular, the parent node of the primary directory is specified as 'root'. And warehousing the standardized emergency plan text for further processing.

In one embodiment of the present invention, in the catalog title classification module, whether the catalog hierarchy is the narrative emergency response grade classification content is judged according to the classification result, and the grade and the trigger condition are extracted from the text content of which the classification result is 'yes'. Therefore, the analysis speed can be improved, and the accuracy of the extraction result can be improved under the constraint of classification. Because the number of words of the title text of the directory is often small, the short text is taken as the main text, and a good classification effect can be achieved by adopting a simple classifier, the invention adopts a naive Bayes classification algorithm to classify the title text of the directory.

And (3) manually labeling and classifying the catalog titles of all the existing emergency plan texts in the database according to the categories, wherein the models are supervised binary classification models, so that the labeling of the data set only needs to label whether the contents of the catalog are contents of 'emergency response', the contents of 'emergency response', early warning response, event level and the like, are labeled as '1' if the contents are the contents of 'emergency response', and are not labeled as '0'.

The title text is firstly subjected to word segmentation by adopting jieba, then word frequency of the segmented words is calculated through TF-IDF, the text is subjected to vectorization, and finally the vectorized text is classified by adopting a polynomial naive Bayes classifier.

TF-IDF, i.e., word frequency-inverse document frequency, is a statistical method used to evaluate the importance of a word to one of a set of documents or a corpus. TF-IDF consists of two parts, TF and IDF. The main idea is as follows: if a word appears in an article with a high frequency TF and rarely appears in other articles, the word or phrase is considered to have a good classification capability and is suitable for classification. Wherein:

TF, Term Frequency (Term Frequency): indicating the frequency with which the word occurs herein.

Namely:

IDF, i.e. inverse file frequency: the IDF for a particular term may be obtained by dividing the total number of documents by the number of documents that contain that term and taking the logarithm of the resulting quotient. If the documents containing the entry t are fewer and the IDF is larger, the entry has good category distinguishing capability. The calculation method of the IDF is shown in formula iii:

where | D | represents the total number of files in the corpus, | { j: t |, where_i∈d_jDenotes the number of documents containing the term ti. In order to prevent the condition that the denominator is zero due to the fact that the entry is not in the corpus, smoothing is generally performed, namely 1+ | { j: t is used under the condition that the denominator is general_i∈d_j}|。

Finally, the calculation method of TF-IDF is shown in formula IV:

TF-IDF ═ TF ═ IDF formula iv

A naive Bayes classifier (Navie Bayes) is a classifier constructed based on bayesian principles. In the training stage, training sample characteristics and classes are input, the occurrence frequency of each class in the training samples and the conditional probability of each characteristic attribute to each class are calculated, and the probabilities are stored after training. In the prediction stage, after the input text is subjected to word segmentation and vector conversion, the probability of the text appearing in different categories is calculated, and the text with the highest probability is selected as the classification result of the text. The naive Bayes formula is shown in formula V:

P(y_k|x)＝P(y_k)×ΠP(x_i|y_k) Formula V

Where x denotes the probability of belonging to a certain class, y_kIndicating that the entry occurs in combination.

According to the characteristics of the pre-arranged plan text, the text classification is sequentially carried out according to the directory hierarchy, namely, the text classification is carried out on the primary directory, and when the classification result is '1' in the primary directory, namely, the content is the directory of the 'emergency response' content, the secondary and lower directories are not classified; if not, classifying the secondary directory, and so on.

In an embodiment of the present invention, in the plan entity extraction module, the name of the administrative unit and the scope of responsibility are extracted from the text contents in all the directories. And according to the classification result obtained by the catalog title classification module, extracting the response grade and the corresponding trigger condition of the text which is considered as the content describing 'emergency response grade, early warning grade, event grade' and the like in the classification result. The key information is extracted mainly by combining the entity identification and entity type classification of the Bi-LST M-CRF.

The Bi-LSTM is a bidirectional long-short time memory network and is formed by combining a forward LSTM and a backward LSTM. Both are often used to model context information in natural language processing tasks.

CRF is conditional random field, belonging to discriminant probability map model. CRF is able to label the probability of sequence occurrence given the variable sequence under which it is observed. In the task, the observation sequence is a word sequence, the tag sequence is a corresponding part-of-speech sequence, and the tag sequence has a linear sequence structure.

The advantage of Bi-LSTM is that it can learn the dependency between observation sequences (input words) by Bi-directional setup, and LSTM can automatically extract the features of observation sequences according to targets (such as recognition entities) during training, but it has the disadvantage that it cannot learn the relationship between labeled sequences (output labels). In the named entity recognition task, labels have a certain relationship, for example, a B-type label (representing the beginning of an entity) is not followed by a B-type label, so that while the LSTM solves the sequential labeling task such as NER, although a complicated feature engineering can be omitted, the disadvantage that the labeling context cannot be learned exists.

In contrast, CRF has the advantage of being able to model and learn the characteristics of marker sequences, but has the disadvantage of requiring manual extraction of observed sequence features. It is therefore common to add a CRF layer after the LSTM to obtain the benefits of both.

The main steps of entity identification and entity type classification are as follows:

1. processing text data: in the training stage, entity recognition is carried out on each directory title and all texts under the directory title, and entities needing to be recognized mainly comprise the following types: number nouns (M), emergency Response levels (Response), conditional Trigger words (Trigger), keywords of number boundaries (Keyword), number units (Q), administrative unit names (ORG), and the like, 6 types of entities. The specific data labeling method is as follows: splitting each sentence according to characters, giving each character a label according to a BMESO strategy, wherein the BMESO strategy is that all non-entities are marked as 'O', the entities are directly marked as S _ entity names according to specific entity types, if the entity length is a character, the entities are marked as S _ entity names, otherwise, the first characters of the entities are marked as B _ entity names, the middle characters are marked as M _ entity names, and the last characters are marked as E _ entity names. Specifically, for the conditional Trigger word (Trigger), the extractable Trigger conditions currently supported by the present invention mainly include: the method comprises the following steps of uniformly marking seven categories of death, injury (including serious injury, light injury, poisoning, disability, adverse reaction, local organ disability, acute severe radiation sickness and the like), missing, economic loss (property loss), earthquake magnitude, duration, air quality index, emergency transfer and the like as Trigger in an entity identification stage, and classifying specific types of Trigger words after the entities are extracted.

As a key sentence: "more than 10 persons casualty in one sudden public incident, wherein, more than 3 cases of death and critical case are particularly important incidents, starting first-level emergency response", splitting the sentence by words (including all punctuations and other characters), and after marking, the sequence label corresponding to each word is: "O, O, O, O, O, O, O, B _ Trigger, E _ Trigger, B _ M, E _ M, B _ Keyword, E _ Keyword, O, O, O, O, B _ Trigger, E _ Trigger, O, O, B _ Keyword, E _ Keyword, S _ M, O, O, O, O, O, O, O, O, O, B _ Response, E _ Response, O, O, O, O, O, O, O, O".

2. Entity identification and trigger word category classification model establishment: coding each directory title and all texts under the directory titles according to characters by adopting one-hot, wherein the coded vector is the input vector of the model; the vectors are input into a Bi-LSTM model, final state vectors of each input word are obtained through model coding, and the final state vectors are temporarily stored. And then decoding the vector output CRF model to obtain a final sequence labeling result, if the sequence labeling result contains Trigger entities, finding a final state vector corresponding to each word in each Trigger entity, taking vector arithmetic mean as a word vector of the Trigger entity, and inputting Softmax classification.

Softmax is a very common and important function, and is widely used especially in multi-category scenes. He maps some inputs to real numbers between 0-1 and the normalization guarantees a sum of 1, so the sum of the probabilities for the multi-classes is also exactly 1. The Softmax function is defined as shown in equation VI:

wherein, V_iIs the output of the classifier category, i represents the category index, and the total category number is C; s_iAnd the ratio of the index of the current element to the sum of the indexes of all elements is shown, Softmax converts the output numerical values of multiple classifications into relative probability, and in practical application, the classification with the highest probability value is selected as a classification result.

And the Loss of the whole model is generated by adding the Loss of the entity recognition model Loss and Trigger classification Loss in the Loss of the model in the training process, and the final entity recognition and Trigger word classification model is obtained through training.

In the prediction stage, according to the text standardized in the emergency plan text preprocessing module, the text content under each directory is sequentially analyzed according to the directory hierarchy, the text is punctuated according to the period, the document directory is analyzed in the directory title classification module, when the text under the directory is the content of 'emergency response', the text under the directory is subjected to entity identification, and the Trigger word identified as Trigger is classified; otherwise, only entity recognition is carried out on all texts, so as to extract administrative units in the document.

In one embodiment of the present invention, in the administration unit role standardization module, the entity of the extracted category ORG is processed with respect to the entity identification result in the plan entity extraction module.

Firstly, carrying out duplicate removal on the entity name of an administrative unit: since the names describing the administration units in the text may be spoken, abbreviated, aliased, and the like, the extracted administration unit names need to be deduplicated. The specific method comprises the following steps:

the method comprises the steps of adopting a preplan entity extraction module to carry out entity identification, outputting a final state vector of each word by a Bi-LSTM model, averaging each word vector of a word with an identification result of ORG as a vector of the entity word, extracting word vectors of each word identified as ORG by all text entities under a directory according to the method, calculating cosine similarity in pairs, taking a word with the highest similarity between each word and other words, judging that the description is the same administrative unit when the cosine similarity of the two words is more than 0.9, dividing the two entities into one group, dividing the entities into different groups through the comparison of the similarity, and respectively forming one group if the similarity is more than 0.9.

And secondly, after the unit name is removed, selecting the name with the longest character length in each group as an administrative unit, splicing sentences containing any entity in the group in sequence, and outputting the sentences as responsibility of the administrative unit.

In one embodiment of the present invention, in the response trigger condition standardization module, the extracted entities related to the trigger conditions are standardized with respect to the classification results of the entity identification and the trigger words in the plan entity extraction module. The standardization responding to the repeat condition is mainly to standardize the extracted trigger word entity, number word entity, quantifier entity and keyword entity.

Firstly, for a trigger word entity, according to the characteristics of a pre-arranged text, the invention limits that the extraction of each trigger condition must simultaneously comprise a trigger word, a number word entity and a quantifier entity. When a plurality of trigger word entities appear in a sentence, the sentence is broken again according to punctuation marks on the premise of ensuring the rules, so that only one set of trigger conditions appears in each clause finally.

Secondly, in the aspect of measuring word processing, in order to ensure the accuracy of the triggering condition, the classified triggering words and the corresponding measuring words are limited. If the trigger word classification result is 'death', the corresponding quantifier can only be generated in 'people, names, examples' and the like, and more accurate trigger conditions can be screened out through secondary matching of the trigger word and the quantifier.

For the number words and keywords, the emergency response triggering condition is often a range limitation for the number of people, the magnitude of earthquake, the economic loss and the like. Therefore, when the trigger condition is normalized, when two number word entities are extracted from a set of trigger conditions, the two number word entities are determined as the number boundary of the trigger condition. If the text is 'injured 1 to 3 persons, then four-level emergency response is started', the trigger words are extracted in the entity recognition stage: injury, number: 0,3, quantifier: human, then the standardized trigger conditions are: the number of injured people: 0-3 persons; when only one digital word entity is extracted from a group of trigger conditions but the digital word entity contains a keyword entity, judging the number boundary of the trigger conditions according to the keyword. . If the text is that the number of injured people is below 3, four-level emergency response is started, and the entity recognition stage extracts a trigger word: injury, number: 3, quantifier: human, keyword: the following ", then the standardized trigger conditions are: the number of injured people: 0-3 persons; specifically, when there is no upper limit on the number boundary in the text, the numerical value '99999' is uniformly set as the upper limit on the number.

Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present invention, which are used for illustrating the technical solutions of the present invention and not for limiting the same, and the protection scope of the present invention is not limited thereto, although the present invention is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the present invention in its spirit and scope. Are intended to be covered by the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. An automatic extraction method for emergency response grade of an emergency plan and responsibility of an administrative unit is characterized by comprising the following steps:

2. The method for automatically extracting the emergency response level and the administrative responsibility of the emergency plan according to claim 1, wherein the specific process of the step S1 is as follows: splitting the content according to the directory title of the plan, storing the text content in each section of text, simultaneously storing the directory title and the father node of the directory title, defining the father node of the first-level directory title as 'root', and storing the standardized emergency plan text into a database for further processing.

3. The method for automatically extracting emergency response levels and administrative unit responsibilities of emergency plans according to claim 1, wherein in step S2, the classification labels adopt a supervised binary classification model, and the label of the data set needs to label whether the content in each directory title is emergency response "type content, if yes, label as '1', otherwise label as '0';

4. The method for automatically extracting emergency response level and administrative responsibility of emergency plans according to claim 1, wherein the step of entity identification and entity type classification in step S3 is as follows:

5. The method for automatically extracting emergency response levels and administrative unit responsibilities of emergency plans according to claim 4, wherein Loss of the whole model is generated by adding an entity recognition model Loss and a Trigger classification Loss in the training process, and a final entity recognition and Trigger word classification model is obtained through training.

6. The method for automatically extracting emergency response level and administrative responsibility of emergency plans according to claim 4, wherein the method for performing deduplication processing and splicing in step S4 is as follows: averaging each word vector of the word of which the recognition result is the ORG by using the final state vector of each word output in the step S3.2 to be used as a vector of an entity word, extracting word vectors of each word of which all text entities are recognized as the ORG under the directory, calculating cosine similarity in pairs, taking the word of which each word has the highest similarity with other words, judging that the description is the same administrative unit when the cosine similarity between the two words is greater than 0.9, dividing the two entities into one group, dividing the entities into different groups by comparing the similarity, and respectively forming one group if no similarity is greater than 0.9; and selecting the character with the longest length in each group as the name of the administrative unit, splicing sentences containing any entity in the group according to the sequence, and outputting the sentences as the responsibility of the administrative unit.

7. The method of claim 6, wherein the standardization is performed for extracted trigger word entities, number word entities, quantifier entities and keyword entities, and each trigger condition must be extracted from the trigger word, number word entities and quantifier entities.

8. The method of claim 7, wherein when a plurality of trigger word entities appear in a sentence, the sentence is again punctuated such that only one set of trigger conditions appears in each final clause.

9. The method for automatically extracting emergency response levels and administrative unit responsibilities of emergency plans according to claim 7, wherein quantifier words corresponding to the trigger words are limited, and the trigger conditions are screened by secondary matching of the trigger words and the quantifier words.

10. The method of claim 7, wherein when the trigger condition is standardized, and when two digital entities are extracted from a set of trigger conditions, the two digital entities are determined as the number boundary of the trigger condition.