CN111966879B

CN111966879B - Epidemic situation news information extraction method and system

Info

Publication number: CN111966879B
Application number: CN202010824197.6A
Authority: CN
Inventors: 陈佳珊; 黄景浩; 杨坦
Original assignee: South China Normal University
Current assignee: South China Normal University
Priority date: 2020-08-17
Filing date: 2020-08-17
Publication date: 2023-08-08
Anticipated expiration: 2040-08-17
Also published as: CN111966879A

Abstract

The invention provides an epidemic situation news information extraction method, which aims at a specific scene of an epidemic situation news webpage to extract relevant information in a news text, converts the relevant information into structural data, and then stores and visually displays the data; the method is characterized by comprising the following steps of: a step of data crawling; a data processing step; a path information extraction step; a living place/constant place information extraction step; a step of extracting traffic riding information; an information output and display step; and loading a webpage through a crawler tool to acquire news texts, constructing a sentence splicing and text segmentation algorithm, comprehensively utilizing tools such as entity naming identification, map API and the like by combining epidemic text characteristics, constructing three extraction modules of path information, residence/residence information and traffic boarding information, and finally deploying the system into a user-friendly webpage to provide convenience for autonomous information extraction of users.

Description

Epidemic situation news information extraction method and system

Technical Field

The invention relates to an internet information collection technology, in particular to an epidemic situation news information extraction method and system.

Background

The Internet news webpage information is an important information source channel for people, but is faced with massive webpage information, people often have difficulty in rapidly judging and acquiring the required content, and the Internet news webpage information is characterized in that a plurality of unnecessary noise information such as advertisement links, script programs and the like exist on the periphery of a news webpage text, so that the information greatly interferes with the sight of people, and the people are interfered when acquiring the news text information. In this regard, effective data cleaning means are required to filter noise information from news web pages to obtain relevant text information.

Disclosure of Invention

In order to meet the extraction requirement of news text information, the invention provides an epidemic situation news information extraction method and system, which aims at a specific scene of an epidemic situation news webpage to extract relevant information in a news text, convert the relevant information into structured data, and then store and visually display the data; the specific technical content is as follows:

an epidemic situation news information extraction method comprises the following steps:

step 01, data crawling;

simulating and loading a plurality of websites pointing to the news webpage based on a crawler tool so as to acquire the content in the news webpage;

step 02, a data processing step;

judging whether adjacent sentences in the obtained webpage content are continued according to a preset rule, executing a continuing operation on the adjacent sentences to be continued, and traversing all sentences in the webpage content to obtain news texts; dividing the obtained news text into a plurality of long sentence sets, and dividing each long sentence into a plurality of short sentence sets;

step 03, extracting path information;

extracting a plurality of path elements from the sentence set of the news text after the segmentation processing to form path information, wherein the path elements comprise address information, behavior event information, province/city/county information and time information;

Step 04, residence/constant residence information extraction;

extracting a plurality of residential-usual-place elements from a sentence collection of the news text after the segmentation processing to form residential place/usual-place information, wherein the residential-usual-place elements comprise residential place or usual-place information and province, city and county administrative district information to which the residential place/usual-place belongs;

step 05, extracting traffic boarding information;

extracting a plurality of traffic elements from the sentence sets of the news text after the segmentation processing to form traffic boarding information, wherein the traffic elements comprise traffic tool information, starting point information and end point information;

step 06, information output and display step;

and displaying one or more of the path information, the residence/residence information and the traffic boarding information in the webpage through webpage dynamic rendering.

In one or more embodiments of the present invention, the operation of step 01 includes:

step 011, camouflage is carried out on the crawler program by adding an appropriate request head, so that the condition that the news website identifies the crawler program to carry out IP blocking is avoided; loading a news webpage by using a crawler program and waiting for all elements of the webpage to be loaded;

step 012, analyzing the obtained page by using the lxml library;

And step 013, extracting the content of the corresponding html element by using the Xpath expression, wherein the extracted content comprises a webpage text, and the webpage text is formed by splicing the content in all < p > tags in the webpage.

In one or more embodiments of the present invention, the operation of step 02 includes:

judging whether the sentences adjacent to each other in front and behind should be continued or not based on punctuation marks, wherein the method comprises the following steps:

in case 1, the characters at the end of the sentence include any punctuation marks as follows: comma, colon, left half of quotation marks, left half of brackets, left half of signature marks, or any of the following: when 'in', 'out', continuing the sentence with the adjacent post sentence;

in case 2, the characters of the sentence head include any punctuation marks as follows: the comma, the colon, the semicolon, the second half of the quotation mark, the exclamation mark, the period, the percent mark, the bracket, the title mark, the pause mark, the question mark and the "heat" are connected with the adjacent front sentences;

if a plurality of punctuation mark pairs exist in the sentence, and the left half number of the punctuation mark pairs is larger than the right half number, continuing the punctuation mark pairs with the adjacent post sentence;

in case 4, if a plurality of punctuation mark pairs exist in the sentence, and the left half number of the punctuation mark pairs is less than the right half number, continuing the punctuation mark pairs with the adjacent previous sentence;

In case 5, if a group of punctuation mark pairs exist in the sentence and the directions of the punctuation mark pairs are opposite, continuing the punctuation mark pairs with the adjacent front and rear sentences;

in case 6, if the character at the end of the preceding sentence is "it" and the character at the end of the following sentence is "middle" of the Chinese character, the preceding sentence and the following sentence are continued;

wherein, the punctuation mark pairs mentioned in cases 3-6 include brackets, quotations or signature numbers; and obtaining news text after the operation.

dividing the obtained news text according to the long sentence segmenters to obtain a plurality of long sentences;

dividing each long sentence according to the short sentence divider to obtain a plurality of short sentences corresponding to the long sentences;

the long sentence segmenter includes any one of the following punctuations: mark, period, question mark, exclamation mark, ellipsis; the phrase segmenter includes any of the following punctuation marks: comma; and punctuation marks contained within brackets or signature numbers are not considered during the segmentation process.

In one or more embodiments of the present invention, the operation of step 03 includes:

acquiring path information effective sentences from a sentence set after the news text segmentation, and extracting address information and/or behavior event information from the path information effective sentences;

The path information valid sentence includes any one of the following forms:

1) Dashes or connection numbers exist in the sentences, and the number of the dashes or the connection numbers is more than 2;

2) The sentence contains a preset address trigger word;

if the input sentence is identified as the above case 1, dividing the sentence by using a dash or a connection number as a divider to obtain a plurality of clauses, and then identifying and obtaining address information and/or behavior event information from the clauses;

if the input sentence is identified as the case 2, the trigger word and the following text content are extracted to form a route sentence, and address information and/or behavior event information are identified and acquired in the route sentence.

a 'city-province' lookup table and a 'county-city' lookup table are preset, and corresponding provincial administrative areas or county administrative areas are queried according to the extracted city names respectively; the county-city lookup table is a county-level name which is not the same as the city, and the county-level name which is not in the lookup table is queried by calling hundred-degree api;

and according to the address information, acquiring province/city/county information of province, city and county administrative areas where the address information is located by combining a 'city-province' lookup table, a 'county-city' lookup table and a hundred-degree api.

the time information in the news text is divided into 3 cases:

case 1, time and place are in the same valid sentence;

in case 2, the time and the place are not in the same effective sentence, namely, the effective sentence has no time at the place;

the case 3, the time information is incomplete, namely, any several items of month, day, time and minute are deleted;

deducing time information of the effective sentences by taking the long sentences in which the effective sentences are as judgment units:

if the situation 1 is judged, matching the extracted time and place;

if the condition 2 is judged, when the effective sentence is the first sentence of the long sentence, setting the time as a null value; if not, inquiring whether time information exists in the preceding sentence, if so, matching the time information into the time of the effective sentence, otherwise, setting the time as a null value;

if the condition 3 is judged, inquiring the time information in the short sentence which is in front of the sentence, has the time information and is closest to the sentence, and supplementing the time information completely.

In one or more embodiments of the present invention, the operation of step 04 includes:

Presetting a residence/common place mode comparison table according to the common form of residence/common place information;

sequentially matching sentences according to the situation in the residence/normal residence mode comparison table, and if the matching is successful, further eliminating sentences conforming to the characteristics:

1) The expressions of "hospital" and "isolation" or "hospital" and "treatment" exist in the sentence at the same time;

2) The sentences are non-place information and do not contain living place/common place words, wherein the living place/common place words comprise 'cells', 'apartments', 'hotels';

the matching result containing the above features is invalid, excluding that it is the residence/residence of the patient; and combining the rest matching results with a 'city-province' lookup table, a 'county-city' lookup table and hundred-degree api to obtain province/city/county information of the residential/frequent places.

In one or more embodiments of the present invention, the operation of step 05 includes:

a traffic boarding information comparison table is preset according to a common traffic boarding form;

sequentially matching sentences according to the conditions in the traffic riding information comparison table, and executing if the matching is successful:

s1: splitting the matching result into short sentences, and eliminating interference words in the sentences; the interfering words comprise passengers, riders, riding passengers and riding days;

S2: judging whether a sentence is 'to' or not, if not, jumping to S3, otherwise S5;

s3: if the number of the 'multiplication' words in the sentence is greater than or equal to 2, jumping to S4, otherwise S5

S4: if the multiplication word is contained in the place information, taking the place information where the multiplication word is located as remark information, otherwise, separating Fu Cafen sentences by taking the multiplication word as separation, and sequentially inputting S5;

s5: extracting start point information, end point information and vehicle information from sentences; wherein the sentence comprises any one of the following forms:

1) The sentence has a symbol "→" and then the symbol "→" is used as a segmenter to divide so as to obtain a plurality of clauses, and then starting point information and end point information are identified and obtained from the clauses;

2) The sentence contains a preset traffic trigger word 'multiplication', and the traffic tool information after the traffic trigger word 'multiplication' is extracted;

s6: and setting a vehicle table, and if the starting point information is null and the vehicle information does not contain vehicles in the vehicle table, remarking the identified behavior event information by the extracted end point information.

In one or more embodiments of the present invention, the operation of step 06 includes:

1) The data is returned after the relevant information is extracted, the webpage is subjected to dynamic rendering, and the data is displayed in the webpage; simultaneously providing a downloading service of a form of the extracted data for downloading and storing the extracted information;

2) Generating a path information knowledge graph according to the path information, wherein nodes of the path information knowledge graph are province, city and county administrative areas in the path information, and a user selects any node for unfolding and viewing.

An epidemic news information extraction system, comprising:

the data crawling module is used for carrying out simulated loading on a plurality of websites pointing to the news webpage so as to acquire the content in the news webpage;

the data processing module is used for judging whether adjacent sentences in the obtained webpage content are continued according to a preset rule, executing a continuing operation on the adjacent sentences needing to be continued to obtain news texts, and dividing the obtained news texts;

the path information extraction module is used for extracting a plurality of path elements from the sentence sets of the news text after the segmentation processing to form path information;

a residential/usual-ground information extraction module for extracting a plurality of residential-usual-ground elements from the sentence sets of the news text after the segmentation processing to constitute residential/usual-ground information;

the traffic boarding information extraction module is used for extracting a plurality of traffic elements from the sentence sets of the news texts after the segmentation processing to form traffic boarding information;

And the information output display module is used for displaying one or more of the path information, the residence/residence information and the traffic riding information in the webpage through webpage dynamic rendering.

The beneficial effects of the invention are as follows: firstly analyzing characteristics of epidemic situation news web pages, acquiring news texts through a Selenium loading web page, constructing sentence splicing and text segmentation algorithms, combining the characteristics of the epidemic situation texts, comprehensively utilizing tools such as entity naming identification, map API and the like, constructing three extraction modules of path information, residence/residence information and traffic boarding information, and finally deploying the system into a user-friendly web page to provide convenience for users to extract information autonomously, wherein the method has the advantages that:

1. aiming at a specific scene of an epidemic news webpage, analyzing text characteristics of the epidemic news, designing related rules, and extracting information of three aspects of the epidemic news: patient path information, residence/residence information, transportation occupancy information. Compared with the original text of the webpage news, the information can be read and queried by the public more quickly, so that the public's knowledge of the occurrence time and place of epidemic situation is greatly increased, and epidemic situation protection is performed; for government related departments, the invention can save a large amount of manpower to extract epidemic situation information, is convenient for carrying out epidemic situation statistics, and provides important basis for infectious agent analysis.

2. Fully considering the characteristics of news texts, and independently designing different rules and matching modes aiming at each information module, so that the extraction efficiency and accuracy are improved; the existing NLP tool kit is fully utilized, and the identification accuracy of the address is improved; and calling a hundred-degree map platform, and improving the matching accuracy of the address.

3. The method has the advantages that a good operation interface is provided, so that a user can extract news webpages independently and visualize location information.

Drawings

Fig. 1 is a flowchart of acquiring province-city-county.

Fig. 2 is a flowchart of acquiring province-city-county (continuous fig. 1).

Fig. 3 is a flowchart for extracting resident/common place information and administrative areas thereof.

Fig. 4 is a route information extraction flow chart.

Fig. 5 is a valid sentence, address information, behavior event extraction flow.

Fig. 6 is a flow chart of a design of a rule system.

Fig. 7 is a province-city-county information completion flow chart.

Fig. 8 is a time information extraction flowchart.

Fig. 9 is a residential/ordinary information and administrative district extraction flow.

Fig. 10 is a flow chart of traffic boarding information extraction.

FIG. 11 is a user interaction interface screenshot (Web site upload).

Fig. 12 is a user interaction interface screenshot (route information).

Detailed Description

The present application is further described with reference to fig. 1 to 12, as follows:

The invention analyzes epidemic situation notification information by applying a related natural language processing method and constructs an epidemic situation news information extraction system. By means of the existing NLP tool kit and hundred-degree map development platform, the text characteristics of epidemic news are combined, related rules are designed, and information extraction in three aspects is carried out on the epidemic news: patient path information, residence/residence information, transportation occupancy information. Finally, we present the system in the form of a website through which the user can access the system and use the relevant functions.

The specific flow is as follows:

1. data crawling

Because the related web page of epidemic situation notification adopts a dynamic page technology, the text content in the web page cannot be obtained directly by requesting the web page. Therefore, the web page is simulated and loaded by using the Selenium to acquire all contents in the web page. The Selenium test runs directly in the browser, simulating the user's operational behavior. The specific crawling method comprises the following steps:

(1) Adding a proper request head, avoiding IP blocking caused by the fact that a web site recognizes a crawler program, loading a web page by using a Sepenum, waiting for loading of all elements of the web page to be completed, and retrying for a certain number of times if the request fails.

(2) The obtained page is parsed with etre of the lxml library.

(3) Extracting the content of the corresponding html element by using the Xpath expression, wherein the extracting the content comprises the following steps: web page title, publishing mechanism, publishing time and web page text. The web page text part is formed by splicing the contents in all p labels.

2. Data processing

(1) Data cleansing

The crawled text is presented in the form of a list, where there are cases where the same sentence/paragraph is broken. Through data auditing, whether the sentences adjacent to each other before and after should be connected is judged by taking punctuation marks as the basis. We propose the following 6 possible cases, hereinafter "|" means "or" meaning for separating the contents of the required enumeration; inputting an i th sentence:

case 1: the last character is punctuation or text [, |, |: | (i) (| [ i { i < i > at|to ]

Case 2: the first character in sentence i+1 is the punctuation [, |, |; i (I); | a. The invention relates to a method for producing a fibre-reinforced plastic composite. |. I: || "| -! | is (is) provided! +|? ? ]

Case 3: it has punctuation pairs [ () | () | [ can ] { } | < |in "" ], and the number of left halves of the pairs of symbols is greater than the number of right halves, such as "xx (xx) xx (xx").

Case 4: the i+1 sentence has a punctuation mark pair [ () | () | [ the |{ } | < the |the | (the | ") and the number of the left half of the mark pair is smaller than the number of the right half such as ' xx ') xx (xx) xx ' in the right half.

Case 5: it has a pair of symbols but opposite positions, e.g. "xxx (xxx"

Case 6: its last character is "it" and the first character of the i+1st sentence is "medium" of the Chinese character "

If the input i sentence satisfies one of the 6 conditions, the i sentence is connected with the i+1th sentence. If not, the input sentence does not need to be connected with the next sentence.

(2) Text segmentation

For an input epidemic situation notification text, the system extracts information by taking sentences as units. First in "; | a. The invention relates to a method for producing a fibre-reinforced plastic composite. ? | is (is) provided! I … i … … "is a segmenter that segments text into a set of long sentences. The long sentences are then split into a set of short sentences in "segmenters" for each long sentence. Punctuation marks contained in symbol pairs such as "(),", and "(") are not considered in the segmentation process. When the system extracts information, short sentences are used as main and long sentences are used as auxiliary, and the long sentences are mainly used for supplementing the information of the short sentences.

3. Path information extraction

The path information includes: effective sentences, time, behavior events, address information, provincial administrative areas, municipal administrative areas, county administrative areas, websites. The effective sentences are determined by the existence of address information, behavior events and address extraction are extracted from the effective sentences, time is deduced from the effective sentences and the former, and province, city and county administrative areas are deduced according to the address information. Their specific meanings are as follows:

Valid sentence: news sentences, short sentences, containing address information;

time: time nodes contained in the travel track of the patient;

behavior events: the patient goes to a certain place for purposes such as purchasing, exploring and the like;

address information: a location of concern in the patient's travel route;

provincial administrative district: the province, the direct administration city, the autonomous region and the special administrative region to which the address information belongs;

municipal administration area: the address information belongs to the ground city, region, autonomous state and alliance;

county level administrative district: the "address information" belongs to the district, county level city, county, autonomous county, flag, autonomous flag, district, forest district;

web site: and the source of the news webpage to which the effective sentence belongs.

In the design of the system, the path information module is mainly formed by extracting and combining three parts of information (effective sentences, address information and behavior events, and matching of 'province-city-county' and time information).

(1) Efficient sentence, address information, behavioral event extraction

The valid sentence is a sentence containing address information, so that address information must exist in the valid sentence, but no behavior event must exist, such as "arrive at the guangzhou train station", which does not contain a behavior event. Address information refers to places involved in the patient's travel route, not addresses that appear in sentences. The process of identifying and extracting address information and behavior events is also a process of identifying valid sentences.

In the design of the system, we assume that there are 2 valid sentences.

1 st valid sentence: the number of symbols in the sentence (the number of symbols) is more than or equal to 2, such as the number of symbols in the sentence (the number of symbols in the sentence) is more than or equal to 2, for example, the symbol is "the driving around the experimental university of reputable, the southern road of Xingan is northbound, the symbol turns right to enter the Wulan to be detected, the symbol goes straight to the east two rings and turns left to the driving around the Chinese name of Chinese Max. For such sentences, we consider the symbols "-the connected essentially valid address information.

Valid sentence 2: contains address trigger words and address information, and the address information is led out by the trigger words. Address trigger words refer to words that can elicit address information, such as: the "arrival" in "arrival at the Guangzhou Tianhe passenger station" is an address trigger word that elicits the target address. It is noted that in actual news text, the trigger word is not necessarily followed by an address, as in "arrive and stay" is not an address. Therefore, the text content behind the trigger word needs to be identified through a rule system, and if the address information is identified, the short sentence in which the trigger word is located is the valid sentence.

By browsing epidemic situation notification news, the system constructs a corresponding address trigger word list as follows:

TABLE 1 Address trigger vocabulary

A phrase is input, and if the phrase is identified as the 1 st valid sentence, the phrase is directly segmented by taking the "" as a segmenter to obtain a plurality of phrases. And then sequentially inputting all clauses into a rule system, and identifying address information and behavior events in the clauses.

If the input sentence does not belong to the 1 st valid sentence, it is determined whether the input sentence is the 2 nd valid sentence. Firstly, judging whether a trigger word exists or not, and defaulting to a non-valid sentence if the trigger word does not exist. If present, trigger words and subsequent text (only the first if there are multiple trigger words) are extracted based on the sentence, this portion of text being referred to as a route sentence. Address information and behavioral events in the route sentence are then identified according to a rule system. Finally, if the address information is null, the address cannot be obtained from the route sentence, and the short sentence does not belong to the effective sentence; if not, the phrase is identified as a valid sentence and the sentence, address information, behavioral event (possibly null) are stored. The specific flow is as follows:

step 1: and inputting a route sentence, carrying out named entity recognition by using a foolnltk package, and judging whether words recognized as company/location/organization exist or not. If so, go to Step2; otherwise, the next sentence is input.

Step 2: and performing word segmentation operation and part-of-speech tagging on the route sentences.

Step 3: the categorization discussion determines terms that participate in the information extraction. The 1 st word is a trigger word, not discussed

(1) Case 3-1: the 2 nd word is a 'through' word or verb or preposition- & gt, and information extraction is carried out from the 3 rd word;

(2) Case 3-2: the 2 nd word is verb and the 3 rd word is preposition, and information extraction is carried out from the 4 th word;

(3) Case 3-3: the 2 nd word is a conjunctive word and the 3 rd word is a verb, and information extraction is carried out from the 4 th word;

(4) Cases 3-4: the 2 nd word is verb & 3 rd word is conjunctive & 4 th word is verb- & gt information extraction from the 5 th word.

Step 4: sequentially inputting words participating in information extraction, and if the input words are contained in the words obtained by Step 1, directly identifying the words as address information; otherwise, go to Step 5.

Step 5: and carrying out classified discussion on the positions of the words, and if a certain situation is not met, directly entering the next situation to judge.

(1) Case 5-1: not the last 1 word. (1) If the word meets the condition of 'verb + preceding and following word fat ligature + not in () and other symbol pairs + in the following words, the character length between noun and 2 verbs +2 verbs is not less than 4', the word is identified as address information. (2) If the word satisfies one of 2 conditions: the character length <4 "between the noun and the 2 verbs exists in the words in the sign pair of ' verb + preceding and following word non-conjunctions + not in ' () ' and the like, or the word is a preposition + the following word is a verb, and the word is identified as the word of the action event.

(2) Case 5-2: not the last 2 words. If the conditions are satisfied: the word is a word which is 'for' or the part of speech is a verb, the last 1 word is a conjunctive word and the last 1 word is a verb, and the word is identified as a word of a behavioral event.

(3) Case 5-3: the last 1 word. (1) If the word is a verb, identifying the word as address information; (2) if the word is a non-verb, the word is identified as a word of the behavioral event.

(4) Cases 5-4: none of the above 3 cases is satisfied, the word is identified as address information.

Step 6: determining final address information (connecting all words recognized as address information) and final behavior event (connecting all words recognized as behavior address) of the input route sentence based on the recognition result of Step 5

Step 7: if the input route sentence is not the last sentence, the next sentence is continuously input, and Steps 1-6 are repeated.

(2) "province-city-county" matching

In epidemic situation notification news, epidemic places usually do not appear in the form of a standard four-level address of provincial administrative district, municipal administrative district, county administrative district and specific address, and other combination modes of the four-level addresses are generally adopted, so that the system needs standardized address information. The specific method comprises the following steps:

a "city-province" lookup table and a "county-city" lookup table are prepared. The lookup table refers to that the corresponding province can be queried according to the name of the city. The names of provincial administrative regions and municipal administrative regions do not overlap with each other, so a 'city-provincial' lookup table is directly constructed based on the names of all provincial and municipal administrative regions. The county administrative areas have the same name, and the municipal administrative areas to which the county administrative areas belong cannot be judged. The county-city lookup table constructed by the system only maintains names without the same names, such as Tianhe region, chen county and the like. County level names that are not in the lookup table are queried by invoking hundred api.

In order to improve the information query speed and the query accuracy of the system, the method combines an autonomous design rule, a named entity recognition function of a foolnltk package and a hundred-degree api method to extract province, city and county administrative areas where address information is located. The specific implementation process is as follows:

( And (3) injection: the province-level administrative district, the city-level administrative district, and the county-level administrative district are replaced by the province-level, the city-level, and the county-level administrative district. )

Step1: the city in which the text first appears is referred to as the "a priori city".

Step2: and inputting address information, judging whether the first 2 characters are in 'me city' or 'home city', and if so, replacing the first 2 characters with a priori city.

Step3: the address information is segmented, and words belonging to province, city and county are extracted.

Step4: and carrying out classified discussion according to the extracted vocabulary conditions:

1) Case 4-1: non-empty in province, city and county

(1) Province, city, county are connected together in address information (e.g. "Guangzhou city, guangdong, tian He region"): save province, city, county and jump to Step10.

(2) Not satisfying (1), but the cities and counties are connected together in address information and the cities and provinces are in correspondence (e.g. "Guangdong province museum in Tianhe, guangzhou"): save province, city, county and jump to Step10.

(3) Not satisfying (1) and (2): jump to Step5.

2) Case 4-2: province, city, non-null, county, null

(1) Province, city are connected together in address information: if address information = province, setting county as null; otherwise, calling hundred-degree api to inquire about county; save province, city, county and jump to Step10.

(2) Not satisfying (1): jump to Step5.

3) Case 4-3: city and county are not empty, and province is empty

(1) The city and county are connected together in address information: searching for the corresponding province according to the city; save province, city, county and jump to Step10.

(2) Not satisfying (1): jump to Step5.

4) Cases 4-4: province and county are not empty, the city is null, and province and county are connected together in the address information

(1) County is a full name, and is not a homonymous county: searching corresponding cities according to counties; save province, city, county and jump to Step10.

(2) County is not a full name, or is a homonymous county: calling hundred-degree api to inquire about the city, if the inquired city is not in a corresponding relation with the province, setting the city as a null value, otherwise, reserving the inquired city; save province, city, county and jump to Step10.

(3) Not satisfying (1) and (2): jump to Step5.

5) Cases 4-5: province, non-space, city, county as space value

(1) Address information = province, set city, county to null; save province, city, county and jump to Step10.

(2) Not satisfying (1): jump to Step5.

6) Cases 4-6: the city is not empty, and the province and county are empty

(1) Address information = city, query the corresponding province, set county as null; save province, city, county and jump to Step10.

(2) Not satisfying (1): jump to Step5.

7) Cases 4-7: county is not empty, province and city is empty

(1) County is a full name, and is not a homonymous county: inquiring the corresponding province and city; save province, city, county and jump to Step10.

(2) Not satisfying (1): jump to Step5.

8) Cases 4-8: province, city and county are all null values and jump to Step9.

Step5: if the province and the city are both null, jumping to Step9. Otherwise, using the foolltk packet to carry out named entity identification on the address information, and judging whether province or city is identified as an entity or not:

1) Identified as an entity: go to Step6

2) Not identified as an entity: setting non-empty province/city to null value, and entering Step6

Step6: if the province and the city are not null, carrying out classified discussion, otherwise, entering Step7

1) The city belongs to the province: calling hundred-degree api to inquire about county; save province, city, county and jump to Step10.

2) The city does not belong to the province:

(1) city = a priori city: inquiring a province corresponding to a city, and calling hundred-degree api to inquire about a county; save province, city, county and jump to Step10.

(2) City +.a priori city, province = province corresponding to a priori city: setting the market as a null value; save province, city, county and jump to Step7

(3) City not prior city, province not prior city corresponding province: setting province and city as null value; save province, city, county and jump to Step9.

Step7: the province is not null, and the city is null (such as Henan postal bank). Province for calling hundred-degree api to inquire address information

1) Province = province of inquiry:

(1) number of characters of address information > number of characters of province+2: city calling hundred-degree api to inquire address information

a) The market is not null: calling hundred-degree api to inquire about county; save province, city, county and jump to Step10.

b) The market is null: setting the county as a null value; save province, city, county and jump to Step10.

(2) The number of characters of the address information is less than or equal to the number of characters of province +2: setting the city and county as null values; save province, city, county and jump to Step10.

2) Province +.q. province of inquiry: setting the province as a null value; save province, city, county and jump to Step9.

Step8: the province is null, and the city is not null

1) The city belongs to the same province as the prior city: according to the city, searching the corresponding province, and calling hundred-degree api to search county; save province, city, county and jump to Step10.

2) The city is different from the prior city in one province: setting the market as a null value; save province, city, county and jump to Step9.

Step9: the provinces and cities are all null values

1) County is a full scale, and non-homonymous shows: searching for the corresponding province and city; save province, city, county and jump to Step10.

2) County is not a full name/county is a null value/county is a homonymous county: invoking hundred degree api to query the marketplace

(1) The market appears throughout the news text: inquiring the corresponding province according to the city, and calling hundred-degree api to inquire about the county; save province, city, county and jump to Step10.

(2) The market has not appeared in the entire news text: setting a city as a priori city, inquiring the corresponding province, and calling hundred-degree api to inquire about county; save province, city, county and jump to Step10.

Step10: and outputting provinces, cities and county administrative areas.

Step11: inputting the next address information, and repeating Step2-11 until all the address information is traversed.

(3) Time information extraction

The time in the news text exists for 3 cases:

table 2 case where time information is in text

The system takes a long sentence in which an effective sentence is positioned as a unit, combines a foolntt toolkit, designs an extraction rule according to a time format, and deduces time information of the effective sentence from the extraction rule:

1) For case 1, the extracted time and place are directly matched.

2) For case 2, the valid sentence has no place and no time. If the effective sentence is the first sentence of the long sentence, setting the time as a null value; if not, inquiring whether the previous sentence has time information of 'x month and x day', if so, matching the 'x month and x day' as the time of the effective sentence, otherwise, setting the time as a null value.

3) For case 3, the time information in the valid sentence has an imperfection. Such as: there is "xx day" no "xx month", there is "xx time (xx score)" no xx month xx day ". The solution method is as follows: the time information of the short sentence (not limited to the same long sentence) with the time information and the nearest short sentence in front of the sentence is queried, and the time information is supplemented completely. For example: the effective sentence only appears "11 hours 30 minutes", and "2 months 20 days" appears in the short sentence preceding and closest to it, and the time information of the effective sentence is supplemented to "2 months 20 days 11 hours 30 minutes".

4. "resident/resident information" extraction

The residence/residence information module contains information: effective sentences, residential/frequent places, provincial administrative areas, municipal administrative areas, county administrative areas, web sites. Wherein the "valid sentence" is determined by the existence of "residential/frequent land" which is determined by the fixed pattern of the design; the province, city and county administrative district is deduced from the residential/usual place. Their specific meanings are as follows:

valid sentence: news sentences, short sentences containing patient residence/residence information;

Residential/living land: the residence or residence of the patient;

province, city, county level administrative district: province, city and county administrative district to which "resident/resident" belongs

The 4 patterns that exist in the epidemic notification text for out-of-residence/in-constant-floor are shown in the following table.

TABLE 3 categories and modes of residential/residential sites

The input phrases are matched in sequence according to the order of the first class to the fourth class in the table. If the matching mode of a certain class is not satisfied, the next class is directly matched until the matching is successful. Wherein, if a sentence is input:

enter a fourth class: matching according to the sequence of (1) - (9);

entering the second class (live in xxx): the "stay in xxx" case needs to be excluded;

enter class four (live xxx): the conditions of hospitalization, resident and residence are eliminated.

If the input sentence fails to match in all four classes, then the input sentence is deemed to contain no resident/resident information for the patient, no valid sentences, and no reservation. If the input sentence is successfully matched, the text obtained by matching is not represented as residence/residence information of the patient. For sentences that match successfully, rejecting sentences that match the following features:

(1) In the results, there are both "hospital" and "isolation", or "hospital" and "treatment"

( 2) The matching result is named entity identified by means of the foolntk package, the identified result is null or not null but is not place information, and words such as 'cell, apartment, hotel' and the like are not contained in the result (the foolntk is likely to not identify the place information: ruyi district )

As long as at least one of the above conditions is satisfied, the preliminary result obtained by pattern matching is considered invalid, not the residence/usual information of the patient. And finally, extracting the three-level administrative district information of the residence/usual place information according to the algorithm of 'province-city-county matching'.

The implementation flow of the module is as follows:

step1: inputting short sentences

Step2: pattern matching is performed on the first to fourth classes. If the matching result is null, directly eliminating the phrase; if the value is not null, go to Step3

Step3: if the 'hospital' and the 'isolation' exist at the same time or the 'hospital' and the 'treatment' exist at the same time, the phrase is directly removed; otherwise go to Step4

Step4: and carrying out entity identification on the matching result by using a foolntk packet. If the identification result is null, or the identification result is not null but is not a place entity, entering Step5; otherwise jump to Step6

Step5: if the matched result contains words of district, apartment and hotel, entering Step6; otherwise, directly eliminating the phrase

Step6: preserving matching results

Step7: imitating the algorithm of 'province-city-county matching', extracting the corresponding three-level administrative regions according to the matching result

5. Traffic ride information extraction

The traffic boarding information module contains information as follows: effective sentences, vehicle information, starting, stopping, remarks and websites. Wherein the "valid sentence" is determined by the existence of "vehicle information" determined by the designed fixed pattern; the "start and stop" are extracted from the "effective sentence". Their specific meanings are as follows:

valid sentence: news sentences, short sentences, containing vehicle information;

vehicle information: vehicle information on which the patient is riding, such as CA8255 flights, 89 buses, etc.;

starting with: the patient takes the vehicle at the starting place;

stopping: the patient takes the target place of the traffic work;

remarks: other information than "vehicle information, start and stop", e.g. "buying flowers", etc

The patient's transportation occupancy information is presented mainly in the following 8 modes, as shown in the following table.

TABLE 4 categories and modes of traffic ride information

The input phrases are matched in sequence according to the first to eighth types of sequences in the table. If the matching mode of a certain class is not satisfied, the next class is directly matched until the matching is successful. If the input sentence fails to match in all eight types, the input sentence is considered to not contain traffic boarding information of the patient, and the non-valid sentence is not reserved.

In order to improve accuracy, constraint rules are added on the basis of the eight types of modes, and a named entity identification function of a foolntlk packet about time types is introduced. The specific analysis and extraction process is as follows:

step 1: splitting text into short sentences, eliminating possible words of passenger, passenger and passenger, passenger date, and sequentially inputting sentences

Step 2: judging whether the sentence is 'to' or not, if not, jumping to Step3, otherwise, step5

Step 3: calculating the number of 'multiplication' words in the sentence, if the number is more than or equal to 2, jumping to Step4, otherwise Step5

Step 4: if the multiplication word is in the "(xxx)", taking the "(xxx)" in which the multiplication word is as remark information, otherwise, taking the multiplication word as separation Fu Cafen sentence, and sequentially inputting Step5

Step 5: matching the first to fourth types of modes, extracting information, and if the "start" is recognized as a time type text by a foolnltk packet naming entity, then: the time-class text is extracted as remark information, and then the "start", "stop" and "traffic boarding information" are reset to null values, and then Step7 is skipped. Otherwise go to Step6

Step 6: if the traffic riding information of the patient is extracted, jumping to Step8, otherwise entering Step7

Step 7: sentence matching fifth to eighth class patterns

Step 8: the vehicle table [ high speed rail, train, road, bus, motor car, aviation, train, subway, license plate, taxi, drop, express car, net car, mail wheel, private car, windward car, electric car ] is set. If the extracted information satisfies the following conditions: the "start" is a null value, and no element in the vehicle table exists in the "vehicle information", and the sentence is recognized as an invalid sentence and is rejected. Otherwise, go to Step9

Step 9: following the operations of "extract address information and behavior event" in section 3.1, identifying the extracted "stop", updating the "stop" to the identified address information, and adding the identified behavior event (possibly null) to the remark information

Step10: storing the extracted information

6. Web site design

After the epidemic situation information extraction system is designed, the epidemic situation information extraction system is presented by constructing a website. We use the 3 third party modules flask, streamlit, pyecharts of python for user interface deployment of the epidemic information extraction system. After deployment is completed, the user may access the system through the web page and use the relevant functions. Mainly realizes the following functions:

(1) Data input: in the scenario of extracting a small number of websites, a user can input one or more websites into the system through a text box; in the batch extraction scenario, the user uploads a csv file recording the website to be extracted to the system. And the system receives the websites and then extracts the websites respectively.

(2) And (3) information output: and the system returns data after extracting information, the webpage performs dynamic rendering, and the data is displayed in the webpage. And simultaneously, the download service of the form of the extracted form is provided, and the user can download and save the extracted information.

(3) Knowledge graph: after the system extracts the path information, a new webpage is opened to present a knowledge graph about the path information, and nodes of the graph are province, city and county administrative areas in the path information table. The user may select any node in the map for viewing.

(4) Good interaction and aesthetic interface: the user can intuitively know the system function through the webpage and perform autonomous operation.

After a user inputs a website in a text mode or inputs the website in a file uploading mode, the interface displays the extraction progress of the information. After the information is extracted, the system outputs the result to the webpage and provides the downloading service. When the information extraction results of the three parts are obtained, the system opens a new webpage to present a location knowledge graph about the path information. In the map, a provincial administrative district, a municipal administrative district and a county administrative district are taken as nodes.

The epidemic situation news information extraction system for executing the method comprises the following steps:

The above-mentioned preferred embodiments should be regarded as illustrative examples of embodiments of the present application, and all such technical deductions, substitutions, improvements and the like which are made on the basis of the embodiments of the present application, are considered to be within the scope of protection of the present patent.

Claims

1. The epidemic situation news information extraction method is characterized by comprising the following steps of:

step 01, data crawling;

step 02, a data processing step;

Step 03, extracting path information;

step 04, residence/constant residence information extraction;

step 05, extracting traffic boarding information;

the operation of step 05 comprises the following steps:

s6: setting a traffic tool table, and if the starting point information is null and the traffic tool information does not contain traffic tools in the traffic tool table, remarking the identified behavior event information by the extracted end point information;

step 06, information output and display step;

2. The epidemic news information extraction method according to claim 1, wherein:

the operation of the step 01 comprises the following steps:

step 012, analyzing the obtained page by using the lxml library;

3. The epidemic news information extraction method according to claim 1, wherein:

the operation of step 02 includes:

4. The epidemic news information extraction method according to claim 3, wherein:

the operation of step 02 includes:

5. The epidemic news information extraction method according to claim 1, wherein:

the operation of step 03 includes:

the path information valid sentence includes any one of the following forms:

2) The sentence contains a preset address trigger word;

If the input sentence is identified as the situation 2, extracting the trigger word and the following text content to form a route sentence, and identifying and acquiring address information and/or behavior event information in the route sentence;

6. The epidemic news information extraction method according to claim 5, wherein:

the operation of step 03 includes:

the time information in the news text is divided into 3 cases:

case 1, time and place are in the same valid sentence;

if the situation 1 is judged, matching the extracted time and place;

7. The epidemic news information extraction method according to claim 1, wherein:

the operation of step 04 includes:

8. The epidemic news information extraction method according to claim 1, wherein:

the operation of step 06 includes:

9. An epidemic news information extraction system, applying the epidemic news information extraction method according to any one of claims 1 to 8, characterized by comprising: