CN103886080B - Method for extracting road traffic information from Internet unstructured text - Google Patents

Method for extracting road traffic information from Internet unstructured text Download PDF

Info

Publication number
CN103886080B
CN103886080B CN201410115332.4A CN201410115332A CN103886080B CN 103886080 B CN103886080 B CN 103886080B CN 201410115332 A CN201410115332 A CN 201410115332A CN 103886080 B CN103886080 B CN 103886080B
Authority
CN
China
Prior art keywords
information
traffic
sentence
type
feature words
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201410115332.4A
Other languages
Chinese (zh)
Other versions
CN103886080A (en
Inventor
陆锋
仇培元
张恒才
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Geographic Sciences and Natural Resources of CAS
Original Assignee
Institute of Geographic Sciences and Natural Resources of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Geographic Sciences and Natural Resources of CAS filed Critical Institute of Geographic Sciences and Natural Resources of CAS
Priority to CN201410115332.4A priority Critical patent/CN103886080B/en
Publication of CN103886080A publication Critical patent/CN103886080A/en
Application granted granted Critical
Publication of CN103886080B publication Critical patent/CN103886080B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/313Selection or weighting of terms for indexing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars

Abstract

The invention provides a method for extracting road traffic information from an Internet unstructured text. The method comprises the steps of defining a data structure of the road traffic information and a description feature word type of the road traffic information, expanding a few manually-established basic extraction modes to obtain an extraction mode bank, generating a feature word type sequence after the input Internet unstructured text is preprocessed, obtaining a matched extraction mode of the input text according to the similarity of the feature word type sequence, utilizing the matching extraction mode for extracting a positioning information element and a type information element of the road traffic information from the Internet unstructured text, utilizing a regular expression and a judgment rule for extracting a time information element from the input text, and obtaining the road traffic information through the combination of the positioning information element, the type information element and the time information element. By the means of the method for extracting the road traffic information from the Internet unstructured text, real-time processing can be carried out on the unstructured text collected from the Internet, the road traffic information can be extracted, and the traffic information collecting means are enriched.

Description

One kind extracts Traffic Information method from the Internet non-structured text
Technical field
The present invention relates to traffic information field, particularly a kind of for extracting Traffic Information from the Internet non-structured text Method.
Background technology
In city, being continuously increased of vehicles number makes Traffic Problems become increasingly conspicuous, and the public is to real-time road traffic information Demand also more urgent.Traffic Information mainly include road traffic flow, road conditions, traffic limitation, traffic control, Traffic events, traffic weather and road environment information etc..Existing real-time road traffic information acquiring technology, such as fixing sensor Technology (induction coil, video monitoring and microwave sounding), the floating car technology of installation gps and wireless telecommunications system, movement are logical News terminal signaling analytical technology etc. is widely applied in acquisition arithmetic for real-time traffic flow message context, but cannot gather sudden traffic The Traffic Informations such as event, temporary traffic control, newly-increased traffic limitation.Meanwhile, the Internet is a kind of as today's society The information carrier of convenient and efficient, has attracted a large amount of government organs, specialized information provider and personal user in professional website, forum Issue real-time road traffic information with micro-blog platform.These information types cover abundant, ageing strong, and its quantity of information also will be with The growth of number of users and be continuously increased, therefore, the Internet contains Traffic Information will become acquisition real-time road traffic letter The significant data source of breath, forms complementation with other traffic information collection technology, in government planning decision-making, Public Traveling service side Face plays a significant role.
Current portions the Internet contains Traffic Information to be existed with non-structured text form, and these non-structured texts are typically logical The form crossing natural language is described to road traffic accident.However, existing road traffic information system can only be processed with two dimension The structural data of sheet form expression, needs to extract structurized road from the Internet non-structured text by information extraction technique Road transport information.Existing information extraction technique does not account for the feature of Traffic Information description it is impossible to from the Internet destructuring Correctly identify the road positioning describing information based on linear refer-ence method in text, and lack process the Internet non-structured text Traffic Information element in description implies the ability with omission.
Content of the invention
The technology of the present invention solve problem: overcome prior art not enough, provide one kind for extracting from the Internet non-structured text The method of Traffic Information, can identify the road positioning describing based on linear refer-ence method from the Internet non-structured text Information, and correctly process text description in occur Traffic Information element imply and omission, can be used for transport information system System and service platform, realize the automatic acquisition and processing that the Internet non-structured text contains Traffic Information.
The technology of the present invention solution: one kind extracts Traffic Information method from the Internet non-structured text, to collection from mutually The real-time text data of networking related web site is processed, and therefrom identifies and extract the location information unit that Traffic Information should comprise Element, type information elements and temporal information element, the fusion for Traffic Information is supported with issuing to provide.Specifically comprise the following steps that
Step 1. defines the data structure of Traffic Information, is easy to organization and management Traffic Information in the form of bivariate table, Described data structure is made up of the concrete element property of information element and information element, and described information element includes location information unit Element, type information elements and temporal information element, element property own centre line road that described location information element comprises, initial Road, termination road, prime direction and termination direction, the element property that described type information elements comprise has traffic events type, The element property that described temporal information element comprises has traffic events time started and traffic events end time;Described road traffic Information includes road conditions information, road traffic restricted information, road traffic control information, road traffic accident information, road Environmental information;
The vocabulary playing a crucial role during description Traffic Information as Feature Words, is existed by step 2. according to these vocabulary The grammatical function playing in the Internet non-structured text, defines the class of the Feature Words for filling Traffic Information element property Type, and press Feature Words type build traffic specialized dictionary, described Feature Words type specifically include road name word, attached positioning word, Direction descriptor, preposition, road incidents word and general word;Described general word refer to be not included in road name word, attached positioning word, Vocabulary in the Feature Words types such as direction descriptor, preposition, road incidents word;Described the Internet non-structured text refers to webpage News, forum postings, blog article daily record, Twitter message;
The data structure of the Traffic Information that step 3. is defined based on step 1 and the Feature Words type of step 2 definition, in conjunction with The grammatical structure feature of traffic events and syntactic structure feature described in the Internet non-structured text, artificial formulation extracts mould substantially Formula, is extended to basic extraction pattern by rule, obtains extracting library;Described extraction pattern includes Feature Words type sequence Row and two parts of element property sequence;Described Feature Words type sequence is people's traffic described in the Internet non-structured text The sequencing arrangement of the type of Feature Words used during event, in extraction pattern, the function of Feature Words type sequence is to judge the Internet Can non-structured text with this extraction pattern match;Described element property sequence is identical with Feature Words type sequence length, element Sequence Item in sequence of attributes is that in Feature Words type sequence, same position sequence Item corresponding element in Traffic Information belongs to Property, the function of element property sequence is to instruct computer that the Feature Words that the Internet non-structured text occurs are mapped to road traffic In the corresponding element property of information;
The Internet non-structured text gathering as input text, is carried out pretreatment to input text by step 4.;Described pre- Process the duplicate message including deleting in input text and Chinese word segmentation is made to input text, obtain inputting the sequence of words of text;
Step 5. utilizes the Feature Words occurring in the traffic specialized dictionary identification step 4 gained sequence of words of step 2, and according to The type of sequencing recording feature word in input text for the Feature Words, generates the Feature Words type sequence of input text, passes through Judge whether the Feature Words type needed for Traffic Information element property completely filters to input text;
Step 6., to input text punctuate, the sentence set obtaining according to punctuate, step 5 gained is inputted the Feature Words of text Type sequence is divided into Feature Words type sequence set corresponding with sentence set, using dynamic time warping dtw(dynamic Time warping, dtw) in distance metric this feature part of speech type arrangement set each Feature Words type sequence with extract in library The similarity of each Feature Words type sequence extracting pattern, selects similarity highest and is less than the extraction pattern of given threshold value as this The coupling of sentence extracts pattern;
The sentence set of step 7. traversal input text, if the sentence in sentence set obtains coupling extraction pattern in step 6, The element property sequence then extracting pattern according to this coupling fills the Feature Words in this sentence to corresponding Traffic Information unit Plain attribute, generates the corresponding Traffic Information of this sentence;After the completion of traversal, judge the location information of gained Traffic Information In centrage road attribute and type information elements in element, whether traffic events type attribute is complete, if imperfect, utilizes Supplement rule to traffic thing in centrage road attribute in the location information element of Traffic Information disappearance or type information elements Part type attribute is filled up;Finally, obtain inputting the road traffic that text has extracted location information element and type information elements Information aggregate;
Step 8. according to the different expression-forms to the time in the Internet non-structured text, artificial formulate extract year, month, day, When, the regular expression set of minute, second element of time numerical value, in conjunction with judgment rule utilize this regular expression set from input literary composition Extraction time element numerical value in this, these element of time combinations of values are become traffic events time started element property and traffic events End time element property, obtains the temporal information element of Traffic Information;
The temporal information element that step 8 is extracted is filled the Traffic Information set each bar road obtaining to step 7 by step 9. In transport information, obtain the complete Traffic Information set of Traffic Information element.
In described step 6, respectively extract with extracting in library in the Feature Words type sequence using each sentence of dtw distance metric It is embodied as during the similarity of Feature Words type sequence of pattern:
If ci=tj, make d (ci,tj)=0;
If ci≠tj, and tjFor road name word, road incidents word, make d (ci,tj)=2;
If ci≠tj, and tjFor attached positioning word, direction descriptor, preposition, general word, make d (ci,tj)=1;
Wherein, ciRepresent i-th sequence Item of the Feature Words type sequence of input text sentence, tjRepresent the spy in extraction pattern Levy j-th sequence Item of word type sequence, d (ci,tj) represent ciAnd tjBetween distance value.
In described step 7, the benefit that in the type information elements that Traffic Information is lacked, traffic events type attribute is filled up Filling rule is:
(1) currently pending Traffic Information corresponds to sentence si, j=i;
(2) read sentence sj(j=j-1), if sentence sjExist, then go to step (3);Otherwise, (6) are gone to;
(3) if sentence sjFeature Words type sequence meets the sequential structure of " type information elements location information element ", goes to Step (4);Otherwise, go to step (5);
(4) by sentence sjCorresponding traffic events type attribute gives currently pending Traffic Information, and the process of supplement terminates;
(5) if sentence sjFeature Words type sequence meets the sequential structure of " location information element type information element ", goes to Step (6);Otherwise, go to step (2);
(6) sentence sjWith sentence siUnrelated, j=i, go to step (7);
(7) read sentence sj(j=j+1), if sentence sjExist, then go to step (8);Otherwise, supplement process to terminate;
(8) if sentence sjFeature Words type sequence meets the sequential structure of " location information element type information element ", goes to Step (4);Otherwise, go to step (9);
(9) if sentence sjFeature Words type sequence meets the sequential structure of " type information elements location information element ", then sentence Sub- sjWith sentence siUnrelated, the process of supplement terminates;Otherwise, go to step (7).
In described step 7, the supplement that in the location information element that Traffic Information is lacked, centrage road attribute is filled up Rule is:
(1) currently pending Traffic Information corresponds to sentence si, j=i;
(2) read sentence sj(j=j-1), if sentence sjExist, then go to step (3);Otherwise, supplement process to terminate;
(3) if sentence sjFeature Words type sequence meets the sequential structure of " type information elements location information element ", and contains Own centre line road attribute, then go to step (4);Otherwise, go to step (5);
(4) by sentence sjThe centrage road attribute of corresponding Traffic Information gives currently pending Traffic Information, The process of supplement terminates;
(5) if sentence sjThere is corresponding Traffic Information, and lack centrage road attribute, then go to step (2);No Then, supplement process to terminate.
Present invention advantage compared with prior art is: takes into full account the spy of Traffic Information description in information extraction process Point, extracting method can identify the positioning describing information in the Internet non-structured text based on linear refer-ence, and correctly processes text The Traffic Information element occurring in description implies and omission, realizes from the Internet non-structural based on natural language expressing Change and in text, extract Traffic Information.Extraction process does not need a large amount of manual interventions, is easy to the interconnection to Real-time Collection for the computer Net non-structured text is automatically processed.
Brief description
Fig. 1 is the flow chart of the inventive method;
Fig. 2 is the flow chart of traffic events type attribute compensation process in Traffic Information deletion type information element;
Fig. 3 is the flow chart of centrage road attribute compensation process in Traffic Information deletion mapping information element;
Fig. 4 is the flow chart of Traffic Information temporal information element extraction method.
Specific embodiment
In order that those skilled in the art more fully understand the scheme of the embodiment of the present invention, below in conjunction with the accompanying drawings with embodiment pair The embodiment of the present invention is described in further detail.
As shown in figure 1, being a kind of flow process extracting Traffic Information method from the Internet non-structured text of the embodiment of the present invention Figure, comprises the following steps:
Step 1. defines the data structure of Traffic Information, is easy to organization and management Traffic Information in the form of bivariate table, This data structure is made up of the concrete element property of information element and information element, can be used for the class of Traffic Information expressed Type has road conditions information, road traffic restricted information, road traffic control information, road traffic accident information, road environment Information.Particular content is as follows:
The vocabulary playing a crucial role during description Traffic Information as Feature Words, is existed by step 2. according to these vocabulary The grammatical function playing in the Internet non-structured text, defines the class of the Feature Words for filling Traffic Information element property Type.The Internet non-structured text of institute's foundation has web page news, forum postings, blog article daily record, Twitter message.The spy of definition Levy part of speech type and as follows with the relation of information element type:
The traffic specialized dictionary that feature based part of speech type builds includes road name dictionary, attached positioning dictionary, direction describe dictionary, Preposition dictionary and road incidents dictionary.Road name dictionary contains the title of all roads in some specific region;Attached fixed Position dictionary stores the vocabulary being used in combination with road name vocabulary, pointing to final positioning address but be unable to location-independent in itself; The various expression in direction in direction descriptor library storage traffic events;Preposition dictionary stores for describing initial road and termination Between road or prime direction and terminate direction between relation connection vocabulary;Road incidents dictionary stores state in traffic events Various description of information.
The data structure of the Traffic Information that step 3. is defined based on step 1 and the Feature Words type of step 2 definition, in conjunction with The grammatical structure feature of traffic events and syntactic structure feature described in the Internet non-structured text, artificial formulation extracts mould substantially Formula, is extended to basic extraction pattern by rule, obtains extracting library, to identify from the Internet non-structured text With the element property extracting Traffic Information.
Extraction pattern comprises 2 parts: Feature Words type sequence and element property sequence.Feature Words type sequence is people mutual Described in networking non-structured text, during traffic events, the sequencing of the type of Feature Words used arranges, Feature Words in extraction pattern The function of type sequence is to judge that can the Internet non-structured text with this extraction pattern match.Described element property sequence and spy Levy part of speech type sequence length identical, the sequence Item in element property sequence be in Feature Words type sequence same position sequence Item in road Corresponding element property in the transport information of road, the function of element property sequence is to instruct computer to go out the Internet non-structured text Existing Feature Words map in the corresponding element property of Traffic Information.
The particular content of the basic extraction pattern that the present invention uses is:
By extension rule, basic pattern of extracting is expanded, generate for extracting Traffic Information from the Internet non-structured text Extraction library.The particular content of extension rule is as follows:
" auxiliary positioning word " can be added in basic one or more " the road name words " extracting pattern afterwards;
Can add " the general word of event word " before basic extraction pattern, or " general word thing can be added after this basic model Part word ".
The Internet non-structured text gathering as input text, is carried out pretreatment to input text by step 4..Pretreatment Make Chinese word segmentation including the duplicate message deleted in input text with to input text, obtain inputting the sequence of words of text.
Step 5. utilizes the Feature Words occurring in the traffic specialized dictionary identification step 4 gained sequence of words of step 2, and according to The type of sequencing recording feature word in input text for the Feature Words, generates the Feature Words type sequence of input text, passes through Judge whether the Feature Words type needed for Traffic Information element property completely filters to input text.
A. each word in the sequence of words of input text is mated with traffic specialized dictionary, if the match is successful, record it right The Feature Words type answered.Otherwise, this word is recorded as general word.Multiple continuously general part of speech types are merged into a general word Type.Finally obtain the Feature Words type sequence of input text.
If b. lacking the feature part of speech relevant with location information element or type information elements in the Feature Words type sequence of input text Type then it is assumed that this input text does not comprise complete transport information element, that is, extracts unsuccessfully, no longer carries out follow-up extraction operation.
Step 6., to input text punctuate, input text is converted to sentence set, relatively the Feature Words type sequence of each sentence with Extract the similarity of each Feature Words type sequence extracting pattern in library, select similarity highest and carrying less than given threshold value Delivery formula extracts pattern as the coupling of this sentence.Specific implementation method is as follows:
A. to input text punctuate, the sentence set being obtained according to punctuate, it is right that the Feature Words type sequence of input text is divided into The Feature Words type sequence set answered.Each using dynamic time warping (dynamic time warping, dtw) distance metric The Feature Words type sequence of sentence and the similarity extracting each Feature Words type sequence extracting pattern in library, dtw distance The similarity degree of 2 Feature Words type sequences of less expression is higher.The dtw based on dynamic programming method adopting in the present invention As follows apart from computing formula:
dtw ( c , t ) = γ ( m , n ) length ( t )
γ ( i , j ) = d ( c i , t j ) + min γ ( i - 1 , j ) , γ ( i - 1 , j - 1 ) , γ ( i , j - 1 ) ,
γ (0,0)=0, γ (0, ∞)=γ (∞, 0)=∞, (i=1,2 ..., m;J=1,2 ..., n).
Wherein, dtw (c, t) represents the dtw distance of Feature Words type sequence c and Feature Words type sequence t, C={ c1,c2,...,ci,...,cmRepresent the Feature Words type sequence inputting text generation, t={ t1,t2,...,tj,...,tnRepresent and extract mould Feature word class sequence in formula, length (t) is characterized the length of word type sequence, d (ci,tj) represent sequence Item ciWith sequence Item tj Between distance value, and specify:
If ci=tj, make d (ci,tj)=0;
If ci≠tj, and tjFor road name word, road incidents word, make d (ci,tj)=2;
If ci≠tj, and tjFor attached positioning word, direction descriptor, preposition, general word, make d (ci,tj)=1.
B. from extracting the extraction pattern selecting library with the Feature Words type sequence dtw distance minimum of sentence, if should Dtw distance is less than given threshold value then it is assumed that this extraction pattern is the extraction pattern of this sentence coupling.If meeting conditions above Extraction pattern is 2 or more than 2, then select the extraction pattern conduct the longest of Feature Words type sequence from these extraction patterns The extraction pattern of coupling.If the extraction pattern meeting conditions above is 2 or more than 2, select from these extraction patterns First is extracted pattern as the extraction pattern of coupling.Distance threshold value 0.5 in the present invention.
The sentence set of step 7. traversal input text, if the sentence in sentence set obtains coupling extraction pattern in step 6, The element property sequence then extracting pattern according to this coupling fills the Feature Words in this sentence to corresponding Traffic Information unit Plain attribute, generates the corresponding Traffic Information of this sentence.After the completion of traversal, judge the location information of gained Traffic Information In centrage road attribute and type information elements in element, whether traffic events type attribute is complete, if imperfect, utilizes Supplement rule center drawing lines road attribute or traffic events type attribute are filled up.
Fig. 2 is the stream of traffic events type attribute compensation process in embodiment of the present invention Traffic Information deletion type information element Cheng Tu, comprises the following steps:
(1) currently pending Traffic Information corresponds to sentence si, j=i;
(2) read sentence sj(j=j-1), if sentence sjExist, then go to step (3);Otherwise, (6) are gone to;
(3) if sentence sjFeature Words type sequence meets the sequential structure of " type information elements location information element ", goes to Step (4);Otherwise, go to step (5);
(4) by sentence sjCorresponding traffic events type attribute gives currently pending Traffic Information, and the process of supplement terminates;
(5) if sentence sjFeature Words type sequence meets the sequential structure of " location information element type information element ", goes to Step (6);Otherwise, go to step (2);
(6) sentence sjWith sentence siUnrelated, j=i, go to step (7);
(7) read sentence sj(j=j+1), if sentence sjExist, then go to step (8);Otherwise, supplement process to terminate;
(8) if sentence sjFeature Words type sequence meets the sequential structure of " location information element type information element ", goes to Step (4);Otherwise, go to step (9);
(9) if sentence sjFeature Words type sequence meets the sequential structure of " type information elements location information element ", then sentence Sub- sjWith sentence siUnrelated, the process of supplement terminates;Otherwise, go to step (7).
Fig. 3 is the flow process of centrage road attribute compensation process in embodiment of the present invention Traffic Information deletion mapping information element Figure, comprises the following steps:
(1) currently pending Traffic Information corresponds to sentence si, j=i;
(2) read sentence sj(j=j-1), if sentence sjExist, then go to step (3);Otherwise, supplement process to terminate;
(3) if sentence sjFeature Words type sequence meets the sequential structure of " type information elements location information element ", and contains Own centre line road attribute, then go to step (4);Otherwise, go to step (5);
(4) by sentence sjThe centrage road attribute of corresponding Traffic Information gives currently pending Traffic Information, The process of supplement terminates;
(5) if sentence sjThere is corresponding Traffic Information, and lack centrage road attribute, then go to step (2);No Then, supplement process to terminate.
If after element property supplement process, centrage road attribute or type information in the location information element of Traffic Information In element, traffic events type attribute still lacks, then the Traffic Information being extracted by this sentence is invalid information.Finally, reject Invalid information, obtains inputting the Traffic Information set that text has extracted location information element and type information elements.
Step 8. according to the different expression-forms to the time in the Internet non-structured text, artificial formulate extract year, month, day, When, the regular expression set of the element of time numerical value such as minute, second, utilize this regular expression set from input in conjunction with judgment rule Extraction time element numerical value in text, these element of time combinations of values are become traffic events time started element property and traffic thing Part end time element property, obtains the temporal information element of Traffic Information.Extracting time information element of the present invention concrete Regular expression is:
Expression formula is numbered Expression formula content
1 (d { 1,2 } moon d { 1,2 } day d* d { 1,2 } [: :] d { 1,2 })
2 (next day d { 1,2 } [: :] d { 1,2 })
3 (to d { 1,2 } [: :] d { 1,2 })
4 (( d { 2,4 } year d { 1,2 } moon d { 1,2 } day d { 1,3 } d { 1,2 } moon d { 1,2 } day))
5 (( d { 2,4 } year d { 1,2 } moon d { 1,2 } day d { 1,3 } d { 2,4 } year d { 1,2 } moon d { 1,2 } day))
6 ((\\d{4}\\d?\\d{1,2}\\d?\\d{1,2}\\d{1,3}\\d{1,2}\\d\\d{1,2}))
7 (([0‐1]?[0 9] | 2 [0 3]) [: :] ([0 5] [0 9]))
Fig. 4 is embodiment of the present invention Traffic Information temporal information element extraction method flow diagram, comprises the following steps:
(1) use regular expression 1 to input text matches, if the match is successful, go to (2);Otherwise, (7) are gone to.
(2) occurrence is split, obtain " the time-division day moon " information, obtain " year " letter in the acquisition time of input text Breath, combination obtains complete traffic events time started attribute, goes to (3).
(3) use regular expression 2 to input text matches, if the match is successful, go to (4);Otherwise, (5) are gone to.
(4) occurrence is split, obtains " time-division " information, obtain " date " information in the traffic events time started, After " day " therein information is added 1, combination obtains complete traffic events end time attribute, completes Traffic Information Time element attributes extraction.
(5) use regular expression 3 to input text matches, if the match is successful, go to (6);Otherwise go to (16).
(6) occurrence is split, obtains " time-division " information, obtain " date " information in the traffic events time started, Combination obtains complete traffic events end time attribute, completes the time element attributes extraction of Traffic Information.
(7) use regular expression 4 to input text matches, if the match is successful, go to (8);Otherwise go to (9).
(8) occurrence is split, obtain two groups of " date " information, correspond to traffic events time started and traffic thing respectively The part end time, setting time started " time-division " information is " 0 point when 0 ", and it is " 0 when 24 that setting terminates " time-division " information Point ", obtain complete traffic events time started attribute and traffic events end time attribute, complete Traffic Information when Between element property extract.
(9) use regular expression 5 to input text matches, if the match is successful, go to (8);Otherwise go to (10).
(10) use regular expression 6 to input text matches, if the match is successful, go to (8);Otherwise go to (11).
(11) use regular expression 7 to input text matches, if the match is successful, go to (12);Otherwise, (13) are gone to.
(12) occurrence is split, obtain " time-division " information, obtain " date " information in the information gathering time, group Close and obtain complete traffic events time started attribute, go to (16).
(13) obtain the information gathering time of input text, if the information gathering time exists, go to (14);Otherwise, go to (15).
(14) using the acquisition time of input text as traffic events time started attribute, (16) are gone to.
(15) the temporal information element extraction failure of Traffic Information, the Traffic Information collection of input text is combined into invalid letter Breath set, this Traffic Information extracts and terminates.
(16) increase threshold value preset time on traffic events time started attribute, obtain transport information end time attribute, complete Become the time element attributes extraction of Traffic Information.In the present invention, time threshold is 45 minutes.
The temporal information element that step 8 is extracted is filled the Traffic Information set each bar road obtaining to step 4 by step 9. In transport information, obtain the complete Traffic Information set of Traffic Information element, complete the Internet non-structural from input Change and in text, extract Traffic Information.
Extracting method can identify the positioning describing information in the Internet non-structured text based on linear refer-ence, and correctly processes text The Traffic Information element occurring in description implies and omission, realizes from the Internet non-structural based on natural language expressing Change and in text, extract Traffic Information.Extraction process does not need a large amount of manual interventions, is easy to the interconnection to Real-time Collection for the computer Net non-structured text is automatically processed.
Present invention may apply to the application such as navigation system of map web site system, public trip information platform and center service formula system System, expands Traffic Information Data Source and the data type of such system, makes up the deficiency of existing real-time road traffic data, Road improvement traffic-information service quality.
Above the embodiment of the present invention is described in detail, specific embodiment used herein is explained to the present invention State, the explanation of above example is only intended to help and understands assembly of the invention and method;General skill simultaneously for this area Art personnel, according to the thought of the present invention, all will change in specific embodiments and applications, in sum, this Description should not be construed as limitation of the present invention.

Claims (4)

1. one kind extracts Traffic Information method it is characterised in that to realize step as follows from the Internet non-structured text:
Step 1. defines the data structure of Traffic Information, is easy to organization and management Traffic Information in the form of bivariate table, Described data structure is made up of the concrete element property of information element and information element, and described information element includes location information unit Element, type information elements and temporal information element, element property own centre line road that described location information element comprises, initial Road, termination road, prime direction and termination direction, the element property that described type information elements comprise has traffic events type, The element property that described temporal information element comprises has traffic events time started and traffic events end time;Described road traffic Information includes road conditions information, road traffic restricted information, road traffic control information, road traffic accident information, road Environmental information;
The vocabulary playing a crucial role during description Traffic Information as Feature Words, is existed by step 2. according to these vocabulary The grammatical function playing in the Internet non-structured text, defines the class of the Feature Words for filling Traffic Information element property Type, and press Feature Words type build traffic specialized dictionary, described Feature Words type specifically include road name word, attached positioning word, Direction descriptor, preposition, road incidents word and general word;Described general word refer to be not included in road name word, attached positioning word, Vocabulary in direction descriptor, preposition, road incidents word Feature Words type;Described the Internet non-structured text refers to that webpage is new News, forum postings, blog article daily record, Twitter message;
The data structure of the Traffic Information that step 3. is defined based on step 1 and the Feature Words type of step 2 definition, in conjunction with The grammatical structure feature of traffic events and syntactic structure feature described in the Internet non-structured text, artificial formulation extracts mould substantially Formula, is extended to basic extraction pattern by rule, obtains extracting library;Described extraction pattern includes Feature Words type sequence Row and two parts of element property sequence;Described Feature Words type sequence is people's traffic described in the Internet non-structured text The sequencing arrangement of the type of Feature Words used during event, in extraction pattern, the function of Feature Words type sequence is to judge the Internet Can non-structured text with this extraction pattern match;Described element property sequence is identical with Feature Words type sequence length, element Sequence Item in sequence of attributes is that in Feature Words type sequence, same position sequence Item corresponding element in Traffic Information belongs to Property, the function of element property sequence is to instruct computer that the Feature Words that the Internet non-structured text occurs are mapped to road traffic In the corresponding element property of information;
The Internet non-structured text gathering as input text, is carried out pretreatment to input text by step 4.;Described pre- Process the duplicate message including deleting in input text and Chinese word segmentation is made to input text, obtain inputting the sequence of words of text;
Step 5. utilizes the Feature Words occurring in the traffic specialized dictionary identification step 4 gained sequence of words of step 2, and according to The type of sequencing recording feature word in input text for the Feature Words, generates the Feature Words type sequence of input text, passes through Judge whether the Feature Words type needed for Traffic Information element property completely filters to input text;
Step 6., to input text punctuate, the sentence set obtaining according to punctuate, step 5 gained is inputted the Feature Words of text Type sequence is divided into Feature Words type sequence set corresponding with sentence set, using dynamic time warping dtw distance, that is, In each Feature Words type sequence and extraction library in dynamic time warping distance metric this feature part of speech type arrangement set The similarity of each Feature Words type sequence extracting pattern, selects similarity highest and is less than the extraction pattern of given threshold value as this The coupling of sentence extracts pattern;
The sentence set of step 7. traversal input text, if the sentence in sentence set obtains coupling extraction pattern in step 6, The element property sequence then extracting pattern according to this coupling fills the Feature Words in this sentence to corresponding Traffic Information unit Plain attribute, generates the corresponding Traffic Information of this sentence;After the completion of traversal, judge the location information of gained Traffic Information In centrage road attribute and type information elements in element, whether traffic events type attribute is complete, if imperfect, utilizes Supplement rule to traffic thing in centrage road attribute in the location information element of Traffic Information disappearance or type information elements Part type attribute is filled up;Finally, obtain inputting the road traffic that text has extracted location information element and type information elements Information aggregate;
Step 8. according to the different expression-forms to the time in the Internet non-structured text, artificial formulate extract year, month, day, When, the regular expression set of minute, second element of time numerical value, in conjunction with judgment rule utilize this regular expression set from input literary composition Extraction time element numerical value in this, these element of time combinations of values are become traffic events time started element property and traffic events End time element property, obtains the temporal information element of Traffic Information;
The temporal information element that step 8 is extracted is filled the Traffic Information set each bar road obtaining to step 7 by step 9. In transport information, obtain the complete Traffic Information set of Traffic Information element.
2. according to claim 1 from the Internet non-structured text extract Traffic Information method it is characterised in that: In described step 6, in the Feature Words type sequence using each sentence of dtw distance metric and each extraction pattern in library of extracting The similarity of Feature Words type sequence during be embodied as:
If ci=tj, make d (ci,tj)=0;
If ci≠tj, and tjFor road name word, road incidents word, make d (ci,tj)=2;
If ci≠tj, and tjFor attached positioning word, direction descriptor, preposition, general word, make d (ci,tj)=1;
Wherein, ciRepresent i-th sequence Item of the Feature Words type sequence of input text sentence, tjRepresent the spy in extraction pattern Levy j-th sequence Item of word type sequence, d (ci,tj) represent ciAnd tjBetween distance value.
3. according to claim 1 from the Internet non-structured text extract Traffic Information method it is characterised in that: In described step 7, supplemented rule to what traffic events type attribute in the type information elements of Traffic Information disappearance filled up It is then:
(1) currently pending Traffic Information corresponds to sentence si, j=i;
(2) read sentence sj(j=j-1), if sentence sjExist, then go to step (3);Otherwise, (6) are gone to;
(3) if sentence sjFeature Words type sequence meets the sequential structure of " type information elements location information element ", goes to Step (4);Otherwise, go to step (5);
(4) by sentence sjCorresponding traffic events type attribute gives currently pending Traffic Information, and the process of supplement terminates;
(5) if sentence sjFeature Words type sequence meets the sequential structure of " location information element type information element ", goes to Step (6);Otherwise, go to step (2);
(6) sentence sjWith sentence siUnrelated, j=i, go to step (7);
(7) read sentence sj(j=j+1), if sentence sjExist, then go to step (8);Otherwise, supplement process to terminate;
(8) if sentence sjFeature Words type sequence meets the sequential structure of " location information element type information element ", goes to Step (4);Otherwise, go to step (9);
(9) if sentence sjFeature Words type sequence meets the sequential structure of " type information elements location information element ", then sentence Sub- sjWith sentence siUnrelated, the process of supplement terminates;Otherwise, go to step (7).
4. according to claim 1 from the Internet non-structured text extract Traffic Information method it is characterised in that: In described step 7, supplemented rule to what centrage road attribute in the location information element of Traffic Information disappearance filled up For:
(1) currently pending Traffic Information corresponds to sentence si, j=i;
(2) read sentence sj(j=j-1), if sentence sjExist, then go to step (3);Otherwise, supplement process to terminate;
(3) if sentence sjFeature Words type sequence meets the sequential structure of " type information elements location information element ", and contains Own centre line road attribute, then go to step (4);Otherwise, go to step (5);
(4) by sentence sjThe centrage road attribute of corresponding Traffic Information gives currently pending Traffic Information, The process of supplement terminates;
(5) if sentence sjThere is corresponding Traffic Information, and lack centrage road attribute, then go to step (2);No Then, supplement process to terminate.
CN201410115332.4A 2014-03-25 2014-03-25 Method for extracting road traffic information from Internet unstructured text Expired - Fee Related CN103886080B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410115332.4A CN103886080B (en) 2014-03-25 2014-03-25 Method for extracting road traffic information from Internet unstructured text

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410115332.4A CN103886080B (en) 2014-03-25 2014-03-25 Method for extracting road traffic information from Internet unstructured text

Publications (2)

Publication Number Publication Date
CN103886080A CN103886080A (en) 2014-06-25
CN103886080B true CN103886080B (en) 2017-01-25

Family

ID=50954972

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410115332.4A Expired - Fee Related CN103886080B (en) 2014-03-25 2014-03-25 Method for extracting road traffic information from Internet unstructured text

Country Status (1)

Country Link
CN (1) CN103886080B (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107451158B (en) * 2016-06-01 2021-01-19 中国科学院地理科学与资源研究所 Method for extracting semantic roles of traffic events in web text
CN106169246A (en) * 2016-08-04 2016-11-30 苏州运诺帷交通规划咨询有限公司 A kind of system and method being obtained road real-time road by API
CN107145592A (en) * 2017-05-26 2017-09-08 浙江宇视科技有限公司 The method and device that a kind of calibration position is obtained
CN107357772A (en) * 2017-07-04 2017-11-17 贵州小爱机器人科技有限公司 List filling method, device and computer equipment
CN108776673B (en) * 2018-05-23 2020-08-18 哈尔滨工业大学 Automatic conversion method and device of relation mode and storage medium
TWI740450B (en) * 2020-04-10 2021-09-21 群邁通訊股份有限公司 Driving assistance method and vehicle
CN113609244B (en) * 2021-06-08 2023-09-05 中国科学院软件研究所 Structured record extraction method and device based on controllable generation
CN113837162B (en) * 2021-11-29 2022-04-08 腾讯科技(深圳)有限公司 Data processing method and related device
CN117408521A (en) * 2023-12-15 2024-01-16 深圳竹云科技股份有限公司 Risk detection method, risk detection device, computer equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101308487A (en) * 2008-06-25 2008-11-19 中国科学院地理科学与资源研究所 Space-time fusion method for natural language expressing dynamic traffic information
CN102163225A (en) * 2011-04-11 2011-08-24 中国科学院地理科学与资源研究所 A fusion evaluation method of traffic information collected based on micro blogs

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101308487A (en) * 2008-06-25 2008-11-19 中国科学院地理科学与资源研究所 Space-time fusion method for natural language expressing dynamic traffic information
CN102163225A (en) * 2011-04-11 2011-08-24 中国科学院地理科学与资源研究所 A fusion evaluation method of traffic information collected based on micro blogs

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
面向网页文本的地理要素变化检测;王曙等;《地理信息科学学报》;20131031;第15卷(第5期);625-634 *

Also Published As

Publication number Publication date
CN103886080A (en) 2014-06-25

Similar Documents

Publication Publication Date Title
CN103886080B (en) Method for extracting road traffic information from Internet unstructured text
CN104820629B (en) A kind of intelligent public sentiment accident emergent treatment system and method
CN104881488B (en) Configurable information extraction method based on relation table
CN112329467B (en) Address recognition method and device, electronic equipment and storage medium
Forero-Ortiz et al. Flood risk assessment in an underground railway system under the impact of climate change—a case study of the Barcelona Metro
CN107808011A (en) Classification abstracting method, device, computer equipment and the storage medium of information
CN106066866A (en) A kind of automatic abstracting method of english literature key phrase and system
Zheng et al. Chatgpt is on the horizon: Could a large language model be all we need for intelligent transportation?
CN112527915B (en) Linear cultural heritage knowledge graph construction method, system, computing device and medium
CN103714132B (en) A kind of method and apparatus for being used to carry out focus incident excavation based on region and industry
CN102999511B (en) A kind of page fast conversion method, device and system
CN112084329B (en) Semantic analysis method for entity identification and relation extraction tasks
Yue et al. Using twitter data to determine hurricane category: An experiment
CN110334188A (en) A kind of multi-document summary generation method and system
CN112528642B (en) Automatic implicit chapter relation recognition method and system
Black et al. The urban design process
CN112463985A (en) Government affair map model construction method, device, equipment and computer readable medium
CN116701648A (en) Mapping knowledge graph and schema design method based on standard specification
CN102693284A (en) Extraction method of information in personal address list
CN116186288A (en) Knowledge graph feedback method based on places and semantics
CN111723164B (en) Address information processing method and device
CN112199954A (en) Disease entity matching method and device based on voice semantics and computer equipment
Ma et al. Analysis of public emotion on flood disasters in southern China in 2020 based on social media data
CN117093661B (en) Map data processing method and device, electronic equipment and storage medium
Chen et al. An approach of using social media data to detect the real time spatio-temporal variations of urban waterlogging

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20170125

CF01 Termination of patent right due to non-payment of annual fee