CN106469203A - A kind of screening technique of incident data and device - Google Patents

A kind of screening technique of incident data and device Download PDF

Info

Publication number
CN106469203A
CN106469203A CN201610796947.7A CN201610796947A CN106469203A CN 106469203 A CN106469203 A CN 106469203A CN 201610796947 A CN201610796947 A CN 201610796947A CN 106469203 A CN106469203 A CN 106469203A
Authority
CN
China
Prior art keywords
word
data
denoising
vocabulary
quasi
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201610796947.7A
Other languages
Chinese (zh)
Other versions
CN106469203B (en
Inventor
刘菲菲
王芳
祝笑舟
常璐
牛珍珍
王程
汤智谦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Lian Technology Co Ltd
Original Assignee
Beijing Lian Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Lian Technology Co Ltd filed Critical Beijing Lian Technology Co Ltd
Priority to CN201610796947.7A priority Critical patent/CN106469203B/en
Publication of CN106469203A publication Critical patent/CN106469203A/en
Application granted granted Critical
Publication of CN106469203B publication Critical patent/CN106469203B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a kind of screening technique of incident data and device, wherein method mainly includes:Build denoising summary table, combination vocabulary and reverse vocabulary;According to denoising summary table, combination one or more of vocabulary and reverse vocabulary, the accident related data collecting is screened.The present invention passes through natural language analysis, is that the structure of accident key word vocabulary provides foundation.For ensureing the accuracy rate of data, the present invention realizes the screening of incident data using many vocabularys.Around the comprehensive of data and accuracy, the present invention also applies precision ratio and recall ratio to carry out quantitative evaluation to each vocabulary performance, is that the renewal of vocabulary provides foundation.

Description

A kind of screening technique of incident data and device
Technical field
The present invention relates to the screening technique of computer technology application field, more particularly, to incident data and device.
Background technology
Currently, the whole world enters the accident high-incidence season, and all kinds of accidents frequently occur, and the life giving people and property are pacified Entirely cause grave danger.In the face of accident, fast and effectively Emergency decision, play vital work to reducing loss With.And historical incident experiences and lessons have important reference value to the formulation of Emergency decision, for this reason, it is necessary to prominent The data of the event of sending out is collected and studies.Additionally, historical incident research and analyse the prevention to accident, prediction Have great importance.
However, continuing to bring out with the Web information issuance mode such as microblogging, social networkies, the species of data and scale are just It is constantly increasing at an unprecedented rate and accumulates, the collection of accident related data is faced with stern challenge.Currently During incident data is collected, how carry out the screening of incident data using single key word, its accuracy is past Toward barely satisfactory, contain substantial amounts of uncorrelated data in result data so that the workload ratio of manual intervention is larger, also give simultaneously Researching and analysing of incident data brings very big inconvenience.
Content of the invention
In view of above-mentioned analysis, the present invention is intended to provide a kind of screening technique of incident data and device, by many Vocabulary is applied, and solves the problems, such as incident data screening in existing network data acquisition.
The purpose of the present invention is mainly achieved through the following technical solutions:
The invention provides a kind of screening technique of incident data, including:
Build denoising summary table, combination vocabulary and reverse vocabulary;
According to one or more of described denoising summary table, described combination vocabulary and described reverse vocabulary, to collecting Accident related data screened.
Further, the process building denoising summary table, combination vocabulary and reverse vocabulary specifically includes:
Using natural language analysis technology, pretreatment is carried out to data, analyzed by quasi- denoising word, realize basic denoising total Table and the structure of denoising summary table;
Based on described denoising summary table, using natural language analysis technology, pretreatment is carried out to data, by the noun obtaining and Verb and described denoising summary table carry out Co-occurrence Analysis, select to realize the structure of combination vocabulary by portmanteau word;
Based on described combination vocabulary, using natural language analysis technology, pretreatment is carried out to data, analyzed by reverse word Realize the structure of reverse vocabulary.
Further, the process building denoising summary table specifically includes:
According to described basic denoising summary table, using denoising word coupling, historical data is filtered, obtains training set TD1; Based on the data in described training set TD1, it is marked according to whether for incident data, finally give incident data Training set TD11 and non-burst event data training set TD10;Assembly is increased income respectively to described training set TD11 using Word participle Carry out participle, part-of-speech tagging and word frequency statisticses with the data in TD10, only retain noun therein and verb, and the word that will retain Item is added in keywords database;Through above-mentioned data prediction, the lexical item that described incident data training set TD11 is obtained, press It is ranked up from high to low according to word frequency, through the analysis of quasi- denoising word, the higher word of prioritizing selection word frequency, as denoising word, generates denoising Summary table.
Further, also include:
Filtered using described denoising summary table, that is, according to described denoising summary table, mated by denoising word, to gathered data Filtered, if the match is successful, labelling is defined incident data, otherwise labelling is defined non-burst event data, finally Obtain quasi- incident data collection D11 and quasi- non-burst event data collection D10.
Further, also include:
Described denoising summary table is updated, increases income assembly to described quasi- non-burst event data collection D10 using Word participle Carry out participle and part-of-speech tagging, only retain noun therein and verb;The lexical item obtaining for data prediction, with keywords database In lexical item carry out duplicate checking, if there is not this word in keywords database, this word being designated neologisms, be added simultaneously to key word Storehouse simultaneously updates, and otherwise waits for new data;The neologisms obtaining for data prediction, determine whether accident Feature Words, obtain To new denoising word;The new denoising word obtaining through quasi- based Denoising is added to denoising summary table and updates, meanwhile, using this Newly-increased denoising word carries out incident data screening again to described quasi- non-burst event data collection D10, updates described accurate prominent Send out event data collection D11 and described quasi- non-burst event data collection D10.
Further, the process building combination vocabulary specifically includes:
According to described denoising summary table, correct accident historical data in right amount is selected to generate incident data training set TD21;For described incident data training set TD21, carry out participle and part-of-speech tagging using Word participle assembly of increasing income, protect Stay noun therein and verb;For noun obtained above and verb, carry out with the denoising word in described denoising summary table respectively Co-occurrence Analysis, and the co-occurrence frequency is counted, obtain co-occurrence set of words, described co-occurrence set of words is saved in co-occurrence word simultaneously In storehouse;For described co-occurrence set of words, it is ranked up from high to low according to the co-occurrence frequency, in conjunction with described denoising summary table, preferential choosing Select the higher co-occurrence word of the frequency as portmanteau word, generate combination vocabulary.
Further, also include:
Using described combination vocabulary, described quasi- incident data collection D11 is filtered, that is, be directed to described quasi- burst thing Part data set D11, is mated using the co-occurrence word in described combination vocabulary, if the match is successful, is marked the data as standard Incident data, otherwise labelling be defined non-burst event data, ultimately generate quasi- incident data collection D21 and quasi- non-burst Event data collection D20.
Further, also include:
Using described quasi- non-burst event data collection D20, described combination vocabulary is updated, is increased income using Word participle Assembly carries out participle and part-of-speech tagging to described quasi- non-burst event data collection D20, only retains noun therein and verb;For Noun and verb that above-mentioned data prediction obtains, respectively with go described in the denoising word in summary table of making an uproar carry out Co-occurrence Analysis, and right The co-occurrence frequency is counted;The co-occurrence word of the co-occurrence word obtaining and co-occurrence dictionary is carried out duplicate checking, determines whether new co-occurrence word, If new co-occurrence word, then it is combined selected ci poem and selects, be added simultaneously to co-occurrence dictionary and update, otherwise wait for new data;For Described new co-occurrence word, is ranked up from high to low according to the co-occurrence frequency, selects the co-occurrence word with accident feature as new Increase portmanteau word.
Further, also include:
Described newly-increased portmanteau word is added to described combination vocabulary and updates, meanwhile, right again using newly-increased portmanteau word Described quasi- non-burst event data collection D20 carries out incident data screening, updates described quasi- incident data collection D21 and institute State quasi- non-burst event data collection D20.
Further, the process building reverse vocabulary specifically includes:
According to described quasi- incident data collection D21, based on described combination vocabulary, using historical incident data with make an uproar Sound data, generates incident data training set TD31 and non-burst event data training set TD30;Increased income using Word participle Assembly carries out participle, word to described incident data training set TD31 and described non-burst event data training set TD30 respectively Property mark and word frequency statisticses, only retain noun therein and verb, and by all lexical items obtaining be added to training dictionary in;Will The noun of described non-burst event data training set TD30 and verb that data prediction obtains, with described incident data instruction Practice collection TD31 in noun and verb carry out duplicate checking, only retain the proprietary noun of described non-burst event data training set TD31 and Verb, the higher word of prioritizing selection word frequency, as reverse word, generates reverse vocabulary.
Further, also include:
Using described reverse vocabulary, described quasi- incident data collection D21 is filtered, i.e. described pin be aligned burst thing Part data set D21, is mated using the reverse word in described reverse vocabulary, if the match is successful, is marked the data as standard Non-burst event data, otherwise labelling be defined incident data, ultimately generate quasi- incident data collection D31 and quasi- non-burst Event data collection D30.
Further, also include:
Based on described quasi- incident data collection D31, described reverse vocabulary is updated, is increased income using Word participle Assembly carries out participle, part-of-speech tagging and word frequency statisticses to described quasi- incident data D31, only retains noun therein and moves Word;Lexical item in noun and verb that data prediction obtains, with training dictionary carries out duplicate checking, determines whether neologisms, if It is neologisms, then carries out reverse word analysis, be added simultaneously to train dictionary and update, otherwise wait for new data;New for obtain Word, is ranked up from high to low according to word frequency, and the higher word of prioritizing selection word frequency is as newly-increased reverse word;Newly reverse by obtain Word is added to reverse vocabulary and updates, and, for described incident data collection D31, is carried out using newly-increased reverse word meanwhile Filter, updates described quasi- incident data collection D31 and described quasi- non-burst event data collection D30.
Further, also include:
Evaluation to one or more vocabulary performances in denoising summary table, combination vocabulary and reverse vocabulary, that is, count quasi- burst Event data concentrates the quantity of correct accident, and the accuracy of data is evaluated;Then in conjunction with quasi- incident data The quantity of correct accident and quasi- non-burst event data is concentrated to concentrate the quantity of correct accident, comprehensive to data Evaluated, be finally completed the assessment to vocabulary performance.
Present invention also offers a kind of screening plant of incident data, including:
Vocabulary construction unit, for building denoising summary table, combination vocabulary and reverse vocabulary;
Data screening unit, for according in described denoising summary table, described combination vocabulary and described reverse vocabulary Individual or multiple, the accident related data collecting is screened.
Further, described vocabulary construction unit at least includes following one or more modules:
First structure module, for using natural language analysis technology, carrying out pretreatment to data, is divided by quasi- denoising word Analysis, realizes the structure of basic denoising summary table and denoising summary table;
Second structure module, for based on described denoising summary table, using natural language analysis technology, carrying out pre- place to data Reason, the noun obtaining and verb are carried out Co-occurrence Analysis with described denoising summary table, select to realize combination vocabulary by portmanteau word Build;
3rd structure module, for based on described combination vocabulary, using natural language analysis technology, carrying out pre- place to data Reason, realizes the structure of reverse vocabulary by the analysis of reverse word.
Further, described first build module specifically for according to described basic denoising summary table, using denoising word coupling Historical data is filtered, obtains training set TD1;Based on the data in described training set TD1, according to whether for accident Data is marked, and finally gives incident data training set TD11 and non-burst event data training set TD10;Using Word participle increase income assembly respectively the data in described training set TD11 and TD10 is carried out participle, part-of-speech tagging and word frequency system Meter, only retains noun therein and verb, and the lexical item of reservation is added in keywords database;Through above-mentioned data prediction, will The lexical item that described incident data training set TD11 obtains, is ranked up from high to low according to word frequency, through the analysis of quasi- denoising word, The higher word of prioritizing selection word frequency, as denoising word, generates denoising summary table.
Further, described second build module specifically for according to described denoising summary table, selecting correctly to happen suddenly in right amount Event history data generates incident data training set TD21;For described incident data training set TD21, utilize Word participle assembly of increasing income carries out participle and part-of-speech tagging, retains noun therein and verb;For noun obtained above and Verb, carries out Co-occurrence Analysis with the denoising word in described denoising summary table respectively, and the co-occurrence frequency is counted, obtain co-occurrence word Set, described co-occurrence set of words is saved in co-occurrence dictionary simultaneously;For described co-occurrence set of words, according to the co-occurrence frequency by height It is ranked up to low, in conjunction with described denoising summary table, the higher co-occurrence word of the prioritizing selection frequency, as portmanteau word, generates portmanteau word Table.
Further, the described 3rd build module specifically for according to described quasi- incident data collection D21, based on institute State combination vocabulary, using historical incident data and noise data, generate incident data training set TD31 and non-burst Event data training set TD30;Assembly is increased income respectively to described incident data training set TD31 and described using Word participle Non-burst event data training set TD30 carries out participle, part-of-speech tagging and word frequency statisticses, only retains noun therein and verb, and The all lexical items obtaining are added in training dictionary;The described non-burst event data training set that data prediction is obtained Noun in the noun of TD30 and verb, with described incident data training set TD31 and verb carry out duplicate checking, only retain institute State the proprietary noun of non-burst event data training set TD31 and verb, the higher word of prioritizing selection word frequency is as reverse word, raw Become reverse vocabulary.
The present invention has the beneficial effect that:
The present invention, by artificial intelligence natural language analytical technology, especially Chinese word segmentation, vocabulary label technology, is applied to happen suddenly Event antistop list builds, and is easy to the extraction to accident Feature Words, is generating and renewal of accident antistop list Reference is provided.
Function according to vocabulary and effect, the accident key word vocabulary that the present invention builds is divided into denoising summary table, combination Three sublists such as vocabulary and reverse vocabulary, for being filtered at many levels to mass data, to improve incident data screening Accuracy, and then reduce later data processing procedure in manual intervention.
Invention also defines the calculating implementation method of recall ratio and precision ratio, from comprehensive and accuracy angle, Assessment for vocabulary performance provides foundation.
Brief description
Accompanying drawing is only used for illustrating the purpose of specific embodiment, and is not considered as limitation of the present invention, in whole accompanying drawing In, identical reference markss represent identical part.
Fig. 1 is the schematic flow sheet of embodiment of the present invention methods described;
Fig. 2 is the implementation process diagram of denoising summary table in the embodiment of the present invention;
Fig. 3 is the implementation process diagram combining vocabulary in the embodiment of the present invention;
Fig. 4 is the implementation process diagram of reversely vocabulary in the embodiment of the present invention;
Fig. 5 is the structural representation of embodiment of the present invention described device.
Specific embodiment
To specifically describe the preferred embodiments of the present invention below in conjunction with the accompanying drawings, wherein, accompanying drawing constitutes the application part, and It is used for together with embodiments of the present invention explaining the principle of the present invention.
As shown in figure 1, Fig. 1 is the schematic flow sheet of embodiment of the present invention methods described, main inclusion:The structure of denoising summary table Build, apply, updating and evaluation process, the structure of combination vocabulary, application, renewal and evaluation process, and, the structure of reverse vocabulary Build, apply, updating and evaluation process.
Present embodiments provide a kind of screening technique of incident data, for the data collecting, using denoising Summary table, generates quasi- incident data, realizes the first screening of incident data.As shown in Fig. 2 Fig. 2 is that structure denoising is total The schematic flow sheet of table, arrives step 105 including step 101.
Step 101:Collection information simultaneously builds basic denoising summary table
Collection is from accident taxonomic hierarchieses, emergency preplan and laws and regulations and other relevant departments with regard to burst The information such as the circular of event, generate basic denoising summary table using above-mentioned derived data;
Be exactly specifically first, to be increased income participle assembly using Word, to accident taxonomic hierarchieses, emergency preplan with Laws and regulations and other relevant departments carry out Chinese word segmentation and part-of-speech tagging with regard to the text data such as circular of accident, only Retain verb therein and noun;Then, carry out accident Feature Words extraction, realize quasi- denoising word analysis, generate and substantially go Make an uproar summary table.
Step 102:Based on above-mentioned basic denoising summary table, obtain denoising summary table
Using basic denoising summary table to historical data analysis, build accident training set and non-burst event training set, Using natural language analysis technology, analyzed by quasi- denoising word, realize the structure of denoising summary table, detailed process mainly includes:
(1) training set builds
For appropriate historical data, the basic denoising summary table obtaining according to step 101, mated to history number by denoising word According to being filtered, obtain training set TD1.Based on the data in training set TD1, enter rower according to whether for incident data Note, finally gives incident data training set TD11 and non-burst event data training set TD10.
(2) data prediction
Respectively the data in training set TD11 and TD10 in step (1) is carried out point using Word participle assembly of increasing income Word, part-of-speech tagging and word frequency statisticses, only retain noun therein and verb, and the lexical item of reservation are added in keywords database.
(3) quasi- denoising word analysis
Through above-mentioned data prediction, the lexical item that incident data training set TD11 is obtained, according to word frequency from high to low It is ranked up, the higher word of prioritizing selection word frequency, as denoising word, generates denoising summary table.
Step 103:Filtered using denoising summary table
For the data collecting, filtered using the denoising summary table that step 102 obtains, specific rules are:According to step The denoising summary table that rapid 102 obtain, is mated by denoising word, gathered data is filtered, if the match is successful, labelling is defined Incident data, otherwise labelling be defined non-burst event data, finally give quasi- incident data collection D11 and quasi- non-burst Event data collection D10.
Step 104:Update denoising summary table
The quasi- non-burst event data collection D10 obtaining for step 103, by natural language analysis method to denoising summary table It is updated, detailed process mainly includes:
(1) data prediction
Carry out participle and part-of-speech tagging using the Word participle assembly alignment non-burst event data collection D10 that increases income, only retain Noun therein and verb.
(2) new word identification
Lexical item in the lexical item obtaining for above-mentioned steps data prediction, with keywords database carries out duplicate checking, if crucial There is not this word in dictionary, then this word is designated neologisms, be added simultaneously to keywords database and update, otherwise wait for new data.
(3) quasi- denoising word analysis
The neologisms obtaining for data prediction, according to word frequency from high to low, determine whether accident Feature Words, obtain To new denoising word.
(4) update denoising word
The new denoising word obtaining through quasi- based Denoising is added to denoising summary table and updates, meanwhile, newly-increased using this Denoising word be directed at non-burst event data collection D10 again and carry out incident data screening, update quasi- incident data collection D11 and quasi- non-burst event data collection D10.
Additionally, according to the file such as relevant departments' circular and related prediction scheme to the content description such as accident and classification refinement Update, also should in time denoising summary table be updated.
Step 105:Denoising summary table is assessed
Update the quasi- incident data collection D11 generating after denoising summary table and quasi- non-burst event data for step 104 Collection D10, the evaluation index quoting searching system is estimated to denoising summary table, and circular is:
(1) precision ratio:
(2) recall ratio:
Wherein, D1iArbitrary data in the incident data that is defined collection D11, D1jIn the non-burst that is defined event data collection D10 Arbitrary data.| * | represents the size of data volume.
Based on quasi- incident data D11 obtained above, carry out the secondary sieve of incident data using combination vocabulary Choosing, to improve the accuracy of this key word vocabulary application, following steps 106 arrive the process that step 109 is that combination vocabulary builds, such as Shown in Fig. 3.
Step 106:Generate combination vocabulary
Using natural language analysis, select appropriate historical data as training data, generate combination vocabulary, detailed process bag Include:
(1) training set builds
According to above-mentioned denoising summary table, correct accident historical data in right amount is selected to generate incident data training set TD21.
(2) data prediction
For incident data training set TD21, carry out participle and part-of-speech tagging using Word participle assembly of increasing income, protect Stay noun therein and verb.
(3) Co-occurrence Analysis
The noun obtaining for step (2) and verb, carry out Co-occurrence Analysis with the denoising word in denoising summary table respectively, and right The co-occurrence frequency is counted, and obtains co-occurrence set of words, this co-occurrence set of words is saved in co-occurrence dictionary simultaneously.
(4) portmanteau word selects
For co-occurrence set of words obtained above, it is ranked up from high to low according to the co-occurrence frequency, in conjunction with denoising summary table, excellent First select the higher co-occurrence word of the frequency as portmanteau word, generate combination vocabulary.
Step 107:Application combination vocabulary
Filtered using combination vocabulary be aligned incident data collection D11, specific rules are:Pin is directed at accident number According to collection D11, mated using the co-occurrence word in combination vocabulary, if the match is successful, marked the data as quasi- accident Data, otherwise labelling be defined non-burst event data, ultimately generate quasi- incident data collection D21 and quasi- non-burst event data Collection D20.
Step 108:Update combination vocabulary
Based on natural language analysis technology, using quasi- non-burst event data collection D20, combination vocabulary is updated, specifically Process includes:
(1) data prediction
Carry out participle and part-of-speech tagging using the Word participle assembly alignment non-burst event data collection D20 that increases income, only retain Noun therein and verb.
(2) Co-occurrence Analysis
The noun obtaining for data prediction above and verb, carry out co-occurrence and divide with the denoising word in denoising summary table respectively Analysis, and the co-occurrence frequency is counted.
(3) new co-occurrence word identification
The co-occurrence word that step (2) is obtained carries out duplicate checking with the co-occurrence word of co-occurrence dictionary, determines whether new co-occurrence word, such as Fruit is new co-occurrence word, then be combined selected ci poem and select, be added simultaneously to co-occurrence dictionary and update, otherwise wait for new data.
(4) portmanteau word selects
The new co-occurrence word obtaining for step (3), is ranked up from high to low according to the co-occurrence frequency, selects there is burst thing The co-occurrence word of part feature is as newly-increased portmanteau word.
(5) combination vocabulary updates
The newly-increased portmanteau word that step (4) is obtained is added to combination vocabulary and updates.Meanwhile, using newly-increased portmanteau word again Secondary be aligned non-burst event data collection D20 carries out incident data screening, updates quasi- incident data collection D21 and standard is non-prominent Send out event data collection D20.
Step 109:Combination vocabulary assessment
For updating the quasi- incident data collection D21 generating after combination vocabulary and quasi- non-burst event data collection D20, draw With the evaluation index of searching system, combination vocabulary is estimated, circular is:
(1) precision ratio:
(2) recall ratio:
Wherein, D2iArbitrary data in the incident data that is defined collection D21, D2jIn the non-burst that is defined event data collection D20 Arbitrary data.| * | represents the size of data volume.
For quasi- incident data D21 obtained above, using reverse vocabulary, wherein non-burst event data is carried out Filter, to improve the accuracy of this key word vocabulary application.As shown in figure 4, the concrete step that in this embodiment, reverse vocabulary builds Suddenly as follows:
Step 110:Set up reverse vocabulary
Based on natural language analysis, select appropriate historical data, generate reverse vocabulary, specific rules are:
(1) training set builds
The quasi- incident data collection D21 that combined vocabulary obtains, has been related to incident data and noise data.Base In combination vocabulary, using historical incident data and noise data, generate incident data training set TD31 and non-burst Event data training set TD30.
(2) data prediction
Assembly is increased income respectively to incident data training set TD31 and the training of non-burst event data using Word participle Collection TD30 carries out participle, part-of-speech tagging and word frequency statisticses, only retains noun therein and verb, and all lexical items obtaining are added It is added in training dictionary.
(3) reversely word analysis
The noun of non-burst event data training set TD30 that step (2) data prediction is obtained and verb, with burst Noun in event data training set TD31 and verb carry out duplicate checking, only retain non-burst event data training set TD30 proprietary Noun and verb, the higher word of prioritizing selection word frequency, as reverse word, generates reverse vocabulary.
Step 111:Apply reverse vocabulary
Using reverse vocabulary, the quasi- incident data collection D21 that combination vocabulary is obtained is filtered again, obtains accurate prominent Send out event data collection D31 and quasi- non-burst event data collection D30, wherein, quasi- incident data collection D31 is system output Incident data.
Further, filtered using reverse vocabulary be aligned incident data collection D21, specific rules are:Pin be aligned is prominent Send out event data collection D21, mated using the reverse word in reverse vocabulary, if the match is successful, mark the data as standard Non-burst event data, otherwise labelling be defined incident data, ultimately generate quasi- incident data collection D31 and quasi- non-burst Event data collection D30.Wherein, quasi- incident data collection D31 is the accident number that system is obtained using key word vocabulary According to.
Step 112:Update reverse vocabulary
Reverse vocabulary renewal is carried out based on quasi- incident data collection D31, specific rules are:
(1) data prediction
Participle, part-of-speech tagging and word frequency statisticses are carried out using Word participle assembly alignment incident data D31 of increasing income, only Retain noun therein and verb.
(2) new word identification.
Lexical item in the noun that obtain step (1) data prediction and verb, with training dictionary carries out duplicate checking, and judgement is No if neologisms, then carry out reverse word analysis for neologisms, be added simultaneously to train dictionary and update, otherwise wait for new data.
(3) reversely word analysis.
The neologisms obtaining for step (2), are ranked up from high to low according to word frequency, and the higher word of prioritizing selection word frequency is made For increasing reverse word newly.
(4) update reverse word.
The newly reverse word that step (3) is obtained is added to reverse vocabulary and updates.Meanwhile, for incident data collection D31, is filtered using newly-increased reverse word, updates quasi- incident data collection D31 and quasi- non-burst event data collection D30.
Step 113:Reversely vocabulary assessment
Update the quasi- incident data collection D31 generating after reverse vocabulary and quasi- non-burst event data for step 112 Collection D30, the evaluation index quoting searching system is estimated to denoising summary table, and circular is:
(1) precision ratio:
(2) recall ratio:
Wherein, D3iArbitrary data in the incident data that is defined collection D31, D3jIn the non-burst that is defined event data collection D30 Arbitrary data.| * | represents the size of data volume.
Next embodiment of the present invention described device is described in detail.
As shown in figure 5, Fig. 5 is the structural representation of embodiment of the present invention described device, specifically can include:
Vocabulary construction unit, is mainly responsible for building denoising summary table, combination vocabulary and reverse vocabulary;
Data screening unit, is mainly responsible for according in described denoising summary table, described combination vocabulary and described reverse vocabulary One or more, the accident related data collecting is screened.
Wherein, vocabulary construction unit at least includes following one or more modules:
First structure module, is mainly responsible for utilizing natural language analysis technology, carries out pretreatment to data, by quasi- denoising Word is analyzed, and realizes the structure of basic denoising summary table and denoising summary table;
It is exactly specifically that first builds module according to described basic denoising summary table, is mated to history number using denoising word According to being filtered, obtain training set TD1;Based on the data in described training set TD1, carry out according to whether for incident data Labelling, finally gives incident data training set TD11 and non-burst event data training set TD10;Opened using Word participle Source component carries out participle, part-of-speech tagging and word frequency statisticses to the data in described training set TD11 and TD10 respectively, only retains it In noun and verb, and the lexical item of reservation is added in keywords database;Through above-mentioned data prediction, by described accident The lexical item that data training set TD11 obtains, is ranked up from high to low according to word frequency, through the analysis of quasi- denoising word, prioritizing selection word frequency Higher word, as denoising word, generates denoising summary table.
Second structure module, is mainly responsible for being based on described denoising summary table, using natural language analysis technology, data is carried out Pretreatment, the noun obtaining and verb are carried out Co-occurrence Analysis with described denoising summary table, select to realize portmanteau word by portmanteau word The structure of table;
It is exactly specifically that second builds module according to described denoising summary table, select correct accident history in right amount Data genaration incident data training set TD21;For described incident data training set TD21, increased income point using Word Phrase part carries out participle and part-of-speech tagging, retains noun therein and verb;For noun obtained above and verb, respectively with Denoising word in described denoising summary table carries out Co-occurrence Analysis, and the co-occurrence frequency is counted, and obtains co-occurrence set of words, will simultaneously Described co-occurrence set of words is saved in co-occurrence dictionary;For described co-occurrence set of words, arranged from high to low according to the co-occurrence frequency Sequence, in conjunction with described denoising summary table, the higher co-occurrence word of the prioritizing selection frequency, as portmanteau word, generates combination vocabulary.
3rd structure module, is mainly responsible for being based on described combination vocabulary, using natural language analysis technology, data is carried out Pretreatment, realizes the structure of reverse vocabulary by the analysis of reverse word.
It is exactly specifically that the 3rd builds module according to described quasi- incident data collection D21, based on described combination vocabulary, Using historical incident data and noise data, generate incident data training set TD31 and the training of non-burst event data Collection TD30;Assembly is increased income respectively to described incident data training set TD31 and described non-burst event number using Word participle Carry out participle, part-of-speech tagging and word frequency statisticses according to training set TD30, only retain noun therein and verb, and all by obtain Lexical item is added in training dictionary;The noun of described non-burst event data training set TD30 that data prediction is obtained and dynamic Noun in word, with described incident data training set TD31 and verb carry out duplicate checking, only retain described non-burst event number According to the proprietary noun of training set TD31 and verb, the higher word of prioritizing selection word frequency, as reverse word, generates reverse vocabulary.
Process is implemented for embodiment of the present invention described device, is described in detail due in said method, therefore Here is omitted.
In sum, a kind of screening technique of incident data and device are embodiments provided, the method is led to Cross natural language analysis technology, based on Chinese word segmentation and part-of-speech tagging result, the structure for accident key word vocabulary provides Foundation.For ensureing the accuracy rate of data, the key word vocabulary of accident is divided into denoising summary table, combination vocabulary by the present invention With reverse three sublists of vocabulary, realize the screening of incident data by the way of many vocabularys.Around the comprehensive of data and Accuracy, the present invention also introduces precision ratio and recall ratio, is that Performance Evaluation and the renewal of each vocabulary provides foundation.
It will be understood by those skilled in the art that realizing all or part of flow process of above-described embodiment method, can be by meter Calculation machine program to complete come the hardware to instruct correlation, and described program can be stored in computer-readable recording medium.Wherein, institute Stating computer-readable recording medium is disk, CD, read-only memory or random access memory etc..
The above, the only present invention preferably specific embodiment, but protection scope of the present invention is not limited thereto, Any those familiar with the art the invention discloses technical scope in, the change or replacement that can readily occur in, All should be included within the scope of the present invention.

Claims (18)

1. a kind of screening technique of incident data is it is characterised in that include:
Build denoising summary table, combination vocabulary and reverse vocabulary;
According to one or more of described denoising summary table, described combination vocabulary and described reverse vocabulary, prominent to collect Send out event related data to be screened.
2. as requested the method described in 1 it is characterised in that build denoising summary table, combination vocabulary and reverse vocabulary process Specifically include:
Using natural language analysis technology, pretreatment is carried out to data, by quasi- denoising word analyze, realize basic denoising summary table and The structure of denoising summary table;
Based on described denoising summary table, using natural language analysis technology, pretreatment is carried out to data, by the noun obtaining and verb Carry out Co-occurrence Analysis with described denoising summary table, select to realize the structure of combination vocabulary by portmanteau word;
Based on described combination vocabulary, using natural language analysis technology, pretreatment is carried out to data, analyzed by reverse word and realize The reversely structure of vocabulary.
3. method according to claim 1 and 2 is it is characterised in that the process building denoising summary table specifically includes:
According to described basic denoising summary table, using denoising word coupling, historical data is filtered, obtain training set TD1;It is based on Data in described training set TD1, is marked according to whether for incident data, finally gives incident data training Collection TD11 and non-burst event data training set TD10;Using Word participle increase income assembly respectively to described training set TD11 and Data in TD10 carries out participle, part-of-speech tagging and word frequency statisticses, only retains noun therein and verb, and the lexical item that will retain It is added in keywords database;Through above-mentioned data prediction, the lexical item that described incident data training set TD11 is obtained, according to Word frequency is ranked up from high to low, and through the analysis of quasi- denoising word, the higher word of prioritizing selection word frequency, as denoising word, generates denoising total Table.
4. method according to claim 1 and 2 is it is characterised in that also include:
Filtered using described denoising summary table, that is, according to described denoising summary table, mated by denoising word, gathered data is carried out Filter, if the match is successful, labelling is defined incident data, otherwise labelling is defined non-burst event data, finally gives Quasi- incident data collection D11 and quasi- non-burst event data collection D10.
5. method according to claim 4 is it is characterised in that also include:
Described denoising summary table is updated, using Word participle assembly of increasing income, described quasi- non-burst event data collection D10 is carried out Participle and part-of-speech tagging, only retain noun therein and verb;The lexical item obtaining for data prediction, in keywords database Lexical item carries out duplicate checking, if there is not this word in keywords database, this word being designated neologisms, being added simultaneously to keywords database simultaneously Update, otherwise wait for new data;The neologisms obtaining for data prediction, carry out quasi- based Denoising, determine whether the thing that happens suddenly Part Feature Words, obtain new denoising word;The new denoising word obtaining is added to denoising summary table and updates, meanwhile, new using this The denoising word increasing carries out incident data screening again to described quasi- non-burst event data collection D10, updates described quasi- burst Event data collection D11 and described quasi- non-burst event data collection D10.
6. method according to claim 1 and 2 is it is characterised in that the process building combination vocabulary specifically includes:
According to described denoising summary table, correct accident historical data in right amount is selected to generate incident data training set TD21;For described incident data training set TD21, carry out participle and part-of-speech tagging using Word participle assembly of increasing income, protect Stay noun therein and verb;For noun obtained above and verb, carry out with the denoising word in described denoising summary table respectively Co-occurrence Analysis, and the co-occurrence frequency is counted, obtain co-occurrence set of words, described co-occurrence set of words is saved in co-occurrence word simultaneously In storehouse;For described co-occurrence set of words, it is ranked up from high to low according to the co-occurrence frequency, in conjunction with described denoising summary table, preferential choosing Select the higher co-occurrence word of the frequency as portmanteau word, generate combination vocabulary.
7. method according to claim 6 is it is characterised in that also include:
Using described combination vocabulary, described quasi- incident data collection D11 is filtered, that is, be directed to described quasi- accident number According to collection D11, mated using the co-occurrence word in described combination vocabulary, if the match is successful, marked the data as quasi- burst Event data, otherwise labelling be defined non-burst event data, ultimately generate quasi- incident data collection D21 and quasi- non-burst event Data set D20.
8. method according to claim 7 is it is characterised in that also include:
Using described quasi- non-burst event data collection D20, described combination vocabulary is updated, is increased income assembly using Word participle Participle and part-of-speech tagging are carried out to described quasi- non-burst event data collection D20, only retains noun therein and verb;For above-mentioned Noun and verb that data prediction obtains, carry out Co-occurrence Analysis with the denoising word in described denoising summary table respectively, and to co-occurrence The frequency is counted;The co-occurrence word of the co-occurrence word obtaining and co-occurrence dictionary is carried out duplicate checking, determines whether new co-occurrence word, if For new co-occurrence word, then it is combined selected ci poem and selects, be added simultaneously to co-occurrence dictionary and update, otherwise wait for new data;For described New co-occurrence word, is ranked up from high to low according to the co-occurrence frequency, selects the co-occurrence word with accident feature as newly-increased group Close word.
9. method according to claim 8 is it is characterised in that also include:
Described newly-increased portmanteau word is added to described combination vocabulary and updates, meanwhile, using newly-increased portmanteau word again to described Quasi- non-burst event data collection D20 carries out incident data screening, updates described quasi- incident data collection D21 and described standard Non-burst event data collection D20.
10. method according to claim 9 is it is characterised in that the process building reverse vocabulary specifically includes:
According to described quasi- incident data collection D21, based on described combination vocabulary, using historical incident data and noise number According to generating incident data training set TD31 and non-burst event data training set TD30;Increased income assembly using Word participle Respectively participle, part of speech mark are carried out to described incident data training set TD31 and described non-burst event data training set TD30 Note and word frequency statisticses, only retain noun therein and verb, and all lexical items obtaining are added in training dictionary;By data The noun of described non-burst event data training set TD30 and verb that pretreatment obtains, with described incident data training set Noun in TD31 and verb carry out duplicate checking, only retain the proprietary noun of described non-burst event data training set TD31 and move Word, the higher word of prioritizing selection word frequency, as reverse word, generates reverse vocabulary.
11. methods according to claim 10 are it is characterised in that also include:
Using described reverse vocabulary, described quasi- incident data collection D21 is filtered, that is, described pin is directed at accident number According to collection D21, mated using the reverse word in described reverse vocabulary, if the match is successful, marked the data as accurate non-prominent Send out event data, otherwise labelling is defined incident data, ultimately generates quasi- incident data collection D31 and quasi- non-burst event Data set D30.
12. methods according to claim 11 are it is characterised in that also include:
Based on described quasi- incident data collection D31, described reverse vocabulary is updated, is increased income assembly using Word participle Participle, part-of-speech tagging and word frequency statisticses are carried out to described quasi- incident data D31, only retains noun therein and verb;Number Lexical item in noun and verb that Data preprocess obtains, with training dictionary carries out duplicate checking, determines whether neologisms, if newly Word, then carry out reverse word analysis, is added simultaneously to train dictionary and update, otherwise waits for new data;For the neologisms obtaining, press It is ranked up from high to low according to word frequency, the higher word of prioritizing selection word frequency is as newly-increased reverse word;The newly reverse word obtaining is added It is added to reverse vocabulary and updates, for described incident data collection D31, filtered using newly-increased reverse word, more meanwhile Newly described quasi- incident data collection D31 and described quasi- non-burst event data collection D30.
13. methods according to claim 1 and 2 are it is characterised in that also include:
Evaluation to one or more vocabulary performances in denoising summary table, combination vocabulary and reverse vocabulary, that is, count quasi- accident In data set, the quantity of correct accident, evaluates to the accuracy of data;Concentrate then in conjunction with quasi- incident data The quantity of correct accident and the quantity of the quasi- non-burst event data correct accident of concentration, are carried out to the comprehensive of data Evaluate, be finally completed the assessment to vocabulary performance.
A kind of 14. screening plants of incident data are it is characterised in that include:
Vocabulary construction unit, for building denoising summary table, combination vocabulary and reverse vocabulary;
Data screening unit, for according to one of described denoising summary table, described combination vocabulary and described reverse vocabulary or Multiple, the accident related data collecting is screened.
15. as requested the device described in 14 it is characterised in that described vocabulary construction unit at least include following one or more Module:
First structure module, for using natural language analysis technology, carrying out pretreatment to data, is analyzed by quasi- denoising word, Realize the structure of basic denoising summary table and denoising summary table;
Second structure module, for based on described denoising summary table, using natural language analysis technology, pretreatment being carried out to data, The noun obtaining and verb are carried out Co-occurrence Analysis with described denoising summary table, selects to realize the structure of combination vocabulary by portmanteau word Build;
3rd structure module, for based on described combination vocabulary, using natural language analysis technology, pretreatment being carried out to data, Realize the structure of reverse vocabulary by the analysis of reverse word.
16. devices according to claims 14 or 15 it is characterised in that described first build module specifically for, according to Described basic denoising summary table, is filtered to historical data using denoising word coupling, obtains training set TD1;Based on described training Collection TD1 in data, be marked according to whether for incident data, finally give incident data training set TD11 and Non-burst event data training set TD10;Assembly is increased income respectively to the number in described training set TD11 and TD10 using Word participle According to carrying out participle, part-of-speech tagging and word frequency statisticses, only retain noun therein and verb, and the lexical item of reservation is added to key In dictionary;Through above-mentioned data prediction, the lexical item that described incident data training set TD11 is obtained, according to word frequency by height to Low be ranked up, through the analysis of quasi- denoising word, the higher word of prioritizing selection word frequency as denoising word, generates denoising summary table.
17. devices according to claims 14 or 15 it is characterised in that described second build module specifically for, according to Described denoising summary table, selects correct accident historical data in right amount to generate incident data training set TD21;For institute State incident data training set TD21, carry out participle and part-of-speech tagging using Word participle assembly of increasing income, retain name therein Word and verb;For noun obtained above and verb, carry out Co-occurrence Analysis with the denoising word in described denoising summary table respectively, and The co-occurrence frequency is counted, obtains co-occurrence set of words, described co-occurrence set of words is saved in co-occurrence dictionary simultaneously;For institute State co-occurrence set of words, be ranked up from high to low according to the co-occurrence frequency, in conjunction with described denoising summary table, the prioritizing selection frequency is higher Co-occurrence word, as portmanteau word, generates combination vocabulary.
18. according to the device of claims 14 or 15 it is characterised in that the described 3rd builds module specifically for according to described Quasi- incident data collection D21, based on described combination vocabulary, using historical incident data and noise data, generates burst Event data training set TD31 and non-burst event data training set TD30;Assembly is increased income respectively to described prominent using Word participle Send out event data training set TD31 and described non-burst event data training set TD30 carries out participle, part-of-speech tagging and word frequency system Meter, only retains noun therein and verb, and all lexical items obtaining are added in training dictionary;Data prediction is obtained The noun and verb, with described incident data training set TD31 of described non-burst event data training set TD30 in name Word and verb carry out duplicate checking, only retain the proprietary noun of described non-burst event data training set TD31 and verb, prioritizing selection The higher word of word frequency, as reverse word, generates reverse vocabulary.
CN201610796947.7A 2016-08-31 2016-08-31 A kind of screening technique and device of incident data Active CN106469203B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610796947.7A CN106469203B (en) 2016-08-31 2016-08-31 A kind of screening technique and device of incident data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610796947.7A CN106469203B (en) 2016-08-31 2016-08-31 A kind of screening technique and device of incident data

Publications (2)

Publication Number Publication Date
CN106469203A true CN106469203A (en) 2017-03-01
CN106469203B CN106469203B (en) 2019-07-23

Family

ID=58230339

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610796947.7A Active CN106469203B (en) 2016-08-31 2016-08-31 A kind of screening technique and device of incident data

Country Status (1)

Country Link
CN (1) CN106469203B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108681613A (en) * 2018-07-13 2018-10-19 北京酷车易美网络科技有限公司 A kind of vehicle history vehicle condition solution read apparatus

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102937960A (en) * 2012-09-06 2013-02-20 北京邮电大学 Device and method for identifying and evaluating emergency hot topic
CN103577404A (en) * 2012-07-19 2014-02-12 中国人民大学 Microblog-oriented discovery method for new emergencies
CN104573006A (en) * 2015-01-08 2015-04-29 南通大学 Construction method of public health emergent event domain knowledge base
CN104615718A (en) * 2015-02-05 2015-05-13 北京航空航天大学 Hierarchical analysis method for social network emergency

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103577404A (en) * 2012-07-19 2014-02-12 中国人民大学 Microblog-oriented discovery method for new emergencies
CN102937960A (en) * 2012-09-06 2013-02-20 北京邮电大学 Device and method for identifying and evaluating emergency hot topic
CN104573006A (en) * 2015-01-08 2015-04-29 南通大学 Construction method of public health emergent event domain knowledge base
CN104615718A (en) * 2015-02-05 2015-05-13 北京航空航天大学 Hierarchical analysis method for social network emergency

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
周智星: ""网络舆情监测管理系统设计的研究与应用"", 《中国优秀硕士学位论文全文数据库—信《中国优秀硕士学位论文全文数据库—信息科技辑》息科技辑》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108681613A (en) * 2018-07-13 2018-10-19 北京酷车易美网络科技有限公司 A kind of vehicle history vehicle condition solution read apparatus

Also Published As

Publication number Publication date
CN106469203B (en) 2019-07-23

Similar Documents

Publication Publication Date Title
CN107239446B (en) A kind of intelligence relationship extracting method based on neural network Yu attention mechanism
CN106095928B (en) A kind of event type recognition methods and device
CN102937960B (en) Device for identifying and evaluating emergency hot topic
CN106649260B (en) Product characteristic structure tree construction method based on comment text mining
CN106570179B (en) A kind of kernel entity recognition methods and device towards evaluation property text
CN110334212A (en) A kind of territoriality audit knowledge mapping construction method based on machine learning
CN101655866B (en) Automatic decimation method of scientific and technical terminology
WO2014094332A1 (en) Method for creating knowledge base engine for emergency management of sudden event and method for querying in knowledge base engine
DE112013004082T5 (en) Search system of the emotion entity for the microblog
CN106021410A (en) Source code annotation quality evaluation method based on machine learning
CN105975478A (en) Word vector analysis-based online article belonging event detection method and device
CN104965867A (en) Text event classification method based on CHI feature selection
CN107194617B (en) App software engineer soft skill classification system and method
CN102144229A (en) System for extracting term from document containing text segment
CN103605702A (en) Word similarity based network text classification method
CN110472203B (en) Article duplicate checking and detecting method, device, equipment and storage medium
CN111597356B (en) Intelligent education knowledge map construction system and method
CN106570109A (en) Method for automatically generating knowledge points of question bank through text analysis
CN105095091B (en) A kind of software defect code file localization method based on Inverted Index Technique
CN106547733A (en) A kind of name entity recognition method towards particular text
Pimm et al. Natural Language Processing (NLP) tools for the analysis of incident and accident reports
Cuadros et al. Quality assessment of large scale knowledge resources
CN112328792A (en) Optimization method for recognizing credit events based on DBSCAN clustering algorithm
CN109299251A (en) A kind of abnormal refuse messages recognition methods and system based on deep learning algorithm
CN109471934B (en) Financial risk clue mining method based on Internet

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant