CN106469203A - A kind of screening technique of incident data and device - Google Patents
A kind of screening technique of incident data and device Download PDFInfo
- Publication number
- CN106469203A CN106469203A CN201610796947.7A CN201610796947A CN106469203A CN 106469203 A CN106469203 A CN 106469203A CN 201610796947 A CN201610796947 A CN 201610796947A CN 106469203 A CN106469203 A CN 106469203A
- Authority
- CN
- China
- Prior art keywords
- word
- data
- denoising
- vocabulary
- quasi
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/253—Grammatical analysis; Style critique
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Probability & Statistics with Applications (AREA)
- Machine Translation (AREA)
Abstract
The invention provides a kind of screening technique of incident data and device, wherein method mainly includes:Build denoising summary table, combination vocabulary and reverse vocabulary;According to denoising summary table, combination one or more of vocabulary and reverse vocabulary, the accident related data collecting is screened.The present invention passes through natural language analysis, is that the structure of accident key word vocabulary provides foundation.For ensureing the accuracy rate of data, the present invention realizes the screening of incident data using many vocabularys.Around the comprehensive of data and accuracy, the present invention also applies precision ratio and recall ratio to carry out quantitative evaluation to each vocabulary performance, is that the renewal of vocabulary provides foundation.
Description
Technical field
The present invention relates to the screening technique of computer technology application field, more particularly, to incident data and device.
Background technology
Currently, the whole world enters the accident high-incidence season, and all kinds of accidents frequently occur, and the life giving people and property are pacified
Entirely cause grave danger.In the face of accident, fast and effectively Emergency decision, play vital work to reducing loss
With.And historical incident experiences and lessons have important reference value to the formulation of Emergency decision, for this reason, it is necessary to prominent
The data of the event of sending out is collected and studies.Additionally, historical incident research and analyse the prevention to accident, prediction
Have great importance.
However, continuing to bring out with the Web information issuance mode such as microblogging, social networkies, the species of data and scale are just
It is constantly increasing at an unprecedented rate and accumulates, the collection of accident related data is faced with stern challenge.Currently
During incident data is collected, how carry out the screening of incident data using single key word, its accuracy is past
Toward barely satisfactory, contain substantial amounts of uncorrelated data in result data so that the workload ratio of manual intervention is larger, also give simultaneously
Researching and analysing of incident data brings very big inconvenience.
Content of the invention
In view of above-mentioned analysis, the present invention is intended to provide a kind of screening technique of incident data and device, by many
Vocabulary is applied, and solves the problems, such as incident data screening in existing network data acquisition.
The purpose of the present invention is mainly achieved through the following technical solutions:
The invention provides a kind of screening technique of incident data, including:
Build denoising summary table, combination vocabulary and reverse vocabulary;
According to one or more of described denoising summary table, described combination vocabulary and described reverse vocabulary, to collecting
Accident related data screened.
Further, the process building denoising summary table, combination vocabulary and reverse vocabulary specifically includes:
Using natural language analysis technology, pretreatment is carried out to data, analyzed by quasi- denoising word, realize basic denoising total
Table and the structure of denoising summary table;
Based on described denoising summary table, using natural language analysis technology, pretreatment is carried out to data, by the noun obtaining and
Verb and described denoising summary table carry out Co-occurrence Analysis, select to realize the structure of combination vocabulary by portmanteau word;
Based on described combination vocabulary, using natural language analysis technology, pretreatment is carried out to data, analyzed by reverse word
Realize the structure of reverse vocabulary.
Further, the process building denoising summary table specifically includes:
According to described basic denoising summary table, using denoising word coupling, historical data is filtered, obtains training set TD1;
Based on the data in described training set TD1, it is marked according to whether for incident data, finally give incident data
Training set TD11 and non-burst event data training set TD10;Assembly is increased income respectively to described training set TD11 using Word participle
Carry out participle, part-of-speech tagging and word frequency statisticses with the data in TD10, only retain noun therein and verb, and the word that will retain
Item is added in keywords database;Through above-mentioned data prediction, the lexical item that described incident data training set TD11 is obtained, press
It is ranked up from high to low according to word frequency, through the analysis of quasi- denoising word, the higher word of prioritizing selection word frequency, as denoising word, generates denoising
Summary table.
Further, also include:
Filtered using described denoising summary table, that is, according to described denoising summary table, mated by denoising word, to gathered data
Filtered, if the match is successful, labelling is defined incident data, otherwise labelling is defined non-burst event data, finally
Obtain quasi- incident data collection D11 and quasi- non-burst event data collection D10.
Further, also include:
Described denoising summary table is updated, increases income assembly to described quasi- non-burst event data collection D10 using Word participle
Carry out participle and part-of-speech tagging, only retain noun therein and verb;The lexical item obtaining for data prediction, with keywords database
In lexical item carry out duplicate checking, if there is not this word in keywords database, this word being designated neologisms, be added simultaneously to key word
Storehouse simultaneously updates, and otherwise waits for new data;The neologisms obtaining for data prediction, determine whether accident Feature Words, obtain
To new denoising word;The new denoising word obtaining through quasi- based Denoising is added to denoising summary table and updates, meanwhile, using this
Newly-increased denoising word carries out incident data screening again to described quasi- non-burst event data collection D10, updates described accurate prominent
Send out event data collection D11 and described quasi- non-burst event data collection D10.
Further, the process building combination vocabulary specifically includes:
According to described denoising summary table, correct accident historical data in right amount is selected to generate incident data training set
TD21;For described incident data training set TD21, carry out participle and part-of-speech tagging using Word participle assembly of increasing income, protect
Stay noun therein and verb;For noun obtained above and verb, carry out with the denoising word in described denoising summary table respectively
Co-occurrence Analysis, and the co-occurrence frequency is counted, obtain co-occurrence set of words, described co-occurrence set of words is saved in co-occurrence word simultaneously
In storehouse;For described co-occurrence set of words, it is ranked up from high to low according to the co-occurrence frequency, in conjunction with described denoising summary table, preferential choosing
Select the higher co-occurrence word of the frequency as portmanteau word, generate combination vocabulary.
Further, also include:
Using described combination vocabulary, described quasi- incident data collection D11 is filtered, that is, be directed to described quasi- burst thing
Part data set D11, is mated using the co-occurrence word in described combination vocabulary, if the match is successful, is marked the data as standard
Incident data, otherwise labelling be defined non-burst event data, ultimately generate quasi- incident data collection D21 and quasi- non-burst
Event data collection D20.
Further, also include:
Using described quasi- non-burst event data collection D20, described combination vocabulary is updated, is increased income using Word participle
Assembly carries out participle and part-of-speech tagging to described quasi- non-burst event data collection D20, only retains noun therein and verb;For
Noun and verb that above-mentioned data prediction obtains, respectively with go described in the denoising word in summary table of making an uproar carry out Co-occurrence Analysis, and right
The co-occurrence frequency is counted;The co-occurrence word of the co-occurrence word obtaining and co-occurrence dictionary is carried out duplicate checking, determines whether new co-occurrence word,
If new co-occurrence word, then it is combined selected ci poem and selects, be added simultaneously to co-occurrence dictionary and update, otherwise wait for new data;For
Described new co-occurrence word, is ranked up from high to low according to the co-occurrence frequency, selects the co-occurrence word with accident feature as new
Increase portmanteau word.
Further, also include:
Described newly-increased portmanteau word is added to described combination vocabulary and updates, meanwhile, right again using newly-increased portmanteau word
Described quasi- non-burst event data collection D20 carries out incident data screening, updates described quasi- incident data collection D21 and institute
State quasi- non-burst event data collection D20.
Further, the process building reverse vocabulary specifically includes:
According to described quasi- incident data collection D21, based on described combination vocabulary, using historical incident data with make an uproar
Sound data, generates incident data training set TD31 and non-burst event data training set TD30;Increased income using Word participle
Assembly carries out participle, word to described incident data training set TD31 and described non-burst event data training set TD30 respectively
Property mark and word frequency statisticses, only retain noun therein and verb, and by all lexical items obtaining be added to training dictionary in;Will
The noun of described non-burst event data training set TD30 and verb that data prediction obtains, with described incident data instruction
Practice collection TD31 in noun and verb carry out duplicate checking, only retain the proprietary noun of described non-burst event data training set TD31 and
Verb, the higher word of prioritizing selection word frequency, as reverse word, generates reverse vocabulary.
Further, also include:
Using described reverse vocabulary, described quasi- incident data collection D21 is filtered, i.e. described pin be aligned burst thing
Part data set D21, is mated using the reverse word in described reverse vocabulary, if the match is successful, is marked the data as standard
Non-burst event data, otherwise labelling be defined incident data, ultimately generate quasi- incident data collection D31 and quasi- non-burst
Event data collection D30.
Further, also include:
Based on described quasi- incident data collection D31, described reverse vocabulary is updated, is increased income using Word participle
Assembly carries out participle, part-of-speech tagging and word frequency statisticses to described quasi- incident data D31, only retains noun therein and moves
Word;Lexical item in noun and verb that data prediction obtains, with training dictionary carries out duplicate checking, determines whether neologisms, if
It is neologisms, then carries out reverse word analysis, be added simultaneously to train dictionary and update, otherwise wait for new data;New for obtain
Word, is ranked up from high to low according to word frequency, and the higher word of prioritizing selection word frequency is as newly-increased reverse word;Newly reverse by obtain
Word is added to reverse vocabulary and updates, and, for described incident data collection D31, is carried out using newly-increased reverse word meanwhile
Filter, updates described quasi- incident data collection D31 and described quasi- non-burst event data collection D30.
Further, also include:
Evaluation to one or more vocabulary performances in denoising summary table, combination vocabulary and reverse vocabulary, that is, count quasi- burst
Event data concentrates the quantity of correct accident, and the accuracy of data is evaluated;Then in conjunction with quasi- incident data
The quantity of correct accident and quasi- non-burst event data is concentrated to concentrate the quantity of correct accident, comprehensive to data
Evaluated, be finally completed the assessment to vocabulary performance.
Present invention also offers a kind of screening plant of incident data, including:
Vocabulary construction unit, for building denoising summary table, combination vocabulary and reverse vocabulary;
Data screening unit, for according in described denoising summary table, described combination vocabulary and described reverse vocabulary
Individual or multiple, the accident related data collecting is screened.
Further, described vocabulary construction unit at least includes following one or more modules:
First structure module, for using natural language analysis technology, carrying out pretreatment to data, is divided by quasi- denoising word
Analysis, realizes the structure of basic denoising summary table and denoising summary table;
Second structure module, for based on described denoising summary table, using natural language analysis technology, carrying out pre- place to data
Reason, the noun obtaining and verb are carried out Co-occurrence Analysis with described denoising summary table, select to realize combination vocabulary by portmanteau word
Build;
3rd structure module, for based on described combination vocabulary, using natural language analysis technology, carrying out pre- place to data
Reason, realizes the structure of reverse vocabulary by the analysis of reverse word.
Further, described first build module specifically for according to described basic denoising summary table, using denoising word coupling
Historical data is filtered, obtains training set TD1;Based on the data in described training set TD1, according to whether for accident
Data is marked, and finally gives incident data training set TD11 and non-burst event data training set TD10;Using
Word participle increase income assembly respectively the data in described training set TD11 and TD10 is carried out participle, part-of-speech tagging and word frequency system
Meter, only retains noun therein and verb, and the lexical item of reservation is added in keywords database;Through above-mentioned data prediction, will
The lexical item that described incident data training set TD11 obtains, is ranked up from high to low according to word frequency, through the analysis of quasi- denoising word,
The higher word of prioritizing selection word frequency, as denoising word, generates denoising summary table.
Further, described second build module specifically for according to described denoising summary table, selecting correctly to happen suddenly in right amount
Event history data generates incident data training set TD21;For described incident data training set TD21, utilize
Word participle assembly of increasing income carries out participle and part-of-speech tagging, retains noun therein and verb;For noun obtained above and
Verb, carries out Co-occurrence Analysis with the denoising word in described denoising summary table respectively, and the co-occurrence frequency is counted, obtain co-occurrence word
Set, described co-occurrence set of words is saved in co-occurrence dictionary simultaneously;For described co-occurrence set of words, according to the co-occurrence frequency by height
It is ranked up to low, in conjunction with described denoising summary table, the higher co-occurrence word of the prioritizing selection frequency, as portmanteau word, generates portmanteau word
Table.
Further, the described 3rd build module specifically for according to described quasi- incident data collection D21, based on institute
State combination vocabulary, using historical incident data and noise data, generate incident data training set TD31 and non-burst
Event data training set TD30;Assembly is increased income respectively to described incident data training set TD31 and described using Word participle
Non-burst event data training set TD30 carries out participle, part-of-speech tagging and word frequency statisticses, only retains noun therein and verb, and
The all lexical items obtaining are added in training dictionary;The described non-burst event data training set that data prediction is obtained
Noun in the noun of TD30 and verb, with described incident data training set TD31 and verb carry out duplicate checking, only retain institute
State the proprietary noun of non-burst event data training set TD31 and verb, the higher word of prioritizing selection word frequency is as reverse word, raw
Become reverse vocabulary.
The present invention has the beneficial effect that:
The present invention, by artificial intelligence natural language analytical technology, especially Chinese word segmentation, vocabulary label technology, is applied to happen suddenly
Event antistop list builds, and is easy to the extraction to accident Feature Words, is generating and renewal of accident antistop list
Reference is provided.
Function according to vocabulary and effect, the accident key word vocabulary that the present invention builds is divided into denoising summary table, combination
Three sublists such as vocabulary and reverse vocabulary, for being filtered at many levels to mass data, to improve incident data screening
Accuracy, and then reduce later data processing procedure in manual intervention.
Invention also defines the calculating implementation method of recall ratio and precision ratio, from comprehensive and accuracy angle,
Assessment for vocabulary performance provides foundation.
Brief description
Accompanying drawing is only used for illustrating the purpose of specific embodiment, and is not considered as limitation of the present invention, in whole accompanying drawing
In, identical reference markss represent identical part.
Fig. 1 is the schematic flow sheet of embodiment of the present invention methods described;
Fig. 2 is the implementation process diagram of denoising summary table in the embodiment of the present invention;
Fig. 3 is the implementation process diagram combining vocabulary in the embodiment of the present invention;
Fig. 4 is the implementation process diagram of reversely vocabulary in the embodiment of the present invention;
Fig. 5 is the structural representation of embodiment of the present invention described device.
Specific embodiment
To specifically describe the preferred embodiments of the present invention below in conjunction with the accompanying drawings, wherein, accompanying drawing constitutes the application part, and
It is used for together with embodiments of the present invention explaining the principle of the present invention.
As shown in figure 1, Fig. 1 is the schematic flow sheet of embodiment of the present invention methods described, main inclusion:The structure of denoising summary table
Build, apply, updating and evaluation process, the structure of combination vocabulary, application, renewal and evaluation process, and, the structure of reverse vocabulary
Build, apply, updating and evaluation process.
Present embodiments provide a kind of screening technique of incident data, for the data collecting, using denoising
Summary table, generates quasi- incident data, realizes the first screening of incident data.As shown in Fig. 2 Fig. 2 is that structure denoising is total
The schematic flow sheet of table, arrives step 105 including step 101.
Step 101:Collection information simultaneously builds basic denoising summary table
Collection is from accident taxonomic hierarchieses, emergency preplan and laws and regulations and other relevant departments with regard to burst
The information such as the circular of event, generate basic denoising summary table using above-mentioned derived data;
Be exactly specifically first, to be increased income participle assembly using Word, to accident taxonomic hierarchieses, emergency preplan with
Laws and regulations and other relevant departments carry out Chinese word segmentation and part-of-speech tagging with regard to the text data such as circular of accident, only
Retain verb therein and noun;Then, carry out accident Feature Words extraction, realize quasi- denoising word analysis, generate and substantially go
Make an uproar summary table.
Step 102:Based on above-mentioned basic denoising summary table, obtain denoising summary table
Using basic denoising summary table to historical data analysis, build accident training set and non-burst event training set,
Using natural language analysis technology, analyzed by quasi- denoising word, realize the structure of denoising summary table, detailed process mainly includes:
(1) training set builds
For appropriate historical data, the basic denoising summary table obtaining according to step 101, mated to history number by denoising word
According to being filtered, obtain training set TD1.Based on the data in training set TD1, enter rower according to whether for incident data
Note, finally gives incident data training set TD11 and non-burst event data training set TD10.
(2) data prediction
Respectively the data in training set TD11 and TD10 in step (1) is carried out point using Word participle assembly of increasing income
Word, part-of-speech tagging and word frequency statisticses, only retain noun therein and verb, and the lexical item of reservation are added in keywords database.
(3) quasi- denoising word analysis
Through above-mentioned data prediction, the lexical item that incident data training set TD11 is obtained, according to word frequency from high to low
It is ranked up, the higher word of prioritizing selection word frequency, as denoising word, generates denoising summary table.
Step 103:Filtered using denoising summary table
For the data collecting, filtered using the denoising summary table that step 102 obtains, specific rules are:According to step
The denoising summary table that rapid 102 obtain, is mated by denoising word, gathered data is filtered, if the match is successful, labelling is defined
Incident data, otherwise labelling be defined non-burst event data, finally give quasi- incident data collection D11 and quasi- non-burst
Event data collection D10.
Step 104:Update denoising summary table
The quasi- non-burst event data collection D10 obtaining for step 103, by natural language analysis method to denoising summary table
It is updated, detailed process mainly includes:
(1) data prediction
Carry out participle and part-of-speech tagging using the Word participle assembly alignment non-burst event data collection D10 that increases income, only retain
Noun therein and verb.
(2) new word identification
Lexical item in the lexical item obtaining for above-mentioned steps data prediction, with keywords database carries out duplicate checking, if crucial
There is not this word in dictionary, then this word is designated neologisms, be added simultaneously to keywords database and update, otherwise wait for new data.
(3) quasi- denoising word analysis
The neologisms obtaining for data prediction, according to word frequency from high to low, determine whether accident Feature Words, obtain
To new denoising word.
(4) update denoising word
The new denoising word obtaining through quasi- based Denoising is added to denoising summary table and updates, meanwhile, newly-increased using this
Denoising word be directed at non-burst event data collection D10 again and carry out incident data screening, update quasi- incident data collection
D11 and quasi- non-burst event data collection D10.
Additionally, according to the file such as relevant departments' circular and related prediction scheme to the content description such as accident and classification refinement
Update, also should in time denoising summary table be updated.
Step 105:Denoising summary table is assessed
Update the quasi- incident data collection D11 generating after denoising summary table and quasi- non-burst event data for step 104
Collection D10, the evaluation index quoting searching system is estimated to denoising summary table, and circular is:
(1) precision ratio:
(2) recall ratio:
Wherein, D1iArbitrary data in the incident data that is defined collection D11, D1jIn the non-burst that is defined event data collection D10
Arbitrary data.| * | represents the size of data volume.
Based on quasi- incident data D11 obtained above, carry out the secondary sieve of incident data using combination vocabulary
Choosing, to improve the accuracy of this key word vocabulary application, following steps 106 arrive the process that step 109 is that combination vocabulary builds, such as
Shown in Fig. 3.
Step 106:Generate combination vocabulary
Using natural language analysis, select appropriate historical data as training data, generate combination vocabulary, detailed process bag
Include:
(1) training set builds
According to above-mentioned denoising summary table, correct accident historical data in right amount is selected to generate incident data training set
TD21.
(2) data prediction
For incident data training set TD21, carry out participle and part-of-speech tagging using Word participle assembly of increasing income, protect
Stay noun therein and verb.
(3) Co-occurrence Analysis
The noun obtaining for step (2) and verb, carry out Co-occurrence Analysis with the denoising word in denoising summary table respectively, and right
The co-occurrence frequency is counted, and obtains co-occurrence set of words, this co-occurrence set of words is saved in co-occurrence dictionary simultaneously.
(4) portmanteau word selects
For co-occurrence set of words obtained above, it is ranked up from high to low according to the co-occurrence frequency, in conjunction with denoising summary table, excellent
First select the higher co-occurrence word of the frequency as portmanteau word, generate combination vocabulary.
Step 107:Application combination vocabulary
Filtered using combination vocabulary be aligned incident data collection D11, specific rules are:Pin is directed at accident number
According to collection D11, mated using the co-occurrence word in combination vocabulary, if the match is successful, marked the data as quasi- accident
Data, otherwise labelling be defined non-burst event data, ultimately generate quasi- incident data collection D21 and quasi- non-burst event data
Collection D20.
Step 108:Update combination vocabulary
Based on natural language analysis technology, using quasi- non-burst event data collection D20, combination vocabulary is updated, specifically
Process includes:
(1) data prediction
Carry out participle and part-of-speech tagging using the Word participle assembly alignment non-burst event data collection D20 that increases income, only retain
Noun therein and verb.
(2) Co-occurrence Analysis
The noun obtaining for data prediction above and verb, carry out co-occurrence and divide with the denoising word in denoising summary table respectively
Analysis, and the co-occurrence frequency is counted.
(3) new co-occurrence word identification
The co-occurrence word that step (2) is obtained carries out duplicate checking with the co-occurrence word of co-occurrence dictionary, determines whether new co-occurrence word, such as
Fruit is new co-occurrence word, then be combined selected ci poem and select, be added simultaneously to co-occurrence dictionary and update, otherwise wait for new data.
(4) portmanteau word selects
The new co-occurrence word obtaining for step (3), is ranked up from high to low according to the co-occurrence frequency, selects there is burst thing
The co-occurrence word of part feature is as newly-increased portmanteau word.
(5) combination vocabulary updates
The newly-increased portmanteau word that step (4) is obtained is added to combination vocabulary and updates.Meanwhile, using newly-increased portmanteau word again
Secondary be aligned non-burst event data collection D20 carries out incident data screening, updates quasi- incident data collection D21 and standard is non-prominent
Send out event data collection D20.
Step 109:Combination vocabulary assessment
For updating the quasi- incident data collection D21 generating after combination vocabulary and quasi- non-burst event data collection D20, draw
With the evaluation index of searching system, combination vocabulary is estimated, circular is:
(1) precision ratio:
(2) recall ratio:
Wherein, D2iArbitrary data in the incident data that is defined collection D21, D2jIn the non-burst that is defined event data collection D20
Arbitrary data.| * | represents the size of data volume.
For quasi- incident data D21 obtained above, using reverse vocabulary, wherein non-burst event data is carried out
Filter, to improve the accuracy of this key word vocabulary application.As shown in figure 4, the concrete step that in this embodiment, reverse vocabulary builds
Suddenly as follows:
Step 110:Set up reverse vocabulary
Based on natural language analysis, select appropriate historical data, generate reverse vocabulary, specific rules are:
(1) training set builds
The quasi- incident data collection D21 that combined vocabulary obtains, has been related to incident data and noise data.Base
In combination vocabulary, using historical incident data and noise data, generate incident data training set TD31 and non-burst
Event data training set TD30.
(2) data prediction
Assembly is increased income respectively to incident data training set TD31 and the training of non-burst event data using Word participle
Collection TD30 carries out participle, part-of-speech tagging and word frequency statisticses, only retains noun therein and verb, and all lexical items obtaining are added
It is added in training dictionary.
(3) reversely word analysis
The noun of non-burst event data training set TD30 that step (2) data prediction is obtained and verb, with burst
Noun in event data training set TD31 and verb carry out duplicate checking, only retain non-burst event data training set TD30 proprietary
Noun and verb, the higher word of prioritizing selection word frequency, as reverse word, generates reverse vocabulary.
Step 111:Apply reverse vocabulary
Using reverse vocabulary, the quasi- incident data collection D21 that combination vocabulary is obtained is filtered again, obtains accurate prominent
Send out event data collection D31 and quasi- non-burst event data collection D30, wherein, quasi- incident data collection D31 is system output
Incident data.
Further, filtered using reverse vocabulary be aligned incident data collection D21, specific rules are:Pin be aligned is prominent
Send out event data collection D21, mated using the reverse word in reverse vocabulary, if the match is successful, mark the data as standard
Non-burst event data, otherwise labelling be defined incident data, ultimately generate quasi- incident data collection D31 and quasi- non-burst
Event data collection D30.Wherein, quasi- incident data collection D31 is the accident number that system is obtained using key word vocabulary
According to.
Step 112:Update reverse vocabulary
Reverse vocabulary renewal is carried out based on quasi- incident data collection D31, specific rules are:
(1) data prediction
Participle, part-of-speech tagging and word frequency statisticses are carried out using Word participle assembly alignment incident data D31 of increasing income, only
Retain noun therein and verb.
(2) new word identification.
Lexical item in the noun that obtain step (1) data prediction and verb, with training dictionary carries out duplicate checking, and judgement is
No if neologisms, then carry out reverse word analysis for neologisms, be added simultaneously to train dictionary and update, otherwise wait for new data.
(3) reversely word analysis.
The neologisms obtaining for step (2), are ranked up from high to low according to word frequency, and the higher word of prioritizing selection word frequency is made
For increasing reverse word newly.
(4) update reverse word.
The newly reverse word that step (3) is obtained is added to reverse vocabulary and updates.Meanwhile, for incident data collection
D31, is filtered using newly-increased reverse word, updates quasi- incident data collection D31 and quasi- non-burst event data collection D30.
Step 113:Reversely vocabulary assessment
Update the quasi- incident data collection D31 generating after reverse vocabulary and quasi- non-burst event data for step 112
Collection D30, the evaluation index quoting searching system is estimated to denoising summary table, and circular is:
(1) precision ratio:
(2) recall ratio:
Wherein, D3iArbitrary data in the incident data that is defined collection D31, D3jIn the non-burst that is defined event data collection D30
Arbitrary data.| * | represents the size of data volume.
Next embodiment of the present invention described device is described in detail.
As shown in figure 5, Fig. 5 is the structural representation of embodiment of the present invention described device, specifically can include:
Vocabulary construction unit, is mainly responsible for building denoising summary table, combination vocabulary and reverse vocabulary;
Data screening unit, is mainly responsible for according in described denoising summary table, described combination vocabulary and described reverse vocabulary
One or more, the accident related data collecting is screened.
Wherein, vocabulary construction unit at least includes following one or more modules:
First structure module, is mainly responsible for utilizing natural language analysis technology, carries out pretreatment to data, by quasi- denoising
Word is analyzed, and realizes the structure of basic denoising summary table and denoising summary table;
It is exactly specifically that first builds module according to described basic denoising summary table, is mated to history number using denoising word
According to being filtered, obtain training set TD1;Based on the data in described training set TD1, carry out according to whether for incident data
Labelling, finally gives incident data training set TD11 and non-burst event data training set TD10;Opened using Word participle
Source component carries out participle, part-of-speech tagging and word frequency statisticses to the data in described training set TD11 and TD10 respectively, only retains it
In noun and verb, and the lexical item of reservation is added in keywords database;Through above-mentioned data prediction, by described accident
The lexical item that data training set TD11 obtains, is ranked up from high to low according to word frequency, through the analysis of quasi- denoising word, prioritizing selection word frequency
Higher word, as denoising word, generates denoising summary table.
Second structure module, is mainly responsible for being based on described denoising summary table, using natural language analysis technology, data is carried out
Pretreatment, the noun obtaining and verb are carried out Co-occurrence Analysis with described denoising summary table, select to realize portmanteau word by portmanteau word
The structure of table;
It is exactly specifically that second builds module according to described denoising summary table, select correct accident history in right amount
Data genaration incident data training set TD21;For described incident data training set TD21, increased income point using Word
Phrase part carries out participle and part-of-speech tagging, retains noun therein and verb;For noun obtained above and verb, respectively with
Denoising word in described denoising summary table carries out Co-occurrence Analysis, and the co-occurrence frequency is counted, and obtains co-occurrence set of words, will simultaneously
Described co-occurrence set of words is saved in co-occurrence dictionary;For described co-occurrence set of words, arranged from high to low according to the co-occurrence frequency
Sequence, in conjunction with described denoising summary table, the higher co-occurrence word of the prioritizing selection frequency, as portmanteau word, generates combination vocabulary.
3rd structure module, is mainly responsible for being based on described combination vocabulary, using natural language analysis technology, data is carried out
Pretreatment, realizes the structure of reverse vocabulary by the analysis of reverse word.
It is exactly specifically that the 3rd builds module according to described quasi- incident data collection D21, based on described combination vocabulary,
Using historical incident data and noise data, generate incident data training set TD31 and the training of non-burst event data
Collection TD30;Assembly is increased income respectively to described incident data training set TD31 and described non-burst event number using Word participle
Carry out participle, part-of-speech tagging and word frequency statisticses according to training set TD30, only retain noun therein and verb, and all by obtain
Lexical item is added in training dictionary;The noun of described non-burst event data training set TD30 that data prediction is obtained and dynamic
Noun in word, with described incident data training set TD31 and verb carry out duplicate checking, only retain described non-burst event number
According to the proprietary noun of training set TD31 and verb, the higher word of prioritizing selection word frequency, as reverse word, generates reverse vocabulary.
Process is implemented for embodiment of the present invention described device, is described in detail due in said method, therefore
Here is omitted.
In sum, a kind of screening technique of incident data and device are embodiments provided, the method is led to
Cross natural language analysis technology, based on Chinese word segmentation and part-of-speech tagging result, the structure for accident key word vocabulary provides
Foundation.For ensureing the accuracy rate of data, the key word vocabulary of accident is divided into denoising summary table, combination vocabulary by the present invention
With reverse three sublists of vocabulary, realize the screening of incident data by the way of many vocabularys.Around the comprehensive of data and
Accuracy, the present invention also introduces precision ratio and recall ratio, is that Performance Evaluation and the renewal of each vocabulary provides foundation.
It will be understood by those skilled in the art that realizing all or part of flow process of above-described embodiment method, can be by meter
Calculation machine program to complete come the hardware to instruct correlation, and described program can be stored in computer-readable recording medium.Wherein, institute
Stating computer-readable recording medium is disk, CD, read-only memory or random access memory etc..
The above, the only present invention preferably specific embodiment, but protection scope of the present invention is not limited thereto,
Any those familiar with the art the invention discloses technical scope in, the change or replacement that can readily occur in,
All should be included within the scope of the present invention.
Claims (18)
1. a kind of screening technique of incident data is it is characterised in that include:
Build denoising summary table, combination vocabulary and reverse vocabulary;
According to one or more of described denoising summary table, described combination vocabulary and described reverse vocabulary, prominent to collect
Send out event related data to be screened.
2. as requested the method described in 1 it is characterised in that build denoising summary table, combination vocabulary and reverse vocabulary process
Specifically include:
Using natural language analysis technology, pretreatment is carried out to data, by quasi- denoising word analyze, realize basic denoising summary table and
The structure of denoising summary table;
Based on described denoising summary table, using natural language analysis technology, pretreatment is carried out to data, by the noun obtaining and verb
Carry out Co-occurrence Analysis with described denoising summary table, select to realize the structure of combination vocabulary by portmanteau word;
Based on described combination vocabulary, using natural language analysis technology, pretreatment is carried out to data, analyzed by reverse word and realize
The reversely structure of vocabulary.
3. method according to claim 1 and 2 is it is characterised in that the process building denoising summary table specifically includes:
According to described basic denoising summary table, using denoising word coupling, historical data is filtered, obtain training set TD1;It is based on
Data in described training set TD1, is marked according to whether for incident data, finally gives incident data training
Collection TD11 and non-burst event data training set TD10;Using Word participle increase income assembly respectively to described training set TD11 and
Data in TD10 carries out participle, part-of-speech tagging and word frequency statisticses, only retains noun therein and verb, and the lexical item that will retain
It is added in keywords database;Through above-mentioned data prediction, the lexical item that described incident data training set TD11 is obtained, according to
Word frequency is ranked up from high to low, and through the analysis of quasi- denoising word, the higher word of prioritizing selection word frequency, as denoising word, generates denoising total
Table.
4. method according to claim 1 and 2 is it is characterised in that also include:
Filtered using described denoising summary table, that is, according to described denoising summary table, mated by denoising word, gathered data is carried out
Filter, if the match is successful, labelling is defined incident data, otherwise labelling is defined non-burst event data, finally gives
Quasi- incident data collection D11 and quasi- non-burst event data collection D10.
5. method according to claim 4 is it is characterised in that also include:
Described denoising summary table is updated, using Word participle assembly of increasing income, described quasi- non-burst event data collection D10 is carried out
Participle and part-of-speech tagging, only retain noun therein and verb;The lexical item obtaining for data prediction, in keywords database
Lexical item carries out duplicate checking, if there is not this word in keywords database, this word being designated neologisms, being added simultaneously to keywords database simultaneously
Update, otherwise wait for new data;The neologisms obtaining for data prediction, carry out quasi- based Denoising, determine whether the thing that happens suddenly
Part Feature Words, obtain new denoising word;The new denoising word obtaining is added to denoising summary table and updates, meanwhile, new using this
The denoising word increasing carries out incident data screening again to described quasi- non-burst event data collection D10, updates described quasi- burst
Event data collection D11 and described quasi- non-burst event data collection D10.
6. method according to claim 1 and 2 is it is characterised in that the process building combination vocabulary specifically includes:
According to described denoising summary table, correct accident historical data in right amount is selected to generate incident data training set
TD21;For described incident data training set TD21, carry out participle and part-of-speech tagging using Word participle assembly of increasing income, protect
Stay noun therein and verb;For noun obtained above and verb, carry out with the denoising word in described denoising summary table respectively
Co-occurrence Analysis, and the co-occurrence frequency is counted, obtain co-occurrence set of words, described co-occurrence set of words is saved in co-occurrence word simultaneously
In storehouse;For described co-occurrence set of words, it is ranked up from high to low according to the co-occurrence frequency, in conjunction with described denoising summary table, preferential choosing
Select the higher co-occurrence word of the frequency as portmanteau word, generate combination vocabulary.
7. method according to claim 6 is it is characterised in that also include:
Using described combination vocabulary, described quasi- incident data collection D11 is filtered, that is, be directed to described quasi- accident number
According to collection D11, mated using the co-occurrence word in described combination vocabulary, if the match is successful, marked the data as quasi- burst
Event data, otherwise labelling be defined non-burst event data, ultimately generate quasi- incident data collection D21 and quasi- non-burst event
Data set D20.
8. method according to claim 7 is it is characterised in that also include:
Using described quasi- non-burst event data collection D20, described combination vocabulary is updated, is increased income assembly using Word participle
Participle and part-of-speech tagging are carried out to described quasi- non-burst event data collection D20, only retains noun therein and verb;For above-mentioned
Noun and verb that data prediction obtains, carry out Co-occurrence Analysis with the denoising word in described denoising summary table respectively, and to co-occurrence
The frequency is counted;The co-occurrence word of the co-occurrence word obtaining and co-occurrence dictionary is carried out duplicate checking, determines whether new co-occurrence word, if
For new co-occurrence word, then it is combined selected ci poem and selects, be added simultaneously to co-occurrence dictionary and update, otherwise wait for new data;For described
New co-occurrence word, is ranked up from high to low according to the co-occurrence frequency, selects the co-occurrence word with accident feature as newly-increased group
Close word.
9. method according to claim 8 is it is characterised in that also include:
Described newly-increased portmanteau word is added to described combination vocabulary and updates, meanwhile, using newly-increased portmanteau word again to described
Quasi- non-burst event data collection D20 carries out incident data screening, updates described quasi- incident data collection D21 and described standard
Non-burst event data collection D20.
10. method according to claim 9 is it is characterised in that the process building reverse vocabulary specifically includes:
According to described quasi- incident data collection D21, based on described combination vocabulary, using historical incident data and noise number
According to generating incident data training set TD31 and non-burst event data training set TD30;Increased income assembly using Word participle
Respectively participle, part of speech mark are carried out to described incident data training set TD31 and described non-burst event data training set TD30
Note and word frequency statisticses, only retain noun therein and verb, and all lexical items obtaining are added in training dictionary;By data
The noun of described non-burst event data training set TD30 and verb that pretreatment obtains, with described incident data training set
Noun in TD31 and verb carry out duplicate checking, only retain the proprietary noun of described non-burst event data training set TD31 and move
Word, the higher word of prioritizing selection word frequency, as reverse word, generates reverse vocabulary.
11. methods according to claim 10 are it is characterised in that also include:
Using described reverse vocabulary, described quasi- incident data collection D21 is filtered, that is, described pin is directed at accident number
According to collection D21, mated using the reverse word in described reverse vocabulary, if the match is successful, marked the data as accurate non-prominent
Send out event data, otherwise labelling is defined incident data, ultimately generates quasi- incident data collection D31 and quasi- non-burst event
Data set D30.
12. methods according to claim 11 are it is characterised in that also include:
Based on described quasi- incident data collection D31, described reverse vocabulary is updated, is increased income assembly using Word participle
Participle, part-of-speech tagging and word frequency statisticses are carried out to described quasi- incident data D31, only retains noun therein and verb;Number
Lexical item in noun and verb that Data preprocess obtains, with training dictionary carries out duplicate checking, determines whether neologisms, if newly
Word, then carry out reverse word analysis, is added simultaneously to train dictionary and update, otherwise waits for new data;For the neologisms obtaining, press
It is ranked up from high to low according to word frequency, the higher word of prioritizing selection word frequency is as newly-increased reverse word;The newly reverse word obtaining is added
It is added to reverse vocabulary and updates, for described incident data collection D31, filtered using newly-increased reverse word, more meanwhile
Newly described quasi- incident data collection D31 and described quasi- non-burst event data collection D30.
13. methods according to claim 1 and 2 are it is characterised in that also include:
Evaluation to one or more vocabulary performances in denoising summary table, combination vocabulary and reverse vocabulary, that is, count quasi- accident
In data set, the quantity of correct accident, evaluates to the accuracy of data;Concentrate then in conjunction with quasi- incident data
The quantity of correct accident and the quantity of the quasi- non-burst event data correct accident of concentration, are carried out to the comprehensive of data
Evaluate, be finally completed the assessment to vocabulary performance.
A kind of 14. screening plants of incident data are it is characterised in that include:
Vocabulary construction unit, for building denoising summary table, combination vocabulary and reverse vocabulary;
Data screening unit, for according to one of described denoising summary table, described combination vocabulary and described reverse vocabulary or
Multiple, the accident related data collecting is screened.
15. as requested the device described in 14 it is characterised in that described vocabulary construction unit at least include following one or more
Module:
First structure module, for using natural language analysis technology, carrying out pretreatment to data, is analyzed by quasi- denoising word,
Realize the structure of basic denoising summary table and denoising summary table;
Second structure module, for based on described denoising summary table, using natural language analysis technology, pretreatment being carried out to data,
The noun obtaining and verb are carried out Co-occurrence Analysis with described denoising summary table, selects to realize the structure of combination vocabulary by portmanteau word
Build;
3rd structure module, for based on described combination vocabulary, using natural language analysis technology, pretreatment being carried out to data,
Realize the structure of reverse vocabulary by the analysis of reverse word.
16. devices according to claims 14 or 15 it is characterised in that described first build module specifically for, according to
Described basic denoising summary table, is filtered to historical data using denoising word coupling, obtains training set TD1;Based on described training
Collection TD1 in data, be marked according to whether for incident data, finally give incident data training set TD11 and
Non-burst event data training set TD10;Assembly is increased income respectively to the number in described training set TD11 and TD10 using Word participle
According to carrying out participle, part-of-speech tagging and word frequency statisticses, only retain noun therein and verb, and the lexical item of reservation is added to key
In dictionary;Through above-mentioned data prediction, the lexical item that described incident data training set TD11 is obtained, according to word frequency by height to
Low be ranked up, through the analysis of quasi- denoising word, the higher word of prioritizing selection word frequency as denoising word, generates denoising summary table.
17. devices according to claims 14 or 15 it is characterised in that described second build module specifically for, according to
Described denoising summary table, selects correct accident historical data in right amount to generate incident data training set TD21;For institute
State incident data training set TD21, carry out participle and part-of-speech tagging using Word participle assembly of increasing income, retain name therein
Word and verb;For noun obtained above and verb, carry out Co-occurrence Analysis with the denoising word in described denoising summary table respectively, and
The co-occurrence frequency is counted, obtains co-occurrence set of words, described co-occurrence set of words is saved in co-occurrence dictionary simultaneously;For institute
State co-occurrence set of words, be ranked up from high to low according to the co-occurrence frequency, in conjunction with described denoising summary table, the prioritizing selection frequency is higher
Co-occurrence word, as portmanteau word, generates combination vocabulary.
18. according to the device of claims 14 or 15 it is characterised in that the described 3rd builds module specifically for according to described
Quasi- incident data collection D21, based on described combination vocabulary, using historical incident data and noise data, generates burst
Event data training set TD31 and non-burst event data training set TD30;Assembly is increased income respectively to described prominent using Word participle
Send out event data training set TD31 and described non-burst event data training set TD30 carries out participle, part-of-speech tagging and word frequency system
Meter, only retains noun therein and verb, and all lexical items obtaining are added in training dictionary;Data prediction is obtained
The noun and verb, with described incident data training set TD31 of described non-burst event data training set TD30 in name
Word and verb carry out duplicate checking, only retain the proprietary noun of described non-burst event data training set TD31 and verb, prioritizing selection
The higher word of word frequency, as reverse word, generates reverse vocabulary.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610796947.7A CN106469203B (en) | 2016-08-31 | 2016-08-31 | A kind of screening technique and device of incident data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610796947.7A CN106469203B (en) | 2016-08-31 | 2016-08-31 | A kind of screening technique and device of incident data |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106469203A true CN106469203A (en) | 2017-03-01 |
CN106469203B CN106469203B (en) | 2019-07-23 |
Family
ID=58230339
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610796947.7A Active CN106469203B (en) | 2016-08-31 | 2016-08-31 | A kind of screening technique and device of incident data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106469203B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108681613A (en) * | 2018-07-13 | 2018-10-19 | 北京酷车易美网络科技有限公司 | A kind of vehicle history vehicle condition solution read apparatus |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102937960A (en) * | 2012-09-06 | 2013-02-20 | 北京邮电大学 | Device and method for identifying and evaluating emergency hot topic |
CN103577404A (en) * | 2012-07-19 | 2014-02-12 | 中国人民大学 | Microblog-oriented discovery method for new emergencies |
CN104573006A (en) * | 2015-01-08 | 2015-04-29 | 南通大学 | Construction method of public health emergent event domain knowledge base |
CN104615718A (en) * | 2015-02-05 | 2015-05-13 | 北京航空航天大学 | Hierarchical analysis method for social network emergency |
-
2016
- 2016-08-31 CN CN201610796947.7A patent/CN106469203B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103577404A (en) * | 2012-07-19 | 2014-02-12 | 中国人民大学 | Microblog-oriented discovery method for new emergencies |
CN102937960A (en) * | 2012-09-06 | 2013-02-20 | 北京邮电大学 | Device and method for identifying and evaluating emergency hot topic |
CN104573006A (en) * | 2015-01-08 | 2015-04-29 | 南通大学 | Construction method of public health emergent event domain knowledge base |
CN104615718A (en) * | 2015-02-05 | 2015-05-13 | 北京航空航天大学 | Hierarchical analysis method for social network emergency |
Non-Patent Citations (1)
Title |
---|
周智星: ""网络舆情监测管理系统设计的研究与应用"", 《中国优秀硕士学位论文全文数据库—信《中国优秀硕士学位论文全文数据库—信息科技辑》息科技辑》 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108681613A (en) * | 2018-07-13 | 2018-10-19 | 北京酷车易美网络科技有限公司 | A kind of vehicle history vehicle condition solution read apparatus |
Also Published As
Publication number | Publication date |
---|---|
CN106469203B (en) | 2019-07-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107239446B (en) | A kind of intelligence relationship extracting method based on neural network Yu attention mechanism | |
CN106095928B (en) | A kind of event type recognition methods and device | |
CN102937960B (en) | Device for identifying and evaluating emergency hot topic | |
CN106649260B (en) | Product characteristic structure tree construction method based on comment text mining | |
CN106570179B (en) | A kind of kernel entity recognition methods and device towards evaluation property text | |
CN110334212A (en) | A kind of territoriality audit knowledge mapping construction method based on machine learning | |
CN101655866B (en) | Automatic decimation method of scientific and technical terminology | |
WO2014094332A1 (en) | Method for creating knowledge base engine for emergency management of sudden event and method for querying in knowledge base engine | |
DE112013004082T5 (en) | Search system of the emotion entity for the microblog | |
CN106021410A (en) | Source code annotation quality evaluation method based on machine learning | |
CN105975478A (en) | Word vector analysis-based online article belonging event detection method and device | |
CN104965867A (en) | Text event classification method based on CHI feature selection | |
CN107194617B (en) | App software engineer soft skill classification system and method | |
CN102144229A (en) | System for extracting term from document containing text segment | |
CN103605702A (en) | Word similarity based network text classification method | |
CN110472203B (en) | Article duplicate checking and detecting method, device, equipment and storage medium | |
CN111597356B (en) | Intelligent education knowledge map construction system and method | |
CN106570109A (en) | Method for automatically generating knowledge points of question bank through text analysis | |
CN105095091B (en) | A kind of software defect code file localization method based on Inverted Index Technique | |
CN106547733A (en) | A kind of name entity recognition method towards particular text | |
Pimm et al. | Natural Language Processing (NLP) tools for the analysis of incident and accident reports | |
Cuadros et al. | Quality assessment of large scale knowledge resources | |
CN112328792A (en) | Optimization method for recognizing credit events based on DBSCAN clustering algorithm | |
CN109299251A (en) | A kind of abnormal refuse messages recognition methods and system based on deep learning algorithm | |
CN109471934B (en) | Financial risk clue mining method based on Internet |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |