CN104462439B - The recognition methods of event and device - Google Patents

The recognition methods of event and device Download PDF

Info

Publication number
CN104462439B
CN104462439B CN201410779142.2A CN201410779142A CN104462439B CN 104462439 B CN104462439 B CN 104462439B CN 201410779142 A CN201410779142 A CN 201410779142A CN 104462439 B CN104462439 B CN 104462439B
Authority
CN
China
Prior art keywords
word
data
association
degrees
euclidean distance
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201410779142.2A
Other languages
Chinese (zh)
Other versions
CN104462439A (en
Inventor
刘粉香
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Gridsum Technology Co Ltd
Original Assignee
Beijing Gridsum Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Gridsum Technology Co Ltd filed Critical Beijing Gridsum Technology Co Ltd
Priority to CN201410779142.2A priority Critical patent/CN104462439B/en
Publication of CN104462439A publication Critical patent/CN104462439A/en
Application granted granted Critical
Publication of CN104462439B publication Critical patent/CN104462439B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/14Details of searching files based on file metadata
    • G06F16/148File search processing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Library & Information Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a kind of recognition methods of event and device.Wherein, this method includes:Word segmentation processing is carried out to text message and obtains the first word and multiple second words;Obtain the second Multidimensional numerical that the first word corresponds to the first Multidimensional numerical of text message and each second word corresponds to text message;The first word, which is calculated, using the first Multidimensional numerical and each second Multidimensional numerical associates degrees of data with the first of each second word;The second word is extracted according to the first association degrees of data, obtains the first association set of words;Each 3rd word calculated in the set associates degrees of data with second of the 4th word in the set of the second word;Using the 4th word corresponding to the second association degrees of data for meeting the second preparatory condition as the 5th word, the event phrase of the first word is obtained.By the present invention, solve the problems, such as that the speed for the correlating event for identifying keyword in the prior art is slow, accuracy is poor, realize the effect of the speed and accuracy that improve identification correlating event.

Description

The recognition methods of event and device
Technical field
The present invention relates to data processing field, recognition methods and device in particular to a kind of event.
Background technology
For theme of concern, if giving a keyword, it is necessary to solve the problems, such as it how is fast and effectively Find the relevance event with given keyword and according to the degree of association to relevance event ordering.Existing solution is to be based on The co-occurrence probabilities of text matches degree or given keyword in sentence determine association phrase, and event is obtained by word frequency statisticses Attention rate sorts.
Specifically, association phrase is determined by text matches degree, i.e., searches what is included with given keyword in the text The similar conjunctive word of word, e.g., if given keyword is " Tian An-men ", then search association by the method for text matches degree Word, it will be considered that " Di'anmen " and " Tian An-men " is very much like, that is, it is give keyword " Tian An-men " one to think " Di'anmen " Conjunctive word, but in fact, the word generally with " Tian An-men " while appearance is " rostrum of Tian An Men ", " the Forbidden City " or " Tian An Door square " etc., rather than " Di'anmen ".
Further, association phrase is determined by co-occurrence probabilities, i.e., all sentences of text is divided into minimum keyword (i.e. cutting is minimum phrase or individual character), calculates the probability that any two minimum keyword occurs jointly in each sentence, obtains To the co-occurrence probabilities of the two minimum keywords.According to probability threshold value set in advance, co-occurrence probabilities are more than the two of probability threshold value Individual word is relevance word, and the relevance of two higher words of co-occurrence probabilities is higher.
Due to existing solution with traversal lookup conjunctive word to determine to associate phrase, in calculating and data storage When the computer resource that expends it is all bigger, processing speed is slow, and the method for word frequency statisticses is not based on natural language processing, Result in can miss many relevance events.
The problem of speed of correlating event for identifying keyword in the prior art is slow, accuracy is poor, is not yet carried at present Go out effective solution.
The content of the invention
It is a primary object of the present invention to provide recognition methods and the device of a kind of event, to solve to identify in the prior art The problem of speed of the correlating event of keyword is slow, accuracy is poor.
To achieve these goals, according to an aspect of the invention, there is provided a kind of recognition methods of event.
Included according to the recognition methods of the present invention:Word segmentation processing is carried out to the text message obtained in advance and obtains the first word With multiple second words;The first Multidimensional numerical and each that first word corresponds to text message is obtained by machine learning method Two words correspond to the second Multidimensional numerical of text message;The first word is calculated using the first Multidimensional numerical and each second Multidimensional numerical Language associates degrees of data with the first of each second word;Extraction meets the corresponding to the first association degrees of data of the first preparatory condition Two words, obtain the first association set of words;Calculate the set of each 3rd word and the second word in the first association set of words In the 4th word second association degrees of data, wherein, the set of the second word includes the 3rd word and the 4th word;It will meet The 4th word is as the 5th word corresponding to second association degrees of data of the second preparatory condition;Preserve the 3rd with incidence relation Word, the 5th word and the first word, obtain the event phrase of the first word.
Further, the first word and each second word are calculated using the first Multidimensional numerical and each second Multidimensional numerical The first degree of association data include:Calculate the first Multidimensional numerical of the first word and each second word the second Multidimensional numerical it Between the first Euclidean distance, obtain the first association degrees of data;Calculate each 3rd word and second in the first association set of words Second degree of association data of the 4th word in the set of word include:Calculate the 3rd Multidimensional numerical and the 4th word of the 3rd word The second Euclidean distance between 4th Multidimensional numerical of language, obtain the second association degrees of data.
Further, the second word corresponding to meeting the first association degrees of data of the first preparatory condition is extracted, obtains first Association set of words includes:Bit-reversed is carried out to the first Euclidean distance being calculated, obtains First ray;Extract First ray Second word corresponding to first Euclidean distance of middle top N, the first association set of words is obtained, wherein, N is natural number;Or will not The first association set of words is saved into more than the second word corresponding to the first Euclidean distance of the first predetermined threshold value.
Further, using the 4th word corresponding to the second association degrees of data for meeting the second preparatory condition as the 5th word Including:Bit-reversed is carried out to the second Euclidean distance being calculated, obtains the second sequence;Extract preceding M positions in the second sequence 4th word corresponding to second Euclidean distance, obtains the 5th word, wherein, M is natural number;Or the second predetermined threshold value will be not more than The second Euclidean distance corresponding to the 4th word as the 5th word.
Further, the 3rd word, the 5th word and the first word with incidence relation are being preserved, is obtaining the first word After the event phrase of language, recognition methods also includes:Calculate the 5th word, the 3rd word and the first word in each event phrase The 3rd association degrees of data;Event phrase is ranked up to obtain sequence of events using the 3rd association degrees of data, wherein, calculate each The 3rd degree of association data of the 5th word, the 3rd word and the first word include in individual event phrase:By the first Euclidean distance and Second Euclidean distance sum is as the 3rd association degrees of data;Event phrase is ranked up to obtain thing using the 3rd association degrees of data Part sequence includes:Numerical values recited according to the 3rd association degrees of data is ranked up to obtain sequence of events to event phrase.
To achieve these goals, according to another aspect of the present invention, there is provided a kind of identification device of event.
Included according to the identification device of the present invention:Word-dividing mode, for being carried out to the text message obtained in advance at participle Reason obtains the first word and multiple second words;Acquisition module, for obtaining the corresponding text of the first word by machine learning method First Multidimensional numerical of this information and each second word correspond to the second Multidimensional numerical of text message;First computing module, use In calculating the first word using the first Multidimensional numerical and each second Multidimensional numerical the number of degrees are associated with the first of each second word According to;Extraction module, for extract meet the first preparatory condition first association degrees of data corresponding to the second word, obtain the first pass Join set of words;Second computing module, for calculating the set of each 3rd word and the second word in the first association set of words In the 4th word second association degrees of data, wherein, the set of the second word includes the 3rd word and the 4th word;First is true Cover half block, for determining using the 4th word corresponding to the second association degrees of data for meeting the second preparatory condition as the 5th word; First preserving module, for preserving the 3rd word, the 5th word and the first word with incidence relation, obtain the first word Event phrase.
Further, the first computing module includes:First calculating sub module, for calculating the dimension more than first of the first word The first Euclidean distance between group and the second Multidimensional numerical of each second word, obtain the first association degrees of data;Second calculates Module includes:Second calculating sub module, for calculating the 3rd Multidimensional numerical of the 3rd word and the dimension more than the 4th of the 4th word The second Euclidean distance between group, obtain the second association degrees of data.
Further, extraction module includes:First order module, for being fallen to the first Euclidean distance being calculated Sequence sorts, and obtains First ray;First extracting sub-module, the first Euclidean distance for extracting top N in First ray are corresponding The second word, obtain the first association set of words, wherein, N is natural number;Or second preserving module, for first will to be not more than Second word corresponding to first Euclidean distance of predetermined threshold value is saved into the first association set of words.
Further, the first determining module includes:Second order module, for entering to the second Euclidean distance being calculated Row Bit-reversed, obtain the second sequence;Second extracting sub-module, for extracting the second Euclidean distance of preceding M positions in the second sequence Corresponding 4th word, obtains the 5th word, wherein, M is natural number;Or the 3rd preserving module, it is pre- for second will to be not more than If the 4th word is as the 5th word corresponding to the second Euclidean distance of threshold value.
Further, identification device also includes:3rd computing module, for preserving the 3rd word with incidence relation Language, the 5th word and the first word, after obtaining the event phrase of the first word, calculate the 5th word in each event phrase 3rd association degrees of data of language, the 3rd word and the first word;3rd order module, for associating degrees of data to thing using the 3rd Part phrase is ranked up to obtain sequence of events, wherein, the 3rd computing module includes:Second determining module, for European by first Distance associates degrees of data with the second Euclidean distance sum as the 3rd;3rd order module includes:Sorting sub-module, for according to The numerical values recited of 3rd association degrees of data is ranked up to obtain sequence of events to event phrase.
Using the embodiment of the present invention, the text message to obtaining in advance segmented to obtain the first word and it is multiple other After word, calculate the first word and associate degrees of data with the first of other each words to determine the first conjunctive word of the first word Set, each word that then calculating first is associated in set of words associate degrees of data with the second of other words and obtain the first association The related word of the 3rd word in set of words, the 3rd word and first preserved in the first word, the first association set of words close The related word (i.e. the 5th word) of the 3rd word in connection set of words obtains the event phrase of the first word.By in the present invention Embodiment is stated, it is determined that after the first association set of words of the first word, determines each word in the first association set of words The pass of related word, the then word in the word and the first association set of words in the first word, the first association set of words Join the event phrase that word generates the first word, without traveling through whole text message to obtain the event phrase (pass of such as keyword Connection event), improve the speed of acquisition event phrase.By the embodiment of the present invention, solve and identify keyword in the prior art Correlating event the effect that speed is slow, the problem of accuracy is poor, realizes the speed and accuracy that improve identification correlating event.
Brief description of the drawings
The accompanying drawing for forming the part of the application is used for providing a further understanding of the present invention, schematic reality of the invention Apply example and its illustrate to be used to explain the present invention, do not form inappropriate limitation of the present invention.In the accompanying drawings:
Fig. 1 is the flow chart of the recognition methods of event according to embodiments of the present invention;
Fig. 2 is the schematic diagram of relation between a kind of optional word according to embodiments of the present invention;And
Fig. 3 is the schematic diagram of the identification device of event according to embodiments of the present invention.
Embodiment
First, the part noun or term occurred during the embodiment of the present invention is described is applied to following solution Release:
Machine learning is that a kind of method of information is converted data to by the extracting rule in data or pattern, mainly Machine learning method have induction learning and analytic learning method.In machine-learning process, data are pretreated first, are formed Feature, then according to certain model of feature-modeling;The data that are collected into of machine learning algorithm analysis, distribution weight, threshold value and its He reaches the aim of learning at parameter.
In order that those skilled in the art more fully understand the present invention program, below in conjunction with the embodiment of the present invention Accompanying drawing, the technical scheme in the embodiment of the present invention is clearly and completely described, it is clear that described embodiment is only The embodiment of a part of the invention, rather than whole embodiments.Based on the embodiment in the present invention, ordinary skill people The every other embodiment that member is obtained under the premise of creative work is not made, it should all belong to the model that the present invention protects Enclose.
It should be noted that term " first " in description and claims of this specification and above-mentioned accompanying drawing, " Two " etc. be for distinguishing similar object, without for describing specific order or precedence.It should be appreciated that so use Data can exchange in the appropriate case, so as to embodiments of the invention described herein can with except illustrating herein or Order beyond those of description is implemented.In addition, term " comprising " and " having " and their any deformation, it is intended that cover Cover it is non-exclusive include, be not necessarily limited to for example, containing the process of series of steps or unit, method, system, product or equipment Those steps or unit clearly listed, but may include not list clearly or for these processes, method, product Or the intrinsic other steps of equipment or unit.
It should be noted that in the case where not conflicting, the feature in embodiment and embodiment in the application can phase Mutually combination.Describe the present invention in detail below with reference to the accompanying drawings and in conjunction with the embodiments.
Fig. 1 is the flow chart of the recognition methods of event according to embodiments of the present invention, as shown in figure 1, the recognition methods bag Include the steps:
Step S102, word segmentation processing is carried out to the text message obtained in advance and obtains the first word and multiple second words.
Step S104, the first word is obtained by machine learning method and corresponds to the first Multidimensional numerical of text message and each Second word corresponds to the second Multidimensional numerical of text message.
Step S106, the first word and each second word are calculated using the first Multidimensional numerical and each second Multidimensional numerical First association degrees of data.
Step S108, the second word corresponding to meeting the first association degrees of data of the first preparatory condition is extracted, obtains first Associate set of words.
Wherein, the set of the second word includes the 3rd word and the 4th word.
Step S110, calculate the 4th word in the set of each 3rd word and the second word in the first association set of words Second association degrees of data of language.
Wherein, the set of the second word includes the 3rd word and the 4th word.
Step S112, using the 4th word corresponding to the second association degrees of data for meeting the second preparatory condition as the 5th word Language.
Step S114, the 3rd word, the 5th word and the first word with incidence relation are preserved, obtains the first word Event phrase.
Using the embodiment of the present invention, the text message to obtaining in advance segmented to obtain the first word and it is multiple other After word, calculate the first word and associate degrees of data with the first of other each words to determine the first conjunctive word of the first word Set, each word that then calculating first is associated in set of words associate degrees of data with the second of other words and obtain the first association The related word of the 3rd word in set of words, the 3rd word and first preserved in the first word, the first association set of words close The related word (i.e. the 5th word) of the 3rd word in connection set of words obtains the event phrase of the first word.By in the present invention Embodiment is stated, it is determined that after the first association set of words of the first word, determines each word in the first association set of words The pass of related word, the then word in the word and the first association set of words in the first word, the first association set of words Join the event phrase that word generates the first word, without traveling through whole text message to obtain the event phrase (pass of such as keyword Connection event), improve the speed of acquisition event phrase.By the embodiment of the present invention, solve and identify keyword in the prior art Correlating event the effect that speed is slow, the problem of accuracy is poor, realizes the speed and accuracy that improve identification correlating event.
In the above-described embodiments, text message can be text (e.g., news item or one obtained from internet Wen Bo is commented on) or the e-text that is obtained by scanning or inputting the content of paper document, it can also be that user passes through E-text of terminal input etc..Alternatively, text message can be present in text message in the form of paragraph, and e.g., one new Hear or a comment is a paragraph.
It should be further stated that carrying out word segmentation processing to text message, obtaining multiple words can be by such as lower section Method is realized:Text message is split as multiple words according to default word combination.
Specifically, default word combination can be obtained from term database, and by the word and word in text message Default word combination in database is matched, if the word in text message is identical with default word combination, by the word Language marks off from text message to be come, and obtains multiple words.
It is alternatively possible to word segmentation processing is carried out to text message using participle instrument.
For example, if text message is " today, weather was fine ", text information is being carried out at participle using participle instrument After reason, obtained word can be " today ", " weather ", " very " and " good ".
According to the above embodiment of the present invention, using the first Multidimensional numerical and each second Multidimensional numerical calculate the first word with First association degrees of data of each second word can include:Calculate the first Multidimensional numerical of the first word and each second word The second Multidimensional numerical between the first Euclidean distance, obtain the first association degrees of data;Calculate each in the first association set of words Individual 3rd word associates degrees of data with second of the 4th word in the set of the second word to be included:Calculate the 3rd word The second Euclidean distance between 3rd Multidimensional numerical and the 4th Multidimensional numerical of the 4th word, obtain the second association degrees of data.
Specifically, the first word of acquisition corresponds to the first Multidimensional numerical of text message and the second word corresponds to text message Second Multidimensional numerical, and the first Euclidean distance between the first Multidimensional numerical being calculated and each second Multidimensional numerical is made For first association degrees of data, using the second Euclidean distance between the 3rd Multidimensional numerical being calculated and the 4th Multidimensional numerical as Second association degrees of data.
It is possible to further calculate Euclidean distance d according to the following equation:D=| | X-Y | |2, wherein, calculating the first Europe Formula apart from when, X be the first word the first attribute array, Y be the second word the second attribute array;Calculate second it is European away from From when, X be the 3rd word the 3rd attribute array, Y be the 4th word the 4th attribute array.
In the above-described embodiment, word can be characterized as attribute array using instrument word2vec.Word2vec is One instrument that word is converted into vector form.
Further, obtain each word correspond to the attribute array of text message can be by the method for machine learning (e.g., Machine learning program) realize.Alternatively, the attribute array in the embodiment can be the array of 500 dimensions, in this embodiment Array using 500 dimensions can ensure terminal operating efficiency and operation result accuracy.
By the above embodiments of the present invention, the attribute of text message is corresponded to using attribute array representation word, is being obtained During the first association degrees of data, only the distance between the first word and the second word need to be calculated;When obtaining the second association degrees of data, The distance of other words in the set of the 3rd word and the second word in the first association set of words need to be only calculated, without one by one Travel through text message in all words, save storage word and text message needed for space, in the data of text message When measuring larger, the first association degrees of data that can rapidly and accurately obtain the first word associates degrees of data with second.
According to the above embodiment of the present invention, extraction meets the second word corresponding to the first association degrees of data of the first preparatory condition Language, obtaining the first association set of words can include:Bit-reversed is carried out to the first Euclidean distance being calculated, obtains the first sequence Row;The second word corresponding to the first Euclidean distance of top N in First ray is extracted, obtains the first association set of words, wherein, N For natural number;Or the second word corresponding to the first Euclidean distance no more than the first predetermined threshold value is saved into the first association word set Close.
Specifically, first between the first attribute array and the second attribute array of the second word for calculating the first word After Euclidean distance, the first Euclidean distance being calculated can be subjected to Bit-reversed and obtain First ray, and by the first sequence Each second word corresponding to the first Euclidean distance of top N is ordered as in row and is saved into the first association set of words;Or will not The first association set of words is saved into more than the second word corresponding to the first Euclidean distance of the first predetermined threshold value.
Wherein, N and the first predetermined threshold value can determine according to request is obtained.
By the above embodiment of the present invention, using attribute array identification of words, and by the distance between attribute array come The similarity of word in text message is objectively represented, improves the obtained accuracy of the first association set of words.In above-mentioned reality Apply in example, the first association set of words can be obtained by simple data processing, improve the first association for obtaining the first word The speed of set of words.
In the above embodiment of the present invention, the 4th word corresponding to the second association degrees of data of the second preparatory condition will be met Language can include as the 5th word:Bit-reversed is carried out to the second Euclidean distance being calculated, obtains the second sequence;Extraction The 4th word corresponding to the second Euclidean distance of preceding M positions, obtains the 5th word in second sequence, wherein, M is natural number;Or will No more than the 4th word corresponding to the second Euclidean distance of the second predetermined threshold value as the 5th word.
Specifically, second between the 3rd attribute array and the 4th attribute array of the 4th word for calculating the 3rd word After Euclidean distance, the second Euclidean distance being calculated can be subjected to Bit-reversed and obtain the second sequence, and by the second sequence Each 4th word corresponding to the second Euclidean distance of M positions is as the 5th word before being ordered as in row;Or second will be not more than The 4th word is as the 5th word corresponding to second Euclidean distance of predetermined threshold value.
Wherein, M and the second predetermined threshold value can determine according to request is obtained.
According to the above embodiments of the present invention, the 3rd word, the 5th word and first with incidence relation are being preserved Word, after obtaining the event phrase of the first word, recognition methods can also include:Calculate the 5th word in each event phrase 3rd association degrees of data of language, the 3rd word and the first word;Event phrase is ranked up using the 3rd association degrees of data To sequence of events, wherein, calculate the 5th word in each event phrase, the 3rd word and the first word the 3rd associates degrees of data It can include:Using the first Euclidean distance and the second Euclidean distance sum as the 3rd association degrees of data;Use the 3rd association number of degrees It is ranked up to obtain sequence of events and can includes according to event phrase:According to the numerical values recited of the 3rd association degrees of data to event word Group is ranked up to obtain sequence of events.
Specifically, after the event phrase of the first word is obtained, by the first Europe between the first word and the 3rd word The second Euclidean distance sum between formula distance and the 3rd word and the 5th word associates degrees of data as the 3rd, and uses the 3rd Association degrees of data is ranked up to each event phrase, obtains sequence of events, wherein, the attention rate of the event in sequence of events can Represented with the numerical values recited of the 3rd association degrees of data.
Fig. 2 is the schematic diagram of relation between a kind of optional word according to embodiments of the present invention.
As shown in Fig. 2 obtaining set of words after text message progress word segmentation processing includes the first set of words (in the set Including the first word in above-described embodiment) and the second set of words (set includes the second word in above-described embodiment Language), the second set of words (not shown) includes the 3rd set of words (the first pass in machine above-described embodiment shown in Fig. 2 Join set of words, the set includes the 3rd word in above-described embodiment) and the 4th set of words (set includes above-mentioned reality Apply the 4th word in example), including the 5th set of words shown in Fig. 2, (set includes above-mentioned implementation to the 4th set of words The 5th word in example).
According to the above embodiment of the present invention, the first word of calculating associates degrees of data with the first of each second word to be obtained Degrees of data is associated to first between the first word and the second word, the first preparatory condition will be met in the first association degrees of data Word as the 3rd word form first association set of words, then calculate first association set of words in word and the 4th word The second association degrees of data between set, and meet the word of the second preparatory condition as the 5th using in the second association degrees of data Word, then preserve the first word, the 3rd word and the 5th word with incidence relation and obtain the event word of the first word Group;By the of the 3rd word in each event phrase and the first Euclidean distance of the first word and the 5th word and the 3rd word Two Euclidean distance sums are carried out as the 3rd association degrees of data, and according to the numerical values recited of the 3rd association degrees of data to event phrase Sequence, obtains sequence of events.
By the above embodiment of the present invention, sequence of events sum is being obtained, can be according to corresponding to each sequence of events The numerical values recited of three association degrees of data, determine the attention rate of each event in sequence of events, i.e., the numerical value of the 3rd association degrees of data Smaller, the attention rate of corresponding event is higher.
In an optional embodiment of the present invention, word correlation analysis technology can be based on to keyword (on i.e. State the first word in embodiment) correlating event and its attention rate be identified, pass through the part-of-speech rule model of tectonic event And word association analysis, the dependent event phrase of keyword is obtained, and event phrase is sorted according to attention rate.Specifically Ground, it can be achieved by the steps of:
1st, word segmentation processing is carried out to text training sample (text message i.e. in above-described embodiment), obtains multiple event words Language;
2nd, with the 500 dimension each event words of array representation, attribute corresponding to each event word is obtained by machine learning Array;
3rd, one or more keywords are inputted, keyword and the analysis of multiple being associated property of event word (are calculated and are somebody's turn to do Euclidean distance between the attribute array of keyword and the attribute array of all event words, i.e., first in above-described embodiment closes Join degrees of data), obtain the relevance word list (the first association set of words i.e. in above-described embodiment) of the keyword;
4th, by carrying out secondary and correlation analysis training three times to relevance word list, the relevance of keyword is obtained Phrase (set of the 3rd word and the 5th word i.e. in above-described embodiment), by result (i.e. above-described embodiment of successive ignition In first association degrees of data associate degrees of data with second) by relevance score (i.e. in above-described embodiment the 3rd associate the number of degrees According to) sequence, sequence of events is obtained, keyword forms event (the event phrase i.e. in above-described embodiment) with relevance phrase, by This obtains event and its attention rate sequence (i.e. sequence of events).
By the above embodiment of the present invention, using array Identifying Keywords, calculating speed and accuracy are improved, is dropped simultaneously The low running wastage of data processing machine;Attribute array is obtained using machine learning method and calculates word with Euclidean distance Relevance so that correlation analysis is more accurate;Secondary, correlation analysis training three times is carried out with machine learning, can be identified certainly Semantic dependency relation in right language, avoids missing critical event.
It should be noted that can be in such as one group of computer executable instructions the flow of accompanying drawing illustrates the step of Performed in computer system, although also, show logical order in flow charts, in some cases, can be with not The order being same as herein performs shown or described step.
Fig. 3 is according to the schematic diagram of the identification device of the event of the present invention, as shown in figure 3, the identification device can include: Word-dividing mode 10, the first word and multiple second words are obtained for carrying out word segmentation processing to the text message obtained in advance;Obtain Modulus block 30, for corresponding to the first Multidimensional numerical and each second of text message by machine learning method the first word of acquisition Word corresponds to the second Multidimensional numerical of text message;First computing module 50, for using the first Multidimensional numerical and each second Multidimensional numerical calculates the first word and associates degrees of data with the first of each second word;Extraction module 70, meet for extracting Second word corresponding to first association degrees of data of one preparatory condition, obtains the first association set of words;Second computing module 90, use Second degree of association of the 4th word in the set for calculating each 3rd word and the second word in the first association set of words Data, wherein, the set of the second word includes the 3rd word and the 4th word;First determining module 110, for determining to meet The 4th word is as the 5th word corresponding to second association degrees of data of the second preparatory condition;First preserving module 130, for protecting The 3rd word, the 5th word and the first word with incidence relation are deposited, obtains the event phrase of the first word.
Using the embodiment of the present invention, the text message to obtaining in advance segmented to obtain the first word and it is multiple other After word, calculate the first word and associate degrees of data with the first of other each words to determine the first conjunctive word of the first word Set, each word that then calculating first is associated in set of words associate degrees of data with the second of other words and obtain the first association The related word of the 3rd word in set of words, the 3rd word and first preserved in the first word, the first association set of words close The related word (i.e. the 5th word) of the 3rd word in connection set of words obtains the event phrase of the first word.By in the present invention Embodiment is stated, it is determined that after the first association set of words of the first word, determines each word in the first association set of words The pass of related word, the then word in the word and the first association set of words in the first word, the first association set of words Join the event phrase that word generates the first word, without traveling through whole text message to obtain the event phrase (pass of such as keyword Connection event), improve the speed of acquisition event phrase.By the embodiment of the present invention, solve and identify keyword in the prior art Correlating event the effect that speed is slow, the problem of accuracy is poor, realizes the speed and accuracy that improve identification correlating event.
In the above-described embodiments, text message can be text (e.g., news item or one obtained from internet Wen Bo is commented on) or the e-text that is obtained by scanning or inputting the content of paper document, it can also be that user passes through E-text of terminal input etc..Alternatively, text message can be present in text message in the form of paragraph, and e.g., one new Hear or a comment is a paragraph.
It should be further stated that carrying out word segmentation processing to text message, obtaining multiple words can be by such as lower section Method is realized:Text message is split as multiple words according to default word combination.
Specifically, default word combination can be obtained from term database, and by the word and word in text message Default word combination in database is matched, if the word in text message is identical with default word combination, by the word Language marks off from text message to be come, and obtains multiple words.
It is alternatively possible to word segmentation processing is carried out to text message using participle instrument.
According to the above embodiment of the present invention, the first computing module includes:First calculating sub module, for calculating the first word The first Multidimensional numerical and each second word the second Multidimensional numerical between the first Euclidean distance, obtain the first association number of degrees According to;Second computing module includes:Second calculating sub module, for calculating the 3rd Multidimensional numerical and the 4th word of the 3rd word The second Euclidean distance between 4th Multidimensional numerical, obtain the second association degrees of data.
Specifically, the first word of acquisition corresponds to the first Multidimensional numerical of text message and the second word corresponds to text message Second Multidimensional numerical, and the first Euclidean distance between the first Multidimensional numerical being calculated and each second Multidimensional numerical is made For first association degrees of data, using the second Euclidean distance between the 3rd Multidimensional numerical being calculated and the 4th Multidimensional numerical as Second association degrees of data.
It is possible to further calculate Euclidean distance d according to the following equation:D=| | X-Y | |2, wherein, calculating the first Europe Formula apart from when, X be the first word the first attribute array, Y be the second word the second attribute array;Calculate second it is European away from From when, X be the 3rd word the 3rd attribute array, Y be the 4th word the 4th attribute array.
In the above-described embodiment, word can be characterized as attribute array using instrument word2vec.Word2vec is One instrument that word is converted into vector form.
Further, obtain each word correspond to the attribute array of text message can be by the method for machine learning (e.g., Machine learning program) realize.Alternatively, the attribute array in the embodiment can be the array of 500 dimensions, in this embodiment Array using 500 dimensions can ensure terminal operating efficiency and operation result accuracy.
By the above embodiments of the present invention, the attribute of text message is corresponded to using attribute array representation word, is being obtained During the first association degrees of data, only the distance between the first word and the second word need to be calculated;When obtaining the second association degrees of data, The distance of other words in the set of the 3rd word and the second word in the first association set of words need to be only calculated, without one by one Travel through text message in all words, save storage word and text message needed for space, in the data of text message When measuring larger, the first association degrees of data that can rapidly and accurately obtain the first word associates degrees of data with second.
According to the above embodiment of the present invention, extraction module can include:First order module, for being calculated One Euclidean distance carries out Bit-reversed, obtains First ray;First extracting sub-module, for extracting top N in First ray Second word corresponding to first Euclidean distance, the first association set of words is obtained, wherein, N is natural number;Or second preserving module, For the second word corresponding to the first Euclidean distance no more than the first predetermined threshold value to be saved into the first association set of words.
Specifically, first between the first attribute array and the second attribute array of the second word for calculating the first word After Euclidean distance, the first Euclidean distance being calculated can be subjected to Bit-reversed and obtain First ray, and by the first sequence Each second word corresponding to the first Euclidean distance of top N is ordered as in row and is saved into the first association set of words;Or will not The first association set of words is saved into more than the second word corresponding to the first Euclidean distance of the first predetermined threshold value.
Wherein, N and the first predetermined threshold value can determine according to request is obtained.
By the above embodiment of the present invention, using attribute array identification of words, and by the distance between attribute array come The similarity of word in text message is objectively represented, improves the obtained accuracy of the first association set of words.In above-mentioned reality Apply in example, the first association set of words can be obtained by simple data processing, improve the first association for obtaining the first word The speed of set of words.
In the above embodiment of the present invention, the first determining module can include:Second order module, for calculating The second Euclidean distance arrived carries out Bit-reversed, obtains the second sequence;Second extracting sub-module, before extracting in the second sequence 4th word corresponding to second Euclidean distance of M positions, obtains the 5th word, wherein, M is natural number;Or the 3rd preserving module, use In using the 4th word corresponding to the second Euclidean distance no more than the second predetermined threshold value as the 5th word.
Specifically, second between the 3rd attribute array and the 4th attribute array of the 4th word for calculating the 3rd word After Euclidean distance, the second Euclidean distance being calculated can be subjected to Bit-reversed and obtain the second sequence, and by the second sequence Each 4th word corresponding to the second Euclidean distance of M positions is as the 5th word before being ordered as in row;Or second will be not more than The 4th word is as the 5th word corresponding to second Euclidean distance of predetermined threshold value.
Wherein, M and the second predetermined threshold value can determine according to request is obtained.
According to the above embodiments of the present invention, identification device can also include:3rd computing module, for having in preservation The 3rd word, the 5th word and the first word of incidence relation, after obtaining the event phrase of the first word, calculate each thing 3rd association degrees of data of the 5th word, the 3rd word and the first word in part phrase;3rd order module, for using the 3rd Association degrees of data is ranked up to obtain sequence of events to event phrase, wherein, the 3rd computing module includes:Second determining module, For using the first Euclidean distance and the second Euclidean distance sum as the 3rd association degrees of data;3rd order module includes:Sequence Submodule, for being ranked up to obtain sequence of events to event phrase according to the numerical values recited of the 3rd association degrees of data.
Specifically, after the event phrase of the first word is obtained, by the first Europe between the first word and the 3rd word The second Euclidean distance sum between formula distance and the 3rd word and the 5th word associates degrees of data as the 3rd, and uses the 3rd Association degrees of data is ranked up to each event phrase, obtains sequence of events, wherein, the attention rate of the event in sequence of events can Represented with the numerical values recited of the 3rd association degrees of data.
Modules provided in the present embodiment are identical with the application method that the corresponding step of embodiment of the method is provided, should Can also be identical with scene.It is noted, of course, that the scheme that above-mentioned module is related to can be not limited in above-described embodiment Content and scene, and above-mentioned module may operate in terminal or mobile terminal, can be realized by software or hardware.
As can be seen from the above description, the present invention realizes following technique effect:
Using the embodiment of the present invention, the text message to obtaining in advance segmented to obtain the first word and it is multiple other After word, calculate the first word and associate degrees of data with the first of other each words to determine the first conjunctive word of the first word Set, each word that then calculating first is associated in set of words associate degrees of data with the second of other words and obtain the first association The related word of the 3rd word in set of words, the 3rd word and first preserved in the first word, the first association set of words close The related word (i.e. the 5th word) of the 3rd word in connection set of words obtains the event phrase of the first word.By in the present invention Embodiment is stated, it is determined that after the first association set of words of the first word, determines each word in the first association set of words The pass of related word, the then word in the word and the first association set of words in the first word, the first association set of words Join the event phrase that word generates the first word, without traveling through whole text message to obtain the event phrase (pass of such as keyword Connection event), improve the speed of acquisition event phrase.By the embodiment of the present invention, solve and identify keyword in the prior art Correlating event the effect that speed is slow, the problem of accuracy is poor, realizes the speed and accuracy that improve identification correlating event.
Obviously, those skilled in the art should be understood that above-mentioned each module of the invention or each step can be with general Computing device realize that they can be concentrated on single computing device, or be distributed in multiple computing devices and formed Network on, alternatively, they can be realized with the program code that computing device can perform, it is thus possible to they are stored Performed in the storage device by computing device, either they are fabricated to respectively each integrated circuit modules or by they In multiple modules or step be fabricated to single integrated circuit module to realize.So, the present invention is not restricted to any specific Hardware and software combines.
The preferred embodiments of the present invention are the foregoing is only, are not intended to limit the invention, for the skill of this area For art personnel, the present invention can have various modifications and variations.Within the spirit and principles of the invention, that is made any repaiies Change, equivalent substitution, improvement etc., should be included in the scope of the protection.

Claims (10)

  1. A kind of 1. recognition methods of event, it is characterised in that including:
    Word segmentation processing is carried out to the text message obtained in advance and obtains the first word and multiple second words;
    First word is obtained by machine learning method and corresponds to the first Multidimensional numerical of the text message and each described Second word corresponds to the second Multidimensional numerical of the text message;
    First word and each described second is calculated using first Multidimensional numerical and each second Multidimensional numerical First association degrees of data of word;
    The second word corresponding to meeting the first association degrees of data of the first preparatory condition is extracted, obtains the first association word set Close;
    Calculate each 3rd word in the first association set of words and the 4th word in the set of second word Second association degrees of data, wherein, the set of second word includes the 3rd word and the 4th word;
    Using the 4th word corresponding to the second association degrees of data for meeting the second preparatory condition as the 5th word;
    The 3rd word, the 5th word and first word with incidence relation are preserved, obtains described first The event phrase of word.
  2. 2. recognition methods according to claim 1, it is characterised in that
    First word and each described second is calculated using first Multidimensional numerical and each second Multidimensional numerical First degree of association data of word include:Calculate the of the first Multidimensional numerical of first word and each second word The first Euclidean distance between two Multidimensional numericals, obtain the first association degrees of data;
    Calculate each 3rd word in the first association set of words and the 4th word in the set of second word Second degree of association data include:Calculate the 3rd Multidimensional numerical of the 3rd word and the 4th Multidimensional numerical of the 4th word Between the second Euclidean distance, obtain it is described second association degrees of data.
  3. 3. recognition methods according to claim 2, it is characterised in that extraction meets first pass of the first preparatory condition Join the second word corresponding to degrees of data, obtaining the first association set of words includes:
    Bit-reversed is carried out to first Euclidean distance being calculated, obtains First ray;Extract in the First ray Second word corresponding to first Euclidean distance of top N, the first association set of words is obtained, wherein, N is nature Number;Or
    Second word corresponding to first Euclidean distance no more than the first predetermined threshold value is saved into described first to close Join set of words.
  4. 4. recognition methods according to claim 2, it is characterised in that second association of the second preparatory condition will be met The 4th word includes as the 5th word corresponding to degrees of data:
    Bit-reversed is carried out to second Euclidean distance being calculated, obtains the second sequence;Extract in second sequence 4th word corresponding to second Euclidean distance of preceding M positions, obtains the 5th word, wherein, M is natural number;Or
    Using the 4th word corresponding to second Euclidean distance no more than the second predetermined threshold value as the 5th word.
  5. 5. recognition methods according to claim 2, it is characterised in that preserving the 3rd word with incidence relation Language, the 5th word and first word, after obtaining the event phrase of first word, the recognition methods is also Including:
    Calculate the 3rd degree of association of the 5th word, the 3rd word and first word described in each event phrase Data;
    The event phrase is ranked up to obtain sequence of events using the described 3rd association degrees of data,
    Wherein, the 3rd of the 5th word described in each event phrase, the 3rd word and first word is calculated Degree of association data include:Using first Euclidean distance with the second Euclidean distance sum the number of degrees are associated as the described 3rd According to;
    The event phrase, which is ranked up to obtain sequence of events, using the described 3rd association degrees of data includes:According to the described 3rd The numerical values recited of association degrees of data is ranked up to obtain the sequence of events to the event phrase.
  6. A kind of 6. identification device of event, it is characterised in that including:
    Word-dividing mode, the first word and multiple second words are obtained for carrying out word segmentation processing to the text message obtained in advance;
    Acquisition module, for obtaining the dimension more than first that first word corresponds to the text message by machine learning method Group corresponds to the second Multidimensional numerical of the text message with each second word;
    First computing module, for calculating first word using first Multidimensional numerical and each second Multidimensional numerical Language associates degrees of data with the first of each second word;
    Extraction module, for extract meet the first preparatory condition it is described first association degrees of data corresponding to the second word, obtain First association set of words;
    Second computing module, for calculating the collection of each 3rd word and second word in the first association set of words Second association degrees of data of the 4th word in conjunction, wherein, the set of second word includes the 3rd word and described 4th word;
    First determining module, for determining to meet the described 4th corresponding to the second association degrees of data of the second preparatory condition Word is as the 5th word;
    First preserving module, for preserving the 3rd word, the 5th word and described first with incidence relation Word, obtain the event phrase of first word.
  7. 7. identification device according to claim 6, it is characterised in that
    First computing module includes:First calculating sub module, for calculate the first Multidimensional numerical of first word with The first Euclidean distance between second Multidimensional numerical of each second word, obtain the first association degrees of data;
    Second computing module includes:Second calculating sub module, for calculate the 3rd Multidimensional numerical of the 3rd word with The second Euclidean distance between 4th Multidimensional numerical of the 4th word, obtain the second association degrees of data.
  8. 8. identification device according to claim 7, it is characterised in that the extraction module includes:
    First order module, for carrying out Bit-reversed to first Euclidean distance being calculated, obtain First ray;The One extracting sub-module, for extracting second word corresponding to first Euclidean distance of top N in the First ray, The first association set of words is obtained, wherein, N is natural number;Or
    Second preserving module, for by second word corresponding to first Euclidean distance no more than the first predetermined threshold value It is saved into the first association set of words.
  9. 9. identification device according to claim 7, it is characterised in that first determining module includes:
    Second order module, for carrying out Bit-reversed to second Euclidean distance being calculated, obtain the second sequence;The Two extracting sub-modules, for extracting the 4th word corresponding to second Euclidean distance of preceding M positions in second sequence, The 5th word is obtained, wherein, M is natural number;Or
    3rd preserving module, for by the 4th word corresponding to second Euclidean distance no more than the second predetermined threshold value As the 5th word.
  10. 10. identification device according to claim 7, it is characterised in that the identification device also includes:
    3rd computing module, for preserving with the 3rd word of incidence relation, the 5th word and described the One word, after obtaining the event phrase of first word, calculate the 5th word described in each event phrase, described 3rd word associates degrees of data with the 3rd of first word;
    3rd order module, for being ranked up to obtain event sequence to the event phrase using the described 3rd association degrees of data Row,
    Wherein, the 3rd computing module includes:Second determining module, for by first Euclidean distance and second Europe Formula is apart from sum as the described 3rd association degrees of data;
    3rd order module includes:Sorting sub-module, for the numerical values recited according to the described 3rd association degrees of data to institute The event phrase of stating is ranked up to obtain the sequence of events.
CN201410779142.2A 2014-12-15 2014-12-15 The recognition methods of event and device Active CN104462439B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410779142.2A CN104462439B (en) 2014-12-15 2014-12-15 The recognition methods of event and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410779142.2A CN104462439B (en) 2014-12-15 2014-12-15 The recognition methods of event and device

Publications (2)

Publication Number Publication Date
CN104462439A CN104462439A (en) 2015-03-25
CN104462439B true CN104462439B (en) 2017-12-19

Family

ID=52908474

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410779142.2A Active CN104462439B (en) 2014-12-15 2014-12-15 The recognition methods of event and device

Country Status (1)

Country Link
CN (1) CN104462439B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106649334B (en) * 2015-10-29 2020-09-15 北京国双科技有限公司 Processing method and device of associated word set
CN106156299B (en) * 2016-06-29 2019-09-20 北京小米移动软件有限公司 The subject content recognition methods of text information and device
CN109471926A (en) * 2018-10-30 2019-03-15 广东原昇信息科技有限公司 Intelligent word making method based on NLP and company information
CN109885696A (en) * 2019-02-01 2019-06-14 杭州晶一智能科技有限公司 A kind of foreign language word library construction method based on self study

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB0227692D0 (en) * 2002-11-27 2003-01-08 Sony Uk Ltd Information storage and retrieval
US6633868B1 (en) * 2000-07-28 2003-10-14 Shermann Loyall Min System and method for context-based document retrieval
CN102063469A (en) * 2010-12-03 2011-05-18 百度在线网络技术(北京)有限公司 Method and device for acquiring relevant keyword message and computer equipment
CN103886063A (en) * 2014-03-18 2014-06-25 国家电网公司 Text retrieval method and device
CN104121907A (en) * 2014-07-30 2014-10-29 杭州电子科技大学 Square root cubature Kalman filter-based aircraft attitude estimation method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6633868B1 (en) * 2000-07-28 2003-10-14 Shermann Loyall Min System and method for context-based document retrieval
GB0227692D0 (en) * 2002-11-27 2003-01-08 Sony Uk Ltd Information storage and retrieval
CN102063469A (en) * 2010-12-03 2011-05-18 百度在线网络技术(北京)有限公司 Method and device for acquiring relevant keyword message and computer equipment
CN103886063A (en) * 2014-03-18 2014-06-25 国家电网公司 Text retrieval method and device
CN104121907A (en) * 2014-07-30 2014-10-29 杭州电子科技大学 Square root cubature Kalman filter-based aircraft attitude estimation method

Also Published As

Publication number Publication date
CN104462439A (en) 2015-03-25

Similar Documents

Publication Publication Date Title
CN104408191B (en) The acquisition methods and device of the association keyword of keyword
CN104504150B (en) News public sentiment monitoring system
US10997678B2 (en) Systems and methods for image searching of patent-related documents
CN103198057B (en) One kind adds tagged method and apparatus to document automatically
CN104615593B (en) Hot microblog topic automatic testing method and device
CN111291210B (en) Image material library generation method, image material recommendation method and related devices
CN107463658B (en) Text classification method and device
CN107515877A (en) The generation method and device of sensitive theme word set
CN110362678A (en) A kind of method and apparatus automatically extracting Chinese text keyword
KR102398832B1 (en) Device, method and computer program for deriving response based on knowledge graph
GB2509773A (en) Automatic genre determination of web content
CN104462439B (en) The recognition methods of event and device
CN106372122B (en) A kind of Document Classification Method and system based on Wiki semantic matches
CN106844482B (en) Search engine-based retrieval information matching method and device
CN106033445A (en) Method and device for obtaining article association degree data
CN104537341A (en) Human face picture information obtaining method and device
CN106776567A (en) A kind of internet big data analyzes extracting method and system
CN106354871A (en) Similarity search method of enterprise names
CN108388556B (en) Method and system for mining homogeneous entity
CN109033212A (en) A kind of file classification method based on similarity mode
CN103942274B (en) A kind of labeling system and method for the biologic medical image based on LDA
CN112699232A (en) Text label extraction method, device, equipment and storage medium
CN106372038A (en) Keyword extraction method and device
CN110362673A (en) Computer vision class papers contents method of discrimination and system based on abstract semantic analysis
CN106951511A (en) A kind of Text Clustering Method and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
PE01 Entry into force of the registration of the contract for pledge of patent right
PE01 Entry into force of the registration of the contract for pledge of patent right

Denomination of invention: Event recognizing method and device

Effective date of registration: 20190531

Granted publication date: 20171219

Pledgee: Shenzhen Black Horse World Investment Consulting Co.,Ltd.

Pledgor: BEIJING GRIDSUM TECHNOLOGY Co.,Ltd.

Registration number: 2019990000503

CP02 Change in the address of a patent holder
CP02 Change in the address of a patent holder

Address after: 100083 No. 401, 4th Floor, Haitai Building, 229 North Fourth Ring Road, Haidian District, Beijing

Patentee after: BEIJING GRIDSUM TECHNOLOGY Co.,Ltd.

Address before: 100086 Beijing city Haidian District Shuangyushu Area No. 76 Zhichun Road cuigongfandian 8 layer A

Patentee before: BEIJING GRIDSUM TECHNOLOGY Co.,Ltd.

PP01 Preservation of patent right
PP01 Preservation of patent right

Effective date of registration: 20240604

Granted publication date: 20171219