The content of the invention
It is a primary object of the present invention to provide recognition methods and the device of a kind of event, to solve to identify in the prior art
The problem of speed of the correlating event of keyword is slow, accuracy is poor.
To achieve these goals, according to an aspect of the invention, there is provided a kind of recognition methods of event.
Included according to the recognition methods of the present invention:Word segmentation processing is carried out to the text message obtained in advance and obtains the first word
With multiple second words;The first Multidimensional numerical and each that first word corresponds to text message is obtained by machine learning method
Two words correspond to the second Multidimensional numerical of text message;The first word is calculated using the first Multidimensional numerical and each second Multidimensional numerical
Language associates degrees of data with the first of each second word;Extraction meets the corresponding to the first association degrees of data of the first preparatory condition
Two words, obtain the first association set of words;Calculate the set of each 3rd word and the second word in the first association set of words
In the 4th word second association degrees of data, wherein, the set of the second word includes the 3rd word and the 4th word;It will meet
The 4th word is as the 5th word corresponding to second association degrees of data of the second preparatory condition;Preserve the 3rd with incidence relation
Word, the 5th word and the first word, obtain the event phrase of the first word.
Further, the first word and each second word are calculated using the first Multidimensional numerical and each second Multidimensional numerical
The first degree of association data include:Calculate the first Multidimensional numerical of the first word and each second word the second Multidimensional numerical it
Between the first Euclidean distance, obtain the first association degrees of data;Calculate each 3rd word and second in the first association set of words
Second degree of association data of the 4th word in the set of word include:Calculate the 3rd Multidimensional numerical and the 4th word of the 3rd word
The second Euclidean distance between 4th Multidimensional numerical of language, obtain the second association degrees of data.
Further, the second word corresponding to meeting the first association degrees of data of the first preparatory condition is extracted, obtains first
Association set of words includes:Bit-reversed is carried out to the first Euclidean distance being calculated, obtains First ray;Extract First ray
Second word corresponding to first Euclidean distance of middle top N, the first association set of words is obtained, wherein, N is natural number;Or will not
The first association set of words is saved into more than the second word corresponding to the first Euclidean distance of the first predetermined threshold value.
Further, using the 4th word corresponding to the second association degrees of data for meeting the second preparatory condition as the 5th word
Including:Bit-reversed is carried out to the second Euclidean distance being calculated, obtains the second sequence;Extract preceding M positions in the second sequence
4th word corresponding to second Euclidean distance, obtains the 5th word, wherein, M is natural number;Or the second predetermined threshold value will be not more than
The second Euclidean distance corresponding to the 4th word as the 5th word.
Further, the 3rd word, the 5th word and the first word with incidence relation are being preserved, is obtaining the first word
After the event phrase of language, recognition methods also includes:Calculate the 5th word, the 3rd word and the first word in each event phrase
The 3rd association degrees of data;Event phrase is ranked up to obtain sequence of events using the 3rd association degrees of data, wherein, calculate each
The 3rd degree of association data of the 5th word, the 3rd word and the first word include in individual event phrase:By the first Euclidean distance and
Second Euclidean distance sum is as the 3rd association degrees of data;Event phrase is ranked up to obtain thing using the 3rd association degrees of data
Part sequence includes:Numerical values recited according to the 3rd association degrees of data is ranked up to obtain sequence of events to event phrase.
To achieve these goals, according to another aspect of the present invention, there is provided a kind of identification device of event.
Included according to the identification device of the present invention:Word-dividing mode, for being carried out to the text message obtained in advance at participle
Reason obtains the first word and multiple second words;Acquisition module, for obtaining the corresponding text of the first word by machine learning method
First Multidimensional numerical of this information and each second word correspond to the second Multidimensional numerical of text message;First computing module, use
In calculating the first word using the first Multidimensional numerical and each second Multidimensional numerical the number of degrees are associated with the first of each second word
According to;Extraction module, for extract meet the first preparatory condition first association degrees of data corresponding to the second word, obtain the first pass
Join set of words;Second computing module, for calculating the set of each 3rd word and the second word in the first association set of words
In the 4th word second association degrees of data, wherein, the set of the second word includes the 3rd word and the 4th word;First is true
Cover half block, for determining using the 4th word corresponding to the second association degrees of data for meeting the second preparatory condition as the 5th word;
First preserving module, for preserving the 3rd word, the 5th word and the first word with incidence relation, obtain the first word
Event phrase.
Further, the first computing module includes:First calculating sub module, for calculating the dimension more than first of the first word
The first Euclidean distance between group and the second Multidimensional numerical of each second word, obtain the first association degrees of data;Second calculates
Module includes:Second calculating sub module, for calculating the 3rd Multidimensional numerical of the 3rd word and the dimension more than the 4th of the 4th word
The second Euclidean distance between group, obtain the second association degrees of data.
Further, extraction module includes:First order module, for being fallen to the first Euclidean distance being calculated
Sequence sorts, and obtains First ray;First extracting sub-module, the first Euclidean distance for extracting top N in First ray are corresponding
The second word, obtain the first association set of words, wherein, N is natural number;Or second preserving module, for first will to be not more than
Second word corresponding to first Euclidean distance of predetermined threshold value is saved into the first association set of words.
Further, the first determining module includes:Second order module, for entering to the second Euclidean distance being calculated
Row Bit-reversed, obtain the second sequence;Second extracting sub-module, for extracting the second Euclidean distance of preceding M positions in the second sequence
Corresponding 4th word, obtains the 5th word, wherein, M is natural number;Or the 3rd preserving module, it is pre- for second will to be not more than
If the 4th word is as the 5th word corresponding to the second Euclidean distance of threshold value.
Further, identification device also includes:3rd computing module, for preserving the 3rd word with incidence relation
Language, the 5th word and the first word, after obtaining the event phrase of the first word, calculate the 5th word in each event phrase
3rd association degrees of data of language, the 3rd word and the first word;3rd order module, for associating degrees of data to thing using the 3rd
Part phrase is ranked up to obtain sequence of events, wherein, the 3rd computing module includes:Second determining module, for European by first
Distance associates degrees of data with the second Euclidean distance sum as the 3rd;3rd order module includes:Sorting sub-module, for according to
The numerical values recited of 3rd association degrees of data is ranked up to obtain sequence of events to event phrase.
Using the embodiment of the present invention, the text message to obtaining in advance segmented to obtain the first word and it is multiple other
After word, calculate the first word and associate degrees of data with the first of other each words to determine the first conjunctive word of the first word
Set, each word that then calculating first is associated in set of words associate degrees of data with the second of other words and obtain the first association
The related word of the 3rd word in set of words, the 3rd word and first preserved in the first word, the first association set of words close
The related word (i.e. the 5th word) of the 3rd word in connection set of words obtains the event phrase of the first word.By in the present invention
Embodiment is stated, it is determined that after the first association set of words of the first word, determines each word in the first association set of words
The pass of related word, the then word in the word and the first association set of words in the first word, the first association set of words
Join the event phrase that word generates the first word, without traveling through whole text message to obtain the event phrase (pass of such as keyword
Connection event), improve the speed of acquisition event phrase.By the embodiment of the present invention, solve and identify keyword in the prior art
Correlating event the effect that speed is slow, the problem of accuracy is poor, realizes the speed and accuracy that improve identification correlating event.
Embodiment
First, the part noun or term occurred during the embodiment of the present invention is described is applied to following solution
Release:
Machine learning is that a kind of method of information is converted data to by the extracting rule in data or pattern, mainly
Machine learning method have induction learning and analytic learning method.In machine-learning process, data are pretreated first, are formed
Feature, then according to certain model of feature-modeling;The data that are collected into of machine learning algorithm analysis, distribution weight, threshold value and its
He reaches the aim of learning at parameter.
In order that those skilled in the art more fully understand the present invention program, below in conjunction with the embodiment of the present invention
Accompanying drawing, the technical scheme in the embodiment of the present invention is clearly and completely described, it is clear that described embodiment is only
The embodiment of a part of the invention, rather than whole embodiments.Based on the embodiment in the present invention, ordinary skill people
The every other embodiment that member is obtained under the premise of creative work is not made, it should all belong to the model that the present invention protects
Enclose.
It should be noted that term " first " in description and claims of this specification and above-mentioned accompanying drawing, "
Two " etc. be for distinguishing similar object, without for describing specific order or precedence.It should be appreciated that so use
Data can exchange in the appropriate case, so as to embodiments of the invention described herein can with except illustrating herein or
Order beyond those of description is implemented.In addition, term " comprising " and " having " and their any deformation, it is intended that cover
Cover it is non-exclusive include, be not necessarily limited to for example, containing the process of series of steps or unit, method, system, product or equipment
Those steps or unit clearly listed, but may include not list clearly or for these processes, method, product
Or the intrinsic other steps of equipment or unit.
It should be noted that in the case where not conflicting, the feature in embodiment and embodiment in the application can phase
Mutually combination.Describe the present invention in detail below with reference to the accompanying drawings and in conjunction with the embodiments.
Fig. 1 is the flow chart of the recognition methods of event according to embodiments of the present invention, as shown in figure 1, the recognition methods bag
Include the steps:
Step S102, word segmentation processing is carried out to the text message obtained in advance and obtains the first word and multiple second words.
Step S104, the first word is obtained by machine learning method and corresponds to the first Multidimensional numerical of text message and each
Second word corresponds to the second Multidimensional numerical of text message.
Step S106, the first word and each second word are calculated using the first Multidimensional numerical and each second Multidimensional numerical
First association degrees of data.
Step S108, the second word corresponding to meeting the first association degrees of data of the first preparatory condition is extracted, obtains first
Associate set of words.
Wherein, the set of the second word includes the 3rd word and the 4th word.
Step S110, calculate the 4th word in the set of each 3rd word and the second word in the first association set of words
Second association degrees of data of language.
Wherein, the set of the second word includes the 3rd word and the 4th word.
Step S112, using the 4th word corresponding to the second association degrees of data for meeting the second preparatory condition as the 5th word
Language.
Step S114, the 3rd word, the 5th word and the first word with incidence relation are preserved, obtains the first word
Event phrase.
Using the embodiment of the present invention, the text message to obtaining in advance segmented to obtain the first word and it is multiple other
After word, calculate the first word and associate degrees of data with the first of other each words to determine the first conjunctive word of the first word
Set, each word that then calculating first is associated in set of words associate degrees of data with the second of other words and obtain the first association
The related word of the 3rd word in set of words, the 3rd word and first preserved in the first word, the first association set of words close
The related word (i.e. the 5th word) of the 3rd word in connection set of words obtains the event phrase of the first word.By in the present invention
Embodiment is stated, it is determined that after the first association set of words of the first word, determines each word in the first association set of words
The pass of related word, the then word in the word and the first association set of words in the first word, the first association set of words
Join the event phrase that word generates the first word, without traveling through whole text message to obtain the event phrase (pass of such as keyword
Connection event), improve the speed of acquisition event phrase.By the embodiment of the present invention, solve and identify keyword in the prior art
Correlating event the effect that speed is slow, the problem of accuracy is poor, realizes the speed and accuracy that improve identification correlating event.
In the above-described embodiments, text message can be text (e.g., news item or one obtained from internet
Wen Bo is commented on) or the e-text that is obtained by scanning or inputting the content of paper document, it can also be that user passes through
E-text of terminal input etc..Alternatively, text message can be present in text message in the form of paragraph, and e.g., one new
Hear or a comment is a paragraph.
It should be further stated that carrying out word segmentation processing to text message, obtaining multiple words can be by such as lower section
Method is realized:Text message is split as multiple words according to default word combination.
Specifically, default word combination can be obtained from term database, and by the word and word in text message
Default word combination in database is matched, if the word in text message is identical with default word combination, by the word
Language marks off from text message to be come, and obtains multiple words.
It is alternatively possible to word segmentation processing is carried out to text message using participle instrument.
For example, if text message is " today, weather was fine ", text information is being carried out at participle using participle instrument
After reason, obtained word can be " today ", " weather ", " very " and " good ".
According to the above embodiment of the present invention, using the first Multidimensional numerical and each second Multidimensional numerical calculate the first word with
First association degrees of data of each second word can include:Calculate the first Multidimensional numerical of the first word and each second word
The second Multidimensional numerical between the first Euclidean distance, obtain the first association degrees of data;Calculate each in the first association set of words
Individual 3rd word associates degrees of data with second of the 4th word in the set of the second word to be included:Calculate the 3rd word
The second Euclidean distance between 3rd Multidimensional numerical and the 4th Multidimensional numerical of the 4th word, obtain the second association degrees of data.
Specifically, the first word of acquisition corresponds to the first Multidimensional numerical of text message and the second word corresponds to text message
Second Multidimensional numerical, and the first Euclidean distance between the first Multidimensional numerical being calculated and each second Multidimensional numerical is made
For first association degrees of data, using the second Euclidean distance between the 3rd Multidimensional numerical being calculated and the 4th Multidimensional numerical as
Second association degrees of data.
It is possible to further calculate Euclidean distance d according to the following equation:D=| | X-Y | |2, wherein, calculating the first Europe
Formula apart from when, X be the first word the first attribute array, Y be the second word the second attribute array;Calculate second it is European away from
From when, X be the 3rd word the 3rd attribute array, Y be the 4th word the 4th attribute array.
In the above-described embodiment, word can be characterized as attribute array using instrument word2vec.Word2vec is
One instrument that word is converted into vector form.
Further, obtain each word correspond to the attribute array of text message can be by the method for machine learning (e.g.,
Machine learning program) realize.Alternatively, the attribute array in the embodiment can be the array of 500 dimensions, in this embodiment
Array using 500 dimensions can ensure terminal operating efficiency and operation result accuracy.
By the above embodiments of the present invention, the attribute of text message is corresponded to using attribute array representation word, is being obtained
During the first association degrees of data, only the distance between the first word and the second word need to be calculated;When obtaining the second association degrees of data,
The distance of other words in the set of the 3rd word and the second word in the first association set of words need to be only calculated, without one by one
Travel through text message in all words, save storage word and text message needed for space, in the data of text message
When measuring larger, the first association degrees of data that can rapidly and accurately obtain the first word associates degrees of data with second.
According to the above embodiment of the present invention, extraction meets the second word corresponding to the first association degrees of data of the first preparatory condition
Language, obtaining the first association set of words can include:Bit-reversed is carried out to the first Euclidean distance being calculated, obtains the first sequence
Row;The second word corresponding to the first Euclidean distance of top N in First ray is extracted, obtains the first association set of words, wherein, N
For natural number;Or the second word corresponding to the first Euclidean distance no more than the first predetermined threshold value is saved into the first association word set
Close.
Specifically, first between the first attribute array and the second attribute array of the second word for calculating the first word
After Euclidean distance, the first Euclidean distance being calculated can be subjected to Bit-reversed and obtain First ray, and by the first sequence
Each second word corresponding to the first Euclidean distance of top N is ordered as in row and is saved into the first association set of words;Or will not
The first association set of words is saved into more than the second word corresponding to the first Euclidean distance of the first predetermined threshold value.
Wherein, N and the first predetermined threshold value can determine according to request is obtained.
By the above embodiment of the present invention, using attribute array identification of words, and by the distance between attribute array come
The similarity of word in text message is objectively represented, improves the obtained accuracy of the first association set of words.In above-mentioned reality
Apply in example, the first association set of words can be obtained by simple data processing, improve the first association for obtaining the first word
The speed of set of words.
In the above embodiment of the present invention, the 4th word corresponding to the second association degrees of data of the second preparatory condition will be met
Language can include as the 5th word:Bit-reversed is carried out to the second Euclidean distance being calculated, obtains the second sequence;Extraction
The 4th word corresponding to the second Euclidean distance of preceding M positions, obtains the 5th word in second sequence, wherein, M is natural number;Or will
No more than the 4th word corresponding to the second Euclidean distance of the second predetermined threshold value as the 5th word.
Specifically, second between the 3rd attribute array and the 4th attribute array of the 4th word for calculating the 3rd word
After Euclidean distance, the second Euclidean distance being calculated can be subjected to Bit-reversed and obtain the second sequence, and by the second sequence
Each 4th word corresponding to the second Euclidean distance of M positions is as the 5th word before being ordered as in row;Or second will be not more than
The 4th word is as the 5th word corresponding to second Euclidean distance of predetermined threshold value.
Wherein, M and the second predetermined threshold value can determine according to request is obtained.
According to the above embodiments of the present invention, the 3rd word, the 5th word and first with incidence relation are being preserved
Word, after obtaining the event phrase of the first word, recognition methods can also include:Calculate the 5th word in each event phrase
3rd association degrees of data of language, the 3rd word and the first word;Event phrase is ranked up using the 3rd association degrees of data
To sequence of events, wherein, calculate the 5th word in each event phrase, the 3rd word and the first word the 3rd associates degrees of data
It can include:Using the first Euclidean distance and the second Euclidean distance sum as the 3rd association degrees of data;Use the 3rd association number of degrees
It is ranked up to obtain sequence of events and can includes according to event phrase:According to the numerical values recited of the 3rd association degrees of data to event word
Group is ranked up to obtain sequence of events.
Specifically, after the event phrase of the first word is obtained, by the first Europe between the first word and the 3rd word
The second Euclidean distance sum between formula distance and the 3rd word and the 5th word associates degrees of data as the 3rd, and uses the 3rd
Association degrees of data is ranked up to each event phrase, obtains sequence of events, wherein, the attention rate of the event in sequence of events can
Represented with the numerical values recited of the 3rd association degrees of data.
Fig. 2 is the schematic diagram of relation between a kind of optional word according to embodiments of the present invention.
As shown in Fig. 2 obtaining set of words after text message progress word segmentation processing includes the first set of words (in the set
Including the first word in above-described embodiment) and the second set of words (set includes the second word in above-described embodiment
Language), the second set of words (not shown) includes the 3rd set of words (the first pass in machine above-described embodiment shown in Fig. 2
Join set of words, the set includes the 3rd word in above-described embodiment) and the 4th set of words (set includes above-mentioned reality
Apply the 4th word in example), including the 5th set of words shown in Fig. 2, (set includes above-mentioned implementation to the 4th set of words
The 5th word in example).
According to the above embodiment of the present invention, the first word of calculating associates degrees of data with the first of each second word to be obtained
Degrees of data is associated to first between the first word and the second word, the first preparatory condition will be met in the first association degrees of data
Word as the 3rd word form first association set of words, then calculate first association set of words in word and the 4th word
The second association degrees of data between set, and meet the word of the second preparatory condition as the 5th using in the second association degrees of data
Word, then preserve the first word, the 3rd word and the 5th word with incidence relation and obtain the event word of the first word
Group;By the of the 3rd word in each event phrase and the first Euclidean distance of the first word and the 5th word and the 3rd word
Two Euclidean distance sums are carried out as the 3rd association degrees of data, and according to the numerical values recited of the 3rd association degrees of data to event phrase
Sequence, obtains sequence of events.
By the above embodiment of the present invention, sequence of events sum is being obtained, can be according to corresponding to each sequence of events
The numerical values recited of three association degrees of data, determine the attention rate of each event in sequence of events, i.e., the numerical value of the 3rd association degrees of data
Smaller, the attention rate of corresponding event is higher.
In an optional embodiment of the present invention, word correlation analysis technology can be based on to keyword (on i.e.
State the first word in embodiment) correlating event and its attention rate be identified, pass through the part-of-speech rule model of tectonic event
And word association analysis, the dependent event phrase of keyword is obtained, and event phrase is sorted according to attention rate.Specifically
Ground, it can be achieved by the steps of:
1st, word segmentation processing is carried out to text training sample (text message i.e. in above-described embodiment), obtains multiple event words
Language;
2nd, with the 500 dimension each event words of array representation, attribute corresponding to each event word is obtained by machine learning
Array;
3rd, one or more keywords are inputted, keyword and the analysis of multiple being associated property of event word (are calculated and are somebody's turn to do
Euclidean distance between the attribute array of keyword and the attribute array of all event words, i.e., first in above-described embodiment closes
Join degrees of data), obtain the relevance word list (the first association set of words i.e. in above-described embodiment) of the keyword;
4th, by carrying out secondary and correlation analysis training three times to relevance word list, the relevance of keyword is obtained
Phrase (set of the 3rd word and the 5th word i.e. in above-described embodiment), by result (i.e. above-described embodiment of successive ignition
In first association degrees of data associate degrees of data with second) by relevance score (i.e. in above-described embodiment the 3rd associate the number of degrees
According to) sequence, sequence of events is obtained, keyword forms event (the event phrase i.e. in above-described embodiment) with relevance phrase, by
This obtains event and its attention rate sequence (i.e. sequence of events).
By the above embodiment of the present invention, using array Identifying Keywords, calculating speed and accuracy are improved, is dropped simultaneously
The low running wastage of data processing machine;Attribute array is obtained using machine learning method and calculates word with Euclidean distance
Relevance so that correlation analysis is more accurate;Secondary, correlation analysis training three times is carried out with machine learning, can be identified certainly
Semantic dependency relation in right language, avoids missing critical event.
It should be noted that can be in such as one group of computer executable instructions the flow of accompanying drawing illustrates the step of
Performed in computer system, although also, show logical order in flow charts, in some cases, can be with not
The order being same as herein performs shown or described step.
Fig. 3 is according to the schematic diagram of the identification device of the event of the present invention, as shown in figure 3, the identification device can include:
Word-dividing mode 10, the first word and multiple second words are obtained for carrying out word segmentation processing to the text message obtained in advance;Obtain
Modulus block 30, for corresponding to the first Multidimensional numerical and each second of text message by machine learning method the first word of acquisition
Word corresponds to the second Multidimensional numerical of text message;First computing module 50, for using the first Multidimensional numerical and each second
Multidimensional numerical calculates the first word and associates degrees of data with the first of each second word;Extraction module 70, meet for extracting
Second word corresponding to first association degrees of data of one preparatory condition, obtains the first association set of words;Second computing module 90, use
Second degree of association of the 4th word in the set for calculating each 3rd word and the second word in the first association set of words
Data, wherein, the set of the second word includes the 3rd word and the 4th word;First determining module 110, for determining to meet
The 4th word is as the 5th word corresponding to second association degrees of data of the second preparatory condition;First preserving module 130, for protecting
The 3rd word, the 5th word and the first word with incidence relation are deposited, obtains the event phrase of the first word.
Using the embodiment of the present invention, the text message to obtaining in advance segmented to obtain the first word and it is multiple other
After word, calculate the first word and associate degrees of data with the first of other each words to determine the first conjunctive word of the first word
Set, each word that then calculating first is associated in set of words associate degrees of data with the second of other words and obtain the first association
The related word of the 3rd word in set of words, the 3rd word and first preserved in the first word, the first association set of words close
The related word (i.e. the 5th word) of the 3rd word in connection set of words obtains the event phrase of the first word.By in the present invention
Embodiment is stated, it is determined that after the first association set of words of the first word, determines each word in the first association set of words
The pass of related word, the then word in the word and the first association set of words in the first word, the first association set of words
Join the event phrase that word generates the first word, without traveling through whole text message to obtain the event phrase (pass of such as keyword
Connection event), improve the speed of acquisition event phrase.By the embodiment of the present invention, solve and identify keyword in the prior art
Correlating event the effect that speed is slow, the problem of accuracy is poor, realizes the speed and accuracy that improve identification correlating event.
In the above-described embodiments, text message can be text (e.g., news item or one obtained from internet
Wen Bo is commented on) or the e-text that is obtained by scanning or inputting the content of paper document, it can also be that user passes through
E-text of terminal input etc..Alternatively, text message can be present in text message in the form of paragraph, and e.g., one new
Hear or a comment is a paragraph.
It should be further stated that carrying out word segmentation processing to text message, obtaining multiple words can be by such as lower section
Method is realized:Text message is split as multiple words according to default word combination.
Specifically, default word combination can be obtained from term database, and by the word and word in text message
Default word combination in database is matched, if the word in text message is identical with default word combination, by the word
Language marks off from text message to be come, and obtains multiple words.
It is alternatively possible to word segmentation processing is carried out to text message using participle instrument.
According to the above embodiment of the present invention, the first computing module includes:First calculating sub module, for calculating the first word
The first Multidimensional numerical and each second word the second Multidimensional numerical between the first Euclidean distance, obtain the first association number of degrees
According to;Second computing module includes:Second calculating sub module, for calculating the 3rd Multidimensional numerical and the 4th word of the 3rd word
The second Euclidean distance between 4th Multidimensional numerical, obtain the second association degrees of data.
Specifically, the first word of acquisition corresponds to the first Multidimensional numerical of text message and the second word corresponds to text message
Second Multidimensional numerical, and the first Euclidean distance between the first Multidimensional numerical being calculated and each second Multidimensional numerical is made
For first association degrees of data, using the second Euclidean distance between the 3rd Multidimensional numerical being calculated and the 4th Multidimensional numerical as
Second association degrees of data.
It is possible to further calculate Euclidean distance d according to the following equation:D=| | X-Y | |2, wherein, calculating the first Europe
Formula apart from when, X be the first word the first attribute array, Y be the second word the second attribute array;Calculate second it is European away from
From when, X be the 3rd word the 3rd attribute array, Y be the 4th word the 4th attribute array.
In the above-described embodiment, word can be characterized as attribute array using instrument word2vec.Word2vec is
One instrument that word is converted into vector form.
Further, obtain each word correspond to the attribute array of text message can be by the method for machine learning (e.g.,
Machine learning program) realize.Alternatively, the attribute array in the embodiment can be the array of 500 dimensions, in this embodiment
Array using 500 dimensions can ensure terminal operating efficiency and operation result accuracy.
By the above embodiments of the present invention, the attribute of text message is corresponded to using attribute array representation word, is being obtained
During the first association degrees of data, only the distance between the first word and the second word need to be calculated;When obtaining the second association degrees of data,
The distance of other words in the set of the 3rd word and the second word in the first association set of words need to be only calculated, without one by one
Travel through text message in all words, save storage word and text message needed for space, in the data of text message
When measuring larger, the first association degrees of data that can rapidly and accurately obtain the first word associates degrees of data with second.
According to the above embodiment of the present invention, extraction module can include:First order module, for being calculated
One Euclidean distance carries out Bit-reversed, obtains First ray;First extracting sub-module, for extracting top N in First ray
Second word corresponding to first Euclidean distance, the first association set of words is obtained, wherein, N is natural number;Or second preserving module,
For the second word corresponding to the first Euclidean distance no more than the first predetermined threshold value to be saved into the first association set of words.
Specifically, first between the first attribute array and the second attribute array of the second word for calculating the first word
After Euclidean distance, the first Euclidean distance being calculated can be subjected to Bit-reversed and obtain First ray, and by the first sequence
Each second word corresponding to the first Euclidean distance of top N is ordered as in row and is saved into the first association set of words;Or will not
The first association set of words is saved into more than the second word corresponding to the first Euclidean distance of the first predetermined threshold value.
Wherein, N and the first predetermined threshold value can determine according to request is obtained.
By the above embodiment of the present invention, using attribute array identification of words, and by the distance between attribute array come
The similarity of word in text message is objectively represented, improves the obtained accuracy of the first association set of words.In above-mentioned reality
Apply in example, the first association set of words can be obtained by simple data processing, improve the first association for obtaining the first word
The speed of set of words.
In the above embodiment of the present invention, the first determining module can include:Second order module, for calculating
The second Euclidean distance arrived carries out Bit-reversed, obtains the second sequence;Second extracting sub-module, before extracting in the second sequence
4th word corresponding to second Euclidean distance of M positions, obtains the 5th word, wherein, M is natural number;Or the 3rd preserving module, use
In using the 4th word corresponding to the second Euclidean distance no more than the second predetermined threshold value as the 5th word.
Specifically, second between the 3rd attribute array and the 4th attribute array of the 4th word for calculating the 3rd word
After Euclidean distance, the second Euclidean distance being calculated can be subjected to Bit-reversed and obtain the second sequence, and by the second sequence
Each 4th word corresponding to the second Euclidean distance of M positions is as the 5th word before being ordered as in row;Or second will be not more than
The 4th word is as the 5th word corresponding to second Euclidean distance of predetermined threshold value.
Wherein, M and the second predetermined threshold value can determine according to request is obtained.
According to the above embodiments of the present invention, identification device can also include:3rd computing module, for having in preservation
The 3rd word, the 5th word and the first word of incidence relation, after obtaining the event phrase of the first word, calculate each thing
3rd association degrees of data of the 5th word, the 3rd word and the first word in part phrase;3rd order module, for using the 3rd
Association degrees of data is ranked up to obtain sequence of events to event phrase, wherein, the 3rd computing module includes:Second determining module,
For using the first Euclidean distance and the second Euclidean distance sum as the 3rd association degrees of data;3rd order module includes:Sequence
Submodule, for being ranked up to obtain sequence of events to event phrase according to the numerical values recited of the 3rd association degrees of data.
Specifically, after the event phrase of the first word is obtained, by the first Europe between the first word and the 3rd word
The second Euclidean distance sum between formula distance and the 3rd word and the 5th word associates degrees of data as the 3rd, and uses the 3rd
Association degrees of data is ranked up to each event phrase, obtains sequence of events, wherein, the attention rate of the event in sequence of events can
Represented with the numerical values recited of the 3rd association degrees of data.
Modules provided in the present embodiment are identical with the application method that the corresponding step of embodiment of the method is provided, should
Can also be identical with scene.It is noted, of course, that the scheme that above-mentioned module is related to can be not limited in above-described embodiment
Content and scene, and above-mentioned module may operate in terminal or mobile terminal, can be realized by software or hardware.
As can be seen from the above description, the present invention realizes following technique effect:
Using the embodiment of the present invention, the text message to obtaining in advance segmented to obtain the first word and it is multiple other
After word, calculate the first word and associate degrees of data with the first of other each words to determine the first conjunctive word of the first word
Set, each word that then calculating first is associated in set of words associate degrees of data with the second of other words and obtain the first association
The related word of the 3rd word in set of words, the 3rd word and first preserved in the first word, the first association set of words close
The related word (i.e. the 5th word) of the 3rd word in connection set of words obtains the event phrase of the first word.By in the present invention
Embodiment is stated, it is determined that after the first association set of words of the first word, determines each word in the first association set of words
The pass of related word, the then word in the word and the first association set of words in the first word, the first association set of words
Join the event phrase that word generates the first word, without traveling through whole text message to obtain the event phrase (pass of such as keyword
Connection event), improve the speed of acquisition event phrase.By the embodiment of the present invention, solve and identify keyword in the prior art
Correlating event the effect that speed is slow, the problem of accuracy is poor, realizes the speed and accuracy that improve identification correlating event.
Obviously, those skilled in the art should be understood that above-mentioned each module of the invention or each step can be with general
Computing device realize that they can be concentrated on single computing device, or be distributed in multiple computing devices and formed
Network on, alternatively, they can be realized with the program code that computing device can perform, it is thus possible to they are stored
Performed in the storage device by computing device, either they are fabricated to respectively each integrated circuit modules or by they
In multiple modules or step be fabricated to single integrated circuit module to realize.So, the present invention is not restricted to any specific
Hardware and software combines.
The preferred embodiments of the present invention are the foregoing is only, are not intended to limit the invention, for the skill of this area
For art personnel, the present invention can have various modifications and variations.Within the spirit and principles of the invention, that is made any repaiies
Change, equivalent substitution, improvement etc., should be included in the scope of the protection.