The content of the invention
In order to solve the above technical problems, the application proposes a kind of abstracting method of time word, rule are extracted to solve time word
It is then complicated and the problem of easily cause a large amount of omissions.
First aspect, there is provided a kind of abstracting method of time word, comprise the following steps:
Obtain the text of time word to be extracted;
Candidate word whole in the text is extracted, each described candidate word at least has a kind of semantic for when characterizing
Between;
Determine each candidate word in the text respectively corresponding to semantic region, the semantic region include candidate word and
Predetermined quantity character before and after candidate word;
If the first preset characters string corresponding with candidate word is not included in the semantic region, it is determined that the candidate
Word is time word, and exports the time word.
With reference in a first aspect, in first aspect in the first possible implementation, time whole in the text is extracted
The step of selecting word, including:
Prime word is extracted from the text;
Determine each prime word in the text respectively corresponding to matching area, the matching area include prime word and
Predetermined quantity character before and after prime word;
Candidate word is generated, the candidate word is to be used for table comprising prime word and at least with a kind of semanteme in matching area
Levy the word of time.
With reference to the first implementation of first aspect, in second of possible implementation of first aspect, institute is exported
The step of stating time word, including:
If time word includes numeral, judge whether time word is default exclusion type;
If not default exclusion type, then time word is converted into preset format;
Export the time word after format transformation.
It is defeated in first aspect in the third possible implementation with reference to first aspect and above-mentioned possible implementation
The step of going out the time word, including:
It is determined that each start-stop position of the time word in the text;
Merge start-stop location overlap or adjacent time word;
Time word after output merging.
With reference to first aspect and above-mentioned possible implementation, in the 4th kind of possible implementation of first aspect, close
And the step of start-stop location overlap or adjacent time word, including:
Judge whether the start-stop position of current time word is overlapping or adjacent with the start-stop position of next time word;
If overlapping or adjacent, by current time word and next time word be updated to current time word with it is next when
Between word union;
It is determined that start-stop position of the time word after renewal in the text;
If the start-stop position of the time word after renewal is not overlapping with the start-stop position of next time word thereafter and not
It is adjacent, then using the time word after renewal as the time word after merging.
Second aspect, there is provided a kind of time word draw-out device, including:
Acquiring unit, for obtaining the text of time word to be extracted;
Processing unit, for extracting candidate word whole in the text, determine that each candidate word is divided in the text
Not corresponding semantic region, and the feelings not comprising the first preset characters string corresponding with candidate word in the semantic region
It is time word that the candidate word is determined under condition;Wherein, each described candidate word at least has a kind of semantic for characterizing the time,
The semantic region includes the predetermined quantity character before and after candidate word and candidate word;
Output unit, for exporting the time word.
With reference to second aspect, in second aspect in the first possible implementation, the processing unit is additionally operable to from institute
State and prime word is extracted in text, determine that each prime word distinguishes corresponding matching area in the text, and, generate candidate
Word;Wherein, the matching area includes the predetermined quantity character before and after prime word and prime word, and the candidate word is Matching band
Comprising prime word and at least with a kind of semantic word for being used to characterize the time in domain.
With reference to the first implementation of second aspect, in second of possible implementation of second aspect, the place
Reason unit is additionally operable to judge whether time word is default exclusion type in the case where time word includes numeral, if not pre-
If exclusion type, then time word is converted into preset format;The output unit is additionally operable to export the time after format transformation
Word.
With reference to second aspect and above-mentioned possible implementation, in second aspect in the third possible implementation, institute
Processing unit is stated to be additionally operable to determine start-stop position of each time word in the text, and, merge start-stop location overlap or
Adjacent time word;The output unit is additionally operable to the time word after output merges.
With reference to second aspect and above-mentioned possible implementation, in the 4th kind of possible implementation of second aspect, institute
State processing unit be additionally operable to judge current time word whether start-stop position overlapping with the start-stop position of next time word or phase
Neighbour, current time word and next time word are updated to current time word and next time in the case of overlapping or adjacent
The union of word, it is determined that start-stop position of the time word after renewal in the text, and, the start-stop of time word in the updated
In the case of position is not overlapping and non-conterminous with the start-stop position of next time word thereafter, using the time word after renewal as
Time word after merging.
The method and device of extraction time word in above-mentioned technical proposal, the text of time word to be extracted is obtained first, from
Whole candidate words is extracted in text.Wherein, each candidate word at least has a kind of semantic for characterizing the time, that is to say, that
Candidate word is likely to be the time word for characterizing the time in the text, it is also possible to is not the time word for characterizing the time.Then pass through
Semantic region corresponding to determining each candidate word in the text respectively, then whether judge in semantic region comprising relative with candidate word
The the first preset characters string answered, so that it is determined that being in the text time word in candidate word, eliminate ambiguity.When finally exporting
Between word, complete from text extract time word process.
This method disposably extracts correct time word not directly from text, but first extracts candidate word, then determines to wait
Select the semantic region of word, then using semantic region and the first preset characters string come judge candidate word in the text whether be when
Between word, so that correct time word be extracted from text.So, on the one hand, decimation rule can be simplified, expand and extract
The quantity of the candidate word gone out, avoid because the situation that decimation rule is excessively complicated and causes plenty of time word to be missed;The opposing party
Face, by carrying out disambiguation to candidate word, the time word in text can be relatively accurately extracted, be particularly suitable for use in time vocabulary
The Chinese text of existing diversification of forms.The time word that the abstracting method of the time word is applied to Chinese text extracts, can be with
So that the time word covering being drawn into is more comprehensive, form is more diversified, while misses are greatly reduced.
Embodiment
Embodiments herein is elaborated below.
Fig. 1 is refer to, in first embodiment of the application, there is provided a kind of abstracting method of time word, bag
The step of including S100 to S400.
S100:Obtain the text of time word to be extracted.
In the step of S100, the text of time word to be extracted can be the Chinese text or classical Chinese of writings in the vernacular
Diversified forms, the application such as the Chinese text of text are without limitation.
S200:Candidate word whole in the text is extracted, each described candidate word at least there is a kind of semanteme to be used for
Characterize the time.
In the step of S200, each candidate word at least has a kind of semantic for characterizing the time, that is to say, that candidate
Word is a kind of in addition to characterizing the semanteme of time, also having the semanteme for representing other implications except at least having.For example, " No. three "
Both some date can be represented, a series of numbering of some people's things etc. in people's things can also be represented.
Candidate word whole in the text is extracted, can be taken out by the way of structure regular expression directly matches
Take, other modes can also be used.
In a kind of implementation for extracting candidate word, entered using text of the regular expression directly with time word to be extracted
Row is matched to extract candidate word.When building regular expression, the specific character string of regular expression can include a variety of performances
The time word of form.For example, " fourth is ugly ", " period of the day from 11 a.m. to 1 p.m ", " two more " etc. characterize the time of temporal information with the mode of Chinese era
Word;" Great Cold ", " Spring Equinox ", " Summer Solstice " etc. characterize the time word of temporal information with solar term;" National Day ", " International Labour Day " etc. use section
Day characterizes the time word of temporal information;" Tang Dynasty ", " business week ", " epoch in romote antiquity ", " Millennium " etc. represent epoch or dynasty
Time word;" annual ", " day by day " etc. characterize the time word of fixed interval section;And " moment ", " near year ", " decades " etc.
Represent fuzzy time word of period etc..
In another implementation for extracting candidate word, Fig. 2 is refer to, extracts candidate word whole in the text
Step, it can specifically include:
S201:Prime word is extracted from the text;
S202:Matching area corresponding to determining each prime word in the text respectively, the matching area include original
Predetermined quantity character before and after beginning word and prime word;
S203:Candidate word is generated, the candidate word is comprising prime word and at least with a kind of semantic in matching area
For characterizing the word of time.
In the step of S201, prime word is extracted from text and can be extracted using regular expression to match.Here
Prime word can be " more ", " when ", " drum ", " quarter ", " year ", when can be characterized together with the character before and after it " moon ", " day " etc.
Between words.
For example, the text 1 of time word to be extracted is " in the period of the day from 3 p.m. to 5 p.m., state is herded mansion and hung up lanterns and festoons, very lively.One to the period of the day from 7 p. m. to 9 p.m., people
Three quarters were just dispersed cigarette immediately, and the color lamp for leaving also surplus fragmentary pyrotechnic gas on foot is struggled in wind ".Then extracted from text 1
Prime word 1 " when ", prime word 2 " when ", prime word 3 " quarter ".
In the step of S202, predetermined quantity character before and after some prime word and the prime word, the prime word is constituted
Corresponding matching area in the text, each prime word have each self-corresponding matching area in the text.
Continue to use the example in S201, for example, it is default in the text " when " preceding 2 characters, rear 1 character and prime word
" when ", form with prime word " when " corresponding matching area;Default preceding 3 characters and prime word " carved " in the text
" quarter ", form matching area corresponding with prime word " quarter ".
Then in the text 1 of time word to be extracted, matching area corresponding to each prime word difference is as follows:
[in the period of the day from 3 p.m. to 5 p.m. ,] state is herded mansion and hung up lanterns and festoons, very lively.One [to the period of the day from 7 p. m. to 9 p.m. ,] signs of human habitation [three quarter immediately] just disperse, on foot
Leave
The matching area 3 of 1 matching area of matching area 2
The color lamp for also remaining fragmentary pyrotechnic gas is struggled in wind.
In the step of S203, candidate word is the word in matching area corresponding with prime word, and candidate word contains original
Word, while also at least have a kind of semantic for characterizing the time.The step of generating candidate word can be searched in matching area
With the presence or absence of the second preset characters string corresponding with prime word, so as to generate candidate word;Can also be to the text in matching area
This progress semantic analysis generates candidate word.
For example, continue to use the example in S202, preset with prime word " when " corresponding second preset characters string including " son ",
" ugly ", " third of the twelve Earthly Branches ", " fourth of the twelve Earthly Branches ", " occasion ", " the sixth of the twelve Earthly Branches ", " noon ", " not ", " Shen ", " tenth of the twelve Earthly Branches ", " the eleventh of the twelve Earthly Branches ", " last of the twelve Earthly Branches ".If prime word " when " it is previous
Character is any one in the second character string, then the candidate word as generation " the second character string+when " in matching area.In advance
If the second preset characters string corresponding with prime word " quarter " includes " one ", " two ", " three ", " 1 ", " 2 ", " 3 " etc..If prime word
The previous character at " quarter " is any one in the second preset characters string corresponding with " quarter ", then generates " second in matching area
Candidate word as preset characters string+quarter ".
Therefore, by being searched in matching area 1, find prime word 1 " when " previous character be " Shen ", then generation time
Select word 1 " period of the day from 3 p.m. to 5 p.m. ".By being searched in matching area 2, find prime word 2 " when " previous character be " the eleventh of the twelve Earthly Branches ", then generation time
Select word 2 " period of the day from 7 p. m. to 9 p.m. ".By being searched in matching area 3, the previous character for finding prime word 3 " quarter " is " three ", then generation is waited
Select word 3 " three quarters ".
In another example still continue to use the example in S202, by being segmented to the text " in the period of the day from 3 p.m. to 5 p.m., " in matching area 1
And semantic analysis, it is found that " period of the day from 3 p.m. to 5 p.m. " has the semanteme for characterizing the time, then generate candidate word 1 " period of the day from 3 p.m. to 5 p.m. ".By in matching area 2
Text " to the period of the day from 7 p. m. to 9 p.m., " segmented and semantic analysis, it is found that " period of the day from 7 p. m. to 9 p.m. " has the semanteme for characterizing the time, then generate candidate word 2
" period of the day from 7 p. m. to 9 p.m. ".To the text " three quarter immediately " in matching area 3, by participle and semantic analysis, it is found that " three quarter immediately " is to represent
At once, the semanteme of horse back, without the semanteme for characterizing specific time point, so matching area 3 can not generate accordingly
Candidate word.
It should be noted that, the candidate that is generated different for the participle of matching area and the concrete mode of semantic analysis
Word result there may be difference.For example, in foregoing example, for the text " three quarter immediately " in matching area 3, by dividing
Word has obtained segmenting for totally three at " immediately " and " three quarters " and " three quarter immediately ".Then semantic point is carried out to these three participles respectively
Analysis, now find that " three quarters " has the semanteme for characterizing specific time point, so for matching area 3, can generate candidate word 3
" three quarters ".
The method of above-mentioned extraction candidate word, prime word in text is extracted first.Due to prime word decimation rule compared to
It is all simpler for the decimation rule for directly extracting candidate word or the decimation rule for directly extracting correct time word, because
And prime word as much as possible can be extracted from text.It is then determined that matching area corresponding to prime word, then from Matching band
Candidate word is generated in domain, can so simplify the rule that candidate word is extracted from text, decimation rule complexity is further reduced and leads
The omission problem of cause.
The candidate word that the step of S200 is extracted at least has a kind of semantic for characterizing the time, that is to say, that candidate
Word is possible to characterize the time in the text, it is also possible to does not characterize the time, ambiguity be present.For example, in the text " No. three " this
When the previous character of candidate word is " man ", " No. three " represent a series of volume of some people's things in people's things in the text
Number, without characterizing the time.In another example when the latter character of " 7.6 " this candidate word in text is " member ", " gram ", " rice " etc.
When, the quantity of " 7.6 " expression thing in the text, without characterizing the time.Therefore, after step 200, pass through S300 and S400 step
It is rapid to determine whether candidate word is the time word that characterizes the time, disambiguation, so as to extract the time word in text exactly.
S300:Semantic region corresponding to determining each candidate word in the text respectively, the semantic region include waiting
Select the predetermined quantity character before and after word and candidate word.
In the step of S300, each candidate word corresponds at least one semantic region in the text.
For example, for the text 2 of time word to be extracted, " she is born in one eight one three on August 15,1893 years 8
The moon 15, at her 80 birthday, Zhou Shi Mr. and Mrs give a banquet to offer birthday congratulations for her.The period of the day from 3 p.m. to 5 p.m., guests just reach Zhou Shi residences successively.",
The candidate word extracted from text 2 has:Candidate word 1 " on August 15th, 1 ", " August 15 in 1893 of candidate word 2
Day ", candidate word 3 " period of the day from 7 a. m. to 9 a.m. ", candidate word 4 " period of the day from 3 p.m. to 5 p.m. ".
It is assumed that preceding 1 character and candidate word " period of the day from 7 a. m. to 9 a.m. " of default candidate word " period of the day from 7 a. m. to 9 a.m. " in the text, composition and candidate word
Semantic region corresponding to " period of the day from 7 a. m. to 9 a.m. ";Preceding 1 character and candidate word " period of the day from 3 p.m. to 5 p.m. " of default candidate word " period of the day from 3 p.m. to 5 p.m. " in the text, are formed
Semantic region corresponding with candidate word " period of the day from 3 p.m. to 5 p.m. ";The semantic region of the default candidate word of " X X days month X " form in the text is
4 characters before character " year " start to character " day ".Then determine that each candidate word is divided in the text 2 of time word to be extracted
Not corresponding semantic region is as follows:
She is born in [on August 15th, 1], [on August 15th, 1893], in her 80 [during birthdays], week
Family name husband
The semantic region 3 of 1 semantic region of semantic region 2
Woman give a banquet and offer birthday congratulations for her [.The period of the day from 3 p.m. to 5 p.m.], guests just reach Zhou Shi residences successively.
Semantic region 4
S400:If the first preset characters string corresponding with candidate word is not included in the semantic region, it is determined that institute
It is time word to state candidate word, and exports the time word.
In the step of S400, the first preset characters string herein, refer to when candidate word belongs to same semantic region with it
When, candidate word does not characterize the character string of time.That is, when candidate word and the first corresponding preset characters string belong to same semanteme
During region, candidate word does not characterize the time.Different candidate words can correspond to the first different preset characters strings.As some candidate
The first preset characters string corresponding to word is space-time, represents there is unique semanteme for characterizing the time with the candidate word, is not present
Ambiguity.
The first preset characters string corresponding to each candidate word, can be stored in advance in corpus.In the corpus
The first preset characters string can be obtained by passing empirical cumulative, can also generate by other means.
For example, in a kind of implementation for generating the first preset characters string corresponding to some candidate word, can be first
Choose predetermined quantity and candidate's sentence comprising the candidate word;Then selected sentence is filtered out from candidate's sentence, it is described selected
The candidate word does not characterize the time in sentence;The first preset characters string is finally extracted from selected sentence, wherein, the first predetermined word
Symbol string is only occurred in selected sentence, and is not present in other candidate's sentences in addition to selected sentence.
By the first preset characters string, and the candidate word text in corresponding semantic region in the text, the two is compared
Right, if not including the first preset characters string corresponding with candidate word in semantic region, the candidate word is in the time to be extracted
It is time word in the text of word, that is to say, that the candidate word characterizes the time in the text of time word to be extracted.If semantic space
The first preset characters string corresponding with candidate word is included in domain, then it is assumed that the candidate word not table in the text of time word to be extracted
Levy the time, therefore rather than time word.
For example, the example in the step of continuing to use S300, presetting the first preset characters string corresponding with candidate word " period of the day from 7 a. m. to 9 a.m. " is
Any one in " longevity ", " birth ";It is " drawing " to preset the first preset characters string corresponding with candidate word " period of the day from 3 p.m. to 5 p.m. ";Default and " X
X days month X " form candidate word corresponding to the first preset characters string for sky.
The first preset characters string corresponding with the candidate word of " X X days month X " form is sky, so candidate word 1 can be determined
" on August 15th, 1 ", candidate word 2 " on August 15th, 1893 " are time words.
Default the first preset characters string corresponding with candidate word " period of the day from 7 a. m. to 9 a.m. " is any one in " longevity ", " birth ", is passed through
Compare and understand, the first character string " longevity " corresponding to candidate word 3 " period of the day from 7 a. m. to 9 a.m. " is included in semantic region 3, therefore, in time word to be extracted
Text 2 in, candidate word 3 " period of the day from 7 a. m. to 9 a.m. " is not time word.
Default the first preset characters string corresponding with candidate word " period of the day from 3 p.m. to 5 p.m. " is " drawing ", by comparing, semantic region 4
In do not include the first character string " drawing " corresponding to candidate word 4 " period of the day from 3 p.m. to 5 p.m. ", therefore, in the text 2 of time word to be extracted, candidate word
4 " periods of the day from 3 p.m. to 5 p.m. " were time words.
Finally, output time word " on August 15th, 1 ", " on August 15th, 1893 ", " period of the day from 3 p.m. to 5 p.m. ".
The method of extraction time word in above-mentioned technical proposal, the text of time word to be extracted is obtained first, from text
Extract whole candidate words.Wherein, each candidate word at least has a kind of semantic for characterizing the time, that is to say, that candidate word
The time word for characterizing the time is likely to be in the text, it is also possible to is not the time word for characterizing the time.Then it is each by determining
Individual candidate word in the text respectively corresponding to semantic region, then whether judge in semantic region comprising corresponding with candidate word the
One preset characters string, so that it is determined that candidate word is in the text time word, eliminate ambiguity.Last output time word, it is complete
Into the process that time word is extracted from text.
The method of the present embodiment disposably extracts correct time word not directly from text, but first extracts candidate word,
The semantic region of candidate word is determined again, then judges candidate word in the text using semantic region and the first preset characters string
Whether it is time word, so as to exactly extract time word from text.In this way, decimation rule on the one hand can be simplified,
Expand the quantity of the candidate word extracted, avoid because the feelings that decimation rule is excessively complicated and causes plenty of time word to be missed
Condition;On the other hand, by candidate word carry out disambiguation, can relatively accurately extract the time word in text, it is especially suitable
In the diversified Chinese text of the time word form of expression.The time word that the abstracting method of the time word is applied to Chinese text is taken out
In taking, can cause the time word that is drawn into cover more comprehensively, form it is more diversified, while misses also drop significantly
It is low.
Alternatively, Fig. 3 is refer to, in the step of S400, the step of exporting the time word can include:
S411:Judge, whether comprising numeral in time word, if comprising numeral, to perform S412, if not comprising numeral, hold
Row S415.
S412:Judge whether time word is default exclusion type, if not default exclusion type, is then performed
S413;If default exclusion type, then perform S415.
S413:Time word is converted into preset format.
S414:The time word after format transformation is exported, is terminated.
S415:Output time word, terminate.
In the step of S411, the numeral that time word includes can be the numeral that Chinese represents, can also Arabic numerals
Deng the application is without limitation.For example, time word contains the number of Chinese expression in " on August 15th, 1 "
Word;In another example contain Arabic numerals in time word " 1893.08.15 ".
The step of above-mentioned S411 to S414, is primarily to the form of the unified time word containing numeral.But for picture
Time word as " 1 year ", " decades ", " one or two day ", although containing Chinese figure, but once it is converted into Arab
Numeral, just change the semanteme of its script.For example, it is " 23 years " that the Chinese figure in " 1 year ", which is converted to Arabic numerals,
The semanteme of the two is changed.So will so converted form will change the time word of semanteme as default exclusion class
Type, such time word is excluded from the time word comprising numeral the step of by performing S412.
In the step of S413, here presetting at form can be set according to demand by user.For example, can be by all years
The moon is all collectively expressed as the form of " XXXX XX days month XX " day.In another example the time point of time-division can be collectively expressed as
“XX:XX ", as shown in the example in form 1.
Table 1
Time word before conversion |
Time word after conversion |
In August, 2017 No.1 |
On August 1st, 2017 |
1998/7/20 |
On July 20th, 1998 |
From on June 5th, 2007 until on October 9th, 2008 |
On October 9, -2008 years on the 5th June in 2007 |
3: 20 |
3:20 |
3 a moment |
3:15 |
3 points to 5 points |
3:00-5:00 |
In this step, can also be by the time word for including numeral in the Chinese text in ancient times, such as " one more ", " two
Drum " is converted to unified preset format, as shown in the example in form 2.
Table 2
The time word with numeral extracted can be converted to unified preset format by above-mentioned steps, it is then defeated
Go out, so that the form more specification of the time word of output, is easy to later use.
Optionally, in addition, in the case where time word does not include numeral, it can also judge in time word whether be default
Ancient times time word, such as the division of day and night such as " period of the day from 11 p.m. to 1 a.m ", " period of the day from 3 p.m. to 5 p.m. ".If default ancient times time word, then be converted to the time word
Corresponding preset format;If not default ancient times time word, then the time word is directly exported, turned without form
Change.Such as the division of day and night such as " period of the day from 11 p.m. to 1 a.m ", " period of the day from 1 a.m. to 3 a.m. " can be converted to unified preset format, as shown in the example in form 3.
Table 3
Time word before conversion |
Time word after conversion |
The period of the day from 11 p.m. to 1 a.m |
0:00-2:00 |
The period of the day from 7 a. m. to 9 a.m. |
8:00-10:00 |
The period of the day from 3 p.m. to 5 p.m. |
16:00-18:00 |
The period of the day from 7 p. m. to 9 p.m. |
20:00-22:00 |
Alternatively, Fig. 4 is refer to, in the step of S400, the step of exporting the time word can include:
S421:It is determined that each start-stop position of the time word in the text;
S422:Merge start-stop location overlap or adjacent time word;
S423:Time word after output merging.
In the step of S421, the start-stop position of time word in the text, including original position and final position.Start stop bit
Putting can be determined with character sequence, such as in the text 2 of the time word to be extracted in S300 steps, time word " 1
The original position on August 15th, 3 " is the 5th character, and final position is the 14th character;The starting of time word " period of the day from 3 p.m. to 5 p.m. "
Position is the 47th character, and final position is the 48th character.In addition to position of the character sequence to record time word, also
Other modes can be used, such as position of X-axis, Y-axis etc. can also be used.
The text 2 of time word to be extracted:
During extracting candidate word or prime word, some candidate words or prime word may match multiple extractions
Rule, so in the time word finally extracted, overlapping or adjacent situation is there may be between part-time word.
In the step of S422, start-stop location overlap or adjacent time word can include three kinds of situations.
The first situation, previous time word and next time word partly overlap.For example, in text " in September, 2015
In 8 points of morning on the 1st ", time word 1 " morning on the 1st of September in 2015 " and time word 2 " 8 points of morning " are extracted, then passage time word 1
With the start-stop position of time word 2 it was determined that time word 1 with " morning " in time word 2 is overlapping.For such start-stop
The time word of location overlap, it can be merged into " 8 points of the morning on the 1st of September in 2015 ".
Second of situation, previous time word include next time word, or next time word includes the previous time
Word, then the two is overlapping.For example, in text " 8 points of the morning on the 1st of September in 2015 ", " morning on the 1st of September in 2015 of time word 1 is extracted
On " and time word 2 " morning ", then the start-stop position of passage time word 1 and time word 2 is it was determined that time word 1 includes time word
2.For the time word of such start-stop location overlap, can be merged into " morning on the 1st of September in 2015 ".
The third situation, previous time word are adjacent with next time word.For example, in the text " morning on the 1st of September in 2015
Upper 8 points " in, time word 1 " on September 1st, 2015 " and time word 2 " 8 points of morning " are extracted, then passage time word 1 and time word 2
Start-stop position it was determined that time word 1 and time word 2 are adjacent in the text.It is adjacent for such start-stop position
Time word, it can be merged into " 8 points of the morning on the 1st of September in 2015 ".
, can be first by all from text before judging whether the position of two time words in the text is overlapping or adjacent
In the time word that extracts be ranked up according to its start-stop position in the text.So, it is only necessary to more current time word
With the start-stop position of its next time word, it is possible to it is determined that the two position in the text be it is overlapping, adjacent or
Every, after some time word and remaining all time word are compared into position one by one, it just can determine that the time word is
It is no overlapping or adjacent with remaining time word, it can so greatly reduce the operand of combining step.More specifically, it refer to figure
The step of 5, S422 merging start-stop location overlaps or adjacent time word, can include:
S4221:Judge whether the start-stop position of current time word is overlapping or adjacent with the start-stop position of next time word;
S4222:If overlapping or adjacent, current time word and next time word are updated to current time word with
The union of one time word;
S4223:It is determined that start-stop position of the time word after renewal in the text;
Circulate S4221-S4223 the step of, if until S4224 renewal after time word start-stop position with thereafter under
The start-stop position of one time word is not overlapping and non-conterminous, then using the time word after renewal as the time word after merging.
In S4221 judged result, if the start-stop position of current time word and the start-stop position of next time word are not
It is overlapping and non-conterminous, then export current time word.
By the above method, two or more adjacent or overlapping time word can be merged into an output.
Illustrate the time word abstracting method of the application below by way of another specific embodiment.
The text 3 of time word to be extracted:
6.27 at dusk 5 thirty run with long steps in city 10 kilometers of familys for returning to No. 9 building, 44 divide 33 seconds when sharing.Adhere to running
Have 1 year, reaped in abundant.Kan Liao Qing Dynastys novelist Wu callus people just before going to bed《The strange injustice of nine lives》, wherein writing:" I
Calculate herein, it is fraternal over day, seldom it is in simultaneously, if rash goes, it is impossible to which one rouses and catches, and wouldn't be unfortunate!”
After the text 3 for obtaining time word to be extracted, candidate word whole in text totally 8 is extracted:
Candidate word 1:6.27
Candidate word 2:At dusk
Candidate word 3:5 thirty
Candidate word 4:No. 9
Candidate word 5:44 points 33 seconds
Candidate word 6:1 year
Candidate word 7:The Qing Dynasty
Candidate word 8:One drum.
Default " X.X " (X therein is numeral, and X can represent one or more numerals, similarly hereinafter) lattice digital in the text
1 character after the numeral of the candidate word of formula, and " X.XX ", form semantic region corresponding with " X.XX ";With candidate word
First preset characters string corresponding to " X.X " is any one in " gram ", " jin ", " rice ", " member ".
1 character and candidate word " No. X " after default candidate word " No. X " in the text, are formed and candidate word " No. X "
Corresponding semantic region;The first preset characters string corresponding with candidate word " No. X " is any one in " building ", " room ", " shop ".
Rear 2 characters and candidate word " drum " of default candidate word in the text " drum ", are formed and candidate word " one
Semantic region corresponding to drum ";The first preset characters string corresponding with candidate word " drum " is any one in " plate ", " and catching "
It is individual.
Default candidate word " X thirty " itself in the text forms semantic region corresponding with candidate word " X thirty ".It is preset in
Candidate word " dusk " itself forms semantic region corresponding with candidate word " dusk " in text.Default " X points of candidate word in the text
The X seconds " itself form semantic region corresponding with candidate word " X divides the X seconds ".Default candidate word " X " itself in the text form with
Semantic region corresponding to candidate word " X ".Default candidate word " Qing Dynasty " itself in the text forms corresponding with candidate word " Qing Dynasty "
Semantic region.The first preset characters string corresponding with " X thirty ", the first preset characters string corresponding with " dusk ", with " X divides X
First preset characters string, the first preset characters string corresponding with " X ", the first predetermined word corresponding with " Qing Dynasty " corresponding to second "
It is sky to accord with string.
Each candidate word corresponding semantic region in the text 3 of time word to be extracted is determined, as shown in table 4.
Table 4
|
Candidate word |
Semantic region |
First preset characters string |
1 |
6.27 |
6.27 it is close to |
Gram, jin, meter, member |
2 |
At dusk |
At dusk |
—— |
3 |
5 thirty |
5 thirty |
—— |
4 |
No. 9 |
No. 9 building |
Building, room, shop |
5 |
44 points 33 seconds |
44 points 33 seconds |
—— |
6 |
1 year |
1 year |
—— |
7 |
The Qing Dynasty |
The Qing Dynasty |
—— |
8 |
One drum |
One rouses and catches |
One plate and catch |
Judge with candidate word corresponding in semantic region whether corresponding first preset characters string, if do not included, this
Candidate word is time word in the text 3 of time word to be extracted;If comprising text of the candidate word in time word to be extracted
It is not time word in 3.So it is time word that wherein 6 can be determined from above-mentioned 8 candidate words, as follows:
Time word 1:6.27
Time word 2:At dusk
Time word 3:5 thirty
Time word 4:44 points 33 seconds
Time word 5:1 year
Time word 6:The Qing Dynasty.
It is default that the time word of " X thirty " form is converted into " X:30 " preset format;The time word of " X.X " form is turned
Turn to the preset format of " X days month X ";The time word of " X divides the X seconds " form keeps former form, without conversion.Default exclusion type
For " XY ", " the XY months ", " XY days ", " XY days ", wherein X, Y are numerals and X is smaller than Y by 1, and numeral herein is Chinese number
Word.
Judge whether include numeral in above-mentioned 6 time words, wherein including the time word totally 4 of numeral, be respectively:Time
Word 1 " 6.27 ", time word 3 " 5 thirty ", time word 4 " 44 points and 33 seconds ", time word 5 " 1 year ".
4 time words for including numeral, judge whether it is default exclusion type.Time word 1 " 6.27 ", time
Word 3 " 5 thirty ", time word 4 " 44 points and 33 seconds " are not default exclusion type;And time word 5 " 1 year " is default exclusion
Type.
For not being that default exclude type 3 include digital time word, default form is converted into, such as table 5
It is shown.
Table 5
Time word |
Time word before conversion |
Time word after conversion |
Time word 1 |
6.27 |
June 27 |
Time word 3 |
5 thirty |
5:30 |
Time word 4 |
44 points 33 seconds |
44 points 33 seconds |
For the time word not comprising numeral, and it is the time word of default exclusion type, then keeps constant.Now,
The time word extracted from text 3 includes:
Time word 1:June 27
Time word 2:At dusk
Time word 3:5:30
Time word 4:44 points 33 seconds
Time word 5:1 year
Time word 6:The Qing Dynasty.
Above-mentioned 6 time words can be exported, be exported again after above-mentioned 6 time words can also being merged into step.
6 time words respective start-stop position in text 3 is determined, if the step of having already been through form conversion,
Still start-stop position of position of the time word in text 3 as the time word in text 3 before being changed using form.As a result such as
Shown in table 6.
Table 6
|
Time word |
Original position |
End position |
1 |
6.27 |
1 |
4 |
2 |
At dusk |
5 |
6 |
3 |
5 thirty |
7 |
9 |
4 |
44 points 33 seconds |
31 |
36 |
5 |
1 year |
45 |
47 |
6 |
The Qing Dynasty |
50 |
51 |
Judge whether current time word " 6.27 " and next time word " dusk " are overlapping or adjacent;As a result the two is adjacent
, then " 6.27 at dusk " replacement " 6.27 " of the union of " 6.27 " and " dusk " and " dusk " are taken, that is, make with " 6.27 at dusk "
For new current time word.It is determined that start-stop position of the time word " 6.27 at dusk " in text 3 after renewal, as shown in table 7.
Table 7
|
Time word |
Original position |
End position |
1 |
6.27 at dusk |
1 |
6 |
2 |
5 thirty |
7 |
9 |
3 |
44 points 33 seconds |
31 |
36 |
4 |
1 year |
45 |
47 |
5 |
The Qing Dynasty |
50 |
51 |
Now recycling judge current time word " 6.27 dusk " and next time word " 5 thirty " whether overlapping or phase
Neighbour, as a result the two is adjacent, then takes union " 6.27 5 thirty of the dusk " replacement " 6.27 at dusk " and " 5 thirty " of the two, also
It is to be used as new current time word using " 6.27 5 thirty of dusk ".Determine the time word after renewal " 6.27 5 thirty of dusk " in text again
Start-stop position in sheet 3, as shown in table 8.
Table 8
|
Time word |
Original position |
End position |
1 |
6.27 5 thirty of dusk |
1 |
9 |
2 |
44 points 33 seconds |
31 |
36 |
3 |
1 year |
45 |
47 |
4 |
The Qing Dynasty |
50 |
51 |
Now recycling judges whether current time word " 6.27 dusk 5 thirty " and next time word " 44 points and 33 seconds " weigh
Folded, as a result the two is not overlapping also non-conterminous, then exports current time word " 6.27 5 thirty of dusk ", wherein having been converted into pre-
If the part of form is still exported in output with preset format, that is, output " dusk June 27 5:30”.
Then " 44 points 33 seconds " are used as current time word, judge its it is whether overlapping with next time word " 1 year " or
It is adjacent.As a result the two is not overlapping also non-conterminous, then directly will " 44 points 33 seconds " output.
Similarly, then current time word will be used as within " 1 year ", and will judge whether it will be overlapping with next time word " Qing Dynasty "
It is or adjacent.As a result the two is not overlapping also non-conterminous, then will directly export for " 1 year ".
" Qing Dynasty " as the time word of position most rearward in text 3, without next time word, then directly output is " clear
Generation ".
So the time word finally extracted from text 3 to be extracted, output result are:" dusk June 27 5:30 ",
" 44 points 33 seconds ", " 1 year ", " Qing Dynasty ".
Fig. 6 is refer to, in second embodiment of the application, there is provided a kind of time word draw-out device, including:
Acquiring unit 1, for obtaining the text of time word to be extracted;
Processing unit 2, for extracting candidate word whole in the text, determine that each candidate word is divided in the text
Not corresponding semantic region, and the feelings not comprising the first preset characters string corresponding with candidate word in the semantic region
It is time word that the candidate word is determined under condition;Wherein, each described candidate word at least has a kind of semantic for characterizing the time,
The semantic region includes the predetermined quantity character before and after candidate word and candidate word;
Output unit 3, for exporting the time word.
Alternatively, processing unit 2 is additionally operable to extract prime word from the text, determines each prime word in the text
Matching area corresponding to middle difference, and, generate candidate word;Wherein, before and after the matching area is including prime word and prime word
Predetermined quantity character, the candidate word is that have comprising prime word and at least in matching area and a kind of semantic be used to characterize
The word of time.
Alternatively, processing unit 2 is additionally operable to judge whether time word is default in the case where time word includes numeral
Type is excluded, if not default exclusion type, then time word is converted into preset format;The output unit is additionally operable to defeated
The time word gone out after format transformation.
Alternatively, processing unit 2 is additionally operable to determine start-stop position of each time word in the text, and, merge
Start-stop location overlap or adjacent time word;The output unit is additionally operable to the time word after output merges.
Alternatively, processing unit 2 is additionally operable to judge start stop bit of the start-stop position with next time word of current time word
Whether overlapping or adjacent put, current time word and next time word are updated to current time in the case of overlapping or adjacent
The union of word and next time word, it is determined that start-stop position of the time word after renewal in the text, and, in the updated
Time word start-stop position it is not overlapping and non-conterminous with the start-stop position of next time word thereafter in the case of, will update
Time word afterwards is as the time word after merging.
In this specification between each embodiment identical similar part mutually referring to.Invention described above is real
The mode of applying is not intended to limit the scope of the present invention..