CN107894978A

CN107894978A - The abstracting method and device of time word

Info

Publication number: CN107894978A
Application number: CN201711123985.7A
Authority: CN
Inventors: 任宁; 张建军
Original assignee: Beijing Shenzhou Taiyue Software Co Ltd
Current assignee: China Science and Technology (Beijing) Co., Ltd.
Priority date: 2017-11-14
Filing date: 2017-11-14
Publication date: 2018-04-10
Anticipated expiration: 2037-11-14
Also published as: CN107894978B

Abstract

The embodiment of the present invention discloses the abstracting method and device of a kind of time word, and this method comprises the following steps：Obtain the text of time word to be extracted；Candidate word whole in the text is extracted, each described candidate word at least has a kind of semantic for characterizing the time；Semantic region corresponding to determining each candidate word in the text respectively, the semantic region include the predetermined quantity character before and after candidate word and candidate word；If the first preset characters string corresponding with candidate word is not included in the semantic region, it is determined that the candidate word is time word, and exports the time word.On the one hand above-mentioned technical proposal can simplify decimation rule, expand the quantity of the candidate word extracted, avoid because the situation that decimation rule is excessively complicated and causes plenty of time word to be missed；On the other hand, by carrying out disambiguation to candidate word, the time word in text can be relatively accurately extracted, the diversified Chinese text of the time word form of expression that is particularly suitable for use in.

Description

The abstracting method and device of time word

Technical field

The present invention relates to information extraction and processing technology field, and in particular to a kind of abstracting method of time word.In addition, this Invention further relates to a kind of time word draw-out device.

Background technology

Information extraction refers to the technology that information point is extracted from the text of natural language, it is intended to is provided preferably for people Information acquisition instrument, to tackle the serious challenge that information explosion is brought.Temporal information is the important component of natural language, is Completely understand key element indispensable during natural language semanteme.Therefore, the one of which important process of information extraction is exactly from text The time word for characterizing temporal information is extracted in this.

The conventional method that time word is extracted from text mainly builds decimation rule, and decimation rule and text are carried out Matching, so as to extract time word.For example, extract " on December 12nd, 1999 ", " 8 thirty ", time as " Monday " Word.

But think by analysis, for the Chinese text in Chinese text, especially ancient times, time word is except year Outside the conventional form of expression as the day moon, Hour Minute Second, the form of expression of a lot of other forms also be present.For such text This, if to extract correct time word, it is necessary to the decimation rule of complexity is built, and complicated decimation rule is likely to lead Substantial amounts of time word is caused to be missed.

The content of the invention

In order to solve the above technical problems, the application proposes a kind of abstracting method of time word, rule are extracted to solve time word It is then complicated and the problem of easily cause a large amount of omissions.

First aspect, there is provided a kind of abstracting method of time word, comprise the following steps：

Obtain the text of time word to be extracted；

Candidate word whole in the text is extracted, each described candidate word at least has a kind of semantic for when characterizing Between；

Determine each candidate word in the text respectively corresponding to semantic region, the semantic region include candidate word and Predetermined quantity character before and after candidate word；

If the first preset characters string corresponding with candidate word is not included in the semantic region, it is determined that the candidate Word is time word, and exports the time word.

With reference in a first aspect, in first aspect in the first possible implementation, time whole in the text is extracted The step of selecting word, including：

Prime word is extracted from the text；

Determine each prime word in the text respectively corresponding to matching area, the matching area include prime word and Predetermined quantity character before and after prime word；

Candidate word is generated, the candidate word is to be used for table comprising prime word and at least with a kind of semanteme in matching area Levy the word of time.

With reference to the first implementation of first aspect, in second of possible implementation of first aspect, institute is exported The step of stating time word, including：

If time word includes numeral, judge whether time word is default exclusion type；

If not default exclusion type, then time word is converted into preset format；

Export the time word after format transformation.

It is defeated in first aspect in the third possible implementation with reference to first aspect and above-mentioned possible implementation The step of going out the time word, including：

It is determined that each start-stop position of the time word in the text；

Merge start-stop location overlap or adjacent time word；

Time word after output merging.

With reference to first aspect and above-mentioned possible implementation, in the 4th kind of possible implementation of first aspect, close And the step of start-stop location overlap or adjacent time word, including：

Judge whether the start-stop position of current time word is overlapping or adjacent with the start-stop position of next time word；

If overlapping or adjacent, by current time word and next time word be updated to current time word with it is next when Between word union；

It is determined that start-stop position of the time word after renewal in the text；

If the start-stop position of the time word after renewal is not overlapping with the start-stop position of next time word thereafter and not It is adjacent, then using the time word after renewal as the time word after merging.

Second aspect, there is provided a kind of time word draw-out device, including：

Acquiring unit, for obtaining the text of time word to be extracted；

Processing unit, for extracting candidate word whole in the text, determine that each candidate word is divided in the text Not corresponding semantic region, and the feelings not comprising the first preset characters string corresponding with candidate word in the semantic region It is time word that the candidate word is determined under condition；Wherein, each described candidate word at least has a kind of semantic for characterizing the time, The semantic region includes the predetermined quantity character before and after candidate word and candidate word；

Output unit, for exporting the time word.

With reference to second aspect, in second aspect in the first possible implementation, the processing unit is additionally operable to from institute State and prime word is extracted in text, determine that each prime word distinguishes corresponding matching area in the text, and, generate candidate Word；Wherein, the matching area includes the predetermined quantity character before and after prime word and prime word, and the candidate word is Matching band Comprising prime word and at least with a kind of semantic word for being used to characterize the time in domain.

With reference to the first implementation of second aspect, in second of possible implementation of second aspect, the place Reason unit is additionally operable to judge whether time word is default exclusion type in the case where time word includes numeral, if not pre- If exclusion type, then time word is converted into preset format；The output unit is additionally operable to export the time after format transformation Word.

With reference to second aspect and above-mentioned possible implementation, in second aspect in the third possible implementation, institute Processing unit is stated to be additionally operable to determine start-stop position of each time word in the text, and, merge start-stop location overlap or Adjacent time word；The output unit is additionally operable to the time word after output merges.

With reference to second aspect and above-mentioned possible implementation, in the 4th kind of possible implementation of second aspect, institute State processing unit be additionally operable to judge current time word whether start-stop position overlapping with the start-stop position of next time word or phase Neighbour, current time word and next time word are updated to current time word and next time in the case of overlapping or adjacent The union of word, it is determined that start-stop position of the time word after renewal in the text, and, the start-stop of time word in the updated In the case of position is not overlapping and non-conterminous with the start-stop position of next time word thereafter, using the time word after renewal as Time word after merging.

The method and device of extraction time word in above-mentioned technical proposal, the text of time word to be extracted is obtained first, from Whole candidate words is extracted in text.Wherein, each candidate word at least has a kind of semantic for characterizing the time, that is to say, that Candidate word is likely to be the time word for characterizing the time in the text, it is also possible to is not the time word for characterizing the time.Then pass through Semantic region corresponding to determining each candidate word in the text respectively, then whether judge in semantic region comprising relative with candidate word The the first preset characters string answered, so that it is determined that being in the text time word in candidate word, eliminate ambiguity.When finally exporting Between word, complete from text extract time word process.

This method disposably extracts correct time word not directly from text, but first extracts candidate word, then determines to wait Select the semantic region of word, then using semantic region and the first preset characters string come judge candidate word in the text whether be when Between word, so that correct time word be extracted from text.So, on the one hand, decimation rule can be simplified, expand and extract The quantity of the candidate word gone out, avoid because the situation that decimation rule is excessively complicated and causes plenty of time word to be missed；The opposing party Face, by carrying out disambiguation to candidate word, the time word in text can be relatively accurately extracted, be particularly suitable for use in time vocabulary The Chinese text of existing diversification of forms.The time word that the abstracting method of the time word is applied to Chinese text extracts, can be with So that the time word covering being drawn into is more comprehensive, form is more diversified, while misses are greatly reduced.

Brief description of the drawings

In order to illustrate more clearly of the technical scheme of the application, letter will be made to the required accompanying drawing used in embodiment below Singly introduce, it should be apparent that, for those of ordinary skills, without having to pay creative labor, Other accompanying drawings can also be obtained according to these accompanying drawings.

Fig. 1 is the flow chart of the embodiment of the abstracting method of the application time word；

Fig. 2 is in the embodiment of the abstracting method of the application time word, one embodiment the step of S200 Flow chart；

Fig. 3 is first the step of output time word in the embodiment of the abstracting method of the application time word The flow chart of embodiment；

Fig. 4 is second the step of output time word in the embodiment of the abstracting method of the application time word The flow chart of embodiment；

Fig. 5 is second the step of output time word in the embodiment of the abstracting method of the application time word Flow chart in embodiment the step of S422；

Fig. 6 is the structural representation of the embodiment of the time word draw-out device of the application.

Embodiment

Embodiments herein is elaborated below.

Fig. 1 is refer to, in first embodiment of the application, there is provided a kind of abstracting method of time word, bag The step of including S100 to S400.

S100：Obtain the text of time word to be extracted.

In the step of S100, the text of time word to be extracted can be the Chinese text or classical Chinese of writings in the vernacular Diversified forms, the application such as the Chinese text of text are without limitation.

S200：Candidate word whole in the text is extracted, each described candidate word at least there is a kind of semanteme to be used for Characterize the time.

In the step of S200, each candidate word at least has a kind of semantic for characterizing the time, that is to say, that candidate Word is a kind of in addition to characterizing the semanteme of time, also having the semanteme for representing other implications except at least having.For example, " No. three " Both some date can be represented, a series of numbering of some people's things etc. in people's things can also be represented.

Candidate word whole in the text is extracted, can be taken out by the way of structure regular expression directly matches Take, other modes can also be used.

In a kind of implementation for extracting candidate word, entered using text of the regular expression directly with time word to be extracted Row is matched to extract candidate word.When building regular expression, the specific character string of regular expression can include a variety of performances The time word of form.For example, " fourth is ugly ", " period of the day from 11 a.m. to 1 p.m ", " two more " etc. characterize the time of temporal information with the mode of Chinese era Word；" Great Cold ", " Spring Equinox ", " Summer Solstice " etc. characterize the time word of temporal information with solar term；" National Day ", " International Labour Day " etc. use section Day characterizes the time word of temporal information；" Tang Dynasty ", " business week ", " epoch in romote antiquity ", " Millennium " etc. represent epoch or dynasty Time word；" annual ", " day by day " etc. characterize the time word of fixed interval section；And " moment ", " near year ", " decades " etc. Represent fuzzy time word of period etc..

In another implementation for extracting candidate word, Fig. 2 is refer to, extracts candidate word whole in the text Step, it can specifically include：

S201：Prime word is extracted from the text；

S202：Matching area corresponding to determining each prime word in the text respectively, the matching area include original Predetermined quantity character before and after beginning word and prime word；

S203：Candidate word is generated, the candidate word is comprising prime word and at least with a kind of semantic in matching area For characterizing the word of time.

In the step of S201, prime word is extracted from text and can be extracted using regular expression to match.Here Prime word can be " more ", " when ", " drum ", " quarter ", " year ", when can be characterized together with the character before and after it " moon ", " day " etc. Between words.

For example, the text 1 of time word to be extracted is " in the period of the day from 3 p.m. to 5 p.m., state is herded mansion and hung up lanterns and festoons, very lively.One to the period of the day from 7 p. m. to 9 p.m., people Three quarters were just dispersed cigarette immediately, and the color lamp for leaving also surplus fragmentary pyrotechnic gas on foot is struggled in wind ".Then extracted from text 1 Prime word 1 " when ", prime word 2 " when ", prime word 3 " quarter ".

In the step of S202, predetermined quantity character before and after some prime word and the prime word, the prime word is constituted Corresponding matching area in the text, each prime word have each self-corresponding matching area in the text.

Continue to use the example in S201, for example, it is default in the text " when " preceding 2 characters, rear 1 character and prime word " when ", form with prime word " when " corresponding matching area；Default preceding 3 characters and prime word " carved " in the text " quarter ", form matching area corresponding with prime word " quarter ".

Then in the text 1 of time word to be extracted, matching area corresponding to each prime word difference is as follows：

[in the period of the day from 3 p.m. to 5 p.m. ,] state is herded mansion and hung up lanterns and festoons, very lively.One [to the period of the day from 7 p. m. to 9 p.m. ,] signs of human habitation [three quarter immediately] just disperse, on foot Leave

The matching area 3 of 1 matching area of matching area 2

The color lamp for also remaining fragmentary pyrotechnic gas is struggled in wind.

In the step of S203, candidate word is the word in matching area corresponding with prime word, and candidate word contains original Word, while also at least have a kind of semantic for characterizing the time.The step of generating candidate word can be searched in matching area With the presence or absence of the second preset characters string corresponding with prime word, so as to generate candidate word；Can also be to the text in matching area This progress semantic analysis generates candidate word.

For example, continue to use the example in S202, preset with prime word " when " corresponding second preset characters string including " son ", " ugly ", " third of the twelve Earthly Branches ", " fourth of the twelve Earthly Branches ", " occasion ", " the sixth of the twelve Earthly Branches ", " noon ", " not ", " Shen ", " tenth of the twelve Earthly Branches ", " the eleventh of the twelve Earthly Branches ", " last of the twelve Earthly Branches ".If prime word " when " it is previous Character is any one in the second character string, then the candidate word as generation " the second character string+when " in matching area.In advance If the second preset characters string corresponding with prime word " quarter " includes " one ", " two ", " three ", " 1 ", " 2 ", " 3 " etc..If prime word The previous character at " quarter " is any one in the second preset characters string corresponding with " quarter ", then generates " second in matching area Candidate word as preset characters string+quarter ".

Therefore, by being searched in matching area 1, find prime word 1 " when " previous character be " Shen ", then generation time Select word 1 " period of the day from 3 p.m. to 5 p.m. ".By being searched in matching area 2, find prime word 2 " when " previous character be " the eleventh of the twelve Earthly Branches ", then generation time Select word 2 " period of the day from 7 p. m. to 9 p.m. ".By being searched in matching area 3, the previous character for finding prime word 3 " quarter " is " three ", then generation is waited Select word 3 " three quarters ".

In another example still continue to use the example in S202, by being segmented to the text " in the period of the day from 3 p.m. to 5 p.m., " in matching area 1 And semantic analysis, it is found that " period of the day from 3 p.m. to 5 p.m. " has the semanteme for characterizing the time, then generate candidate word 1 " period of the day from 3 p.m. to 5 p.m. ".By in matching area 2 Text " to the period of the day from 7 p. m. to 9 p.m., " segmented and semantic analysis, it is found that " period of the day from 7 p. m. to 9 p.m. " has the semanteme for characterizing the time, then generate candidate word 2 " period of the day from 7 p. m. to 9 p.m. ".To the text " three quarter immediately " in matching area 3, by participle and semantic analysis, it is found that " three quarter immediately " is to represent At once, the semanteme of horse back, without the semanteme for characterizing specific time point, so matching area 3 can not generate accordingly Candidate word.

It should be noted that, the candidate that is generated different for the participle of matching area and the concrete mode of semantic analysis Word result there may be difference.For example, in foregoing example, for the text " three quarter immediately " in matching area 3, by dividing Word has obtained segmenting for totally three at " immediately " and " three quarters " and " three quarter immediately ".Then semantic point is carried out to these three participles respectively Analysis, now find that " three quarters " has the semanteme for characterizing specific time point, so for matching area 3, can generate candidate word 3 " three quarters ".

The method of above-mentioned extraction candidate word, prime word in text is extracted first.Due to prime word decimation rule compared to It is all simpler for the decimation rule for directly extracting candidate word or the decimation rule for directly extracting correct time word, because And prime word as much as possible can be extracted from text.It is then determined that matching area corresponding to prime word, then from Matching band Candidate word is generated in domain, can so simplify the rule that candidate word is extracted from text, decimation rule complexity is further reduced and leads The omission problem of cause.

The candidate word that the step of S200 is extracted at least has a kind of semantic for characterizing the time, that is to say, that candidate Word is possible to characterize the time in the text, it is also possible to does not characterize the time, ambiguity be present.For example, in the text " No. three " this When the previous character of candidate word is " man ", " No. three " represent a series of volume of some people's things in people's things in the text Number, without characterizing the time.In another example when the latter character of " 7.6 " this candidate word in text is " member ", " gram ", " rice " etc. When, the quantity of " 7.6 " expression thing in the text, without characterizing the time.Therefore, after step 200, pass through S300 and S400 step It is rapid to determine whether candidate word is the time word that characterizes the time, disambiguation, so as to extract the time word in text exactly.

S300：Semantic region corresponding to determining each candidate word in the text respectively, the semantic region include waiting Select the predetermined quantity character before and after word and candidate word.

In the step of S300, each candidate word corresponds at least one semantic region in the text.

For example, for the text 2 of time word to be extracted, " she is born in one eight one three on August 15,1893 years 8 The moon 15, at her 80 birthday, Zhou Shi Mr. and Mrs give a banquet to offer birthday congratulations for her.The period of the day from 3 p.m. to 5 p.m., guests just reach Zhou Shi residences successively.", The candidate word extracted from text 2 has：Candidate word 1 " on August 15th, 1 ", " August 15 in 1893 of candidate word 2 Day ", candidate word 3 " period of the day from 7 a. m. to 9 a.m. ", candidate word 4 " period of the day from 3 p.m. to 5 p.m. ".

It is assumed that preceding 1 character and candidate word " period of the day from 7 a. m. to 9 a.m. " of default candidate word " period of the day from 7 a. m. to 9 a.m. " in the text, composition and candidate word Semantic region corresponding to " period of the day from 7 a. m. to 9 a.m. "；Preceding 1 character and candidate word " period of the day from 3 p.m. to 5 p.m. " of default candidate word " period of the day from 3 p.m. to 5 p.m. " in the text, are formed Semantic region corresponding with candidate word " period of the day from 3 p.m. to 5 p.m. "；The semantic region of the default candidate word of " X X days month X " form in the text is 4 characters before character " year " start to character " day ".Then determine that each candidate word is divided in the text 2 of time word to be extracted Not corresponding semantic region is as follows：

She is born in [on August 15th, 1], [on August 15th, 1893], in her 80 [during birthdays], week Family name husband

The semantic region 3 of 1 semantic region of semantic region 2

Woman give a banquet and offer birthday congratulations for her [.The period of the day from 3 p.m. to 5 p.m.], guests just reach Zhou Shi residences successively.

Semantic region 4

S400：If the first preset characters string corresponding with candidate word is not included in the semantic region, it is determined that institute It is time word to state candidate word, and exports the time word.

In the step of S400, the first preset characters string herein, refer to when candidate word belongs to same semantic region with it When, candidate word does not characterize the character string of time.That is, when candidate word and the first corresponding preset characters string belong to same semanteme During region, candidate word does not characterize the time.Different candidate words can correspond to the first different preset characters strings.As some candidate The first preset characters string corresponding to word is space-time, represents there is unique semanteme for characterizing the time with the candidate word, is not present Ambiguity.

The first preset characters string corresponding to each candidate word, can be stored in advance in corpus.In the corpus The first preset characters string can be obtained by passing empirical cumulative, can also generate by other means.

For example, in a kind of implementation for generating the first preset characters string corresponding to some candidate word, can be first Choose predetermined quantity and candidate's sentence comprising the candidate word；Then selected sentence is filtered out from candidate's sentence, it is described selected The candidate word does not characterize the time in sentence；The first preset characters string is finally extracted from selected sentence, wherein, the first predetermined word Symbol string is only occurred in selected sentence, and is not present in other candidate's sentences in addition to selected sentence.

By the first preset characters string, and the candidate word text in corresponding semantic region in the text, the two is compared Right, if not including the first preset characters string corresponding with candidate word in semantic region, the candidate word is in the time to be extracted It is time word in the text of word, that is to say, that the candidate word characterizes the time in the text of time word to be extracted.If semantic space The first preset characters string corresponding with candidate word is included in domain, then it is assumed that the candidate word not table in the text of time word to be extracted Levy the time, therefore rather than time word.

For example, the example in the step of continuing to use S300, presetting the first preset characters string corresponding with candidate word " period of the day from 7 a. m. to 9 a.m. " is Any one in " longevity ", " birth "；It is " drawing " to preset the first preset characters string corresponding with candidate word " period of the day from 3 p.m. to 5 p.m. "；Default and " X X days month X " form candidate word corresponding to the first preset characters string for sky.

The first preset characters string corresponding with the candidate word of " X X days month X " form is sky, so candidate word 1 can be determined " on August 15th, 1 ", candidate word 2 " on August 15th, 1893 " are time words.

Default the first preset characters string corresponding with candidate word " period of the day from 7 a. m. to 9 a.m. " is any one in " longevity ", " birth ", is passed through Compare and understand, the first character string " longevity " corresponding to candidate word 3 " period of the day from 7 a. m. to 9 a.m. " is included in semantic region 3, therefore, in time word to be extracted Text 2 in, candidate word 3 " period of the day from 7 a. m. to 9 a.m. " is not time word.

Default the first preset characters string corresponding with candidate word " period of the day from 3 p.m. to 5 p.m. " is " drawing ", by comparing, semantic region 4 In do not include the first character string " drawing " corresponding to candidate word 4 " period of the day from 3 p.m. to 5 p.m. ", therefore, in the text 2 of time word to be extracted, candidate word 4 " periods of the day from 3 p.m. to 5 p.m. " were time words.

Finally, output time word " on August 15th, 1 ", " on August 15th, 1893 ", " period of the day from 3 p.m. to 5 p.m. ".

The method of extraction time word in above-mentioned technical proposal, the text of time word to be extracted is obtained first, from text Extract whole candidate words.Wherein, each candidate word at least has a kind of semantic for characterizing the time, that is to say, that candidate word The time word for characterizing the time is likely to be in the text, it is also possible to is not the time word for characterizing the time.Then it is each by determining Individual candidate word in the text respectively corresponding to semantic region, then whether judge in semantic region comprising corresponding with candidate word the One preset characters string, so that it is determined that candidate word is in the text time word, eliminate ambiguity.Last output time word, it is complete Into the process that time word is extracted from text.

The method of the present embodiment disposably extracts correct time word not directly from text, but first extracts candidate word, The semantic region of candidate word is determined again, then judges candidate word in the text using semantic region and the first preset characters string Whether it is time word, so as to exactly extract time word from text.In this way, decimation rule on the one hand can be simplified, Expand the quantity of the candidate word extracted, avoid because the feelings that decimation rule is excessively complicated and causes plenty of time word to be missed Condition；On the other hand, by candidate word carry out disambiguation, can relatively accurately extract the time word in text, it is especially suitable In the diversified Chinese text of the time word form of expression.The time word that the abstracting method of the time word is applied to Chinese text is taken out In taking, can cause the time word that is drawn into cover more comprehensively, form it is more diversified, while misses also drop significantly It is low.

Alternatively, Fig. 3 is refer to, in the step of S400, the step of exporting the time word can include：

S411：Judge, whether comprising numeral in time word, if comprising numeral, to perform S412, if not comprising numeral, hold Row S415.

S412：Judge whether time word is default exclusion type, if not default exclusion type, is then performed S413；If default exclusion type, then perform S415.

S413：Time word is converted into preset format.

S414：The time word after format transformation is exported, is terminated.

S415：Output time word, terminate.

In the step of S411, the numeral that time word includes can be the numeral that Chinese represents, can also Arabic numerals Deng the application is without limitation.For example, time word contains the number of Chinese expression in " on August 15th, 1 " Word；In another example contain Arabic numerals in time word " 1893.08.15 ".

The step of above-mentioned S411 to S414, is primarily to the form of the unified time word containing numeral.But for picture Time word as " 1 year ", " decades ", " one or two day ", although containing Chinese figure, but once it is converted into Arab Numeral, just change the semanteme of its script.For example, it is " 23 years " that the Chinese figure in " 1 year ", which is converted to Arabic numerals, The semanteme of the two is changed.So will so converted form will change the time word of semanteme as default exclusion class Type, such time word is excluded from the time word comprising numeral the step of by performing S412.

In the step of S413, here presetting at form can be set according to demand by user.For example, can be by all years The moon is all collectively expressed as the form of " XXXX XX days month XX " day.In another example the time point of time-division can be collectively expressed as “XX：XX ", as shown in the example in form 1.

Table 1

Time word before conversion	Time word after conversion
		In August, 2017 No.1	On August 1st, 2017
1998/7/20	On July 20th, 1998
		From on June 5th, 2007 until on October 9th, 2008	On October 9, -2008 years on the 5th June in 2007
3: 20	3:20
		3 a moment	3:15
3 points to 5 points	3:00-5:00

In this step, can also be by the time word for including numeral in the Chinese text in ancient times, such as " one more ", " two Drum " is converted to unified preset format, as shown in the example in form 2.

Table 2

The time word with numeral extracted can be converted to unified preset format by above-mentioned steps, it is then defeated Go out, so that the form more specification of the time word of output, is easy to later use.

Optionally, in addition, in the case where time word does not include numeral, it can also judge in time word whether be default Ancient times time word, such as the division of day and night such as " period of the day from 11 p.m. to 1 a.m ", " period of the day from 3 p.m. to 5 p.m. ".If default ancient times time word, then be converted to the time word Corresponding preset format；If not default ancient times time word, then the time word is directly exported, turned without form Change.Such as the division of day and night such as " period of the day from 11 p.m. to 1 a.m ", " period of the day from 1 a.m. to 3 a.m. " can be converted to unified preset format, as shown in the example in form 3.

Table 3

Time word before conversion	Time word after conversion
		The period of the day from 11 p.m. to 1 a.m	0:00-2:00
The period of the day from 7 a. m. to 9 a.m.	8:00-10:00
		The period of the day from 3 p.m. to 5 p.m.	16:00-18:00
The period of the day from 7 p. m. to 9 p.m.	20:00-22:00

Alternatively, Fig. 4 is refer to, in the step of S400, the step of exporting the time word can include：

S421：It is determined that each start-stop position of the time word in the text；

S422：Merge start-stop location overlap or adjacent time word；

S423：Time word after output merging.

In the step of S421, the start-stop position of time word in the text, including original position and final position.Start stop bit Putting can be determined with character sequence, such as in the text 2 of the time word to be extracted in S300 steps, time word " 1 The original position on August 15th, 3 " is the 5th character, and final position is the 14th character；The starting of time word " period of the day from 3 p.m. to 5 p.m. " Position is the 47th character, and final position is the 48th character.In addition to position of the character sequence to record time word, also Other modes can be used, such as position of X-axis, Y-axis etc. can also be used.

The text 2 of time word to be extracted：

During extracting candidate word or prime word, some candidate words or prime word may match multiple extractions Rule, so in the time word finally extracted, overlapping or adjacent situation is there may be between part-time word. In the step of S422, start-stop location overlap or adjacent time word can include three kinds of situations.

The first situation, previous time word and next time word partly overlap.For example, in text " in September, 2015 In 8 points of morning on the 1st ", time word 1 " morning on the 1st of September in 2015 " and time word 2 " 8 points of morning " are extracted, then passage time word 1 With the start-stop position of time word 2 it was determined that time word 1 with " morning " in time word 2 is overlapping.For such start-stop The time word of location overlap, it can be merged into " 8 points of the morning on the 1st of September in 2015 ".

Second of situation, previous time word include next time word, or next time word includes the previous time Word, then the two is overlapping.For example, in text " 8 points of the morning on the 1st of September in 2015 ", " morning on the 1st of September in 2015 of time word 1 is extracted On " and time word 2 " morning ", then the start-stop position of passage time word 1 and time word 2 is it was determined that time word 1 includes time word 2.For the time word of such start-stop location overlap, can be merged into " morning on the 1st of September in 2015 ".

The third situation, previous time word are adjacent with next time word.For example, in the text " morning on the 1st of September in 2015 Upper 8 points " in, time word 1 " on September 1st, 2015 " and time word 2 " 8 points of morning " are extracted, then passage time word 1 and time word 2 Start-stop position it was determined that time word 1 and time word 2 are adjacent in the text.It is adjacent for such start-stop position Time word, it can be merged into " 8 points of the morning on the 1st of September in 2015 ".

, can be first by all from text before judging whether the position of two time words in the text is overlapping or adjacent In the time word that extracts be ranked up according to its start-stop position in the text.So, it is only necessary to more current time word With the start-stop position of its next time word, it is possible to it is determined that the two position in the text be it is overlapping, adjacent or Every, after some time word and remaining all time word are compared into position one by one, it just can determine that the time word is It is no overlapping or adjacent with remaining time word, it can so greatly reduce the operand of combining step.More specifically, it refer to figure The step of 5, S422 merging start-stop location overlaps or adjacent time word, can include：

S4221：Judge whether the start-stop position of current time word is overlapping or adjacent with the start-stop position of next time word；

S4222：If overlapping or adjacent, current time word and next time word are updated to current time word with The union of one time word；

S4223：It is determined that start-stop position of the time word after renewal in the text；

Circulate S4221-S4223 the step of, if until S4224 renewal after time word start-stop position with thereafter under The start-stop position of one time word is not overlapping and non-conterminous, then using the time word after renewal as the time word after merging.

In S4221 judged result, if the start-stop position of current time word and the start-stop position of next time word are not It is overlapping and non-conterminous, then export current time word.

By the above method, two or more adjacent or overlapping time word can be merged into an output.

Illustrate the time word abstracting method of the application below by way of another specific embodiment.

The text 3 of time word to be extracted：

6.27 at dusk 5 thirty run with long steps in city 10 kilometers of familys for returning to No. 9 building, 44 divide 33 seconds when sharing.Adhere to running Have 1 year, reaped in abundant.Kan Liao Qing Dynastys novelist Wu callus people just before going to bed《The strange injustice of nine lives》, wherein writing：" I Calculate herein, it is fraternal over day, seldom it is in simultaneously, if rash goes, it is impossible to which one rouses and catches, and wouldn't be unfortunate！”

After the text 3 for obtaining time word to be extracted, candidate word whole in text totally 8 is extracted：

Candidate word 1：6.27

Candidate word 2：At dusk

Candidate word 3：5 thirty

Candidate word 4：No. 9

Candidate word 5：44 points 33 seconds

Candidate word 6：1 year

Candidate word 7：The Qing Dynasty

Candidate word 8：One drum.

Default " X.X " (X therein is numeral, and X can represent one or more numerals, similarly hereinafter) lattice digital in the text 1 character after the numeral of the candidate word of formula, and " X.XX ", form semantic region corresponding with " X.XX "；With candidate word First preset characters string corresponding to " X.X " is any one in " gram ", " jin ", " rice ", " member ".

1 character and candidate word " No. X " after default candidate word " No. X " in the text, are formed and candidate word " No. X " Corresponding semantic region；The first preset characters string corresponding with candidate word " No. X " is any one in " building ", " room ", " shop ".

Rear 2 characters and candidate word " drum " of default candidate word in the text " drum ", are formed and candidate word " one Semantic region corresponding to drum "；The first preset characters string corresponding with candidate word " drum " is any one in " plate ", " and catching " It is individual.

Default candidate word " X thirty " itself in the text forms semantic region corresponding with candidate word " X thirty ".It is preset in Candidate word " dusk " itself forms semantic region corresponding with candidate word " dusk " in text.Default " X points of candidate word in the text The X seconds " itself form semantic region corresponding with candidate word " X divides the X seconds ".Default candidate word " X " itself in the text form with Semantic region corresponding to candidate word " X ".Default candidate word " Qing Dynasty " itself in the text forms corresponding with candidate word " Qing Dynasty " Semantic region.The first preset characters string corresponding with " X thirty ", the first preset characters string corresponding with " dusk ", with " X divides X First preset characters string, the first preset characters string corresponding with " X ", the first predetermined word corresponding with " Qing Dynasty " corresponding to second " It is sky to accord with string.

Each candidate word corresponding semantic region in the text 3 of time word to be extracted is determined, as shown in table 4.

Table 4

	Candidate word	Semantic region	First preset characters string
				1	6.27	6.27 it is close to	Gram, jin, meter, member
2	At dusk	At dusk	——
				3	5 thirty	5 thirty	——
4	No. 9	No. 9 building	Building, room, shop
				5	44 points 33 seconds	44 points 33 seconds	——
6	1 year	1 year	——
				7	The Qing Dynasty	The Qing Dynasty	——
8	One drum	One rouses and catches	One plate and catch

Judge with candidate word corresponding in semantic region whether corresponding first preset characters string, if do not included, this Candidate word is time word in the text 3 of time word to be extracted；If comprising text of the candidate word in time word to be extracted It is not time word in 3.So it is time word that wherein 6 can be determined from above-mentioned 8 candidate words, as follows：

Time word 1：6.27

Time word 2：At dusk

Time word 3：5 thirty

Time word 4：44 points 33 seconds

Time word 5：1 year

Time word 6：The Qing Dynasty.

It is default that the time word of " X thirty " form is converted into " X：30 " preset format；The time word of " X.X " form is turned Turn to the preset format of " X days month X "；The time word of " X divides the X seconds " form keeps former form, without conversion.Default exclusion type For " XY ", " the XY months ", " XY days ", " XY days ", wherein X, Y are numerals and X is smaller than Y by 1, and numeral herein is Chinese number Word.

Judge whether include numeral in above-mentioned 6 time words, wherein including the time word totally 4 of numeral, be respectively：Time Word 1 " 6.27 ", time word 3 " 5 thirty ", time word 4 " 44 points and 33 seconds ", time word 5 " 1 year ".

4 time words for including numeral, judge whether it is default exclusion type.Time word 1 " 6.27 ", time Word 3 " 5 thirty ", time word 4 " 44 points and 33 seconds " are not default exclusion type；And time word 5 " 1 year " is default exclusion Type.

For not being that default exclude type 3 include digital time word, default form is converted into, such as table 5 It is shown.

Table 5

Time word	Time word before conversion	Time word after conversion
			Time word 1	6.27	June 27
Time word 3	5 thirty	5:30
			Time word 4	44 points 33 seconds	44 points 33 seconds

For the time word not comprising numeral, and it is the time word of default exclusion type, then keeps constant.Now, The time word extracted from text 3 includes：

Time word 1：June 27

Time word 2：At dusk

Time word 3：5:30

Time word 4：44 points 33 seconds

Time word 5：1 year

Time word 6：The Qing Dynasty.

Above-mentioned 6 time words can be exported, be exported again after above-mentioned 6 time words can also being merged into step.

6 time words respective start-stop position in text 3 is determined, if the step of having already been through form conversion, Still start-stop position of position of the time word in text 3 as the time word in text 3 before being changed using form.As a result such as Shown in table 6.

Table 6

	Time word	Original position	End position
				1	6.27	1	4
2	At dusk	5	6
				3	5 thirty	7	9
4	44 points 33 seconds	31	36
				5	1 year	45	47
6	The Qing Dynasty	50	51

Judge whether current time word " 6.27 " and next time word " dusk " are overlapping or adjacent；As a result the two is adjacent , then " 6.27 at dusk " replacement " 6.27 " of the union of " 6.27 " and " dusk " and " dusk " are taken, that is, make with " 6.27 at dusk " For new current time word.It is determined that start-stop position of the time word " 6.27 at dusk " in text 3 after renewal, as shown in table 7.

Table 7

	Time word	Original position	End position
				1	6.27 at dusk	1	6
2	5 thirty	7	9
				3	44 points 33 seconds	31	36
4	1 year	45	47
				5	The Qing Dynasty	50	51

Now recycling judge current time word " 6.27 dusk " and next time word " 5 thirty " whether overlapping or phase Neighbour, as a result the two is adjacent, then takes union " 6.27 5 thirty of the dusk " replacement " 6.27 at dusk " and " 5 thirty " of the two, also It is to be used as new current time word using " 6.27 5 thirty of dusk ".Determine the time word after renewal " 6.27 5 thirty of dusk " in text again Start-stop position in sheet 3, as shown in table 8.

Table 8

	Time word	Original position	End position
				1	6.27 5 thirty of dusk	1	9
2	44 points 33 seconds	31	36
				3	1 year	45	47
4	The Qing Dynasty	50	51

Now recycling judges whether current time word " 6.27 dusk 5 thirty " and next time word " 44 points and 33 seconds " weigh Folded, as a result the two is not overlapping also non-conterminous, then exports current time word " 6.27 5 thirty of dusk ", wherein having been converted into pre- If the part of form is still exported in output with preset format, that is, output " dusk June 27 5:30”.

Then " 44 points 33 seconds " are used as current time word, judge its it is whether overlapping with next time word " 1 year " or It is adjacent.As a result the two is not overlapping also non-conterminous, then directly will " 44 points 33 seconds " output.

Similarly, then current time word will be used as within " 1 year ", and will judge whether it will be overlapping with next time word " Qing Dynasty " It is or adjacent.As a result the two is not overlapping also non-conterminous, then will directly export for " 1 year ".

" Qing Dynasty " as the time word of position most rearward in text 3, without next time word, then directly output is " clear Generation ".

So the time word finally extracted from text 3 to be extracted, output result are：" dusk June 27 5:30 ", " 44 points 33 seconds ", " 1 year ", " Qing Dynasty ".

Fig. 6 is refer to, in second embodiment of the application, there is provided a kind of time word draw-out device, including：

Acquiring unit 1, for obtaining the text of time word to be extracted；

Processing unit 2, for extracting candidate word whole in the text, determine that each candidate word is divided in the text Not corresponding semantic region, and the feelings not comprising the first preset characters string corresponding with candidate word in the semantic region It is time word that the candidate word is determined under condition；Wherein, each described candidate word at least has a kind of semantic for characterizing the time, The semantic region includes the predetermined quantity character before and after candidate word and candidate word；

Output unit 3, for exporting the time word.

Alternatively, processing unit 2 is additionally operable to extract prime word from the text, determines each prime word in the text Matching area corresponding to middle difference, and, generate candidate word；Wherein, before and after the matching area is including prime word and prime word Predetermined quantity character, the candidate word is that have comprising prime word and at least in matching area and a kind of semantic be used to characterize The word of time.

Alternatively, processing unit 2 is additionally operable to judge whether time word is default in the case where time word includes numeral Type is excluded, if not default exclusion type, then time word is converted into preset format；The output unit is additionally operable to defeated The time word gone out after format transformation.

Alternatively, processing unit 2 is additionally operable to determine start-stop position of each time word in the text, and, merge Start-stop location overlap or adjacent time word；The output unit is additionally operable to the time word after output merges.

Alternatively, processing unit 2 is additionally operable to judge start stop bit of the start-stop position with next time word of current time word Whether overlapping or adjacent put, current time word and next time word are updated to current time in the case of overlapping or adjacent The union of word and next time word, it is determined that start-stop position of the time word after renewal in the text, and, in the updated Time word start-stop position it is not overlapping and non-conterminous with the start-stop position of next time word thereafter in the case of, will update Time word afterwards is as the time word after merging.

In this specification between each embodiment identical similar part mutually referring to.Invention described above is real The mode of applying is not intended to limit the scope of the present invention..

Claims

1. a kind of abstracting method of time word, it is characterised in that comprise the following steps：

Obtain the text of time word to be extracted；

Candidate word whole in the text is extracted, each described candidate word at least has a kind of semantic for characterizing the time；

Semantic region corresponding to determining each candidate word in the text respectively, the semantic region includes candidate word and candidate Predetermined quantity character before and after word；

If the first preset characters string corresponding with candidate word is not included in the semantic region, it is determined that the candidate word is Time word, and export the time word.

2. the abstracting method of time word according to claim 1, it is characterised in that extract candidate whole in the text The step of word, including：

Prime word is extracted from the text；

Matching area corresponding to determining each prime word in the text respectively, the matching area include prime word and original Predetermined quantity character before and after word；

Candidate word is generated, the candidate word is when prime word is included in matching area and at least being used to characterize with a kind of semanteme Between word.

3. the abstracting method of time word according to claim 1, it is characterised in that the step of exporting the time word, bag Include：

Export the time word after format transformation.

4. the abstracting method of time word according to claim 1, it is characterised in that the step of exporting the time word, bag Include：

It is determined that each start-stop position of the time word in the text；

Merge start-stop location overlap or adjacent time word；

Time word after output merging.

5. the abstracting method of time word according to claim 4, it is characterised in that merge start-stop location overlap or adjacent The step of time word, including：

If overlapping or adjacent, current time word and next time word are updated to current time word and next time word Union；

If the start-stop position of the time word after renewal is not overlapping and non-conterminous with the start-stop position of next time word thereafter, Then using the time word after renewal as the time word after merging.

A kind of 6. time word draw-out device, it is characterised in that including：

Acquiring unit, for obtaining the text of time word to be extracted；

Processing unit, for extracting candidate word whole in the text, determine that each candidate word is right respectively in the text The semantic region answered, and not comprising in the case of the first preset characters string corresponding with candidate word in the semantic region It is time word to determine the candidate word；Wherein, each described candidate word at least has a kind of semantic for characterizing the time, described Semantic region includes the predetermined quantity character before and after candidate word and candidate word；

Output unit, for exporting the time word.

7. time word draw-out device according to claim 6, it is characterised in that the processing unit is additionally operable to from the text Prime word is extracted in this, determines that each prime word distinguishes corresponding matching area in the text, and, generate candidate word； Wherein, the matching area includes the predetermined quantity character before and after prime word and prime word, and the candidate word is matching area In comprising prime word and at least there is a kind of semantic word for being used to characterize the time.

8. time word draw-out device according to claim 6, it is characterised in that the processing unit is additionally operable in time word Judge whether time word is default exclusion type in the case of comprising numeral, if not default exclusion type, then by when Between word be converted to preset format；The output unit is additionally operable to export the time word after format transformation.

9. time word draw-out device according to claim 6, it is characterised in that the processing unit is additionally operable to determine each Start-stop position of the time word in the text, and, merge start-stop location overlap or adjacent time word；The output unit It is additionally operable to the time word after output merges.

10. time word draw-out device according to claim 9, it is characterised in that the processing unit is additionally operable to judge to work as Whether the start-stop position of preceding time word and the start-stop position of next time word are overlapping or adjacent, in the case of overlapping or adjacent Current time word and next time word are updated to the union of current time word and next time word, it is determined that renewal after when Between start-stop position of the word in the text, and, the start-stop position of time word in the updated and next time thereafter In the case of the start-stop position of word is not overlapping and non-conterminous, using the time word after renewal as the time word after merging.