Disclosure of Invention
In order to solve the above technical problems, the present application provides a time word extraction method to solve the problems of complex time word extraction rules and easy omission.
In a first aspect, a method for extracting time words is provided, which includes the following steps:
acquiring a text of a time word to be extracted;
extracting all candidate words in the text, wherein each candidate word at least has one semantic meaning for representing time;
determining semantic regions corresponding to the candidate words in the text respectively, wherein the semantic regions comprise the candidate words and a preset number of characters before and after the candidate words;
and if the semantic region does not contain a first preset character string corresponding to the candidate word, determining the candidate word as a time word, and outputting the time word.
With reference to the first aspect, in a first possible implementation manner of the first aspect, the step of extracting all candidate words in the text includes:
extracting original words from the text;
determining a matching area corresponding to each original word in the text, wherein the matching area comprises the original words and a predetermined number of characters before and after the original words;
and generating a candidate word, wherein the candidate word is a word which contains the original word in the matching area and has at least one semantic meaning for representing time.
With reference to the first implementation manner of the first aspect, in a second possible implementation manner of the first aspect, the step of outputting the time word includes:
if the time words contain numbers, judging whether the time words are preset exclusion types or not;
if the time word is not the preset exclusion type, converting the time word into a preset format;
and outputting the time words after the format conversion.
With reference to the first aspect and the foregoing possible implementation manners, in a third possible implementation manner of the first aspect, the step of outputting the time word includes:
determining a start-stop position of each time word in the text;
merging the time words with overlapped or adjacent start and stop positions;
and outputting the merged time words.
With reference to the first aspect and the foregoing possible implementations, in a fourth possible implementation of the first aspect, the step of merging time words with overlapping or adjacent start-stop positions includes:
judging whether the start-stop position of the current time word is overlapped or adjacent to the start-stop position of the next time word;
if the time words are overlapped or adjacent, updating the current time word and the next time word into a union of the current time word and the next time word;
determining the starting and ending positions of the updated time words in the text;
and if the start-stop position of the updated time word is not overlapped with and adjacent to the start-stop position of the next time word, taking the updated time word as the merged time word.
In a second aspect, a time word extracting apparatus is provided, including:
the acquisition unit is used for acquiring a text of a time word to be extracted;
the processing unit is used for extracting all candidate words in the text, determining semantic areas corresponding to the candidate words in the text respectively, and determining the candidate words as time words under the condition that the semantic areas do not contain first preset character strings corresponding to the candidate words; each candidate word at least has a semantic meaning for representing time, and the semantic area comprises the candidate words and a preset number of characters before and after the candidate words;
and the output unit is used for outputting the time words.
With reference to the second aspect, in a first possible implementation manner of the second aspect, the processing unit is further configured to extract original words from the text, determine matching regions corresponding to the original words in the text, and generate candidate words; the matching area comprises an original word and a predetermined number of characters before and after the original word, and the candidate word is a word which contains the original word in the matching area and at least has one semantic meaning for representing time.
With reference to the first implementation manner of the second aspect, in a second possible implementation manner of the second aspect, the processing unit is further configured to determine whether the time word is a preset exclusion type when the time word includes a number, and if the time word is not the preset exclusion type, convert the time word into a preset format; the output unit is also used for outputting the time words after format conversion.
With reference to the second aspect and the foregoing possible implementations, in a third possible implementation of the second aspect, the processing unit is further configured to determine a start-stop position of each time word in the text, and merge time words with overlapping or adjacent start-stop positions; the output unit is further used for outputting the merged time words.
With reference to the second aspect and the foregoing possible implementation manners, in a fourth possible implementation manner of the second aspect, the processing unit is further configured to determine whether a start-stop position of a current time word overlaps or is adjacent to a start-stop position of a next time word, update the current time word and the next time word to a union of the current time word and the next time word under the overlapping or adjacent condition, determine a start-stop position of the updated time word in the text, and take the updated time word as the merged time word under the condition that the start-stop position of the updated time word does not overlap and is not adjacent to the start-stop position of the next time word.
According to the method and the device for extracting the time words in the technical scheme, firstly, the text of the time words to be extracted is obtained, and all candidate words are extracted from the text. Each candidate word has at least one semantic meaning for representing time, that is, the candidate word may or may not be a time word representing time in the text. And then determining semantic areas corresponding to the candidate words in the text respectively, and judging whether the semantic areas contain first preset character strings corresponding to the candidate words, so as to determine whether the candidate words are time words or not in the text, and eliminating ambiguity. And finally, outputting the time words to finish the process of extracting the time words from the text.
The method does not directly extract the accurate time words from the text at one time, but extracts the candidate words firstly, then determines the semantic area of the candidate words, and then judges whether the candidate words are the time words in the text by utilizing the semantic area and the first preset character string, thereby extracting the accurate time words from the text. Therefore, on one hand, the extraction rule can be simplified, the number of extracted candidate words can be increased, and the condition that a large number of time words are omitted due to the fact that the extraction rule is too complex is avoided; on the other hand, the candidate words are disambiguated, so that the time words in the text can be extracted more accurately, and the method is particularly suitable for Chinese texts with diversified time word expression forms. The extraction method of the time words is applied to extraction of the time words of the Chinese text, so that the extracted time words can be covered more comprehensively and have more diversified forms, and meanwhile, the missing quantity is also greatly reduced.
Detailed Description
The following provides a detailed description of the embodiments of the present application.
Referring to fig. 1, in a first embodiment of the present application, a method for extracting time words is provided, which includes steps S100 to S400.
S100: and acquiring a text of the time word to be extracted.
In the step S100, the text of the time word to be extracted may be a chinese text of a white word, or may be a chinese text of a language, and the like, which is not limited in the present application.
S200: and extracting all candidate words in the text, wherein each candidate word at least has one semantic meaning for representing time.
In the step S200, each candidate word has at least one semantic meaning for representing time, that is, the candidate word has at least one semantic meaning for representing time and another semantic meaning. For example, "three" may indicate a certain date, a number of a certain personal object in a series of personal objects, or the like.
All candidate words in the text are extracted by adopting a direct matching mode of constructing a regular expression, and other modes can also be adopted.
In an implementation mode of extracting candidate words, a regular expression is adopted to be directly matched with a text of a time word to be extracted so as to extract the candidate words. In constructing a regular expression, a particular string of the regular expression may include multiple manifestations of time words. For example, "clown", "noon", "two more" and the like characterize the time words of the time information in the manner of heavenly stems and earthly branches; the time words of the solar terms used for representing time information such as big cold, spring equinox, summer solstice and the like; the festival, the labor festival and the like represent time words of time information by festival days; "Tang dynasty", "business week", "Taigu times", "millennium years" and the like represent time words of the era or dynasty; "yearly", "day-by-day", etc. characterize time words of fixed interval time periods; and "chronological", "escape", "decades" etc. represent fuzzy time periods, etc.
In another implementation manner of extracting candidate words, referring to fig. 2, the step of extracting all candidate words in the text may specifically include:
s201: extracting original words from the text;
s202: determining a matching area corresponding to each original word in the text, wherein the matching area comprises the original words and a predetermined number of characters before and after the original words;
s203: and generating a candidate word, wherein the candidate word is a word which contains the original word in the matching area and has at least one semantic meaning for representing time.
In the step of S201, extracting the original word from the text may employ a regular expression to match the extraction. The original words here may be words such as "more", "time", "drum", "moment", "year", "month", "day", etc., which together with their preceding and following characters may characterize time.
For example, the text 1 of the time word to be extracted is "in Shen, state Munfu festoon, cheerful. When the user likes the dead, the smoke of the user is scattered immediately, and the colored lamps with little smoke and fire are reserved in the wind. The original word 1 "time", the original word 2 "time", and the original word 3 "moment" are extracted from the text 1.
In step S202, an original word and a predetermined number of characters before and after the original word form a corresponding matching region of the original word in the text, and each original word has a corresponding matching region in the text.
Following the example in S201, for example, the first 2 characters and the last 1 character of "time" in the text and the original word "time" are preset to form a matching area corresponding to the original word "time"; the first 3 characters of the 'moment' in the text and the original word 'moment' are preset to form a matching area corresponding to the original word 'moment'.
In the text 1 of the time word to be extracted, the matching area corresponding to each original word is as follows:
[ in time, the state of Mufu festooned, hot and alarming. Once the dead man falls, the smoke of the dead man is scattered and leaves the dead man
Matching region 1 matching region 2 matching region 3 also struggles in the wind with colored lights with sporadic smoke and fire.
In step S203, the candidate word is a word in the matching area corresponding to the original word, and the candidate word includes the original word and also has at least one semantic meaning for representing time. The step of generating the candidate word may be to search whether a second preset character string corresponding to the original word exists in the matching area, so as to generate the candidate word; semantic analysis may also be performed on the text in the matching region to generate candidate words.
For example, following the example in S202, the second preset character string corresponding to the original word "time" is preset to include "son", "ugly", "yin", "fourth", "old", "noon", "not", "please", "unitary", "fifth", "helminth", and "helminth". If the previous character of the original word "time" is any one of the second character strings, a candidate word "second character string + time" is generated in the matching area. The preset second preset character string corresponding to the original word "carving" includes "one", "two", "three", "1", "2", "3", and so on. And if the previous character of the original word 'carving' is any one of the second preset character strings corresponding to the 'carving', generating a candidate word of 'second preset character string + carving' in the matching area.
Therefore, by searching in the matching area 1, the previous character of the original word 1 "time" is found to be "time", and the candidate word 1 "time" is generated. By searching in the matching area 2, the previous character of "when" the original word 2 is found to be "when" and the candidate word 2 "when" is generated. By searching in the matching area 3, the previous character of the original word 3 'moment' is found to be 'three', and then the candidate word 3 'three moments' is generated.
For another example, still continuing with the example in S202, the text "time of application" in the matching area 1 is subjected to word segmentation and semantic analysis, and the "time of application" is found to have a semantic meaning representing time, and the candidate word 1 "time of application" is generated. By performing word segmentation and semantic analysis on the text "when the user likes" in the matching area 2, if the text "when the user likes" has a semantic meaning representing time, the candidate word "when the user likes" is generated. For the text "three moments in time" in the matching area 3, through word segmentation and semantic analysis, it is found that "three moments in time" represents immediate and immediate semantics, and does not have semantics representing a specific time point, so that the matching area 3 may not generate corresponding candidate words.
It should be noted that, for different specific ways of word segmentation and semantic analysis of the matching area, the generated candidate word results may have differences. For example, in the foregoing example, for the text "three moments in time" in the matching area 3, three participles of "three moments in time" and "three moments in time" are obtained by the participle. Then, semantic analysis is performed on the three segmented words respectively, and at this time, the fact that the three moments have semantics representing specific time points is found, so that the candidate word 3, namely the three moments, can be generated for the matching area 3.
The method for extracting the candidate words comprises the steps of extracting original words in the text. Compared with the extraction rule of directly extracting candidate words or the extraction rule of directly extracting accurate time words, the extraction rule of the original words is simpler, so that the original words can be extracted from the text as many as possible. And then determining a matching area corresponding to the original word, and generating a candidate word from the matching area, so that the rule of extracting the candidate word from the text can be simplified, and the problem of omission caused by complex extraction rules is further reduced.
The candidate word extracted in step S200 has at least one semantic meaning for representing time, that is, the candidate word may or may not represent time in the text, and there is ambiguity. For example, when the previous character of the candidate word "third" in the text is "man", the "third" indicates the number of a certain personal thing in a series of personal things in the text, and does not characterize time. For another example, when the next character of the candidate word "7.6" in the text is "yuan", "gram", "meter", etc., "7.6" indicates the number of objects in the text, and does not characterize time. For this reason, after step 200, it is determined whether the candidate word is a time word representing time through the steps of S300 and S400, and disambiguation is performed, thereby accurately extracting a time word in the text.
S300: and determining semantic regions corresponding to the candidate words in the text respectively, wherein the semantic regions comprise the candidate words and a preset number of characters before and after the candidate words.
In step S300, each candidate word corresponds to at least one semantic area in the text.
For example, for the text 2 "she is born on eighty-three years, eighty-five days, 1893, 8-15 days, and on eighty birthday, the Zhou couple sets the feast to congratulate her lives. In time, guests arrive at mansion Zhou one after another. ", the candidate words extracted from text 2 are: candidate word 1 "eighty-five days of eighty-one-three years", candidate word 2 "1893 8 months and 15 days", candidate word 3 "chronogram", and candidate word 4 "chronogram".
Supposing that the first 1 character of the candidate word 'hour' and the candidate word 'hour' in the text are preset to form a semantic area corresponding to the candidate word 'hour'; presetting the first 1 character of the candidate word 'time application' in the text and the candidate word 'time application' to form a semantic area corresponding to the candidate word 'time application'; the semantic area of the candidate word in the format of "X year, X month and X day" preset in the text is from the beginning of 4 characters before the character "year" to the character "day". Determining the semantic regions respectively corresponding to the candidate words in the text 2 of the time word to be extracted as follows:
she born in [ eighty-five days in eighty-three years ], [ 15 days in 8 months in 1893 ], in eighty in her birthday ], Zhou-shi fu
Semantic area 1 semantic area 2 semantic area 3 women set up a feast to congregate her life [. Stated another time, guests arrive at mansion Zhou one after another.
Semantic area 4
S400: and if the semantic region does not contain a first preset character string corresponding to the candidate word, determining the candidate word as a time word, and outputting the time word.
In the step S400, the first predetermined character string herein refers to a character string in which the candidate word does not represent time when the candidate word and the candidate word belong to the same semantic area. That is, when the candidate word and the first preset character string corresponding thereto belong to the same semantic area, the candidate word does not represent time. Different candidate words may correspond to different first preset character strings. When a first preset character string corresponding to a certain candidate word is empty, the semantic meaning with the unique representation time of the candidate word is represented, and no ambiguity exists.
The first preset character string corresponding to each candidate word may be pre-stored in the corpus. The first predetermined character string in the corpus may be accumulated from past experience, or may be generated in other manners.
For example, in an implementation manner of generating a first preset character string corresponding to a certain candidate word, a predetermined number of candidate sentences including the candidate word may be selected first; then, selecting a selected sentence from the candidate sentences, wherein the candidate words in the selected sentence do not represent time; and finally, extracting a first preset character string from the selected sentence, wherein the first preset character string only appears in the selected sentence and does not appear in other candidate sentences except the selected sentence.
Comparing the first preset character string with the text of the candidate word in the corresponding semantic area in the text, wherein if the semantic area does not contain the first preset character string corresponding to the candidate word, the candidate word is a time word in the text of the time word to be extracted, namely the candidate word represents time in the text of the time word to be extracted. And if the semantic area contains a first preset character string corresponding to the candidate word, the candidate word is not considered to represent time in the text of the time word to be extracted, so that the candidate word is not the time word.
For example, following the example in the step of S300, the first preset character string corresponding to the candidate word "time of day" is preset to be any one of "birthday" and "birth"; presetting a first preset character string corresponding to the candidate word 'time application' as 'guide'; and presetting a first preset character string corresponding to the candidate words in the format of 'X month and X day in X year' as null.
The first predetermined character string corresponding to the candidate word in the format of "X year, X month, X day" is empty, so that it can be determined that the candidate word 1 "eighty-one-three year, eighty month, fifteen days", and the candidate word 2 "1893, 8 month, 15 days" are time words.
The preset first preset character string corresponding to the candidate word "time of day" is any one of "life" and "birth", and it can be known through comparison that the semantic area 3 includes the first character string "life" corresponding to the candidate word "time of day" 3, so that the candidate word "time of day" 3 is not a time word in the text 2 of the time word to be extracted.
The preset first preset character string corresponding to the candidate word "time" is "quote", and it can be known through comparison that the semantic area 4 does not contain the first character string "quote" corresponding to the candidate word 4 "time", so in the text 2 of the time word to be extracted, the candidate word 4 "time" is the time word.
Finally, the time words "eighty-month and fifteen days in eighty-one-three years", "8-month and 15-day in 1893" and "time of application" are output.
According to the method for extracting the time words in the technical scheme, firstly, the text of the time words to be extracted is obtained, and all candidate words are extracted from the text. Each candidate word has at least one semantic meaning for representing time, that is, the candidate word may or may not be a time word representing time in the text. And then determining semantic areas corresponding to the candidate words in the text respectively, and judging whether the semantic areas contain first preset character strings corresponding to the candidate words or not, so as to determine whether the candidate words are time words or not in the text and eliminate ambiguity. And finally, outputting the time words to finish the process of extracting the time words from the text.
The method of the embodiment does not directly extract the accurate time words from the text at one time, but extracts the candidate words first, determines the semantic area of the candidate words, and then judges whether the candidate words are the time words in the text by using the semantic area and the first preset character string, thereby accurately extracting the time words from the text. Therefore, on one hand, the extraction rule can be simplified, the number of extracted candidate words can be increased, and the condition that a large number of time words are omitted due to the fact that the extraction rule is too complex is avoided; on the other hand, the candidate words are disambiguated, so that the time words in the text can be extracted more accurately, and the method is particularly suitable for Chinese texts with diversified time word expression forms. The extraction method of the time words is applied to extraction of the time words of the Chinese text, so that the extracted time words can be covered more comprehensively and have more diversified forms, and meanwhile, the missing quantity is also greatly reduced.
Alternatively, referring to fig. 3, in the step of S400, the step of outputting the time word may include:
s411: and judging whether the time word contains a number, if so, executing S412, and if not, executing S415.
S412: judging whether the time word is a preset exclusion type, if not, executing S413; if it is a preset exclusion type, S415 is performed.
S413: and converting the time words into a preset format.
S414: and outputting the time words after the format conversion, and ending.
S415: and outputting the time word and ending.
In the step S411, the number included in the time word may be a number represented in chinese, an arabic number, or the like, which is not limited in the present application. For example, the time word "eighty-one-three years, eighty-months and fifteen days" contains the number expressed in Chinese; also for example, the time word "1893.08.15" includes Arabic numerals.
The steps of S411 to S414 are mainly to unify the format of the time words containing numbers. However, for time words such as "two or three years", "several decades", "one or two days", although including chinese numerals, the original semantic meaning of the time words is changed once they are converted into arabic numerals. For example, the Chinese number in "two or three years" is converted into Arabic number "23 years", and the semantics of the two are changed. The time word whose format is thus converted and whose semantics are changed is taken as a preset exclusion type, and such time word is excluded from the time words containing numerals by performing the step of S412.
In the step of S413, here the preset format may be set by the user according to the need. For example, all year, month and day may be collectively represented in a format of "XXXX year, XX month and XX day". For another example, the time points of the time division may be collectively expressed as "XX: XX ", as shown in the example in table 1.
TABLE 1
Time word before conversion
|
Converted time words
|
Two zero one seven year august one number
|
8.8.1.2017
|
1998/7/20
|
20/7/1998
|
From 5/6/2007 to 9/10/2008
|
5 days 6/2007 to 9/10/2008
|
3 point 20
|
3:20
|
3 o' clock
|
3:15
|
From 3 to five
|
3:00-5:00 |
In this step, the time words containing numbers in the ancient chinese text, such as "one more", "two drums", can also be converted to a unified preset format, as shown in the example in table 2.
TABLE 2
Through the steps, the extracted time words with the numbers can be converted into a unified preset format and then output, so that the format of the output time words is more standard, and the subsequent utilization is facilitated.
In addition, optionally, in the case that the time word does not include a number, it may also be determined whether the time word is a preset ancient time word, such as "child time", "time of application", and the like. If the time word is a preset ancient time word, converting the time word into a preset format corresponding to the time word; if the time word is not the preset ancient time word, the time word is directly output without format conversion. For example, the time of day such as "child time", "ugly time", etc. may be converted into a unified default format, as shown in the example of table 3.
TABLE 3
Time word before conversion
|
Converted time words
|
Sub-hour
|
0:00-2:00
|
Chen Shi
|
8:00-10:00
|
Time signal
|
16:00-18:00
|
When making love
|
20:00-22:00 |
Alternatively, referring to fig. 4, in the step of S400, the step of outputting the time word may include:
s421: determining a start-stop position of each time word in the text;
s422: merging the time words with overlapped or adjacent start and stop positions;
s423: and outputting the merged time words.
In step S421, the start and end positions of the time word in the text include a start position and an end position. The start-stop position may be determined by character order, for example, in the text 2 of the time word to be extracted in step S300, the start position of the time word "eighty-one-three years, eighty-month, fifteen days" is the 5 th character, and the end position is the 14 th character; the start position of the time word "time" is the 47 th character, and the end position is the 48 th character. In addition to recording the position of the time word in character order, other ways may be used, such as X-axis, Y-axis, etc.
Text 2 of the time word to be extracted:
in the process of extracting the candidate words or the original words, some candidate words or original words may be matched with a plurality of extraction rules, so that in the finally extracted time words, overlapping or adjacent conditions may exist among partial time words. In the step of S422, the time words whose start-stop positions overlap or are adjacent may include three cases.
In the first case, the previous time word and the next time word overlap partially. For example, in the text "2015 9/1/8 am", the time word 1 "2015 9/1 am" and the time word 2 "8 am" are extracted, the start and end positions of the time word 1 and the time word 2 can be determined, and the time word 1 and the "morning" in the time word 2 are overlapped. For such time words with overlapping start and stop positions, it is possible to merge the words into "8 am on 9/2015 and 1 st.
In the second case, the previous time word includes the next time word, or the next time word includes the previous time word, the two overlap. For example, in the text "8 o' clock on 1 st morning of 9 th month in 2015", a time word 1 "1 st morning of 9 th month in 2015" and a time word 2 "morning" are extracted, the start and end positions of the time word 1 and the time word 2 can be determined, and the time word 1 includes the time word 2. For such time words with overlapping start and stop positions, it is possible to merge into "9/1/2015 morning".
In the third case, the previous time word is adjacent to the next time word. For example, in the text "2015 year 9, month 1, morning 8", the time word 1 "2015 year 9, month 1" and the time word 2 "morning 8" are extracted, the start and end positions of the time word 1 and the time word 2 can be determined, and the time word 1 and the time word 2 are adjacent in the text. The time words with adjacent start and stop positions may be combined to "8 am on 9/2015 and 1/am".
Before determining whether the positions of the two time words in the text are overlapped or adjacent, all the time words extracted from the text may be sorted according to their starting and ending positions in the text. Therefore, the positions of the current time word and the next time word in the text can be determined to be overlapped, adjacent or spaced only by comparing the starting position and the ending position of the current time word with the starting position and the ending position of the next time word, and whether the time word is overlapped with or adjacent to the other time words can be determined without comparing a certain time word with the rest time words by a certain position, so that the operation amount of the merging step can be greatly reduced. More specifically, referring to fig. 5, the step of S422 merging the time words with overlapping or adjacent start-stop positions may include:
s4221: judging whether the start-stop position of the current time word is overlapped or adjacent to the start-stop position of the next time word;
s4222: if the time words are overlapped or adjacent, updating the current time word and the next time word into a union of the current time word and the next time word;
s4223: determining the starting and ending positions of the updated time words in the text;
and looping the steps of S4221-S4223 until S4224, if the start and stop positions of the updated time word are not overlapped and not adjacent to the start and stop position of the next time word after the updated time word, taking the updated time word as the merged time word.
In the determination result of S4221, if the start-stop position of the current time word is not overlapped with and adjacent to the start-stop position of the next time word, the current time word is output.
By the above method, two or more adjacent or overlapping time words can be combined into one output.
The time word extraction method of the present application is described below by another specific example.
Text 3 of the time word to be extracted:
6.27 in the evening, the user runs 10 km in a city and returns to the home of the building 9 at half night, and the common time is 44 minutes and 33 seconds. The insist running has been done for two or three years and the harvest is quite plentiful. Seeing Jiu Ming Qin Cui (disorder of nine Ming Qi) of Wutrimang of the young of the Qing Dynasty before sleeping, wherein the track is written: "I has counted here, brothers, is rarely at home, cannot catch a drum if they are lost, all alone! "
After the text 3 of the time word to be extracted is obtained, extracting 8 candidate words in the text:
candidate word 1: 6.27
Candidate word 2: in the evening
Candidate word 3: 5 o' clock and half
Candidate word 4: number 9
Candidate word 5: 44 minutes 33 seconds
Candidate word 6: two or three years old
Candidate word 7: qingdai dynasty
Candidate word 8: a drum.
Presetting 1 character after the number of the candidate word in a number 'X.X' (wherein X is a number, X can represent one or more numbers, the same below) format in the text, and 'X.XX', and forming a semantic area corresponding to 'X.XX'; the first predetermined character string corresponding to the candidate word "x.x" is any one of "g", "jin", "m", and "yuan".
Presetting 1 character behind the candidate word 'X number' in the text and the candidate word 'X number' to form a semantic area corresponding to the candidate word 'X number'; the first preset character string corresponding to the candidate word "X number" is any one of "building", "house", and "shop".
Presetting the last 2 characters of the candidate word 'one drum' and the candidate word 'one drum' in the text to form a semantic area corresponding to the candidate word 'one drum'; the first preset character string corresponding to the candidate word "one drum" is any one of "one plate", "and.
The candidate word "X point half" preset in the text itself constitutes a semantic area corresponding to the candidate word "X point half". The candidate word "evening" itself preset in the text constitutes a semantic area corresponding to the candidate word "evening". The candidate word "X minutes X seconds" preset in the text itself constitutes a semantic region corresponding to the candidate word "X minutes X seconds". The candidate word "X year" preset in the text itself constitutes a semantic region corresponding to the candidate word "X year". The candidate word "passage" itself preset in the text constitutes a semantic area corresponding to the candidate word "passage". The first preset character string corresponding to "X dot and half", the first preset character string corresponding to "evening", the first preset character string corresponding to "X minutes and X seconds", the first preset character string corresponding to "X years", and the first preset character string corresponding to "qing dynasty" are all empty.
And determining the corresponding semantic area of each candidate word in the text 3 of the time word to be extracted, as shown in table 4.
TABLE 4
|
Candidate word
|
Semantic region
|
First preset character string
|
1
|
6.27
|
6.27 side of
|
Gram, jin, meter and yuan
|
2
|
In the evening
|
In the evening
|
——
|
3
|
5 o' clock and half
|
5 o' clock and half
|
——
|
4
|
Number 9
|
9 th building
|
Building, house and shop
|
5
|
44 minutes 33 seconds
|
44 minutes 33 seconds
|
——
|
6
|
Two or three years old
|
Two or three years old
|
——
|
7
|
Qingdai dynasty
|
Qingdai dynasty
|
——
|
8
|
One drum
|
One drum and catch
|
One plate and one catch |
Judging whether a semantic region corresponding to the candidate word corresponds to a corresponding first preset character string or not, if not, judging that the candidate word is a time word in a text 3 of the time word to be extracted; if so, the candidate word is not a time word in the text 3 of the time word to be extracted. Therefore, 6 of the 8 candidate words can be determined as time words, as follows:
time word 1: 6.27
Time word 2: in the evening
Time word 3: 5 o' clock and half
Time word 4: 44 minutes 33 seconds
Time word 5: two or three years old
Time word 6: qing dynasty.
Presetting a time word in an X point-half format to be converted into an X: 30' in a predetermined format; converting the time words in the format of X.X into a preset format of X month and X day; the time words in the format of X minutes and X seconds are kept in the original format without conversion. The preset exclusion types are "XY year", "XY month", "XY day", wherein X, Y are numbers and X is 1 less than Y, where the numbers are chinese numbers.
Judging whether the 6 time words contain numbers, wherein the number of the time words containing numbers is 4, and the judgment is as follows: time word 1 "6.27", time word 3 "half 5", time word 4 "44 minutes 33 seconds", time word 5 "two or three years".
And judging whether the 4 time words containing the numbers are of a preset exclusion type or not. The time word 1 "6.27", the time word 3 "5 and a half, and the time word 4" 44 minutes 33 seconds "are not preset exclusion types; and the time word 5 "two or three years" is a preset exclusion type.
For 3 numeric time words that are not of the preset exclusion type, they are converted into a preset format, as shown in table 5.
TABLE 5
Time word
|
Time word before conversion
|
Converted time words
|
Time word |
1
|
6.27
|
6 months and 27 days
|
Time word |
3
|
5 o' clock and half
|
5:30
|
Time word 4
|
44 minutes 33 seconds
|
44 minutes 33 seconds |
The time words that do not contain numbers and are of the preset exclusion type remain unchanged. At this time, the time words extracted from the text 3 include:
time word 1: 6 months and 27 days
Time word 2: in the evening
Time word 3: 5:30
Time word 4: 44 minutes 33 seconds
Time word 5: two or three years old
Time word 6: qing dynasty.
The 6 time words may be output, or the 6 time words may be output after the merging step.
The respective start and stop positions of the 6 time words in the text 3 are determined, and if the format conversion step has already been performed, the position of the time word before the format conversion in the text 3 is still taken as the start and stop position of the time word in the text 3. The results are shown in Table 6.
TABLE 6
|
Time word
|
Starting position
| End position |
|
1
|
6.27
|
1
|
4
|
2
|
In the evening
|
5
|
6
|
3
|
5 o' clock and half
|
7
|
9
|
4
|
44 minutes 33 seconds
|
31
|
36
|
5
|
Two or three years old
|
45
|
47
|
6
|
Qingdai dynasty
|
50
|
51 |
Judging whether the current time word '6.27' is overlapped or adjacent to the next time word 'evening'; as a result, if the two are adjacent, the union "6.27 evening" of "6.27" and "evening" is taken to replace "6.27" and "evening", that is, "6.27 evening" is taken as the new current time word. The start-stop position of the updated time word "6.27 evening" in text 3 is determined as shown in table 7.
TABLE 7
|
Time word
|
Starting position
| End position |
|
1
|
6.27 evening
|
1
|
6
|
2
|
5 o' clock and half
|
7
|
9
|
3
|
44 minutes 33 seconds
|
31
|
36
|
4
|
Two or three years old
|
45
|
47
|
5
|
Qingdai dynasty
|
50
|
51 |
At this time, the recycling determines whether the current time word "6.27 evening" and the next time word "5 o 'clock" are overlapped or adjacent, and as a result, the two are adjacent, and the union "6.27 evening 5 o' clock" is taken to replace "6.27 evening" and "5 o 'clock", that is, "6.27 evening 5 o' clock" is taken as the new current time word. The start and end positions of the updated time word "5 o' clock in the evening" of 6.27 in text 3 are determined again, as shown in table 8.
TABLE 8
|
Time word
|
Starting position
| End position |
|
1
|
6.27 evening 5 o' clock and half
|
1
|
9
|
2
|
44 minutes 33 seconds
|
31
|
36
|
3
|
Two or three years old
|
45
|
47
|
4
|
Qingdai dynasty
|
50
|
51 |
At this time, the recycling determines whether the current time word "6.27 evening 5: half" and the next time word "44 minutes 33 seconds" are overlapped, and as a result, the current time word "6.27 evening 5: half" is output, wherein the part which has been converted into the preset format is still output in the preset format at the time of output, that is, "6 month 27 evening 5: 30" is output.
Then, the word "44 minutes and 33 seconds" is used as the current time word, and whether the word overlaps or is adjacent to the next time word "two or three years" is judged. As a result, the two are not overlapped and adjacent, and the '44 minutes and 33 seconds' is directly output.
Similarly, the current time word is taken as "two or three years", and whether the current time word is overlapped with or adjacent to the next time word "Qing Dynasty" is judged. And if the two are not overlapped and adjacent, the two-year and three-year output is directly carried out.
The 'generation-clearing' is used as the time word at the most rear position in the text 3, and the 'generation-clearing' is directly output if the next time word does not exist.
Therefore, the output result of the time word extracted from the text 3 to be extracted is as follows: "6 months, 27 days, evening 5: 30", "44 minutes, 33 seconds", "two or three years", "Qing Dynasty".
Referring to fig. 6, in a second embodiment of the present application, there is provided a time word extracting apparatus, including:
the acquisition unit 1 is used for acquiring a text of a time word to be extracted;
the processing unit 2 is configured to extract all candidate words in the text, determine a semantic area corresponding to each candidate word in the text, and determine that the candidate word is a time word when the semantic area does not include a first preset character string corresponding to the candidate word; each candidate word at least has a semantic meaning for representing time, and the semantic area comprises the candidate words and a preset number of characters before and after the candidate words;
and the output unit 3 is used for outputting the time words.
Optionally, the processing unit 2 is further configured to extract original words from the text, determine matching regions corresponding to the original words in the text, and generate candidate words; the matching area comprises an original word and a predetermined number of characters before and after the original word, and the candidate word is a word which contains the original word in the matching area and at least has one semantic meaning for representing time.
Optionally, the processing unit 2 is further configured to determine whether the time word is of a preset exclusion type when the time word includes a number, and if not, convert the time word into a preset format; the output unit is also used for outputting the time words after format conversion.
Optionally, the processing unit 2 is further configured to determine a start-stop position of each time word in the text, and merge time words with overlapping or adjacent start-stop positions; the output unit is further used for outputting the merged time words.
Optionally, the processing unit 2 is further configured to determine whether a start-stop position of the current time word overlaps or is adjacent to a start-stop position of the next time word, update the current time word and the next time word to a union of the current time word and the next time word if the start-stop positions of the current time word and the next time word overlap or are adjacent, determine a start-stop position of the updated time word in the text, and take the updated time word as the merged time word if the start-stop position of the updated time word does not overlap and is not adjacent to the start-stop position of the next time word.
The same and similar parts in the various embodiments in this specification may be referred to each other. The above-described embodiments of the present invention should not be construed as limiting the scope of the present invention.