CN106933783A - A kind of method and device on the intelligent extraction date from text - Google Patents
A kind of method and device on the intelligent extraction date from text Download PDFInfo
- Publication number
- CN106933783A CN106933783A CN201511033057.2A CN201511033057A CN106933783A CN 106933783 A CN106933783 A CN 106933783A CN 201511033057 A CN201511033057 A CN 201511033057A CN 106933783 A CN106933783 A CN 106933783A
- Authority
- CN
- China
- Prior art keywords
- date
- text
- month
- year
- day
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/194—Calculation of difference between files
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/151—Transformation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/166—Editing, e.g. inserting or deleting
- G06F40/177—Editing, e.g. inserting or deleting of tables; using ruled lines
- G06F40/18—Editing, e.g. inserting or deleting of tables; using ruled lines of spreadsheets
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Document Processing Apparatus (AREA)
Abstract
The present invention relates to a kind of method on the intelligent extraction date from text, it is comprised the following steps:Step 1:Acquisition will therefrom extract the text and document filling date on date;Step 2:Text is pre-processed, the date that will occur in text is converted to the date of canonical form;Step 3:Construction regular expression, and the date expression formula in text is matched using regular expression, therefrom extract the date expression formula for meeting regular expression form;Step 4:For the date expression formula of different-format, year, month, day numeral therein is extracted respectively;Step 5:By filling out the phase in odd-numbered day, the time lacked in completion text or month data;Step 6:The year, month, day numeral that will identify that is combined into the complete date, and is stored with date format.Present invention also offers a kind of device for realizing the above method.
Description
Technical field
The present invention relates to a kind of date extracting method, especially for the date extracting method of electronic documents;The invention further relates to one
Plant the device for realizing the extracting method.
Background technology
Electronic documents progressively substitute papery document in Course of Enterprise Informationalization, as business datum transmission, business audit it is important
Medium.The typing and examination & verification of electronic documents, and the data analysis based on electronic documents are the important composition portions of financial management software
Point.User have accumulated a large amount of electronic documents data for being available for and analyzing during application financial management software.
User is when using financial management software typing electronic documents, it is often necessary to fill in business date of occurrence.In some finance pipes
In reason software, the business date of occurrence that user fills in is recorded with text mode, is recorded rather than with the date format of standard,
When this results in user and carries out data analysis, it is difficult to extract the date from text, it is impossible to using the business date of occurrence in text
This key message.
Simultaneously as the scene of electronic documents typing has its particularity, such as date expression-form is various, and the date is imperfect, than
As only containing one or two in three data of date, or often to include Chinese figure (such as " September "), this gives
Date recognition and extraction bring new challenge.
However, in the existing data extraction method for text, mainly for the Text Feature Extraction of other field, such as drawing in search
Hold up and use, or used in the identification of other data.Therefore, the existing method for extracting the date is mainly for complete and format specification
Date expression formula, and for the imperfect date, such as:Lack one or two in three data of date, such as " August 1
Day ", Chinese figure date, such as " September ", relative-date, such as " last month ", time period such as " 2014 8
The moon to October ", then lack recognition capability.If directly being extracted to above-mentioned word according to existing extractive technique, can only extract
Go out corresponding word content, it is impossible to carry out its corresponding actual date of Intelligent Recognition.
The content of the invention
The invention reside in the shortcoming and deficiency that overcome prior art, there is provided a kind of method and device on the intelligent extraction date from text.
The present invention is realized by following technical scheme:A kind of method on the intelligent extraction date from text, it includes following
Step:
Step 1:Acquisition will therefrom extract the text and document filling date on date;
Step 2:Text is pre-processed, the date that will occur in text is converted to the date of canonical form;
Step 3:Construction regular expression, and the date expression formula in text is matched using regular expression, Cong Zhongti
Taking-up meets the date expression formula of regular expression form;
Step 4:For the date expression formula of different-format, year, month, day numeral therein is extracted respectively;
Step 5:By filling out the phase in odd-numbered day, the time lacked in completion text or month data;
Step 6:The year, month, day numeral that will identify that is combined into the complete date, and is stored with date format.
As a further improvement on the present invention, following steps are specifically included in the step 2:
Step 21:" month ", " number " are used into matching regular expressions respectively, " moon " is converted to after matching, " day ";
Step 22:There is the date expression-form of incidence relation according to occurring in text and fill out the phase in odd-numbered day, the corresponding date is calculated;
Step 23:The numeral of Chinese expression is converted into Arabic numerals.
As a further improvement on the present invention, include with the expression formula filled out the phase in odd-numbered day and be associated in the step 22:" this year ",
" current year ", " last year ", " upper one year ", " next year ", " next year ", " this month ", " next month ", " this
Month ", after above-mentioned date expression-form is matched, the correspondence date is calculated with reference to the phase in odd-numbered day is filled out.
As a further improvement on the present invention, in the step 3, when according to regular expression to text in date match success
Afterwards, the date expression field for matching is extracted, and the field for matching is deleted.
As a further improvement on the present invention, in the step 5, in time and the month data of completion missing, its completion day
The foundation of phase includes filling out time where the phase in odd-numbered day and month according to taking, and/or according to take time for filling out that the phase in odd-numbered day possesses incidence relation
Or month.
Present invention also offers a kind of device on the intelligent extraction date from text, it includes:
Acquisition module, the text and document filling date on date will be therefrom extracted for obtaining;
Pretreatment module, the date for will occur in text is converted to the date of canonical form;
Matching module, for constructing regular expression, and is matched using regular expression to the date expression formula in text,
Therefrom extract the date expression formula for meeting regular expression form;
Extraction module, for the date expression formula for different-format, extracts year, month, day numeral therein respectively;
Date completion module, for by filling out the phase in odd-numbered day, the time lacked in completion text or month data;
Storage module, the year, month, day numeral for will identify that is combined into the complete date, and is stored with date format.
As a further improvement on the present invention, the pretreatment module includes:
First pretreatment submodule, for " month ", " number " to be used into matching regular expressions respectively, conversion after matching
It is " moon ", " day ";
Second pretreatment submodule, for having the date expression-form of incidence relation according to occurring in text and fill out the phase in odd-numbered day, counts
Calculate the corresponding date;
3rd pretreatment submodule, for the numeral of Chinese expression to be converted into Arabic numerals.
As a further improvement on the present invention, include with the expression formula filled out the phase in odd-numbered day and be associated in the second pretreatment submodule:
" this year ", " current year ", " last year ", " upper one year ", " next year ", " next year ", " this month ", " under
Individual month ", " this month ", after above-mentioned date expression-form is matched, with reference to fill out the phase in odd-numbered day calculate correspondence the date.
As a further improvement on the present invention, the matching module according to regular expression to in text date match success after,
The date expression field that extraction is matched, and the field for matching is deleted.
As a further improvement on the present invention, when the date completion module is lacked in completion time and month data, its completion
The foundation on date includes filling out time where the phase in odd-numbered day and month according to taking, and/or according to take year for filling out that the phase in odd-numbered day possesses incidence relation
Part or month.
Compared to prior art, have the advantages that:
1st, the present invention will in the form of text store the business date of occurrence in explanation is paid, and intelligent extraction is simultaneously converted into analyzable
Date format.On this basis, to reimbursement document by month Classifying Sum, and further financial expenditures analysis can be made, from
And effective management and control of realization reimbursement cost, corporate spending is saved, improve enterprise profit.
2nd, the present invention supports the identification of various common dates expression ways, specifically includes:(1) Chinese figure (2) is with respect to days (3) " XX
Month ", " XX months ", " XX days ", " No. XX " etc..(4) the scope date.(5) without the numeral of the keywords such as date
Date:" 20120809 ", " 2012-08-09 ", " 2012.08-2012.09 ", " 2012 08 09-2012 08 21 " etc..
3rd, by the present invention, when user fills in electronic documents, the date of text formatting can be filled according to oneself custom, without
Stick to date format or fritter away energy on date input table is clicked on, so as to save user time, lift Consumer's Experience.
In order to more fully understand and implement, the invention will now be described in detail with reference to the accompanying drawings.
Brief description of the drawings
Fig. 1 is the flow chart of the method on the intelligent extraction date from text of the invention.
Fig. 2 is the flow chart with matching regular expressions text in the present embodiment.
Fig. 3 is the block diagram of the device on the intelligent extraction date from text of the invention.
Specific embodiment
In order to solve the problems, such as that date extraction cannot be carried out for electronic documents in the prior art, can be from the invention provides one kind
The method and device on intelligent extraction date in text.Preferred embodiment of the invention is introduced in detail below.
The invention provides a kind of method on the intelligent extraction date from text, it is comprised the following steps:
Step S1:Acquisition will therefrom extract the text and document filling date on date.
Step S2:Text is pre-processed, the date that will occur in text is converted to the date of canonical form.
Further, in the present embodiment to specifically including following steps in the step 2:
Step 21:" month ", " number " are used into matching regular expressions respectively, " moon " is converted to after matching, " day ".
Step 22:There is the date expression-form of incidence relation according to occurring in text and fill out the phase in odd-numbered day, the corresponding date is calculated.
Specifically, including with the expression formula filled out the phase in odd-numbered day and be associated in this step:" this year ", " current year ", " last year ",
" upper one year ", " next year ", " next year ", " this month ", " next month ", " this month ", when matching above-mentioned day
After phase expression-form, the correspondence date is calculated with reference to the phase in odd-numbered day is filled out.The phase in odd-numbered day is such as filled out for " in August, 2012 ", then " this
Year ", " last year ", " this month ", " next month " be respectively converted into " 2012 ", " 2011 ", " August ",
" September ".
Step 23:The numeral of Chinese expression is converted into Arabic numerals.The numeral of such as from " one " to " 31 " is respectively
" 1 " to " 31 " is converted to, " zero " is converted to " 0 ".
Step S3:Construction regular expression, and the date expression formula in text is matched using regular expression, Cong Zhongti
Taking-up meets the date expression formula of regular expression form;
Further in this step, when according to regular expression to text in date match success after, date for matching of extraction
Expression field, and the field for matching is deleted, to prevent repeated matching.
Fig. 2 is referred to, it is the flow chart with matching regular expressions text in the present embodiment.Assuming that user constructs as needed
N kind regular expressions, match the text after pretreatment, if N successively with various date formats corresponding regular expressions
Individual matching regular expressions success, then into the corresponding handling process of n-th regular expression, with N+1 after the completion for the treatment of
Matching regular expressions text;If the match is successful for n-th regular expression, continue with the N+1 matching regular expressions
Text.
Step S4:For the date expression formula of different-format, year, month, day numeral therein is extracted respectively.Due to each
Regular expression used by matched text, its form is all fixed, such as " " the XXXX XX months ", its first pure
Numeral is the time, and second pure digi-tal is then month, therefore date carries out digital matching to this, you can the numeral that will match to
It is identified as time and month successively in order.
Step S5:By filling out the phase in odd-numbered day, the time lacked in completion text or month data.
Specifically, in this step, in time and the month data of completion missing, the foundation on its completion date includes that basis takes
Fill out time where the phase in odd-numbered day and month, and/or according to take time or month for filling out that the phase in odd-numbered day possesses incidence relation.
Whether the date that inspection is matched is complete, and the date of cannonical format is directly generated if complete, the intelligence if imperfect
The cannonical format date is regenerated after the completion date.According to different application scenarios, the definition of long date fomat can be with difference, in analysis
During financial reimbursement document, time and month are typically required, without requiring specific to one day, therefore the present invention is with comprising time and the moon
Part as date complete criterion.User of the invention can be decided in its sole discretion specifically using which kind of cannonical format, and this example is adopted
With the form of " XXXX XX months XX day ", such as " on 03 21st, 2012 ".If in the date field for matching
Missing year or the moon, the present invention will carry out intelligent supplement with reference to the phase in odd-numbered day is filled out, for example, fill out the phase in odd-numbered day for " on October 10th, 2012 ",
The date expression formula for matching is " September 8 days ", then result is " on September 8th, 2012 " after date intelligence supplement.
To different documents, may be taken for the time supplemented and month data and fill out time and month where the phase in odd-numbered day, it is also possible to taken and fill out
Upper one year of phase in odd-numbered day, next year, upper one month, next month, it is specifically dependent upon document types and fills out the phase in odd-numbered day and wait to mend
Fill the incidence relation on date.For the time, supplement rule is:If month value<=" the single month value+1 of system ", then take the phase in odd-numbered day processed
There is the time in the time, otherwise take " time phase in odd-numbered day -1 processed " as the business generation time as business.The phase in odd-numbered day is for example filled out for " 2012
On October 10, in ", the date expression formula for matching is " September 8 days ", then result is " 2012 after date intelligence supplement
September 8 days ";For another example the phase in odd-numbered day is filled out for " on January 10th, 2012 ", and the date expression formula for matching is " November 8 ",
Then result is " on November 8th, 2011 " after date intelligence supplement.
Step S6:The year, month, day numeral that will identify that is combined into the complete date, and is stored with date format.
In this step, the present invention combines the demand of data analysis, will repeat date merging, and the result date after merging is complete
Portion is stored by unified cannonical format.
Further, the date match regular expression that the present invention is used simply is introduced below, 17 is employed in the present embodiment
Regular expression is planted, it is as follows respectively:
(1) pattern_date (0)=" ((19)9[5-9]|(20)[0-4] [0-9]) [year -- /] (1 [012] | 0[1-9]) [moon --
\./\\](30|31|([12][0-9])|(0[1-9])) day([extremely arrive~~] | -+| -+) ((19)9[5-9]|(20)[0-4] [0-9]) [year --
\./\\](1[012]|0[1-9]) [moon -- /] (30 | 31 | ([12] [0-9]) | (0[1-9])) (day | (![d -- /])) "
Brief introduction:Xxxx [year -- /] xx [moon -- /] xx [day] (-/---/to/to/~~) xxxx [year -- /] xx [moon -- /] xx [day],
That is to say:The year matched in the regular expression can be 1995~2049, or 95~99,00~49.The moon can be 1~12,
Or 01~09, day can be 01~09,1~9,10~31.Such as:January 11 day~2010 January of nineteen ninety-five.
(2) pattern_date (1)=" ((19)9[5-9]|(20)[0-4] [0-9]) [year -- /] (1 [012] | 0[1-9]) [moon --
\./\\](30|31|([12][0-9])|(0[1-9])) day([extremely arrive~~] | -+| -+) (1 [012] | 0[1-9]) [moon --
\./\\](30|31|([12][0-9])|(0[1-9])) (day | (![d -- /])) "
Brief introduction:Xxxx xx month xx [day] (-/---/to/to/~~) xx month xx [day].Here year can be 1995~2049,
Or 95~99,00~49.The moon can be 1~12, or 01~09, and day can be 01~09,1~9,10~31.Such as:Nineteen ninety-five 1
1~October 1 moon.
(3) pattern_date (2)=" ((19)9[5-9]|(20)[0-4] [0-9]) [year -- /] (1 [012] | 0[1-9]) moon([to arriving
~~] | -+| -+) ((19)9[5-9]|(20)[0-4] [0-9]) [year -- /] (1 [012] | 0[1-9]) (moon | (![d -- /])) "
Brief introduction:Xxxx xx [moon] (-/---/to/to/~~) xxxx xx [moon];Xxxx (- /) xx (-/---/to/to/~~) xxxx
(-./\)xx;Note:Here " 11-08~12-09 " this expression formula can be matched, actually because current year has been 2015,
This date more likely represents 8~December of November 9, rather than in August, 11~12 year September after occurring, therefore will in program
Make special judgement.
(4) pattern_date (3)=" ((19)9[5-9]|(20)[0-4] [0-9]) [year /] (1 [012] | 0[1-9]) moon([to arriving
~~] | -+| -+) (1 [012] | 0[1-9]) (moon | (![d -- /])) | ((19)9[5-9]|(20)[0-4] [0-9]) [--] (1 [012] | 0[1-9])
Month[extremely arrive~~] (1 [012] | 0[1-9]) (moon | (![d -- /])) "
Brief introduction:Xxxx xx [moon] (-/---/to/to/~~) the xx months;Xxxx (- /) xx (-/---/to/to/~~) xx;Use top
" | " is divided into two parts and writes, primarily to preventing the situation of xx-xx-xx to be matched;It is to prevent from being similar to that zero width is asserted
12-07~12-21 is this to be matched 12-07~12.
(5) pattern_date (4)=" ((19)9[5-9]|(20)[0-4] [0-9]) [year -- /] (1 [012] | 0[1-9]) [moon --
\./\\](30|31|([12][0-9])|(0[1-9])) day([extremely arrive~~] | -+| -+) (30 | 31 | ([12] [0-9]) | (0[1-9])) (day | (![moon d --
\./\\]))"
Brief introduction:Xxxx xx month xx [day] (-/---/to/to/~~) xx [day];Xxxx (- /) xx (- /) xx (-/---/to/arrive
/~~) xx.Such as:1~10 January in 1999.
(6) pattern_date (5)=" ((19)9[5-9]|(20)[0-4] [0-9]) [year -- /] (1 [012] | 0[1-9]) [moon --
\./\\](30|31|([12][0-9])|(0[1-9])) (day | (![moon d -- /])) "
Brief introduction:Xxxx xx months xx day;xxxx(-./\)xx(-./\)xx.Such as:On January 1st, 2000, can with
It is fitted on the expression formulas such as 2000-1-1.
(7) pattern_date (6)=" ((19)9[5-9]|(20)[0-4] [0-9]) year ((1 [012] | 0[1-9]) (moon
[,, ]| [,, ])) * ((1 [012] | 0[1-9]) moon[and with and])(1[012]|0[1-9]) moon "
Brief introduction:In the case where there is " moon ", can be without separator [, , ], to match " xx January 2 between two months
Month ".
(8) pattern_date (7)=" (199 [5-9] | 20 [0-4] [0-9]) [-- /] (1 [012] | 0[1-9])(![d -- /]) "
Brief introduction:xxxx(-./\)xx.Such as:2000-12, represents in December, 2000
(9) pattern_date (8)=" (199 [5-9] | 20 [0-4] [0-9]) (1 [012] | 0 [1-9]) ([extremely arrive~~] | -+|-
+)(199[5-9]|20[0-4][0-9])(1[012]|0[1-9])(!\d)"
Brief introduction:YYYYMM (-/---/to/to/~~) YYYYMM
(10) pattern_date (9)=
"(199[5-9]|20[0-4][0-9])(1[012]|0[1-9])(30|31|([12][0-9])|(0[1-9]))(!\d)|(199[5-9]|20[0-4][0-9])(1[0
12]|0[1-9])(!\d)"
Brief introduction:YYYYMMDD/YYYYMM
(11) pattern_date (10)=" (1 [012] | 0[1-9]) [moon -- /] (30 | 31 | ([12] [0-9]) | (0[1-9])) day([to arriving
~~] | -+| -+) (1 [012] | 0[1-9]) [moon -- /] (30 | 31 | ([12] [0-9]) | (0[1-9])) (day | (![moon d -- /])) "
Brief introduction:Xx [moon -- /] xx [day] (-/---/to/to/~~) xx [moon -- /] xx [day], the moon can be 1~12, or 01~09, day
Can be 01~09,1~9,10~31
(12) pattern_date (11)=" (1 [012] | 0[1-9]) [moon -- /] (30 | 31 | ([12] [0-9]) | (0[1-9])) day([to arriving
~~] | -+| -+) (30 | 31 | ([12] [0-9]) | (0[1-9])) (day | (![moon d -- /])) "
Brief introduction:Xx month xx [day] (-/---/to/to/~~) xx [day];Xx (- /) xx (-/---/to/to/~~) xx
(13) pattern_date (12)=" (1 [012] | 0[1-9]) moon([extremely arrive~~] | -+| -+) (1 [012] | 0[1-9]) moon "
Brief introduction:Xx [moon] (-/---/to/to/~~) the xx months, the moon can be 1~12, or 01~09
(14) pattern_date (13)=" (30 | 31 | ([12] [0-9]) | (0[1-9])) day([extremely arrive~~] | -+|-
+)(30|31|([12][0-9])|(0[1-9])) day "
Brief introduction:Xx [day] (-/---/to/to/~~) xx days
(15) pattern_date (14)=" ((19)9[5-9]|(20)[0-4] [0-9]) year([extremely arrive~~] | -+|-
+)((19)9[5-9]|(20)[0-4] [0-9]) year "
Brief introduction:Xxxx [year] (-/---/to/to/~~) xxxx
(16) pattern_date (15)=" b ((1 [012] | 0[1-9]) moon[,, ]) * ((1 [012] | 0[1-9]) moon[and with
And])(1[012]|0[1-9]) moon "
Brief introduction:(xx [moon] (,, and with and)) { 0,11 } xx months, include the form of the xx months;It is matching " 1 month December "
And " December 1 " such case, it is not added with zero width and asserts in advance.
(17) pattern_date (16)=" ((19)9[5-9]|(20)[0-4] [0-9]) year "
Brief introduction:Xxxx and xx
Compared to prior art, have the advantages that:
1st, the present invention will in the form of text store the business date of occurrence in explanation is paid, and intelligent extraction is simultaneously converted into analyzable
Date format.On this basis, to reimbursement document by month Classifying Sum, and further financial expenditures analysis can be made, from
And effective management and control of realization reimbursement cost, corporate spending is saved, improve enterprise profit.
2nd, the present invention supports the identification of various common dates expression ways, specifically includes:(1) Chinese figure (2) is with respect to days (3) " XX
Month ", " XX months ", " XX days ", " No. XX " etc..(4) the scope date.(5) without the numeral of the keywords such as date
Date:" 20120809 ", " 2012-08-09 ", " 2012.08-2012.09 ", " 2012 08 09-2012 08 21 " etc..
3rd, by the present invention, when user fills in electronic documents, the date of text formatting can be filled according to oneself custom, without
Stick to date format or fritter away energy on date input table is clicked on, so as to save user time, lift Consumer's Experience.
Fig. 3 is referred in addition, and it is the block diagram of the device on the intelligent extraction date from text of the invention.Present invention also offers
A kind of device on the intelligent extraction date from text, it includes:Acquisition module 1, pretreatment module 2, matching module 3, extraction
Module 4, date completion module 5 and storage module 6.
The acquisition module 1, the text and document filling date on date will be therefrom extracted for obtaining;
The pretreatment module 2, the date for will occur in text is converted to the date of canonical form.
As a further improvement on the present invention, the pretreatment module 2 includes:
First pretreatment submodule 21, for " month ", " number " to be used into matching regular expressions respectively, after matching
" moon " is converted to, " day ".
Second pretreatment submodule 22, for there is the date expression-form of incidence relation according to occurring in text and fill out the phase in odd-numbered day,
Calculate the corresponding date.Further, include with the expression formula filled out the phase in odd-numbered day and be associated in the second pretreatment submodule:" this
Year ", " current year ", " last year ", " upper one year ", " next year ", " next year ", " this month ", " next month ",
" this month ", after above-mentioned date expression-form is matched, the correspondence date is calculated with reference to the phase in odd-numbered day is filled out.
3rd pretreatment submodule 23, for the numeral of Chinese expression to be converted into Arabic numerals.
The matching module 3, for constructing regular expression, and is carried out using regular expression to the date expression formula in text
Matching, therefrom extracts the date expression formula for meeting regular expression form.
Further, the matching module 3 according to regular expression to after to date match success in text, day for matching of extraction
Phase expresses field, and the field for matching is deleted.
The extraction module 4, for the date expression formula for different-format, extracts year, month, day numeral therein respectively.
The date completion module 5, for by filling out the phase in odd-numbered day, the time lacked in completion text or month data;
Further, when the date completion module 5 is lacked in completion time and month data, the foundation on its completion date includes
Filled out time where the phase in odd-numbered day and month according to taking, and/or according to take time or month for filling out that the phase in odd-numbered day possesses incidence relation.
The storage module 6, the year, month, day numeral for will identify that is combined into the complete date, and is deposited with date format
Storage.
The invention is not limited in above-mentioned implementation method, if not departing from spirit of the invention to various changes of the invention or deformation
And scope, if these are changed and within the scope of deformation belongs to claim of the invention and equivalent technologies, then the present invention is also intended to
Comprising these changes and deformation.
Claims (10)
1. a kind of method on the intelligent extraction date from text, it is comprised the following steps:
Step 1:Acquisition will therefrom extract the text and document filling date on date;
Step 2:Text is pre-processed, the date that will occur in text is converted to the date of canonical form;
Step 3:Construction regular expression, and the date expression formula in text is matched using regular expression, therefrom extract
Go out to meet the date expression formula of regular expression form;
Step 4:For the date expression formula of different-format, year, month, day numeral therein is extracted respectively;
Step 5:By filling out the phase in odd-numbered day, the time lacked in completion text or month data;
Step 6:The year, month, day numeral that will identify that is combined into the complete date, and is stored with date format.
2. according to claim 1 from text the intelligent extraction date method, it is characterised in that:It is specific in the step 2
Comprise the following steps:
Step 21:" month ", " number " are used into matching regular expressions respectively, " moon " is converted to after matching, " day ";
Step 22:There is the date expression-form of incidence relation according to occurring in text and fill out the phase in odd-numbered day, the corresponding date is calculated;
Step 23:The numeral of Chinese expression is converted into Arabic numerals.
3. according to claim 2 from text the intelligent extraction date method, it is characterised in that:In the step 22 with
The expression formula that filling out the phase in odd-numbered day is associated includes:" this year ", " current year ", " last year ", " upper one year ", " next year ",
" next year ", " this month ", " next month ", " this month ", after above-mentioned date expression-form is matched, with reference to filling out
Phase in odd-numbered day calculates the correspondence date.
4. the method on text intelligent extraction date according to claim 1, it is characterised in that:In the step 3, work as basis
After regular expression to date match in text to succeeding, the date expression field that extraction is matched, and the field that this is matched
Delete.
5. according to claim 1 from text the intelligent extraction date method, it is characterised in that:In the step 5,
When the time of completion missing and month data, the foundation on its completion date includes that basis takes and fills out place time phase in odd-numbered day and month, and/
Or according to take time or month for filling out that the phase in odd-numbered day possesses incidence relation.
6. a kind of device on the intelligent extraction date from text, it is characterised in that:Including:
Acquisition module, the text and document filling date on date will be therefrom extracted for obtaining;
Pretreatment module, the date for will occur in text is converted to the date of canonical form;
Matching module, for constructing regular expression, and is matched using regular expression to the date expression formula in text,
Therefrom extract the date expression formula for meeting regular expression form;
Extraction module, for the date expression formula for different-format, extracts year, month, day numeral therein respectively;
Date completion module, for by filling out the phase in odd-numbered day, the time lacked in completion text or month data;
Storage module, the year, month, day numeral for will identify that is combined into the complete date, and is stored with date format.
7. according to claim 6 from text the intelligent extraction date device, it is characterised in that:The pretreatment module bag
Include:
First pretreatment submodule, for " month ", " number " to be used into matching regular expressions respectively, conversion after matching
It is " moon ", " day ";
Second pretreatment submodule, for having the date expression-form of incidence relation according to occurring in text and fill out the phase in odd-numbered day, counts
Calculate the corresponding date;
3rd pretreatment submodule, for the numeral of Chinese expression to be converted into Arabic numerals.
8. according to claim 7 from text the intelligent extraction date device, it is characterised in that:Second pretreatment
Include with the expression formula filled out the phase in odd-numbered day and be associated in module:" this year ", " current year ", " last year ", " upper one year ",
" next year ", " next year ", " this month ", " next month ", " this month ", when match above-mentioned date expression-form it
Afterwards, with reference to fill out the phase in odd-numbered day calculate correspondence the date.
9. according to claim 6 from text the intelligent extraction date device, it is characterised in that:The matching module according to
After regular expression is to date match success in text, the date expression field that extraction is matched, and the field for matching is deleted
Remove.
10. according to claim 6 from text the intelligent extraction date device, it is characterised in that:The date completion mould
When time that block is lacked in completion and month data, the foundation on its completion date includes time and month according to where taking and fills out the phase in odd-numbered day,
And/or according to take time or month for filling out that the phase in odd-numbered day possesses incidence relation.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201511033057.2A CN106933783A (en) | 2015-12-31 | 2015-12-31 | A kind of method and device on the intelligent extraction date from text |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201511033057.2A CN106933783A (en) | 2015-12-31 | 2015-12-31 | A kind of method and device on the intelligent extraction date from text |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106933783A true CN106933783A (en) | 2017-07-07 |
Family
ID=59443734
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201511033057.2A Pending CN106933783A (en) | 2015-12-31 | 2015-12-31 | A kind of method and device on the intelligent extraction date from text |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106933783A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109933198A (en) * | 2019-03-13 | 2019-06-25 | 广东小天才科技有限公司 | A kind of method for recognizing semantics and device |
CN113536732A (en) * | 2021-06-24 | 2021-10-22 | 北京天健源达科技股份有限公司 | Date and time data formatting display method applied to electronic medical record |
CN114356972A (en) * | 2021-12-03 | 2022-04-15 | 四川科瑞软件有限责任公司 | Data processing method, and event time-based retrieval method and device |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102103612A (en) * | 2009-12-22 | 2011-06-22 | 北大方正集团有限公司 | Information extraction method and device |
CN102184204A (en) * | 2011-04-28 | 2011-09-14 | 常州大学 | Auto fill method and system of intelligent Web form |
CN105183742A (en) * | 2015-06-12 | 2015-12-23 | 南京富士通南大软件技术有限公司 | Resume identification method |
-
2015
- 2015-12-31 CN CN201511033057.2A patent/CN106933783A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102103612A (en) * | 2009-12-22 | 2011-06-22 | 北大方正集团有限公司 | Information extraction method and device |
CN102184204A (en) * | 2011-04-28 | 2011-09-14 | 常州大学 | Auto fill method and system of intelligent Web form |
CN105183742A (en) * | 2015-06-12 | 2015-12-23 | 南京富士通南大软件技术有限公司 | Resume identification method |
Non-Patent Citations (1)
Title |
---|
周小甲: "中文病历文本的时间信息提取研究", 《中国优秀硕士学位论文全文数据库信息科技辑》 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109933198A (en) * | 2019-03-13 | 2019-06-25 | 广东小天才科技有限公司 | A kind of method for recognizing semantics and device |
CN113536732A (en) * | 2021-06-24 | 2021-10-22 | 北京天健源达科技股份有限公司 | Date and time data formatting display method applied to electronic medical record |
CN114356972A (en) * | 2021-12-03 | 2022-04-15 | 四川科瑞软件有限责任公司 | Data processing method, and event time-based retrieval method and device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Harmon et al. | The index of linguistic diversity: A new quantitative measure of trends in the status of the world's languages | |
US9495347B2 (en) | Systems and methods for extracting table information from documents | |
CN107544984A (en) | A kind of method and apparatus of data processing | |
CN104361018B (en) | Electronic archives information reorganization method and device | |
CN102122280B (en) | Method and system for intelligently extracting content object | |
CN106021389A (en) | System and method for automatically generating news based on template | |
WO2016060547A1 (en) | Emulating manual system of filing using electronic document and electronic file | |
US20170228356A1 (en) | System Generator Module for Electronic Document and Electronic File | |
CN103282903A (en) | Topic extraction device and program | |
CN106445910A (en) | Document analysis method and apparatus | |
CN106933783A (en) | A kind of method and device on the intelligent extraction date from text | |
CN108255966A (en) | A kind of data migration method and storage medium | |
CN107491563A (en) | Towards the data processing method and system of settlement for account | |
CN105956125A (en) | Patent monitoring system and method | |
CN103425653A (en) | Method and system for realizing DICOM (digital imaging and communication in medicine) image quadratic search | |
CN103455896A (en) | Paperless assembling quality control method based on internet of things | |
CN105488471B (en) | A kind of font recognition methods and device | |
CN103399848A (en) | Engine test data standardized specific format leading-in processing method | |
CN101976394A (en) | Data acquiring and counting system and method | |
US20170235757A1 (en) | Electronic processing system for electronic document and electronic file | |
CN110275938B (en) | Knowledge extraction method and system based on unstructured document | |
CN105335459A (en) | XBRL intelligent report platform based statement consolidation data extraction method | |
CN103679382A (en) | Information statistics management system | |
US20040243536A1 (en) | Information capturing, indexing, and authentication system | |
AU2015331032A1 (en) | Electronic filing system for electronic document and electronic file |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170707 |