CN106933783A - A kind of method and device on the intelligent extraction date from text - Google Patents

A kind of method and device on the intelligent extraction date from text Download PDF

Info

Publication number
CN106933783A
CN106933783A CN201511033057.2A CN201511033057A CN106933783A CN 106933783 A CN106933783 A CN 106933783A CN 201511033057 A CN201511033057 A CN 201511033057A CN 106933783 A CN106933783 A CN 106933783A
Authority
CN
China
Prior art keywords
date
text
month
year
day
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201511033057.2A
Other languages
Chinese (zh)
Inventor
孙晓东
向万红
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yuanguang Software Co Ltd
Original Assignee
Yuanguang Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yuanguang Software Co Ltd filed Critical Yuanguang Software Co Ltd
Priority to CN201511033057.2A priority Critical patent/CN106933783A/en
Publication of CN106933783A publication Critical patent/CN106933783A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/151Transformation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/177Editing, e.g. inserting or deleting of tables; using ruled lines
    • G06F40/18Editing, e.g. inserting or deleting of tables; using ruled lines of spreadsheets

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The present invention relates to a kind of method on the intelligent extraction date from text, it is comprised the following steps:Step 1:Acquisition will therefrom extract the text and document filling date on date;Step 2:Text is pre-processed, the date that will occur in text is converted to the date of canonical form;Step 3:Construction regular expression, and the date expression formula in text is matched using regular expression, therefrom extract the date expression formula for meeting regular expression form;Step 4:For the date expression formula of different-format, year, month, day numeral therein is extracted respectively;Step 5:By filling out the phase in odd-numbered day, the time lacked in completion text or month data;Step 6:The year, month, day numeral that will identify that is combined into the complete date, and is stored with date format.Present invention also offers a kind of device for realizing the above method.

Description

A kind of method and device on the intelligent extraction date from text
Technical field
The present invention relates to a kind of date extracting method, especially for the date extracting method of electronic documents;The invention further relates to one Plant the device for realizing the extracting method.
Background technology
Electronic documents progressively substitute papery document in Course of Enterprise Informationalization, as business datum transmission, business audit it is important Medium.The typing and examination & verification of electronic documents, and the data analysis based on electronic documents are the important composition portions of financial management software Point.User have accumulated a large amount of electronic documents data for being available for and analyzing during application financial management software.
User is when using financial management software typing electronic documents, it is often necessary to fill in business date of occurrence.In some finance pipes In reason software, the business date of occurrence that user fills in is recorded with text mode, is recorded rather than with the date format of standard, When this results in user and carries out data analysis, it is difficult to extract the date from text, it is impossible to using the business date of occurrence in text This key message.
Simultaneously as the scene of electronic documents typing has its particularity, such as date expression-form is various, and the date is imperfect, than As only containing one or two in three data of date, or often to include Chinese figure (such as " September "), this gives Date recognition and extraction bring new challenge.
However, in the existing data extraction method for text, mainly for the Text Feature Extraction of other field, such as drawing in search Hold up and use, or used in the identification of other data.Therefore, the existing method for extracting the date is mainly for complete and format specification Date expression formula, and for the imperfect date, such as:Lack one or two in three data of date, such as " August 1 Day ", Chinese figure date, such as " September ", relative-date, such as " last month ", time period such as " 2014 8 The moon to October ", then lack recognition capability.If directly being extracted to above-mentioned word according to existing extractive technique, can only extract Go out corresponding word content, it is impossible to carry out its corresponding actual date of Intelligent Recognition.
The content of the invention
The invention reside in the shortcoming and deficiency that overcome prior art, there is provided a kind of method and device on the intelligent extraction date from text.
The present invention is realized by following technical scheme:A kind of method on the intelligent extraction date from text, it includes following Step:
Step 1:Acquisition will therefrom extract the text and document filling date on date;
Step 2:Text is pre-processed, the date that will occur in text is converted to the date of canonical form;
Step 3:Construction regular expression, and the date expression formula in text is matched using regular expression, Cong Zhongti Taking-up meets the date expression formula of regular expression form;
Step 4:For the date expression formula of different-format, year, month, day numeral therein is extracted respectively;
Step 5:By filling out the phase in odd-numbered day, the time lacked in completion text or month data;
Step 6:The year, month, day numeral that will identify that is combined into the complete date, and is stored with date format.
As a further improvement on the present invention, following steps are specifically included in the step 2:
Step 21:" month ", " number " are used into matching regular expressions respectively, " moon " is converted to after matching, " day ";
Step 22:There is the date expression-form of incidence relation according to occurring in text and fill out the phase in odd-numbered day, the corresponding date is calculated;
Step 23:The numeral of Chinese expression is converted into Arabic numerals.
As a further improvement on the present invention, include with the expression formula filled out the phase in odd-numbered day and be associated in the step 22:" this year ", " current year ", " last year ", " upper one year ", " next year ", " next year ", " this month ", " next month ", " this Month ", after above-mentioned date expression-form is matched, the correspondence date is calculated with reference to the phase in odd-numbered day is filled out.
As a further improvement on the present invention, in the step 3, when according to regular expression to text in date match success Afterwards, the date expression field for matching is extracted, and the field for matching is deleted.
As a further improvement on the present invention, in the step 5, in time and the month data of completion missing, its completion day The foundation of phase includes filling out time where the phase in odd-numbered day and month according to taking, and/or according to take time for filling out that the phase in odd-numbered day possesses incidence relation Or month.
Present invention also offers a kind of device on the intelligent extraction date from text, it includes:
Acquisition module, the text and document filling date on date will be therefrom extracted for obtaining;
Pretreatment module, the date for will occur in text is converted to the date of canonical form;
Matching module, for constructing regular expression, and is matched using regular expression to the date expression formula in text, Therefrom extract the date expression formula for meeting regular expression form;
Extraction module, for the date expression formula for different-format, extracts year, month, day numeral therein respectively;
Date completion module, for by filling out the phase in odd-numbered day, the time lacked in completion text or month data;
Storage module, the year, month, day numeral for will identify that is combined into the complete date, and is stored with date format.
As a further improvement on the present invention, the pretreatment module includes:
First pretreatment submodule, for " month ", " number " to be used into matching regular expressions respectively, conversion after matching It is " moon ", " day ";
Second pretreatment submodule, for having the date expression-form of incidence relation according to occurring in text and fill out the phase in odd-numbered day, counts Calculate the corresponding date;
3rd pretreatment submodule, for the numeral of Chinese expression to be converted into Arabic numerals.
As a further improvement on the present invention, include with the expression formula filled out the phase in odd-numbered day and be associated in the second pretreatment submodule: " this year ", " current year ", " last year ", " upper one year ", " next year ", " next year ", " this month ", " under Individual month ", " this month ", after above-mentioned date expression-form is matched, with reference to fill out the phase in odd-numbered day calculate correspondence the date.
As a further improvement on the present invention, the matching module according to regular expression to in text date match success after, The date expression field that extraction is matched, and the field for matching is deleted.
As a further improvement on the present invention, when the date completion module is lacked in completion time and month data, its completion The foundation on date includes filling out time where the phase in odd-numbered day and month according to taking, and/or according to take year for filling out that the phase in odd-numbered day possesses incidence relation Part or month.
Compared to prior art, have the advantages that:
1st, the present invention will in the form of text store the business date of occurrence in explanation is paid, and intelligent extraction is simultaneously converted into analyzable Date format.On this basis, to reimbursement document by month Classifying Sum, and further financial expenditures analysis can be made, from And effective management and control of realization reimbursement cost, corporate spending is saved, improve enterprise profit.
2nd, the present invention supports the identification of various common dates expression ways, specifically includes:(1) Chinese figure (2) is with respect to days (3) " XX Month ", " XX months ", " XX days ", " No. XX " etc..(4) the scope date.(5) without the numeral of the keywords such as date Date:" 20120809 ", " 2012-08-09 ", " 2012.08-2012.09 ", " 2012 08 09-2012 08 21 " etc..
3rd, by the present invention, when user fills in electronic documents, the date of text formatting can be filled according to oneself custom, without Stick to date format or fritter away energy on date input table is clicked on, so as to save user time, lift Consumer's Experience.
In order to more fully understand and implement, the invention will now be described in detail with reference to the accompanying drawings.
Brief description of the drawings
Fig. 1 is the flow chart of the method on the intelligent extraction date from text of the invention.
Fig. 2 is the flow chart with matching regular expressions text in the present embodiment.
Fig. 3 is the block diagram of the device on the intelligent extraction date from text of the invention.
Specific embodiment
In order to solve the problems, such as that date extraction cannot be carried out for electronic documents in the prior art, can be from the invention provides one kind The method and device on intelligent extraction date in text.Preferred embodiment of the invention is introduced in detail below.
The invention provides a kind of method on the intelligent extraction date from text, it is comprised the following steps:
Step S1:Acquisition will therefrom extract the text and document filling date on date.
Step S2:Text is pre-processed, the date that will occur in text is converted to the date of canonical form.
Further, in the present embodiment to specifically including following steps in the step 2:
Step 21:" month ", " number " are used into matching regular expressions respectively, " moon " is converted to after matching, " day ".
Step 22:There is the date expression-form of incidence relation according to occurring in text and fill out the phase in odd-numbered day, the corresponding date is calculated.
Specifically, including with the expression formula filled out the phase in odd-numbered day and be associated in this step:" this year ", " current year ", " last year ", " upper one year ", " next year ", " next year ", " this month ", " next month ", " this month ", when matching above-mentioned day After phase expression-form, the correspondence date is calculated with reference to the phase in odd-numbered day is filled out.The phase in odd-numbered day is such as filled out for " in August, 2012 ", then " this Year ", " last year ", " this month ", " next month " be respectively converted into " 2012 ", " 2011 ", " August ", " September ".
Step 23:The numeral of Chinese expression is converted into Arabic numerals.The numeral of such as from " one " to " 31 " is respectively " 1 " to " 31 " is converted to, " zero " is converted to " 0 ".
Step S3:Construction regular expression, and the date expression formula in text is matched using regular expression, Cong Zhongti Taking-up meets the date expression formula of regular expression form;
Further in this step, when according to regular expression to text in date match success after, date for matching of extraction Expression field, and the field for matching is deleted, to prevent repeated matching.
Fig. 2 is referred to, it is the flow chart with matching regular expressions text in the present embodiment.Assuming that user constructs as needed N kind regular expressions, match the text after pretreatment, if N successively with various date formats corresponding regular expressions Individual matching regular expressions success, then into the corresponding handling process of n-th regular expression, with N+1 after the completion for the treatment of Matching regular expressions text;If the match is successful for n-th regular expression, continue with the N+1 matching regular expressions Text.
Step S4:For the date expression formula of different-format, year, month, day numeral therein is extracted respectively.Due to each Regular expression used by matched text, its form is all fixed, such as " " the XXXX XX months ", its first pure Numeral is the time, and second pure digi-tal is then month, therefore date carries out digital matching to this, you can the numeral that will match to It is identified as time and month successively in order.
Step S5:By filling out the phase in odd-numbered day, the time lacked in completion text or month data.
Specifically, in this step, in time and the month data of completion missing, the foundation on its completion date includes that basis takes Fill out time where the phase in odd-numbered day and month, and/or according to take time or month for filling out that the phase in odd-numbered day possesses incidence relation.
Whether the date that inspection is matched is complete, and the date of cannonical format is directly generated if complete, the intelligence if imperfect The cannonical format date is regenerated after the completion date.According to different application scenarios, the definition of long date fomat can be with difference, in analysis During financial reimbursement document, time and month are typically required, without requiring specific to one day, therefore the present invention is with comprising time and the moon Part as date complete criterion.User of the invention can be decided in its sole discretion specifically using which kind of cannonical format, and this example is adopted With the form of " XXXX XX months XX day ", such as " on 03 21st, 2012 ".If in the date field for matching Missing year or the moon, the present invention will carry out intelligent supplement with reference to the phase in odd-numbered day is filled out, for example, fill out the phase in odd-numbered day for " on October 10th, 2012 ", The date expression formula for matching is " September 8 days ", then result is " on September 8th, 2012 " after date intelligence supplement.
To different documents, may be taken for the time supplemented and month data and fill out time and month where the phase in odd-numbered day, it is also possible to taken and fill out Upper one year of phase in odd-numbered day, next year, upper one month, next month, it is specifically dependent upon document types and fills out the phase in odd-numbered day and wait to mend Fill the incidence relation on date.For the time, supplement rule is:If month value<=" the single month value+1 of system ", then take the phase in odd-numbered day processed There is the time in the time, otherwise take " time phase in odd-numbered day -1 processed " as the business generation time as business.The phase in odd-numbered day is for example filled out for " 2012 On October 10, in ", the date expression formula for matching is " September 8 days ", then result is " 2012 after date intelligence supplement September 8 days ";For another example the phase in odd-numbered day is filled out for " on January 10th, 2012 ", and the date expression formula for matching is " November 8 ", Then result is " on November 8th, 2011 " after date intelligence supplement.
Step S6:The year, month, day numeral that will identify that is combined into the complete date, and is stored with date format.
In this step, the present invention combines the demand of data analysis, will repeat date merging, and the result date after merging is complete Portion is stored by unified cannonical format.
Further, the date match regular expression that the present invention is used simply is introduced below, 17 is employed in the present embodiment Regular expression is planted, it is as follows respectively:
(1) pattern_date (0)=" ((19)9[5-9]|(20)[0-4] [0-9]) [year -- /] (1 [012] | 0[1-9]) [moon -- \./\\](30|31|([12][0-9])|(0[1-9])) day([extremely arrive~~] | -+| -+) ((19)9[5-9]|(20)[0-4] [0-9]) [year -- \./\\](1[012]|0[1-9]) [moon -- /] (30 | 31 | ([12] [0-9]) | (0[1-9])) (day | (![d -- /])) "
Brief introduction:Xxxx [year -- /] xx [moon -- /] xx [day] (-/---/to/to/~~) xxxx [year -- /] xx [moon -- /] xx [day], That is to say:The year matched in the regular expression can be 1995~2049, or 95~99,00~49.The moon can be 1~12, Or 01~09, day can be 01~09,1~9,10~31.Such as:January 11 day~2010 January of nineteen ninety-five.
(2) pattern_date (1)=" ((19)9[5-9]|(20)[0-4] [0-9]) [year -- /] (1 [012] | 0[1-9]) [moon -- \./\\](30|31|([12][0-9])|(0[1-9])) day([extremely arrive~~] | -+| -+) (1 [012] | 0[1-9]) [moon -- \./\\](30|31|([12][0-9])|(0[1-9])) (day | (![d -- /])) "
Brief introduction:Xxxx xx month xx [day] (-/---/to/to/~~) xx month xx [day].Here year can be 1995~2049, Or 95~99,00~49.The moon can be 1~12, or 01~09, and day can be 01~09,1~9,10~31.Such as:Nineteen ninety-five 1 1~October 1 moon.
(3) pattern_date (2)=" ((19)9[5-9]|(20)[0-4] [0-9]) [year -- /] (1 [012] | 0[1-9]) moon([to arriving ~~] | -+| -+) ((19)9[5-9]|(20)[0-4] [0-9]) [year -- /] (1 [012] | 0[1-9]) (moon | (![d -- /])) "
Brief introduction:Xxxx xx [moon] (-/---/to/to/~~) xxxx xx [moon];Xxxx (- /) xx (-/---/to/to/~~) xxxx (-./\)xx;Note:Here " 11-08~12-09 " this expression formula can be matched, actually because current year has been 2015, This date more likely represents 8~December of November 9, rather than in August, 11~12 year September after occurring, therefore will in program Make special judgement.
(4) pattern_date (3)=" ((19)9[5-9]|(20)[0-4] [0-9]) [year /] (1 [012] | 0[1-9]) moon([to arriving ~~] | -+| -+) (1 [012] | 0[1-9]) (moon | (![d -- /])) | ((19)9[5-9]|(20)[0-4] [0-9]) [--] (1 [012] | 0[1-9]) Month[extremely arrive~~] (1 [012] | 0[1-9]) (moon | (![d -- /])) "
Brief introduction:Xxxx xx [moon] (-/---/to/to/~~) the xx months;Xxxx (- /) xx (-/---/to/to/~~) xx;Use top " | " is divided into two parts and writes, primarily to preventing the situation of xx-xx-xx to be matched;It is to prevent from being similar to that zero width is asserted 12-07~12-21 is this to be matched 12-07~12.
(5) pattern_date (4)=" ((19)9[5-9]|(20)[0-4] [0-9]) [year -- /] (1 [012] | 0[1-9]) [moon -- \./\\](30|31|([12][0-9])|(0[1-9])) day([extremely arrive~~] | -+| -+) (30 | 31 | ([12] [0-9]) | (0[1-9])) (day | (![moon d -- \./\\]))"
Brief introduction:Xxxx xx month xx [day] (-/---/to/to/~~) xx [day];Xxxx (- /) xx (- /) xx (-/---/to/arrive /~~) xx.Such as:1~10 January in 1999.
(6) pattern_date (5)=" ((19)9[5-9]|(20)[0-4] [0-9]) [year -- /] (1 [012] | 0[1-9]) [moon -- \./\\](30|31|([12][0-9])|(0[1-9])) (day | (![moon d -- /])) "
Brief introduction:Xxxx xx months xx day;xxxx(-./\)xx(-./\)xx.Such as:On January 1st, 2000, can with It is fitted on the expression formulas such as 2000-1-1.
(7) pattern_date (6)=" ((19)9[5-9]|(20)[0-4] [0-9]) year ((1 [012] | 0[1-9]) (moon [,, ]| [,, ])) * ((1 [012] | 0[1-9]) moon[and with and])(1[012]|0[1-9]) moon "
Brief introduction:In the case where there is " moon ", can be without separator [, , ], to match " xx January 2 between two months Month ".
(8) pattern_date (7)=" (199 [5-9] | 20 [0-4] [0-9]) [-- /] (1 [012] | 0[1-9])(![d -- /]) "
Brief introduction:xxxx(-./\)xx.Such as:2000-12, represents in December, 2000
(9) pattern_date (8)=" (199 [5-9] | 20 [0-4] [0-9]) (1 [012] | 0 [1-9]) ([extremely arrive~~] | -+|- +)(199[5-9]|20[0-4][0-9])(1[012]|0[1-9])(!\d)"
Brief introduction:YYYYMM (-/---/to/to/~~) YYYYMM
(10) pattern_date (9)= "(199[5-9]|20[0-4][0-9])(1[012]|0[1-9])(30|31|([12][0-9])|(0[1-9]))(!\d)|(199[5-9]|20[0-4][0-9])(1[0 12]|0[1-9])(!\d)"
Brief introduction:YYYYMMDD/YYYYMM
(11) pattern_date (10)=" (1 [012] | 0[1-9]) [moon -- /] (30 | 31 | ([12] [0-9]) | (0[1-9])) day([to arriving ~~] | -+| -+) (1 [012] | 0[1-9]) [moon -- /] (30 | 31 | ([12] [0-9]) | (0[1-9])) (day | (![moon d -- /])) "
Brief introduction:Xx [moon -- /] xx [day] (-/---/to/to/~~) xx [moon -- /] xx [day], the moon can be 1~12, or 01~09, day Can be 01~09,1~9,10~31
(12) pattern_date (11)=" (1 [012] | 0[1-9]) [moon -- /] (30 | 31 | ([12] [0-9]) | (0[1-9])) day([to arriving ~~] | -+| -+) (30 | 31 | ([12] [0-9]) | (0[1-9])) (day | (![moon d -- /])) "
Brief introduction:Xx month xx [day] (-/---/to/to/~~) xx [day];Xx (- /) xx (-/---/to/to/~~) xx
(13) pattern_date (12)=" (1 [012] | 0[1-9]) moon([extremely arrive~~] | -+| -+) (1 [012] | 0[1-9]) moon "
Brief introduction:Xx [moon] (-/---/to/to/~~) the xx months, the moon can be 1~12, or 01~09
(14) pattern_date (13)=" (30 | 31 | ([12] [0-9]) | (0[1-9])) day([extremely arrive~~] | -+|- +)(30|31|([12][0-9])|(0[1-9])) day "
Brief introduction:Xx [day] (-/---/to/to/~~) xx days
(15) pattern_date (14)=" ((19)9[5-9]|(20)[0-4] [0-9]) year([extremely arrive~~] | -+|- +)((19)9[5-9]|(20)[0-4] [0-9]) year "
Brief introduction:Xxxx [year] (-/---/to/to/~~) xxxx
(16) pattern_date (15)=" b ((1 [012] | 0[1-9]) moon[,, ]) * ((1 [012] | 0[1-9]) moon[and with And])(1[012]|0[1-9]) moon "
Brief introduction:(xx [moon] (,, and with and)) { 0,11 } xx months, include the form of the xx months;It is matching " 1 month December " And " December 1 " such case, it is not added with zero width and asserts in advance.
(17) pattern_date (16)=" ((19)9[5-9]|(20)[0-4] [0-9]) year "
Brief introduction:Xxxx and xx
Compared to prior art, have the advantages that:
1st, the present invention will in the form of text store the business date of occurrence in explanation is paid, and intelligent extraction is simultaneously converted into analyzable Date format.On this basis, to reimbursement document by month Classifying Sum, and further financial expenditures analysis can be made, from And effective management and control of realization reimbursement cost, corporate spending is saved, improve enterprise profit.
2nd, the present invention supports the identification of various common dates expression ways, specifically includes:(1) Chinese figure (2) is with respect to days (3) " XX Month ", " XX months ", " XX days ", " No. XX " etc..(4) the scope date.(5) without the numeral of the keywords such as date Date:" 20120809 ", " 2012-08-09 ", " 2012.08-2012.09 ", " 2012 08 09-2012 08 21 " etc..
3rd, by the present invention, when user fills in electronic documents, the date of text formatting can be filled according to oneself custom, without Stick to date format or fritter away energy on date input table is clicked on, so as to save user time, lift Consumer's Experience.
Fig. 3 is referred in addition, and it is the block diagram of the device on the intelligent extraction date from text of the invention.Present invention also offers A kind of device on the intelligent extraction date from text, it includes:Acquisition module 1, pretreatment module 2, matching module 3, extraction Module 4, date completion module 5 and storage module 6.
The acquisition module 1, the text and document filling date on date will be therefrom extracted for obtaining;
The pretreatment module 2, the date for will occur in text is converted to the date of canonical form.
As a further improvement on the present invention, the pretreatment module 2 includes:
First pretreatment submodule 21, for " month ", " number " to be used into matching regular expressions respectively, after matching " moon " is converted to, " day ".
Second pretreatment submodule 22, for there is the date expression-form of incidence relation according to occurring in text and fill out the phase in odd-numbered day, Calculate the corresponding date.Further, include with the expression formula filled out the phase in odd-numbered day and be associated in the second pretreatment submodule:" this Year ", " current year ", " last year ", " upper one year ", " next year ", " next year ", " this month ", " next month ", " this month ", after above-mentioned date expression-form is matched, the correspondence date is calculated with reference to the phase in odd-numbered day is filled out.
3rd pretreatment submodule 23, for the numeral of Chinese expression to be converted into Arabic numerals.
The matching module 3, for constructing regular expression, and is carried out using regular expression to the date expression formula in text Matching, therefrom extracts the date expression formula for meeting regular expression form.
Further, the matching module 3 according to regular expression to after to date match success in text, day for matching of extraction Phase expresses field, and the field for matching is deleted.
The extraction module 4, for the date expression formula for different-format, extracts year, month, day numeral therein respectively.
The date completion module 5, for by filling out the phase in odd-numbered day, the time lacked in completion text or month data;
Further, when the date completion module 5 is lacked in completion time and month data, the foundation on its completion date includes Filled out time where the phase in odd-numbered day and month according to taking, and/or according to take time or month for filling out that the phase in odd-numbered day possesses incidence relation.
The storage module 6, the year, month, day numeral for will identify that is combined into the complete date, and is deposited with date format Storage.
The invention is not limited in above-mentioned implementation method, if not departing from spirit of the invention to various changes of the invention or deformation And scope, if these are changed and within the scope of deformation belongs to claim of the invention and equivalent technologies, then the present invention is also intended to Comprising these changes and deformation.

Claims (10)

1. a kind of method on the intelligent extraction date from text, it is comprised the following steps:
Step 1:Acquisition will therefrom extract the text and document filling date on date;
Step 2:Text is pre-processed, the date that will occur in text is converted to the date of canonical form;
Step 3:Construction regular expression, and the date expression formula in text is matched using regular expression, therefrom extract Go out to meet the date expression formula of regular expression form;
Step 4:For the date expression formula of different-format, year, month, day numeral therein is extracted respectively;
Step 5:By filling out the phase in odd-numbered day, the time lacked in completion text or month data;
Step 6:The year, month, day numeral that will identify that is combined into the complete date, and is stored with date format.
2. according to claim 1 from text the intelligent extraction date method, it is characterised in that:It is specific in the step 2 Comprise the following steps:
Step 21:" month ", " number " are used into matching regular expressions respectively, " moon " is converted to after matching, " day ";
Step 22:There is the date expression-form of incidence relation according to occurring in text and fill out the phase in odd-numbered day, the corresponding date is calculated;
Step 23:The numeral of Chinese expression is converted into Arabic numerals.
3. according to claim 2 from text the intelligent extraction date method, it is characterised in that:In the step 22 with The expression formula that filling out the phase in odd-numbered day is associated includes:" this year ", " current year ", " last year ", " upper one year ", " next year ", " next year ", " this month ", " next month ", " this month ", after above-mentioned date expression-form is matched, with reference to filling out Phase in odd-numbered day calculates the correspondence date.
4. the method on text intelligent extraction date according to claim 1, it is characterised in that:In the step 3, work as basis After regular expression to date match in text to succeeding, the date expression field that extraction is matched, and the field that this is matched Delete.
5. according to claim 1 from text the intelligent extraction date method, it is characterised in that:In the step 5, When the time of completion missing and month data, the foundation on its completion date includes that basis takes and fills out place time phase in odd-numbered day and month, and/ Or according to take time or month for filling out that the phase in odd-numbered day possesses incidence relation.
6. a kind of device on the intelligent extraction date from text, it is characterised in that:Including:
Acquisition module, the text and document filling date on date will be therefrom extracted for obtaining;
Pretreatment module, the date for will occur in text is converted to the date of canonical form;
Matching module, for constructing regular expression, and is matched using regular expression to the date expression formula in text, Therefrom extract the date expression formula for meeting regular expression form;
Extraction module, for the date expression formula for different-format, extracts year, month, day numeral therein respectively;
Date completion module, for by filling out the phase in odd-numbered day, the time lacked in completion text or month data;
Storage module, the year, month, day numeral for will identify that is combined into the complete date, and is stored with date format.
7. according to claim 6 from text the intelligent extraction date device, it is characterised in that:The pretreatment module bag Include:
First pretreatment submodule, for " month ", " number " to be used into matching regular expressions respectively, conversion after matching It is " moon ", " day ";
Second pretreatment submodule, for having the date expression-form of incidence relation according to occurring in text and fill out the phase in odd-numbered day, counts Calculate the corresponding date;
3rd pretreatment submodule, for the numeral of Chinese expression to be converted into Arabic numerals.
8. according to claim 7 from text the intelligent extraction date device, it is characterised in that:Second pretreatment Include with the expression formula filled out the phase in odd-numbered day and be associated in module:" this year ", " current year ", " last year ", " upper one year ", " next year ", " next year ", " this month ", " next month ", " this month ", when match above-mentioned date expression-form it Afterwards, with reference to fill out the phase in odd-numbered day calculate correspondence the date.
9. according to claim 6 from text the intelligent extraction date device, it is characterised in that:The matching module according to After regular expression is to date match success in text, the date expression field that extraction is matched, and the field for matching is deleted Remove.
10. according to claim 6 from text the intelligent extraction date device, it is characterised in that:The date completion mould When time that block is lacked in completion and month data, the foundation on its completion date includes time and month according to where taking and fills out the phase in odd-numbered day, And/or according to take time or month for filling out that the phase in odd-numbered day possesses incidence relation.
CN201511033057.2A 2015-12-31 2015-12-31 A kind of method and device on the intelligent extraction date from text Pending CN106933783A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201511033057.2A CN106933783A (en) 2015-12-31 2015-12-31 A kind of method and device on the intelligent extraction date from text

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201511033057.2A CN106933783A (en) 2015-12-31 2015-12-31 A kind of method and device on the intelligent extraction date from text

Publications (1)

Publication Number Publication Date
CN106933783A true CN106933783A (en) 2017-07-07

Family

ID=59443734

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201511033057.2A Pending CN106933783A (en) 2015-12-31 2015-12-31 A kind of method and device on the intelligent extraction date from text

Country Status (1)

Country Link
CN (1) CN106933783A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109933198A (en) * 2019-03-13 2019-06-25 广东小天才科技有限公司 A kind of method for recognizing semantics and device
CN113536732A (en) * 2021-06-24 2021-10-22 北京天健源达科技股份有限公司 Date and time data formatting display method applied to electronic medical record
CN114356972A (en) * 2021-12-03 2022-04-15 四川科瑞软件有限责任公司 Data processing method, and event time-based retrieval method and device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102103612A (en) * 2009-12-22 2011-06-22 北大方正集团有限公司 Information extraction method and device
CN102184204A (en) * 2011-04-28 2011-09-14 常州大学 Auto fill method and system of intelligent Web form
CN105183742A (en) * 2015-06-12 2015-12-23 南京富士通南大软件技术有限公司 Resume identification method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102103612A (en) * 2009-12-22 2011-06-22 北大方正集团有限公司 Information extraction method and device
CN102184204A (en) * 2011-04-28 2011-09-14 常州大学 Auto fill method and system of intelligent Web form
CN105183742A (en) * 2015-06-12 2015-12-23 南京富士通南大软件技术有限公司 Resume identification method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
周小甲: "中文病历文本的时间信息提取研究", 《中国优秀硕士学位论文全文数据库信息科技辑》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109933198A (en) * 2019-03-13 2019-06-25 广东小天才科技有限公司 A kind of method for recognizing semantics and device
CN113536732A (en) * 2021-06-24 2021-10-22 北京天健源达科技股份有限公司 Date and time data formatting display method applied to electronic medical record
CN114356972A (en) * 2021-12-03 2022-04-15 四川科瑞软件有限责任公司 Data processing method, and event time-based retrieval method and device

Similar Documents

Publication Publication Date Title
Harmon et al. The index of linguistic diversity: A new quantitative measure of trends in the status of the world's languages
US9495347B2 (en) Systems and methods for extracting table information from documents
CN107544984A (en) A kind of method and apparatus of data processing
CN104361018B (en) Electronic archives information reorganization method and device
CN102122280B (en) Method and system for intelligently extracting content object
CN106021389A (en) System and method for automatically generating news based on template
WO2016060547A1 (en) Emulating manual system of filing using electronic document and electronic file
US20170228356A1 (en) System Generator Module for Electronic Document and Electronic File
CN103282903A (en) Topic extraction device and program
CN106445910A (en) Document analysis method and apparatus
CN106933783A (en) A kind of method and device on the intelligent extraction date from text
CN108255966A (en) A kind of data migration method and storage medium
CN107491563A (en) Towards the data processing method and system of settlement for account
CN105956125A (en) Patent monitoring system and method
CN103425653A (en) Method and system for realizing DICOM (digital imaging and communication in medicine) image quadratic search
CN103455896A (en) Paperless assembling quality control method based on internet of things
CN105488471B (en) A kind of font recognition methods and device
CN103399848A (en) Engine test data standardized specific format leading-in processing method
CN101976394A (en) Data acquiring and counting system and method
US20170235757A1 (en) Electronic processing system for electronic document and electronic file
CN110275938B (en) Knowledge extraction method and system based on unstructured document
CN105335459A (en) XBRL intelligent report platform based statement consolidation data extraction method
CN103679382A (en) Information statistics management system
US20040243536A1 (en) Information capturing, indexing, and authentication system
AU2015331032A1 (en) Electronic filing system for electronic document and electronic file

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20170707