CN107943785A

CN107943785A - A kind of PDF document processing method and processing device based on big data

Info

Publication number: CN107943785A
Application number: CN201711080720.3A
Authority: CN
Inventors: 贾义动; 纪晓阳; 高峰
Original assignee: Guangdong Industry Kaiyuan Science And Technology Co Ltd
Current assignee: Guangdong Industry Kaiyuan Science And Technology Co Ltd
Priority date: 2017-11-06
Filing date: 2017-11-06
Publication date: 2018-04-20
Anticipated expiration: 2037-11-06
Also published as: CN107943785B

Abstract

The invention discloses a kind of PDF document processing method and processing device based on big data, this method includes：Using duplicate removal technology and format conversion techniques, structure includes the multi-format document pond of multiple and different document format financial documentations；Using the regular expression rule of financial index, initiation feature index and terminate characteristic index, after carrying out positioning dissection process to the financial documentations of multiple and different document formats, obtain financial data and index name corresponding with financial data and time；Using the different analysis results corresponding to financial data, financial data is verified.The device is included for the memory of storage program and for loading procedure and the processor of the execution PDF document processing method based on big data.By using the present invention, the parsing extraction of financial data quickly and accurately can be carried out to the financial documentation of a variety of different-formats.The present invention can be widely applied in big data parsing field as a kind of PDF document processing method and processing device based on big data.

Description

A kind of PDF document processing method and processing device based on big data

Technical field

The present invention relates to big data treatment technology, more particularly to a kind of PDF document processing method and dress based on big data Put.

Background technology

Technology word is explained：

Regular expression：A series of character string of some syntactic rule of matching is described, matched using single character string.

Balance sheet：Represent that enterprise fixes the date one financial situation (i.e. assets, the debt of (being usually each accounting end of term) With the situation of proprietary interest) main accounting statement.

Profit flow table：Reflect report of the enterprise in management performance during a certain accounting period.

Cash flow statement：The report that reflection enterprise flows in and out in cash and cash-equivalent during a certain accounting period.

In business finance big data analysis field, the acquisition of many financial datas needs the annual report or hair that are disclosed from company Extracted in the documents such as the recruitment specification issued during debt, and the accuracy of the data obtained to extraction has very high want Ask.At present, these financial documentations are usually saved as PDF format, and the PDF document which part is picture format, therefore such as What carries out the financial data in these PDF documents automatic, fast and accurately parsing extraction, and for enterprise, it is being reduced for this Data acquisition cost, raising data accuracy and treatment effeciency etc. have great importance.

The content of the invention

In order to solve the above-mentioned technical problem, the object of the present invention is to provide a kind of PDF document processing side based on big data Method, system and device, the parsing that can quickly and accurately multiple financial documentations be carried out with financial data are extracted.

First technical solution of the present invention is：A kind of PDF document processing method based on big data, this method bag Include following steps：

Using duplicate removal technology and format conversion techniques, multi-format document pond is built, wherein, the multi-format document pond includes The financial documentation of multiple and different document formats；

Using the regular expression rule of financial index, initiation feature index and terminate characteristic index, to multiple and different texts After the financial documentation of shelves form carries out positioning dissection process, obtain financial data and index name corresponding with financial data and Time；

Using the different analysis results corresponding to financial data, financial data is verified.

Second technical solution of the present invention is：A kind of PDF document processing system based on big data, the system bag Include：

Construction unit, for utilizing duplicate removal technology and format conversion techniques, builds multi-format document pond, wherein, it is described more Format file pond includes the financial documentation of multiple and different document formats；

Resolution unit, for the regular expression rule using financial index, initiation feature index and terminates characteristic index, After carrying out positioning dissection process to the financial documentations of multiple and different document formats, financial data and corresponding with financial data is obtained Index name and the time；

Verification unit, for using the different analysis results corresponding to financial data, being verified to financial data.

3rd technical solution of the present invention is：A kind of PDF document processing unit based on big data, the device bag Include：

At least one processor；

At least one processor, for storing at least one program；

When at least one program is performed by least one processor so that at least one processor is realized A kind of PDF document processing method based on big data as described in above-mentioned first technical solution.

The beneficial effect of the method for the present invention, system and device is：The present invention is by using duplicate removal technology and format conversion skill Art, after building multi-format document pond, is referred to using the regular expression rule, initiation feature index and end feature of financial index Mark, carries out positioning dissection process, to obtain financial data and and financial data to the financial documentation of multiple and different document formats It is corresponding index name and time, then, right using the different analysis results for different document source corresponding to financial data Financial data is verified, therefore it can be seen from the above that by using the present invention, quickly and accurately can carry out wealth to financial documentation The parsing extraction for data of being engaged in, parsing obtain the financial data of high accurancy and precision.

Brief description of the drawings

Fig. 1 is a kind of step flow chart of the PDF document processing method based on big data of the present invention；

Fig. 2 is a kind of structure diagram of the PDF document processing system based on big data of the present invention；

Fig. 3 is an a kind of specific embodiment flow chart of steps of the PDF document processing method based on big data of the present invention.

Embodiment

Embodiment 1

As shown in Figure 1, a kind of PDF document processing method based on big data, this method comprise the following steps：

The preferred embodiment of this method is further used as, described to utilize duplicate removal technology and format conversion techniques, structure is more The step for format file pond, it is specifically included：

Utilize duplicate removal technology, structure profile download link pond；

Using at least one PDF financial documentations download link included in profile download link pond, download obtains opposite At least one PDF financial documentations answered；

Using format conversion techniques, the financial documentation of different document form obtained PDF financial documentations will be downloaded is converted into Afterwards, the financial documentation of different document form is put into multi-format document pond.

The preferred embodiment of this method is further used as, described to utilize duplicate removal technology and format conversion techniques, structure is more The step for format file pond, it is also specifically included：

Calculate the certainty value of each financial documentation in multi-format document pond.

The preferred embodiment of this method is further used as, the regular expression using financial index is regular, starting Characteristic index and terminate characteristic index, after carrying out positioning dissection process to the financial documentations of multiple and different document formats, obtain wealth The step for data of being engaged in and index name corresponding with financial data and time, it is specifically included：

Using the regular expression rule of financial index, initiation feature index and terminate characteristic index, to multiple and different texts The financial documentation of shelves form carries out the localization process of financial statement；

After the data in financial statement obtained to positioning carry out localization process, financial data and and financial data are recorded Corresponding index name and time；

Unit conversion is carried out to the financial data for belonging to numeric type.

The preferred embodiment of this method is further used as, the regular expression using financial index is regular, starting Characteristic index and the step for terminate characteristic index, positioning dissection process carried out to the financial documentations of multiple and different document formats it Before be provided with following steps：

Build the regular expression rule of financial index；

And/or

Obtain the initiation feature index of financial statement and terminate characteristic index.

It is further used as the preferred embodiment of this method, the different analysis results using corresponding to financial data, The step for being verified to financial data, it is specifically included：

Category division is carried out to the different analysis results corresponding to financial data；

In at least one classification obtained from division, the classification for meeting the first preset condition is selected as correct classification；

The analysis result for meeting the second preset condition is selected from correct classification as correct data, and to correct data Corresponding data reliability is set；

Using correct data as the financial data after verification.

It is further used as the preferred embodiment of this method, at least one classification obtained from division, selects Meet the step for classification of the first preset condition is as correct classification, it is specifically included：

When the number for the classification that division obtains is 1, then the classification obtained dividing is as correct classification；

When the number of classification that division obtains is at least 2, then the number of analysis result, wealth according to included in classification The issuing time of the sum of certainty value of business document and/or financial documentation, selects corresponding classification conduct from least two classification Correct classification.

The preferred embodiment of this method is further used as, this method further includes document analytic modification Optimization Steps, described Document analytic modification Optimization Steps specifically include following steps：

Different analysis results and correct data according to corresponding to financial data, are calculated data parsing accuracy；

When the data parsing accuracy being calculated is less than threshold value, according to default amendment optimisation strategy, to document solution Analysis process is modified optimization, until the data parsing accuracy being calculated is more than or equal to threshold value.

The preferred embodiment of this method is further used as, described to utilize duplicate removal technology and format conversion techniques, structure is more The step for format file pond, and/or it is described special using the regular expression rule, initiation feature index and end of financial index Levy index, after carrying out positioning dissection process to the financial documentations of multiple and different document formats, obtain financial data and with finance The step for corresponding index name of data and time, performed using distributed processing mode；

And/or

The multi-format document pond is stored in distributed storage server.

Embodiment 2

As shown in Fig. 2, system corresponding with the above method, a kind of PDF document processing system based on big data, the system Including：

Verification unit, for using the different analysis results corresponding to financial data, being verified to financial data.Its In, the construction unit, resolution unit and/or verification unit can be program module, or hardware module, also can be soft or hard The appliance arrangement module that part combines.

Suitable for the system embodiment, the system embodiment is implemented content in above method embodiment Steps flow chart is identical with above method embodiment, and the beneficial effect that the beneficial effect reached is reached with above method embodiment Fruit is identical.

Embodiment 3

Device corresponding with the above method, a kind of PDF document processing unit based on big data, the device include：

At least one processor；

At least one processor, for storing at least one program；

When at least one program is performed by least one processor so that at least one processor is realized A kind of PDF document processing method based on big data as described in above-mentioned embodiment of the method.

Suitable for present apparatus embodiment, present apparatus embodiment is implemented content in above method embodiment Steps flow chart is identical with above method embodiment, and the beneficial effect that the beneficial effect reached is reached with above method embodiment Fruit is identical.

Embodiment 4

As shown in figure 3, a kind of PDF document processing method based on big data that the present embodiment is provided, it is specifically included Step is as follows.

Step (1)：Using duplicate removal technology and format conversion techniques, multi-format document pond is built, wherein, the multi-format Document pond includes the financial documentation of multiple and different document formats.

Specifically, this step (1) realizes the PDF document pretreatment process of complete set.

Preferably, the step (1) specifically includes following steps.

S101, utilize duplicate removal technology, structure profile download link pond.

Specifically, this step S101 is mainly used for after each PDF financial documentations download link is crawled from network public information, They are collected, duplicate removal, covered, do not repeated comprehensively with forming one and newest profile download links pond.Wherein, it is described PDF financial documentations refer to the financial documentation that document format is PDF.

Preferably, this step S101 specifically includes following steps：

S1011, crawl required PDF financial documentation download links；

Specifically, required multiple PDF financial documentations download links are crawled from multiple website channels, it is ensured that as comprehensive as possible PDF financial documentations needed for ground covering；

S1012, using simhash algorithms, to corresponding to each PDF financial documentation download link for crawling PDF financial documentations title carries out the calculating of simhash codes, so as to obtain the simhash generations of each PDF financial documentation title Code；

Preferably, step S1012 specifically includes following steps：

S10121, can collect the PDF financial documentation download links crawled in same profile download links pond；

S10122, the PDF finance linked to profile download corresponding to each PDF financial documentation download link in pond are literary Shelves title carries out the calculating of simhash codes, i.e. to corresponding to each PDF financial documentation download link for crawling PDF financial documentations title carries out the calculating of simhash codes, so as to calculate the simhash of each PDF financial documentation title Code；

S1013, the simhash codes according to PDF financial documentation titles, to each PDF financial documentations download link into Row is sorted out, wherein, it is classified as same type of PDF financial documentations download link and corresponds to same PDF financial documentations；

Preferably, classifying mode is hamming between the simhash codes based on document title used by step S1013 Distance and realize, i.e., this step is specially：The Hamming distances between the simhash codes of multiple PDF financial documentations titles are calculated, According to the Hamming distances being calculated, multiple PDF financial documentations download links are sorted out；

Preferably for above-mentioned steps S1013, it comprises the following steps：

Calculate the Hamming distances between the simhash codes of any two PDF financial documentation titles；

Hamming distances between the simhash codes of two PDF financial documentation titles are calculated are less than first threshold n When, then judge that the PDF financial documentation download links corresponding to the two PDF financial documentation titles belong to same type, i.e., this two PDF financial documentation download links corresponding to a PDF financial documentations title are corresponding same PDF financial documentations；

Hamming distances between the simhash codes of two PDF financial documentation titles are calculated are more than or equal to the first threshold During value n, then judge that the PDF financial documentation download links corresponding to the two PDF financial documentation titles are not belonging to same type, i.e., PDF financial documentation download links corresponding to the two PDF financial documentation titles correspond to different PDF financial documentations；This step Described in threshold value n be distance threshold；

Above-mentioned calculating judgment processing steps are carried out to the simhash codes of all PDF financial documentations titles, until that will own Untill the classification of PDF financial documentations download link finishes；

As it can be seen that after sorting out, each type can include at least one PDF financial documentations download link, equivalent to one A type represents a set, and includes at least one PDF financial documentations download link in a set；

S1014, using PDF financial documentation download links crawl timestamp, the PDF finance included from each type Selected in profile download link and crawl timestamp as maximum PDF financial documentation download links；Wherein, crawled for described Timestamp, its numerical value is smaller, before it represents the time more, conversely, its numerical value is bigger, after it represents the time more；

Specifically, in profile download links pond, if same document is corresponding with more than two PDF financial documentations and downloads Link, even a type includes more than two PDF financial documentations download links, at this time, then to being wrapped in this type The numerical values recited that the more than two PDF financial documentations download links contained crawl timestamp compares, and will crawl timestamp number It is worth less PDF financial documentations download link to delete, only retains newest PDF financial documentation download links, that is to say, that Timestamp is crawled according to PDF financial documentation download links, same type of two or more PDF financial documentations is subordinated to and downloads chain Selected in connecing and crawl timestamp as maximum PDF financial documentation download links, and remained；

If same document only corresponds to a PDF financial documentation download link, an even type only includes a PDF Financial documentation download link, at this time, this PDF financial documentations download link just crawl timestamp as maximum as what is selected PDF financial documentation download links；

S1015, make all PDF financial documentation download links selected be stored in profile download link pond, at this time, institute The profile download link pond stated links pond for required profile download.

In addition, for above-mentioned steps S10121, it first can also collect the PDF financial documentation download links crawled In other default positions, after the PDF financial documentations download link crawled to these carries out above-mentioned processing, screening is drawn When crawling the PDF financial documentation download links that timestamp is maximum, then the PDF financial documentation download links that these screenings are drawn It is stored in profile download link pond and also may be used.

S102, using at least one PDF financial documentations download link included in profile download link pond, download obtains Corresponding at least one PDF financial documentations.In general, a PDF financial documentations download link, which corresponds to, downloads a PDF finance text Shelves.

S103, using format conversion techniques, obtained PDF financial documentations will be downloaded be converted into the finance of different document form After document, the financial documentation of different document form is put into multi-format document pond.

Format conversion techniques are utilized preferably for described, the PDF financial documentations that download obtains are converted into different document The step for financial documentation of form, it comprises the following steps：

Judge to download whether obtained PDF financial documentations are the PDF financial documentations of picture format, if so, then using first The PDF financial documentations that download obtains, are converted into the document of different-format by conversion regime；Conversely, the second conversion regime is then used, The PDF financial documentations that download obtains are converted into the document of different-format.

Preferably for it is described judge to download obtained PDF financial documentations whether be picture format PDF financial documentations this One step, it comprises the following steps：

S1031, will download obtained PDF financial documentations and be converted into preset format document；

Specifically, the preset format document, its form are TXT forms in the present embodiment, that is to say, that this step has Body is：TXT instruments are turned using default PDF, the PDF financial documentations that download obtains are converted into TXT financial documentations, i.e. document lattice Formula is the financial documentation of TXT；

Default PDF turns TXT instruments used by above-mentioned, it is only applicable to the PDF document for handling non-picture format, therefore, After the use default PDF, which turns TXT instruments, carries out conversion process to the PDF document of picture format, in obtained TXT documents Mess code can be carried, therefore, judgment step is calculated with reference to the mess code of TXT documents, just can realize and judge whether PDF document is picture lattice Formula document；

S1032, the mess code rate for calculating preset format document, that is, calculate the mess code of the above-mentioned TXT financial documentations being converted to Rate；

Specifically, it is then sharp again in the calculating process for realizing document mess code rate, it is necessary to first carry out character resolution to document The calculating of mess code rate is carried out with the result of parsing；

Preferably, the step S1032 includes：

S10321, randomly select s character from TXT financial documentations；

S10322, will extract after obtained each character matched one by one with the character that prestores in pre-set dictionary storehouse, Obtain character match quantity p；

Specifically, when a character extracting with pre-set dictionary storehouse prestore character Corresponding matching when, then coupling number Add 1；After above-mentioned matching treatment is carried out to all extraction characters, obtained matching sum is number of matches p；In addition, for The pre-set dictionary storehouse, it is stored with the characters that prestore such as Chinese simplified and traditional body commonly used word, numeral, letter and common spcial character；

S10323, the mess code rate r1 that TXT financial documentations are calculated using following calculation formula：

R1=(s-p)/s

Whether the mess code rate r1 that S1033, judgement are calculated is less than or equal to second threshold, if, then it represents that download obtains PDF financial documentations be picture format PDF financial documentations；Otherwise, it means that it is picture to download obtained PDF financial documentations The PDF financial documentations of form；Threshold value herein is mess code rate threshold value；

Specifically, drawn according to the test of a certain amount of sample and statistics, mess code rate is less than or equal to the TXT finance of second threshold Document, it has the PDF financial documentations that more than 95% possibility is non-picture format, therefore, using the mess code of TXT financial documentations Rate, to judge whether corresponding PDF financial documentations are picture format, that is to say, that for all PDF financial documentations, if its turn When the mess code rate for the TXT financial documentations got in return is less than or equal to second threshold, then it is non-picture to judge corresponding PDF financial documentations Form, conversely, then judging that corresponding PDF financial documentations are picture format.

The first conversion regime is used preferably for described, the PDF financial documentations that download obtains are converted into different-format Document the step for, it is specifically included：

S1034, when judging to download obtained PDF financial documentations as the PDF financial documentations of picture format, then utilize correspondence The first modular converter, and Text region module OCR is called as auxiliary, so as to PDF financial documentations is converted into corresponding non- PDF financial documentations, WORD financial documentations, the EXCEL financial documentations of picture format, then, by corresponding to this financial documentation PDF format version (the PDF financial documentations of the non-picture format obtained after changing), WORD format versions (obtain after changing WORD financial documentations) and EXCEL format versions (obtained EXCEL financial documentations after changing) be put into the text of corresponding form It is respectively PDF document pond, WORD documents pond, EXCEL document pond in shelves pond；

The second conversion regime is used preferably for described, the PDF financial documentations that download obtains are converted into different-format Document the step for, it is specifically included：

S1035, when judging to download obtained PDF financial documentations not for the PDF financial documentations of picture format, then utilize pair The second modular converter answered, so that PDF financial documentations are converted into corresponding WORD financial documentations and EXCEL financial documentations, so Afterwards by PDF format version (not being the PDF financial documentations of picture format), the WORD format versions corresponding to this financial documentation (obtained WORD financial documentations after changing), EXCEL format versions (the EXCEL financial documentations obtained after changing) and on State the obtained TXT format versions (i.e. TXT financial documentations) on this financial documentation in step S1031 and be put into corresponding form Document pond in, be respectively PDF document pond, WORD documents pond, EXCEL document pond and TXT documents pond.If as it can be seen that above-mentioned steps Selected preset format document is not TXT documents in S1031, and when being other format files, then it is also required in step S1035 Set and utilize default document format crossover tool, so that the PDF financial documentations not for picture format are converted to TXT finance texts The step for shelves.

S104, the certainty value for calculating each financial documentation in multi-format document pond, the certainty value is specially one credible Spend score value.

Specifically, can for every a financial documentation in the document pond (i.e. multi-format document pond) of multiple and different forms According to the mess code rate of each financial documentation, the confidence level score value of each financial documentation is calculated, in this way, according to wealth The confidence level score value of business document can quickly determine that the quality of document conversion effect.

Preferably, this step S104 includes：

S1041, for per a financial documentation, it is used in PDF document pond, WORD documents pond and EXCEL document pond The mess code rate calculation of above-mentioned steps S10323 is financial to carry out every portion PDF financial documentations, WORD financial documentations and EXCEL The calculating of the mess code rate of document；And for TXT financial documentations, then it can directly use obtained TXT finance text in step S1032 The mess code rate of shelves；That is, it can similarly obtain, if preset format document selected in above-mentioned steps S1031 is not TXT texts Shelves, and when being other format files, then this step S1041 just needs to set the mess code rate calculating side using above-mentioned steps S10323 Formula to carry out the step for mess code rate calculates to each TXT financial documentation；

S1042, the mess code rate according to each financial documentation, each finance is calculated using following calculation formula The confidence level score value of document：

K=(1-r) * λ

Wherein, k is expressed as the confidence level score value of financial documentation, and r is expressed as the mess code rate of financial documentation, λ be expressed as through Test numerical value；

S1043, by the confidence level score value k corresponding to every a financial documentation keep records of.

Obtained by above-mentioned, in multi-format document pond, for a financial documentation, it corresponds to the text of a variety of different-formats Shelves；Also, the document in multi-format document pond is using as the data source of follow-up data dissection process, and the confidence level of document scores Value then provides for follow-up verification judges foundation.

Step (2)：Using the regular expression rule of financial index, initiation feature index and terminate characteristic index, to more After the financial documentation of a different document form carries out positioning dissection process, financial data and finger corresponding with financial data are obtained Entitling claims and the time.

Preferably, specifically, the step (2) specifically includes following steps.

S201, the regular expression rule for building financial index.

Preferably, the step S201 is specifically included：

S2011, obtain title storehouse, and the title of financial index is stored with the title storehouse；

Specifically, this step S2011 is preferably by the Ministry of Finance《Accounting standards for enterprises》In index name as finance The title storehouse of index dictionary, that is to say, that with the Ministry of Finance《Accounting standards for enterprises》In index name refer to as finance Target title, and the title of these financial index is stored in the title storehouse；

S2012, build actual name storehouse, and the finance extracted from financial documentation are stored with the actual name storehouse The actual name of index；

Specifically, this step S2012 implements step and includes：First, from multi-format document pond, if randomly selecting A financial documentation for including actual financial statement is done, then, is built according to financial index word described in these documents The actual name storehouse of vertical financial index dictionary, that is to say, that will be used to state wealth in the financial documentation comprising actual financial statement Actual name of the word for index of being engaged in as financial index, and the actual name of these financial index is stored in actual name storehouse In；

The title between title in S2013, the actual name established in actual name storehouse and title storehouse is reflected Penetrate relation；

Specifically, for any one financial index, it has one or more different reality in actual use Title, therefore, it is necessary to first determine the title in actual name storehouse corresponding to the actual name of all financial index；Determining During, if the financial index corresponding to actual name in actual name storehouse, it has corresponding name in title storehouse Claim, then, which is the title corresponding to the financial index；If corresponding to the actual name in actual name storehouse Financial index, it does not have corresponding title in title storehouse, then, then count all realities corresponding to the financial index The frequency of occurrence of border title, then, will appear from title of the highest actual name of the frequency corresponding to as the financial index, And by this title be added to title storehouse in, for example, the actual name corresponding to the financial index include title a1, Title a2 and title a3, and in the document randomly selected, title a1 occurs 10 times, and title a2 occurs 8 times, and title a3 goes out Showed 2 times, at this time, then using title a1 as the financial index corresponding to title, and title a1 is added to standard In namebase；And then, can be according to pair between the title in the actual name in actual name storehouse and title storehouse It should be related to, establish the title mapping obtained between the title in the actual name and title storehouse in actual name storehouse and close System；It can be seen from the above that for step S2013, it preferably includes following steps：

S20131, judge financial index corresponding to actual name in actual name storehouse, it is in title storehouse It is no to have corresponding title, if so, this to be then referred to as to the title corresponding to the financial index；Conversely, then statistics should The frequency of occurrence of each actual name corresponding to financial index, then, will appear from the highest actual name conduct of the frequency should Title corresponding to financial index, and this title is added in title storehouse；

S20132, when the financial index corresponding to each actual name in actual name storehouse, it is in title storehouse In when having corresponding title, then the actual name in actual name storehouse and the title in title storehouse Between financial index correspondence, establish and obtain actual name in actual name storehouse and the title in title storehouse Between title mapping relations, for example, the actual name corresponding to financial index A includes title a1, title a2 and title a3, And the title corresponding to financial index A is b1, at this time, then establishes and be directed to financial index A, its actual name and title Between mapping relations；

S2014, according to title mapping relations, build the regular expression rule of financial index；

Specifically, according to the title mapping relations between the actual name of financial index and title, each is formulated The regular expression rule of financial index, wherein, the regular expression rule of the financial index, it is referred to based on canonical Expression formula, is identified financial index the recognition rule of judgement., can be right by using the regular expression rule of financial index Financial index title in magnanimity financial documentation carries out fast and accurately judging identification.

S202, the initiation feature index for obtaining financial statement and end characteristic index.

Specifically, the step S202 is comprised preferably：

S2021, extract multiple financial documentations for including financial statement；

Specifically, the type of the financial statement includes balance sheet, profit flow table, cash flow statement this three major types type Financial statement, therefore, it is necessary to the financial statement for each type, randomly selects several wealth for including actual financial statement Business document, for example, for the financial statement of this type of balance sheet, randomly selects several and includes actual assets liability account Financial documentation；

S2022, carry out index extraction to the starting content of the financial statement in each financial documentation, then, according to carrying The frequency of occurrence (i.e. occurrence number) of the index of taking-up, according to order from big to small, is ranked up the index extracted, choosing M1 index builds to obtain initiation feature index list before taking；For example, the index extracted from these financial documentations have q1, Q2, q3, and the occurrence number of index q1 is 7, the occurrence number of index q2 is 8, the occurrence number of index q3 is 4, then choose Preceding 2 indexs build to obtain initiation feature index list, it includes index q2 and q2；

Specifically, by using above-mentioned steps S2022, can build to obtain the starting corresponding to different type financial statement Characteristic index list, for example, by step S2021, randomly selects several financial documentations for including actual assets liability account, so Index extraction is carried out to the starting content of the balance sheet in each financial documentation of these extractions afterwards, then, according to carrying The frequency of occurrence (i.e. occurrence number) of the index of taking-up, according to order from big to small, is ranked up the index extracted, choosing M1 index builds to obtain the initiation feature index list corresponding to the financial statement of this type of balance sheet before taking；It is and right In profit flow table, cash flow statement this two type financial statement corresponding to initiation feature index list, its building mode and this It is identical；Wherein, the initiation feature index list obtained for structure, it is the starting spy of the required financial statement acquired Levy index；

S2023, end content to the financial statement in each financial documentation carry out index extraction, then, according to carrying The frequency of occurrence of the index of taking-up, according to order from big to small, is ranked up the index extracted, m2 index before selection Structure obtains terminating characteristic index list；

Specifically, the structure of the end characteristic index list for this step S2023, its mode refer to above-mentioned initiation feature It is similar to mark the building mode of list, is then not set forth in detail herein；Therefore, by above-mentioned steps S2023, it can build and be provided Produce the end characteristic index list corresponding to the financial statement of liability account, profit flow table, cash flow statement this three major types type；Wherein, The end characteristic index list obtained for structure, it is the end characteristic index of the required financial statement acquired.

It is identical preferably for above-mentioned m1 and m2, their numerical value.

S203, regular expression rule, initiation feature index and end characteristic index using financial index, to document pond In multiple and different document formats financial documentation carry out financial statement localization process.

Preferably, the step S203 includes：

S2031, the precedence occurred according to financial statement form, are each financial statement form in financial documentation Configure corresponding ID；

Specifically, for each financial documentation, the priority time occurred in a document according to each financial statement form Sequence, the financial statement form to be occurred establish incremental ID, that is to say, that the ID of financial statement form illustrates financial report The order that table form occurs in a document, that is, the ID of financial statement form illustrates financial statement form in financial documentation The priority position of middle appearance, for example, the precedence occurred in a document according to each financial statement form, is after arriving first Each financial statement form configures corresponding ID, such as ID₁、ID₂、ID₃、ID₄、ID₅、……、ID_K, it is seen then that ID₁Finance report Table form occurs prior to ID₂Financial statement form, that is to say, that in a document, ID₁Financial statement form be located at ID₂Wealth It is engaged in before report form；

S2032, regular expression rule, initiation feature index and end characteristic index using financial index, to finance Each financial statement form in document carries out analysis judgment, draws initiate table lattice so as to position and terminates form, wherein, institute State starting form and refer to the financial statement form in financial statement initial position, the end form is referred in finance The financial statement form of report end position；

Specifically, the step S2032 is specifically included：

S20320, set balance sheet, profit flow table, cash flow statement this three major types type financial statement form starting mark Will is respectively asset_begin_sign, profit_begin_sign, cash_begin_sign, and initial value is False (falsity)；

S20321, when the beginning flag of form is falsity, using the regular expression rule of financial index, from current Identification extracts n1 index in the starting content of financial statement form, then, will identify the n1 index extracted and starting After index in characteristic index list is matched, the first matching rate is obtained, then, when the first matching rate is more than the 3rd threshold value When, then current financial statement form is judged to originate form；

Specifically, when the value of the beginning flag begin_sign of the financial statement form of three major types type is False, then Using the regular expression rule of financial index, identified from the starting content of current financial statement form and extract n1 finger Mark (n1<M1), then, will identify n1 index extracting respectively with the financial statement of the three major types type obtained in step S202 Corresponding initiation feature index list is matched, if the n1 index extracted and one of initiation feature index list (when such as the matching rate of list a) is higher than three threshold values, then it is assumed that financial statement class of the financial statement form corresponding to list a The starting form of type, such as, then it is assumed that the financial statement form is the starting form of balance sheet, and records the financial statement The ID of form is the starting form ID of corresponding financial statement type, while the financial statement form of financial statement type will be corresponded to Beginning flag begin_sign is set to True (true value), if for example, the financial statement form is the starting form of balance sheet When, then asset_begin_sign is set to True；

S20322, when the beginning flag of form is true value, using the regular expression rule of financial index, from current Identification extracts n2 index in the end content of financial statement form, then, the n2 index extracted will be identified with terminating After index in characteristic index list is matched, the second matching rate is obtained, then, when the second matching rate is more than the 4th threshold value When, then current financial statement form is judged to terminate form；

Specifically, when the value for having one in the beginning flag begin_sign of the financial statement form of three major types type is True, then, then using the regular expression rule of financial index, identified from the end content of current financial statement form Extract n2 index (n2<M2), then, n2 index the extracting three major types with being obtained in step S202 respectively will be identified End characteristic index list corresponding to the financial statement of type is matched, if the n2 index extracted terminates with one of (such as the matching rate of list b) is higher than the 4th threshold value, then it is assumed that the financial statement form is corresponding to list b for characteristic index list The end form of financial statement type, such as, then it is assumed that the financial statement form is the end form of balance sheet, and records Financial statement form ID is the end form ID of corresponding financial statement type, while the finance of corresponding financial statement type is reported The beginning flag begin_sign of table form is set to False；

S20323, the ID of starting form and the ID of end form according to financial statement, will originate ID and the end of form The number in the financial statement form corresponding to all ID (ID of the form containing starting and the ID for terminating form) between the ID of form According to mark is the financial data of financial statement type；

Specifically, if ID₁Financial statement form be financial statement starting form, ID₅Financial statement form for finance The end form of report, then be then ID by ID₁To ID₅5 financial statement forms in data mark for finance report The financial data of table type, for example, the financial data labeled as balance sheet.

It is identical preferably for above-mentioned n1 and n2, their numerical value.

After data in S204, the financial statement obtained to positioning carry out localization process, financial data and and wealth are recorded The corresponding index name of data of being engaged in and time.

Specifically, can be by index name and time Lai unique for the financial data in financial statement, each financial data Determine；Preferably, the step S204 includes：

S2041, establish between the index name corresponding to the financial data in financial statement form and place line number One mapping relations, i.e. first mapping relations are referred to where index name and financial data corresponding to financial data Mapping relations between line number；

Wherein, if index name corresponding to the financial data, the financial index corresponding to it is in title storehouse When not having corresponding title, then the index name is added in actual name storehouse and title storehouse, and increases the finance newly The regular expression rule of index；

S2042, establish between the temporal information corresponding to the financial data in financial statement form and place columns Two mapping relations, i.e. second mapping relations refer to the columns where temporal information and data corresponding to financial data Between mapping relations；

S2043, the ranks number using corresponding to the financial data in financial statement form, the first mapping relations and second are reflected Relation is penetrated, records financial data and index name corresponding with financial data and time；

Specifically, according to the ranks number of each financial data, and " line number-index name " and " columns-time letter are passed through The two mapping relations of breath ", it may be determined that index name and time corresponding to data, then using the ranks number of data, " OK Number-index name " and " columns-temporal information " the two mapping relations, can index name of the logarithm according to this and corresponding to data Claim, the time is recorded.

S205, carry out unit conversion to the financial data for belonging to numeric type.

Specifically, the financial data recorded in above-mentioned steps S204, it is the initial data (number presented in document According to), wherein, the data (referred to as numeric type data) for belonging to numeric type also need to carry out unit conversion, can just obtain data Actual value.Therefore, the step S205 is comprised preferably：

S2050, the regular expression rule for building unit information；

Specifically, several financial documentations, the unit information form of presentation that analysis the inside is related to are randomly selected, and is directed to this A little different form of presentation establish the regular expression rule of unit information；

S2051, the regular expression rule using unit information, identification draw the unit information of financial statement；

Specifically, this step S2051 is preferably included：

S20511, travel through the financial data in financial statement form, is advised using the regular expression of unit information Then, judge to identify whether the financial data in financial statement form has unit information, if so, then will be from financial statement form The unit information for the financial statement that the unit information identified identification as needed for is drawn, conversely, then using unit information just After then expression formula rule identifies the table title of financial statement, unit is carried out in the n3 character string after the table title identified The matching search of information, the financial statement that the unit information nearest apart from table title searched identification as needed for is drawn Unit information；

S2052, the unit information drawn according to above-mentioned steps identification, carry out at unit conversion numeric type financial data Reason, makes numeric type financial data be scaled the financial data using member as monetary unit, and is replaced the data record before conversion Get off.

Step (3)：Using the different analysis results corresponding to financial data, financial data is verified.

In document pond, per portion financial documentation, it is corresponding with the document of multiple and different forms, such as the finance of PDF format Document, the financial documentation of WORD forms, the financial documentation etc. of the financial documentation of EXCEL forms and TXT forms, therefore, by upper Step (2) is stated, a financial data can correspond to obtain the analysis result based on documents in various formats, for example, for a finance Document a, wherein containing a financial data b, and in document pond, financial documentation a is corresponding with the financial documentation a of WORD forms_d、 The financial documentation a of EXCEL forms_eAnd the financial documentation a of TXT forms_t, therefore, using above-mentioned steps (two) to document a_d、a_eAnd a_t After carrying out dissection process, financial data b can obtain corresponding three analysis results, be respectively b_d、b_eAnd b_t.And it is existing to produce this As the reason for be：Documents in various formats is parsed, although being realized using the method for above-mentioned steps (two), Specific code realized in details, due to the actual data storage form of documents in various formats be it is different, parse Data may be different；And in the present embodiment, parsed for the document of each form, equal separate configurations are corresponding Analysis program module and/or independent accessory part is called, therefore, in actual document data resolving, utilize difference Independent parsing program module, after accordingly being parsed to the same financial data in documents in various formats, finally obtain Analysis result be mutually independent, that is to say, that if the document of one of which form when document data parses, because code is wrong Cause error result occur by mistake, this can't also influence the data analysis result of another format file.

In addition, in financial documentation, if a financial documentation is to raise specification or tracking grading report, in the document An index often include the data in more than one time.Such as：, can in the recruitment specification of a company A in 2016 It can have a financial data in 2013,2014,2015 and some season in 2016, and in another company A in 2015 Raise in specification, then might have the financial data in 2012,2013,2014 and some season in 2015, it is seen then that right In the financial data of this index of the total assets of company A in 2013, company A can be respectively from 2015 and 2016 two parts Raise the two different document sources of specification；And for each document source, they correspond to respectively again WORD, EXCEL and TXT this During three different-format sources, then by being parsed to the documents in various formats of the two document sources, then can obtain A in 2013 The corresponding data from 6 different document sources of this index of the total assets of company, that is to say, that for company A in 2013 Total assets this indexs financial data, it is corresponding with 6 different analysis results.

For coming from corresponding to the financial data of same index (the same index in same company's same time) Not homologous analysis result, it is contemplated that the data accuracy of different analysis results is different, and therefore, it is necessary to right using financial data institute The different analysis results answered, verify financial data, and the data obtained after verification are as final required financial data.

Preferably, the step (3) specifically includes following steps.

S301, carry out category division to the different analysis results corresponding to financial data；

Specifically, firstly, for the different analysis results corresponding to financial data, compared two-by-two；Then, finance are worked as When data belong to the financial data of numeric type, for example, belonging to the data of this financial index of total assets, then then compare data Decimal point before data length, that is, compare the digit before the decimal point of data and compare, if equal length, then then compare data Preceding m3 bit digitals, if numeral it is equal, then then think the two data, i.e. the two analysis results, belong to same category；When When financial data belongs to the financial data of Ratio-type, for example, belonging to the data of this financial index of net assets income ratio, then then Compare whether digital the part before the decimal point of data is equal, if so, the preceding n4 bit digitals after then comparing the decimal points of data, if It is equal, then then to think the two data, i.e. the two analysis results, belong to same category；

Above-mentioned processing is carried out to the different analysis results corresponding to financial data, until by all analysis result all classifications Complete；And the classification obtained after the completion of classification processing, its correspondence include at least one analysis result；

S302, from the obtained at least one classification of division, select the classification for meeting the first preset condition as correct Classification；

Preferably, the step S302 is specifically included：

S3021, when the number of classification that division obtains is 1, then the classification obtained dividing is as correct classification；

S3022, when the number of classification that division obtains is at least 2, then the analysis result according to included in classification The issuing time of number, the sum of the certainty value of financial documentation and/or financial documentation, selects corresponding from least two classification Classification is as correct classification；

Specifically, the step S3022 is comprised preferably：

S30221, count the classification obtained in above-mentioned steps S301；

S30222, from multiple and different classifications, select containing the largest number of classifications of analysis result；

If the number of the classification selected in S30223, step S30222 is 1, using the classification selected as correct Classification；

If the number of the classification selected in S30224, step S30222 is at least two, it is right to calculate each classification institute The sum of certainty value of financial documentation answered, then, according to the sum of certainty value of financial documentation, therefrom selects the letter of financial documentation The classification of the sum of angle value numerical value maximum；

For example, the classification selected in step S30222 has classification 1, classification 2, classification 3 respectively, classification 1 is then calculated The sum of certainty value of financial documentation, its specific calculation are：It is right for multiple and different analysis result institutes included in classification 1 The financial documentation source answered, their confidence level score value k is added, obtained summation be financial documentation certainty value it With；For the sum of certainty value of financial documentation corresponding to classification 2 and classification 3, their calculation is also such；Then from class Other 1, in classification 2, classification 3, according to the sum of certainty value of financial documentation, the highest classification of value, such as classification 1 are selected；

If the number of the classification selected in S30225, step S30224 is 1, using the classification selected as correct Classification；

If the number of the classification selected in S30226, step S30224 is at least two, compare the finance text of classification The issuing time of shelves, selects the classification closest to the analysis result of current date comprising issuing time；

Specifically, different analysis results included in a classification, the different document source corresponding to them have different Document issuing time, then, by the analysis result corresponding to issuing time closest to the document source of current date, the class where it The classification do not chosen as needed for；

If the number of the classification selected in S30227, step S30226 is 1, using the classification selected as correct Classification；

If the number of the classification selected in S30228, step S30226 is at least two, according to other preset strategies The judgement for carrying out correct classification is chosen, and final choose obtains satisfactory classification as correct classification；For described other Preset strategy, it can be configured selection according to actual conditions；

S303, select the analysis result for meeting the second preset condition as correct data from correct classification, and aligns Exact figures are according to the corresponding data reliability of setting；

Specifically, for analysis result included in correct classification, if comprising analysis result number be more than 1 when, Then using the highest analysis result of precision as correct data, and the confidence level of this correct data is set to high；If comprising parsing As a result when number is 1, then using this 1 analysis result as correct data, and the confidence level of this correct data is set to；

S304, using correct data as the financial data obtained after verification.

Step (4), document analytic modification Optimization Steps.

Preferably, the step (4) specifically includes：

S401, the different analysis results according to corresponding to financial data and correct data, are being calculated data parsing just True rate；

S402, when the data parsing accuracy being calculated is less than five threshold values, according to default amendment optimisation strategy, Optimization is modified to document resolving, until the data parsing accuracy being calculated is more than or equal to the 5th threshold value；

Specifically, the default amendment optimisation strategy is specially：A certain amount of document is often parsed, then to these documents In, the data of the incorrect classification obtained for step (3), randomly select a part and are checked, when an error is discovered, then Modify, optimize；If be the discovery that belong to non-source document malfunctions in itself, the generation of format module is corresponded in analytical procedure (two) Code, finds out error reason and is improved, while the parameters involved in adjusting and optimizing step (2), then re-executes step Suddenly (two), (three), (four) so that data parsing accuracy is constantly lifted, until being finally reached the 5th threshold value.

The preferred embodiment of the present embodiment is further used as, it is described to utilize duplicate removal technology and format conversion techniques, structure The step for multi-format document pond, i.e. step (1), and/or the regular expression using financial index is regular, starting is special Levy index and terminate characteristic index, after carrying out positioning dissection process to the financial documentations of multiple and different document formats, obtain finance The step for data and index name corresponding with financial data and time, i.e. step (2), using distributed processing mode To perform；

And/or

The multi-format document pond is stored in distributed storage server, to reach the processing of Document distribution formula and/or deposit Store up effect.

Specifically, for every a document, since the execution time needed for step S103 and/or step (2) is longer, because This, employs distributed processing mode to step S103 and/or step (2) and is performed to realize, to improve document disposed of in its entirety effect Rate；And for the document pond of multiple and different forms, then distributed storage is carried out, to lift the follow-up reading efficiency of document.Therefore, it is right In the present invention, it further preferably comprises the following steps：

Step 1., according to the waiting task amount n of each server, the PDF financial documentations that step S102 is downloaded are sent out It is sent on the server of waiting task amount minimum, carries out the processing of step S103；

2., to each server step is monitored, that is, setting up monitoring module；

Specifically, if maximum is n_max, minimum value n_min in the waiting task amount of each server；Judge each clothes Whether the n_min in business device is greater than or equal to the 6th threshold value n_extreme set in advance, if so, then pause step S102, directly It is less than or equal to reasonable value n_recommend set in advance to the n_max in each server, at this time, then reopens step S102；If there is n_min in a time period t>The number c of=n_extreme exceedes early warning number c_ set in advance Alarm, at this time, then sends system alarm, so that monitoring personnel is reminded by increasing number of servers to solve the problems, such as, such energy The problem of enough raising document overall treatment efficiencies, timely processing is produced because of server resource deficiency, improve whole document process Fluency, stability and the reliability of flow；

Step 3., for the document pond of caused multiple and different forms in step S103, be stored in fast_dfs distributions In formula storage server, follow-up dissection process is waited；

Step 4., according to the waiting task amount n of each server, the financial documentation in document pond is sent to pending On the server of business amount minimum, the processing of step (2) is carried out, to obtain analysis result, lifts the analyzing efficiency to magnanimity document.

Obtained by above-mentioned, the present invention realizes a kind of PDF document processing method for magnanimity document, by using duplicate removal The data of every a PDF document, are extended to " multi-source " of different-format, with structure by technology and format conversion techniques from " source " Multi-format document pond, recycles the regular expression rule, initiation feature index and end characteristic index of financial index, to document The document of multiple and different forms in pond carries out the data parsing of financial statement, and using corresponding to the financial data of same index Different analysis results verified, obtain the highest data of confidence level as obtained financial data is finally parsed, thus may be used See, by using the processing scheme of the present invention, can quickly, accurately realize the dissection process of financial data, obtain high accurate The financial data of degree.In addition, always according to check results, the iterated revision optimization of document process of analysis is carried out, so can be further The accuracy rate of ground lifting document parsing.

For the document process scheme of the present invention, it is suitable for the financial data dissection process in enterprise annual reports document, hair It is big that debt raises the finance such as the financial data dissection process in specification, the financial data dissection process in tracking grading report of issuing debts In data dissection process field.

All technology contents in the present embodiment can arbitrarily split/be applied in combination in above-described embodiment 1~3.

Above is the preferable of the present invention is implemented to be illustrated, but the invention is not limited to the implementation Example, those skilled in the art can also make a variety of equivalent variations on the premise of without prejudice to spirit of the invention or replace Change, these equivalent deformations or replacement are all contained in the application claim limited range.

Claims

A kind of 1. PDF document processing method based on big data, it is characterised in that：This method comprises the following steps：

Using duplicate removal technology and format conversion techniques, multi-format document pond is built, wherein, the multi-format document pond includes multiple The financial documentation of different document form；

Using the regular expression rule of financial index, initiation feature index and terminate characteristic index, to multiple and different document lattice After the financial documentation of formula carries out positioning dissection process, obtain financial data and index name corresponding with financial data and when Between；

Using the different analysis results corresponding to financial data, financial data is verified.
A kind of 2. PDF document processing method based on big data according to claim 1, it is characterised in that：Described utilize is gone The step for weight technology and format conversion techniques, structure multi-format document pond, it is specifically included：

Utilize duplicate removal technology, structure profile download link pond；

Using at least one PDF financial documentations download link included in profile download link pond, download obtains corresponding At least one PDF financial documentations；

, will after the PDF financial documentations that download obtains are converted into the financial documentation of different document form using format conversion techniques The financial documentation of different document form is put into multi-format document pond.
A kind of 3. PDF document processing method based on big data according to claim 2, it is characterised in that：Described utilize is gone The step for weight technology and format conversion techniques, structure multi-format document pond, it is also specifically included：

Calculate the certainty value of each financial documentation in multi-format document pond.
A kind of 4. PDF document processing method based on big data according to claim 3, it is characterised in that：It is described to utilize wealth Regular expression rule, initiation feature index and the end characteristic index for index of being engaged in, to the finance text of multiple and different document formats After shelves carry out positioning dissection process, financial data is obtained and the step for index name corresponding with financial data and time, It is specifically included：

Using the regular expression rule of financial index, initiation feature index and terminate characteristic index, to multiple and different document lattice The financial documentation of formula carries out the localization process of financial statement；

After carrying out localization process to the data in the obtained financial statement of positioning, financial data and corresponding with financial data is recorded Index name and the time；

Unit conversion is carried out to the financial data for belonging to numeric type.
A kind of 5. PDF document processing method based on big data according to claim 4, it is characterised in that：It is described to utilize wealth Regular expression rule, initiation feature index and the end characteristic index for index of being engaged in, to the finance text of multiple and different document formats Following steps are provided with before the step for shelves carry out positioning dissection process：

Build the regular expression rule of financial index；

And/or

Obtain the initiation feature index of financial statement and terminate characteristic index.
A kind of 6. PDF document processing method based on big data according to claim 4, it is characterised in that：It is described to utilize wealth Different analysis results corresponding to data of being engaged in, the step for verification to financial data, it is specifically included：

Category division is carried out to the different analysis results corresponding to financial data；

In at least one classification obtained from division, the classification for meeting the first preset condition is selected as correct classification；

The analysis result for meeting the second preset condition is selected from correct classification as correct data, and correct data is set Corresponding data reliability；

Using correct data as the financial data after verification.
A kind of 7. PDF document processing method based on big data according to claim 6, it is characterised in that：It is described from division In obtained at least one classification, the step for selecting the classification for meeting the first preset condition as correct classification, its is specific Including：

When the number for the classification that division obtains is 1, then the classification obtained dividing is as correct classification；

When the number for the classification that division obtains is at least 2, then the number of analysis result, finance are literary according to included in classification The issuing time of the sum of certainty value of shelves and/or financial documentation, selects corresponding classification as correct from least two classification Classification.
8. according to a kind of PDF document processing method based on big data of claim 6 or 7, it is characterised in that：This method is also Including document analytic modification Optimization Steps, the document analytic modification Optimization Steps specifically include following steps：

Different analysis results and correct data according to corresponding to financial data, are calculated data parsing accuracy；

When the data parsing accuracy being calculated is less than threshold value, according to default amendment optimisation strategy, document is parsed Journey is modified optimization, until the data parsing accuracy being calculated is more than or equal to threshold value.
9. according to a kind of any one of claim 1-7 PDF document processing methods based on big data, it is characterised in that：Institute The step for stating and utilize duplicate removal technology and format conversion techniques, building multi-format document pond, and/or it is described using financial index Regular expression rule, initiation feature index and end characteristic index, determine the financial documentation of multiple and different document formats After the dissection process of position, financial data is obtained and the step for index name corresponding with financial data and time, using distribution Formula processing mode performs；

And/or

The multi-format document pond is stored in distributed storage server.
A kind of 10. PDF document processing unit based on big data, it is characterised in that：The device includes：

At least one processor；

At least one processor, for storing at least one program；

When at least one program is performed by least one processor so that at least one processor is realized as weighed Profit requires a kind of any one of 1-9 PDF document processing methods based on big data.