CN107943785A - A kind of PDF document processing method and processing device based on big data - Google Patents
A kind of PDF document processing method and processing device based on big data Download PDFInfo
- Publication number
- CN107943785A CN107943785A CN201711080720.3A CN201711080720A CN107943785A CN 107943785 A CN107943785 A CN 107943785A CN 201711080720 A CN201711080720 A CN 201711080720A CN 107943785 A CN107943785 A CN 107943785A
- Authority
- CN
- China
- Prior art keywords
- financial
- data
- index
- document
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Financial Or Insurance-Related Operations Such As Payment And Settlement (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of PDF document processing method and processing device based on big data, this method includes:Using duplicate removal technology and format conversion techniques, structure includes the multi-format document pond of multiple and different document format financial documentations;Using the regular expression rule of financial index, initiation feature index and terminate characteristic index, after carrying out positioning dissection process to the financial documentations of multiple and different document formats, obtain financial data and index name corresponding with financial data and time;Using the different analysis results corresponding to financial data, financial data is verified.The device is included for the memory of storage program and for loading procedure and the processor of the execution PDF document processing method based on big data.By using the present invention, the parsing extraction of financial data quickly and accurately can be carried out to the financial documentation of a variety of different-formats.The present invention can be widely applied in big data parsing field as a kind of PDF document processing method and processing device based on big data.
Description
Technical field
The present invention relates to big data treatment technology, more particularly to a kind of PDF document processing method and dress based on big data
Put.
Background technology
Technology word is explained:
Regular expression:A series of character string of some syntactic rule of matching is described, matched using single character string.
Balance sheet:Represent that enterprise fixes the date one financial situation (i.e. assets, the debt of (being usually each accounting end of term)
With the situation of proprietary interest) main accounting statement.
Profit flow table:Reflect report of the enterprise in management performance during a certain accounting period.
Cash flow statement:The report that reflection enterprise flows in and out in cash and cash-equivalent during a certain accounting period.
In business finance big data analysis field, the acquisition of many financial datas needs the annual report or hair that are disclosed from company
Extracted in the documents such as the recruitment specification issued during debt, and the accuracy of the data obtained to extraction has very high want
Ask.At present, these financial documentations are usually saved as PDF format, and the PDF document which part is picture format, therefore such as
What carries out the financial data in these PDF documents automatic, fast and accurately parsing extraction, and for enterprise, it is being reduced for this
Data acquisition cost, raising data accuracy and treatment effeciency etc. have great importance.
The content of the invention
In order to solve the above-mentioned technical problem, the object of the present invention is to provide a kind of PDF document processing side based on big data
Method, system and device, the parsing that can quickly and accurately multiple financial documentations be carried out with financial data are extracted.
First technical solution of the present invention is:A kind of PDF document processing method based on big data, this method bag
Include following steps:
Using duplicate removal technology and format conversion techniques, multi-format document pond is built, wherein, the multi-format document pond includes
The financial documentation of multiple and different document formats;
Using the regular expression rule of financial index, initiation feature index and terminate characteristic index, to multiple and different texts
After the financial documentation of shelves form carries out positioning dissection process, obtain financial data and index name corresponding with financial data and
Time;
Using the different analysis results corresponding to financial data, financial data is verified.
Second technical solution of the present invention is:A kind of PDF document processing system based on big data, the system bag
Include:
Construction unit, for utilizing duplicate removal technology and format conversion techniques, builds multi-format document pond, wherein, it is described more
Format file pond includes the financial documentation of multiple and different document formats;
Resolution unit, for the regular expression rule using financial index, initiation feature index and terminates characteristic index,
After carrying out positioning dissection process to the financial documentations of multiple and different document formats, financial data and corresponding with financial data is obtained
Index name and the time;
Verification unit, for using the different analysis results corresponding to financial data, being verified to financial data.
3rd technical solution of the present invention is:A kind of PDF document processing unit based on big data, the device bag
Include:
At least one processor;
At least one processor, for storing at least one program;
When at least one program is performed by least one processor so that at least one processor is realized
A kind of PDF document processing method based on big data as described in above-mentioned first technical solution.
The beneficial effect of the method for the present invention, system and device is:The present invention is by using duplicate removal technology and format conversion skill
Art, after building multi-format document pond, is referred to using the regular expression rule, initiation feature index and end feature of financial index
Mark, carries out positioning dissection process, to obtain financial data and and financial data to the financial documentation of multiple and different document formats
It is corresponding index name and time, then, right using the different analysis results for different document source corresponding to financial data
Financial data is verified, therefore it can be seen from the above that by using the present invention, quickly and accurately can carry out wealth to financial documentation
The parsing extraction for data of being engaged in, parsing obtain the financial data of high accurancy and precision.
Brief description of the drawings
Fig. 1 is a kind of step flow chart of the PDF document processing method based on big data of the present invention;
Fig. 2 is a kind of structure diagram of the PDF document processing system based on big data of the present invention;
Fig. 3 is an a kind of specific embodiment flow chart of steps of the PDF document processing method based on big data of the present invention.
Embodiment
Embodiment 1
As shown in Figure 1, a kind of PDF document processing method based on big data, this method comprise the following steps:
Using duplicate removal technology and format conversion techniques, multi-format document pond is built, wherein, the multi-format document pond includes
The financial documentation of multiple and different document formats;
Using the regular expression rule of financial index, initiation feature index and terminate characteristic index, to multiple and different texts
After the financial documentation of shelves form carries out positioning dissection process, obtain financial data and index name corresponding with financial data and
Time;
Using the different analysis results corresponding to financial data, financial data is verified.
The preferred embodiment of this method is further used as, described to utilize duplicate removal technology and format conversion techniques, structure is more
The step for format file pond, it is specifically included:
Utilize duplicate removal technology, structure profile download link pond;
Using at least one PDF financial documentations download link included in profile download link pond, download obtains opposite
At least one PDF financial documentations answered;
Using format conversion techniques, the financial documentation of different document form obtained PDF financial documentations will be downloaded is converted into
Afterwards, the financial documentation of different document form is put into multi-format document pond.
The preferred embodiment of this method is further used as, described to utilize duplicate removal technology and format conversion techniques, structure is more
The step for format file pond, it is also specifically included:
Calculate the certainty value of each financial documentation in multi-format document pond.
The preferred embodiment of this method is further used as, the regular expression using financial index is regular, starting
Characteristic index and terminate characteristic index, after carrying out positioning dissection process to the financial documentations of multiple and different document formats, obtain wealth
The step for data of being engaged in and index name corresponding with financial data and time, it is specifically included:
Using the regular expression rule of financial index, initiation feature index and terminate characteristic index, to multiple and different texts
The financial documentation of shelves form carries out the localization process of financial statement;
After the data in financial statement obtained to positioning carry out localization process, financial data and and financial data are recorded
Corresponding index name and time;
Unit conversion is carried out to the financial data for belonging to numeric type.
The preferred embodiment of this method is further used as, the regular expression using financial index is regular, starting
Characteristic index and the step for terminate characteristic index, positioning dissection process carried out to the financial documentations of multiple and different document formats it
Before be provided with following steps:
Build the regular expression rule of financial index;
And/or
Obtain the initiation feature index of financial statement and terminate characteristic index.
It is further used as the preferred embodiment of this method, the different analysis results using corresponding to financial data,
The step for being verified to financial data, it is specifically included:
Category division is carried out to the different analysis results corresponding to financial data;
In at least one classification obtained from division, the classification for meeting the first preset condition is selected as correct classification;
The analysis result for meeting the second preset condition is selected from correct classification as correct data, and to correct data
Corresponding data reliability is set;
Using correct data as the financial data after verification.
It is further used as the preferred embodiment of this method, at least one classification obtained from division, selects
Meet the step for classification of the first preset condition is as correct classification, it is specifically included:
When the number for the classification that division obtains is 1, then the classification obtained dividing is as correct classification;
When the number of classification that division obtains is at least 2, then the number of analysis result, wealth according to included in classification
The issuing time of the sum of certainty value of business document and/or financial documentation, selects corresponding classification conduct from least two classification
Correct classification.
The preferred embodiment of this method is further used as, this method further includes document analytic modification Optimization Steps, described
Document analytic modification Optimization Steps specifically include following steps:
Different analysis results and correct data according to corresponding to financial data, are calculated data parsing accuracy;
When the data parsing accuracy being calculated is less than threshold value, according to default amendment optimisation strategy, to document solution
Analysis process is modified optimization, until the data parsing accuracy being calculated is more than or equal to threshold value.
The preferred embodiment of this method is further used as, described to utilize duplicate removal technology and format conversion techniques, structure is more
The step for format file pond, and/or it is described special using the regular expression rule, initiation feature index and end of financial index
Levy index, after carrying out positioning dissection process to the financial documentations of multiple and different document formats, obtain financial data and with finance
The step for corresponding index name of data and time, performed using distributed processing mode;
And/or
The multi-format document pond is stored in distributed storage server.
Embodiment 2
As shown in Fig. 2, system corresponding with the above method, a kind of PDF document processing system based on big data, the system
Including:
Construction unit, for utilizing duplicate removal technology and format conversion techniques, builds multi-format document pond, wherein, it is described more
Format file pond includes the financial documentation of multiple and different document formats;
Resolution unit, for the regular expression rule using financial index, initiation feature index and terminates characteristic index,
After carrying out positioning dissection process to the financial documentations of multiple and different document formats, financial data and corresponding with financial data is obtained
Index name and the time;
Verification unit, for using the different analysis results corresponding to financial data, being verified to financial data.Its
In, the construction unit, resolution unit and/or verification unit can be program module, or hardware module, also can be soft or hard
The appliance arrangement module that part combines.
Suitable for the system embodiment, the system embodiment is implemented content in above method embodiment
Steps flow chart is identical with above method embodiment, and the beneficial effect that the beneficial effect reached is reached with above method embodiment
Fruit is identical.
Embodiment 3
Device corresponding with the above method, a kind of PDF document processing unit based on big data, the device include:
At least one processor;
At least one processor, for storing at least one program;
When at least one program is performed by least one processor so that at least one processor is realized
A kind of PDF document processing method based on big data as described in above-mentioned embodiment of the method.
Suitable for present apparatus embodiment, present apparatus embodiment is implemented content in above method embodiment
Steps flow chart is identical with above method embodiment, and the beneficial effect that the beneficial effect reached is reached with above method embodiment
Fruit is identical.
Embodiment 4
As shown in figure 3, a kind of PDF document processing method based on big data that the present embodiment is provided, it is specifically included
Step is as follows.
Step (1):Using duplicate removal technology and format conversion techniques, multi-format document pond is built, wherein, the multi-format
Document pond includes the financial documentation of multiple and different document formats.
Specifically, this step (1) realizes the PDF document pretreatment process of complete set.
Preferably, the step (1) specifically includes following steps.
S101, utilize duplicate removal technology, structure profile download link pond.
Specifically, this step S101 is mainly used for after each PDF financial documentations download link is crawled from network public information,
They are collected, duplicate removal, covered, do not repeated comprehensively with forming one and newest profile download links pond.Wherein, it is described
PDF financial documentations refer to the financial documentation that document format is PDF.
Preferably, this step S101 specifically includes following steps:
S1011, crawl required PDF financial documentation download links;
Specifically, required multiple PDF financial documentations download links are crawled from multiple website channels, it is ensured that as comprehensive as possible
PDF financial documentations needed for ground covering;
S1012, using simhash algorithms, to corresponding to each PDF financial documentation download link for crawling
PDF financial documentations title carries out the calculating of simhash codes, so as to obtain the simhash generations of each PDF financial documentation title
Code;
Preferably, step S1012 specifically includes following steps:
S10121, can collect the PDF financial documentation download links crawled in same profile download links pond;
S10122, the PDF finance linked to profile download corresponding to each PDF financial documentation download link in pond are literary
Shelves title carries out the calculating of simhash codes, i.e. to corresponding to each PDF financial documentation download link for crawling
PDF financial documentations title carries out the calculating of simhash codes, so as to calculate the simhash of each PDF financial documentation title
Code;
S1013, the simhash codes according to PDF financial documentation titles, to each PDF financial documentations download link into
Row is sorted out, wherein, it is classified as same type of PDF financial documentations download link and corresponds to same PDF financial documentations;
Preferably, classifying mode is hamming between the simhash codes based on document title used by step S1013
Distance and realize, i.e., this step is specially:The Hamming distances between the simhash codes of multiple PDF financial documentations titles are calculated,
According to the Hamming distances being calculated, multiple PDF financial documentations download links are sorted out;
Preferably for above-mentioned steps S1013, it comprises the following steps:
Calculate the Hamming distances between the simhash codes of any two PDF financial documentation titles;
Hamming distances between the simhash codes of two PDF financial documentation titles are calculated are less than first threshold n
When, then judge that the PDF financial documentation download links corresponding to the two PDF financial documentation titles belong to same type, i.e., this two
PDF financial documentation download links corresponding to a PDF financial documentations title are corresponding same PDF financial documentations;
Hamming distances between the simhash codes of two PDF financial documentation titles are calculated are more than or equal to the first threshold
During value n, then judge that the PDF financial documentation download links corresponding to the two PDF financial documentation titles are not belonging to same type, i.e.,
PDF financial documentation download links corresponding to the two PDF financial documentation titles correspond to different PDF financial documentations;This step
Described in threshold value n be distance threshold;
Above-mentioned calculating judgment processing steps are carried out to the simhash codes of all PDF financial documentations titles, until that will own
Untill the classification of PDF financial documentations download link finishes;
As it can be seen that after sorting out, each type can include at least one PDF financial documentations download link, equivalent to one
A type represents a set, and includes at least one PDF financial documentations download link in a set;
S1014, using PDF financial documentation download links crawl timestamp, the PDF finance included from each type
Selected in profile download link and crawl timestamp as maximum PDF financial documentation download links;Wherein, crawled for described
Timestamp, its numerical value is smaller, before it represents the time more, conversely, its numerical value is bigger, after it represents the time more;
Specifically, in profile download links pond, if same document is corresponding with more than two PDF financial documentations and downloads
Link, even a type includes more than two PDF financial documentations download links, at this time, then to being wrapped in this type
The numerical values recited that the more than two PDF financial documentations download links contained crawl timestamp compares, and will crawl timestamp number
It is worth less PDF financial documentations download link to delete, only retains newest PDF financial documentation download links, that is to say, that
Timestamp is crawled according to PDF financial documentation download links, same type of two or more PDF financial documentations is subordinated to and downloads chain
Selected in connecing and crawl timestamp as maximum PDF financial documentation download links, and remained;
If same document only corresponds to a PDF financial documentation download link, an even type only includes a PDF
Financial documentation download link, at this time, this PDF financial documentations download link just crawl timestamp as maximum as what is selected
PDF financial documentation download links;
S1015, make all PDF financial documentation download links selected be stored in profile download link pond, at this time, institute
The profile download link pond stated links pond for required profile download.
In addition, for above-mentioned steps S10121, it first can also collect the PDF financial documentation download links crawled
In other default positions, after the PDF financial documentations download link crawled to these carries out above-mentioned processing, screening is drawn
When crawling the PDF financial documentation download links that timestamp is maximum, then the PDF financial documentation download links that these screenings are drawn
It is stored in profile download link pond and also may be used.
S102, using at least one PDF financial documentations download link included in profile download link pond, download obtains
Corresponding at least one PDF financial documentations.In general, a PDF financial documentations download link, which corresponds to, downloads a PDF finance text
Shelves.
S103, using format conversion techniques, obtained PDF financial documentations will be downloaded be converted into the finance of different document form
After document, the financial documentation of different document form is put into multi-format document pond.
Format conversion techniques are utilized preferably for described, the PDF financial documentations that download obtains are converted into different document
The step for financial documentation of form, it comprises the following steps:
Judge to download whether obtained PDF financial documentations are the PDF financial documentations of picture format, if so, then using first
The PDF financial documentations that download obtains, are converted into the document of different-format by conversion regime;Conversely, the second conversion regime is then used,
The PDF financial documentations that download obtains are converted into the document of different-format.
Preferably for it is described judge to download obtained PDF financial documentations whether be picture format PDF financial documentations this
One step, it comprises the following steps:
S1031, will download obtained PDF financial documentations and be converted into preset format document;
Specifically, the preset format document, its form are TXT forms in the present embodiment, that is to say, that this step has
Body is:TXT instruments are turned using default PDF, the PDF financial documentations that download obtains are converted into TXT financial documentations, i.e. document lattice
Formula is the financial documentation of TXT;
Default PDF turns TXT instruments used by above-mentioned, it is only applicable to the PDF document for handling non-picture format, therefore,
After the use default PDF, which turns TXT instruments, carries out conversion process to the PDF document of picture format, in obtained TXT documents
Mess code can be carried, therefore, judgment step is calculated with reference to the mess code of TXT documents, just can realize and judge whether PDF document is picture lattice
Formula document;
S1032, the mess code rate for calculating preset format document, that is, calculate the mess code of the above-mentioned TXT financial documentations being converted to
Rate;
Specifically, it is then sharp again in the calculating process for realizing document mess code rate, it is necessary to first carry out character resolution to document
The calculating of mess code rate is carried out with the result of parsing;
Preferably, the step S1032 includes:
S10321, randomly select s character from TXT financial documentations;
S10322, will extract after obtained each character matched one by one with the character that prestores in pre-set dictionary storehouse,
Obtain character match quantity p;
Specifically, when a character extracting with pre-set dictionary storehouse prestore character Corresponding matching when, then coupling number
Add 1;After above-mentioned matching treatment is carried out to all extraction characters, obtained matching sum is number of matches p;In addition, for
The pre-set dictionary storehouse, it is stored with the characters that prestore such as Chinese simplified and traditional body commonly used word, numeral, letter and common spcial character;
S10323, the mess code rate r1 that TXT financial documentations are calculated using following calculation formula:
R1=(s-p)/s
Whether the mess code rate r1 that S1033, judgement are calculated is less than or equal to second threshold, if, then it represents that download obtains
PDF financial documentations be picture format PDF financial documentations;Otherwise, it means that it is picture to download obtained PDF financial documentations
The PDF financial documentations of form;Threshold value herein is mess code rate threshold value;
Specifically, drawn according to the test of a certain amount of sample and statistics, mess code rate is less than or equal to the TXT finance of second threshold
Document, it has the PDF financial documentations that more than 95% possibility is non-picture format, therefore, using the mess code of TXT financial documentations
Rate, to judge whether corresponding PDF financial documentations are picture format, that is to say, that for all PDF financial documentations, if its turn
When the mess code rate for the TXT financial documentations got in return is less than or equal to second threshold, then it is non-picture to judge corresponding PDF financial documentations
Form, conversely, then judging that corresponding PDF financial documentations are picture format.
The first conversion regime is used preferably for described, the PDF financial documentations that download obtains are converted into different-format
Document the step for, it is specifically included:
S1034, when judging to download obtained PDF financial documentations as the PDF financial documentations of picture format, then utilize correspondence
The first modular converter, and Text region module OCR is called as auxiliary, so as to PDF financial documentations is converted into corresponding non-
PDF financial documentations, WORD financial documentations, the EXCEL financial documentations of picture format, then, by corresponding to this financial documentation
PDF format version (the PDF financial documentations of the non-picture format obtained after changing), WORD format versions (obtain after changing
WORD financial documentations) and EXCEL format versions (obtained EXCEL financial documentations after changing) be put into the text of corresponding form
It is respectively PDF document pond, WORD documents pond, EXCEL document pond in shelves pond;
The second conversion regime is used preferably for described, the PDF financial documentations that download obtains are converted into different-format
Document the step for, it is specifically included:
S1035, when judging to download obtained PDF financial documentations not for the PDF financial documentations of picture format, then utilize pair
The second modular converter answered, so that PDF financial documentations are converted into corresponding WORD financial documentations and EXCEL financial documentations, so
Afterwards by PDF format version (not being the PDF financial documentations of picture format), the WORD format versions corresponding to this financial documentation
(obtained WORD financial documentations after changing), EXCEL format versions (the EXCEL financial documentations obtained after changing) and on
State the obtained TXT format versions (i.e. TXT financial documentations) on this financial documentation in step S1031 and be put into corresponding form
Document pond in, be respectively PDF document pond, WORD documents pond, EXCEL document pond and TXT documents pond.If as it can be seen that above-mentioned steps
Selected preset format document is not TXT documents in S1031, and when being other format files, then it is also required in step S1035
Set and utilize default document format crossover tool, so that the PDF financial documentations not for picture format are converted to TXT finance texts
The step for shelves.
S104, the certainty value for calculating each financial documentation in multi-format document pond, the certainty value is specially one credible
Spend score value.
Specifically, can for every a financial documentation in the document pond (i.e. multi-format document pond) of multiple and different forms
According to the mess code rate of each financial documentation, the confidence level score value of each financial documentation is calculated, in this way, according to wealth
The confidence level score value of business document can quickly determine that the quality of document conversion effect.
Preferably, this step S104 includes:
S1041, for per a financial documentation, it is used in PDF document pond, WORD documents pond and EXCEL document pond
The mess code rate calculation of above-mentioned steps S10323 is financial to carry out every portion PDF financial documentations, WORD financial documentations and EXCEL
The calculating of the mess code rate of document;And for TXT financial documentations, then it can directly use obtained TXT finance text in step S1032
The mess code rate of shelves;That is, it can similarly obtain, if preset format document selected in above-mentioned steps S1031 is not TXT texts
Shelves, and when being other format files, then this step S1041 just needs to set the mess code rate calculating side using above-mentioned steps S10323
Formula to carry out the step for mess code rate calculates to each TXT financial documentation;
S1042, the mess code rate according to each financial documentation, each finance is calculated using following calculation formula
The confidence level score value of document:
K=(1-r) * λ
Wherein, k is expressed as the confidence level score value of financial documentation, and r is expressed as the mess code rate of financial documentation, λ be expressed as through
Test numerical value;
S1043, by the confidence level score value k corresponding to every a financial documentation keep records of.
Obtained by above-mentioned, in multi-format document pond, for a financial documentation, it corresponds to the text of a variety of different-formats
Shelves;Also, the document in multi-format document pond is using as the data source of follow-up data dissection process, and the confidence level of document scores
Value then provides for follow-up verification judges foundation.
Step (2):Using the regular expression rule of financial index, initiation feature index and terminate characteristic index, to more
After the financial documentation of a different document form carries out positioning dissection process, financial data and finger corresponding with financial data are obtained
Entitling claims and the time.
Preferably, specifically, the step (2) specifically includes following steps.
S201, the regular expression rule for building financial index.
Preferably, the step S201 is specifically included:
S2011, obtain title storehouse, and the title of financial index is stored with the title storehouse;
Specifically, this step S2011 is preferably by the Ministry of Finance《Accounting standards for enterprises》In index name as finance
The title storehouse of index dictionary, that is to say, that with the Ministry of Finance《Accounting standards for enterprises》In index name refer to as finance
Target title, and the title of these financial index is stored in the title storehouse;
S2012, build actual name storehouse, and the finance extracted from financial documentation are stored with the actual name storehouse
The actual name of index;
Specifically, this step S2012 implements step and includes:First, from multi-format document pond, if randomly selecting
A financial documentation for including actual financial statement is done, then, is built according to financial index word described in these documents
The actual name storehouse of vertical financial index dictionary, that is to say, that will be used to state wealth in the financial documentation comprising actual financial statement
Actual name of the word for index of being engaged in as financial index, and the actual name of these financial index is stored in actual name storehouse
In;
The title between title in S2013, the actual name established in actual name storehouse and title storehouse is reflected
Penetrate relation;
Specifically, for any one financial index, it has one or more different reality in actual use
Title, therefore, it is necessary to first determine the title in actual name storehouse corresponding to the actual name of all financial index;Determining
During, if the financial index corresponding to actual name in actual name storehouse, it has corresponding name in title storehouse
Claim, then, which is the title corresponding to the financial index;If corresponding to the actual name in actual name storehouse
Financial index, it does not have corresponding title in title storehouse, then, then count all realities corresponding to the financial index
The frequency of occurrence of border title, then, will appear from title of the highest actual name of the frequency corresponding to as the financial index,
And by this title be added to title storehouse in, for example, the actual name corresponding to the financial index include title a1,
Title a2 and title a3, and in the document randomly selected, title a1 occurs 10 times, and title a2 occurs 8 times, and title a3 goes out
Showed 2 times, at this time, then using title a1 as the financial index corresponding to title, and title a1 is added to standard
In namebase;And then, can be according to pair between the title in the actual name in actual name storehouse and title storehouse
It should be related to, establish the title mapping obtained between the title in the actual name and title storehouse in actual name storehouse and close
System;It can be seen from the above that for step S2013, it preferably includes following steps:
S20131, judge financial index corresponding to actual name in actual name storehouse, it is in title storehouse
It is no to have corresponding title, if so, this to be then referred to as to the title corresponding to the financial index;Conversely, then statistics should
The frequency of occurrence of each actual name corresponding to financial index, then, will appear from the highest actual name conduct of the frequency should
Title corresponding to financial index, and this title is added in title storehouse;
S20132, when the financial index corresponding to each actual name in actual name storehouse, it is in title storehouse
In when having corresponding title, then the actual name in actual name storehouse and the title in title storehouse
Between financial index correspondence, establish and obtain actual name in actual name storehouse and the title in title storehouse
Between title mapping relations, for example, the actual name corresponding to financial index A includes title a1, title a2 and title a3,
And the title corresponding to financial index A is b1, at this time, then establishes and be directed to financial index A, its actual name and title
Between mapping relations;
S2014, according to title mapping relations, build the regular expression rule of financial index;
Specifically, according to the title mapping relations between the actual name of financial index and title, each is formulated
The regular expression rule of financial index, wherein, the regular expression rule of the financial index, it is referred to based on canonical
Expression formula, is identified financial index the recognition rule of judgement., can be right by using the regular expression rule of financial index
Financial index title in magnanimity financial documentation carries out fast and accurately judging identification.
S202, the initiation feature index for obtaining financial statement and end characteristic index.
Specifically, the step S202 is comprised preferably:
S2021, extract multiple financial documentations for including financial statement;
Specifically, the type of the financial statement includes balance sheet, profit flow table, cash flow statement this three major types type
Financial statement, therefore, it is necessary to the financial statement for each type, randomly selects several wealth for including actual financial statement
Business document, for example, for the financial statement of this type of balance sheet, randomly selects several and includes actual assets liability account
Financial documentation;
S2022, carry out index extraction to the starting content of the financial statement in each financial documentation, then, according to carrying
The frequency of occurrence (i.e. occurrence number) of the index of taking-up, according to order from big to small, is ranked up the index extracted, choosing
M1 index builds to obtain initiation feature index list before taking;For example, the index extracted from these financial documentations have q1,
Q2, q3, and the occurrence number of index q1 is 7, the occurrence number of index q2 is 8, the occurrence number of index q3 is 4, then choose
Preceding 2 indexs build to obtain initiation feature index list, it includes index q2 and q2;
Specifically, by using above-mentioned steps S2022, can build to obtain the starting corresponding to different type financial statement
Characteristic index list, for example, by step S2021, randomly selects several financial documentations for including actual assets liability account, so
Index extraction is carried out to the starting content of the balance sheet in each financial documentation of these extractions afterwards, then, according to carrying
The frequency of occurrence (i.e. occurrence number) of the index of taking-up, according to order from big to small, is ranked up the index extracted, choosing
M1 index builds to obtain the initiation feature index list corresponding to the financial statement of this type of balance sheet before taking;It is and right
In profit flow table, cash flow statement this two type financial statement corresponding to initiation feature index list, its building mode and this
It is identical;Wherein, the initiation feature index list obtained for structure, it is the starting spy of the required financial statement acquired
Levy index;
S2023, end content to the financial statement in each financial documentation carry out index extraction, then, according to carrying
The frequency of occurrence of the index of taking-up, according to order from big to small, is ranked up the index extracted, m2 index before selection
Structure obtains terminating characteristic index list;
Specifically, the structure of the end characteristic index list for this step S2023, its mode refer to above-mentioned initiation feature
It is similar to mark the building mode of list, is then not set forth in detail herein;Therefore, by above-mentioned steps S2023, it can build and be provided
Produce the end characteristic index list corresponding to the financial statement of liability account, profit flow table, cash flow statement this three major types type;Wherein,
The end characteristic index list obtained for structure, it is the end characteristic index of the required financial statement acquired.
It is identical preferably for above-mentioned m1 and m2, their numerical value.
S203, regular expression rule, initiation feature index and end characteristic index using financial index, to document pond
In multiple and different document formats financial documentation carry out financial statement localization process.
Preferably, the step S203 includes:
S2031, the precedence occurred according to financial statement form, are each financial statement form in financial documentation
Configure corresponding ID;
Specifically, for each financial documentation, the priority time occurred in a document according to each financial statement form
Sequence, the financial statement form to be occurred establish incremental ID, that is to say, that the ID of financial statement form illustrates financial report
The order that table form occurs in a document, that is, the ID of financial statement form illustrates financial statement form in financial documentation
The priority position of middle appearance, for example, the precedence occurred in a document according to each financial statement form, is after arriving first
Each financial statement form configures corresponding ID, such as ID1、ID2、ID3、ID4、ID5、……、IDK, it is seen then that ID1Finance report
Table form occurs prior to ID2Financial statement form, that is to say, that in a document, ID1Financial statement form be located at ID2Wealth
It is engaged in before report form;
S2032, regular expression rule, initiation feature index and end characteristic index using financial index, to finance
Each financial statement form in document carries out analysis judgment, draws initiate table lattice so as to position and terminates form, wherein, institute
State starting form and refer to the financial statement form in financial statement initial position, the end form is referred in finance
The financial statement form of report end position;
Specifically, the step S2032 is specifically included:
S20320, set balance sheet, profit flow table, cash flow statement this three major types type financial statement form starting mark
Will is respectively asset_begin_sign, profit_begin_sign, cash_begin_sign, and initial value is
False (falsity);
S20321, when the beginning flag of form is falsity, using the regular expression rule of financial index, from current
Identification extracts n1 index in the starting content of financial statement form, then, will identify the n1 index extracted and starting
After index in characteristic index list is matched, the first matching rate is obtained, then, when the first matching rate is more than the 3rd threshold value
When, then current financial statement form is judged to originate form;
Specifically, when the value of the beginning flag begin_sign of the financial statement form of three major types type is False, then
Using the regular expression rule of financial index, identified from the starting content of current financial statement form and extract n1 finger
Mark (n1<M1), then, will identify n1 index extracting respectively with the financial statement of the three major types type obtained in step S202
Corresponding initiation feature index list is matched, if the n1 index extracted and one of initiation feature index list
(when such as the matching rate of list a) is higher than three threshold values, then it is assumed that financial statement class of the financial statement form corresponding to list a
The starting form of type, such as, then it is assumed that the financial statement form is the starting form of balance sheet, and records the financial statement
The ID of form is the starting form ID of corresponding financial statement type, while the financial statement form of financial statement type will be corresponded to
Beginning flag begin_sign is set to True (true value), if for example, the financial statement form is the starting form of balance sheet
When, then asset_begin_sign is set to True;
S20322, when the beginning flag of form is true value, using the regular expression rule of financial index, from current
Identification extracts n2 index in the end content of financial statement form, then, the n2 index extracted will be identified with terminating
After index in characteristic index list is matched, the second matching rate is obtained, then, when the second matching rate is more than the 4th threshold value
When, then current financial statement form is judged to terminate form;
Specifically, when the value for having one in the beginning flag begin_sign of the financial statement form of three major types type is
True, then, then using the regular expression rule of financial index, identified from the end content of current financial statement form
Extract n2 index (n2<M2), then, n2 index the extracting three major types with being obtained in step S202 respectively will be identified
End characteristic index list corresponding to the financial statement of type is matched, if the n2 index extracted terminates with one of
(such as the matching rate of list b) is higher than the 4th threshold value, then it is assumed that the financial statement form is corresponding to list b for characteristic index list
The end form of financial statement type, such as, then it is assumed that the financial statement form is the end form of balance sheet, and records
Financial statement form ID is the end form ID of corresponding financial statement type, while the finance of corresponding financial statement type is reported
The beginning flag begin_sign of table form is set to False;
S20323, the ID of starting form and the ID of end form according to financial statement, will originate ID and the end of form
The number in the financial statement form corresponding to all ID (ID of the form containing starting and the ID for terminating form) between the ID of form
According to mark is the financial data of financial statement type;
Specifically, if ID1Financial statement form be financial statement starting form, ID5Financial statement form for finance
The end form of report, then be then ID by ID1To ID55 financial statement forms in data mark for finance report
The financial data of table type, for example, the financial data labeled as balance sheet.
It is identical preferably for above-mentioned n1 and n2, their numerical value.
After data in S204, the financial statement obtained to positioning carry out localization process, financial data and and wealth are recorded
The corresponding index name of data of being engaged in and time.
Specifically, can be by index name and time Lai unique for the financial data in financial statement, each financial data
Determine;Preferably, the step S204 includes:
S2041, establish between the index name corresponding to the financial data in financial statement form and place line number
One mapping relations, i.e. first mapping relations are referred to where index name and financial data corresponding to financial data
Mapping relations between line number;
Wherein, if index name corresponding to the financial data, the financial index corresponding to it is in title storehouse
When not having corresponding title, then the index name is added in actual name storehouse and title storehouse, and increases the finance newly
The regular expression rule of index;
S2042, establish between the temporal information corresponding to the financial data in financial statement form and place columns
Two mapping relations, i.e. second mapping relations refer to the columns where temporal information and data corresponding to financial data
Between mapping relations;
S2043, the ranks number using corresponding to the financial data in financial statement form, the first mapping relations and second are reflected
Relation is penetrated, records financial data and index name corresponding with financial data and time;
Specifically, according to the ranks number of each financial data, and " line number-index name " and " columns-time letter are passed through
The two mapping relations of breath ", it may be determined that index name and time corresponding to data, then using the ranks number of data, " OK
Number-index name " and " columns-temporal information " the two mapping relations, can index name of the logarithm according to this and corresponding to data
Claim, the time is recorded.
S205, carry out unit conversion to the financial data for belonging to numeric type.
Specifically, the financial data recorded in above-mentioned steps S204, it is the initial data (number presented in document
According to), wherein, the data (referred to as numeric type data) for belonging to numeric type also need to carry out unit conversion, can just obtain data
Actual value.Therefore, the step S205 is comprised preferably:
S2050, the regular expression rule for building unit information;
Specifically, several financial documentations, the unit information form of presentation that analysis the inside is related to are randomly selected, and is directed to this
A little different form of presentation establish the regular expression rule of unit information;
S2051, the regular expression rule using unit information, identification draw the unit information of financial statement;
Specifically, this step S2051 is preferably included:
S20511, travel through the financial data in financial statement form, is advised using the regular expression of unit information
Then, judge to identify whether the financial data in financial statement form has unit information, if so, then will be from financial statement form
The unit information for the financial statement that the unit information identified identification as needed for is drawn, conversely, then using unit information just
After then expression formula rule identifies the table title of financial statement, unit is carried out in the n3 character string after the table title identified
The matching search of information, the financial statement that the unit information nearest apart from table title searched identification as needed for is drawn
Unit information;
S2052, the unit information drawn according to above-mentioned steps identification, carry out at unit conversion numeric type financial data
Reason, makes numeric type financial data be scaled the financial data using member as monetary unit, and is replaced the data record before conversion
Get off.
Step (3):Using the different analysis results corresponding to financial data, financial data is verified.
In document pond, per portion financial documentation, it is corresponding with the document of multiple and different forms, such as the finance of PDF format
Document, the financial documentation of WORD forms, the financial documentation etc. of the financial documentation of EXCEL forms and TXT forms, therefore, by upper
Step (2) is stated, a financial data can correspond to obtain the analysis result based on documents in various formats, for example, for a finance
Document a, wherein containing a financial data b, and in document pond, financial documentation a is corresponding with the financial documentation a of WORD formsd、
The financial documentation a of EXCEL formseAnd the financial documentation a of TXT formst, therefore, using above-mentioned steps (two) to document ad、aeAnd at
After carrying out dissection process, financial data b can obtain corresponding three analysis results, be respectively bd、beAnd bt.And it is existing to produce this
As the reason for be:Documents in various formats is parsed, although being realized using the method for above-mentioned steps (two),
Specific code realized in details, due to the actual data storage form of documents in various formats be it is different, parse
Data may be different;And in the present embodiment, parsed for the document of each form, equal separate configurations are corresponding
Analysis program module and/or independent accessory part is called, therefore, in actual document data resolving, utilize difference
Independent parsing program module, after accordingly being parsed to the same financial data in documents in various formats, finally obtain
Analysis result be mutually independent, that is to say, that if the document of one of which form when document data parses, because code is wrong
Cause error result occur by mistake, this can't also influence the data analysis result of another format file.
In addition, in financial documentation, if a financial documentation is to raise specification or tracking grading report, in the document
An index often include the data in more than one time.Such as:, can in the recruitment specification of a company A in 2016
It can have a financial data in 2013,2014,2015 and some season in 2016, and in another company A in 2015
Raise in specification, then might have the financial data in 2012,2013,2014 and some season in 2015, it is seen then that right
In the financial data of this index of the total assets of company A in 2013, company A can be respectively from 2015 and 2016 two parts
Raise the two different document sources of specification;And for each document source, they correspond to respectively again WORD, EXCEL and TXT this
During three different-format sources, then by being parsed to the documents in various formats of the two document sources, then can obtain A in 2013
The corresponding data from 6 different document sources of this index of the total assets of company, that is to say, that for company A in 2013
Total assets this indexs financial data, it is corresponding with 6 different analysis results.
For coming from corresponding to the financial data of same index (the same index in same company's same time)
Not homologous analysis result, it is contemplated that the data accuracy of different analysis results is different, and therefore, it is necessary to right using financial data institute
The different analysis results answered, verify financial data, and the data obtained after verification are as final required financial data.
Preferably, the step (3) specifically includes following steps.
S301, carry out category division to the different analysis results corresponding to financial data;
Specifically, firstly, for the different analysis results corresponding to financial data, compared two-by-two;Then, finance are worked as
When data belong to the financial data of numeric type, for example, belonging to the data of this financial index of total assets, then then compare data
Decimal point before data length, that is, compare the digit before the decimal point of data and compare, if equal length, then then compare data
Preceding m3 bit digitals, if numeral it is equal, then then think the two data, i.e. the two analysis results, belong to same category;When
When financial data belongs to the financial data of Ratio-type, for example, belonging to the data of this financial index of net assets income ratio, then then
Compare whether digital the part before the decimal point of data is equal, if so, the preceding n4 bit digitals after then comparing the decimal points of data, if
It is equal, then then to think the two data, i.e. the two analysis results, belong to same category;
Above-mentioned processing is carried out to the different analysis results corresponding to financial data, until by all analysis result all classifications
Complete;And the classification obtained after the completion of classification processing, its correspondence include at least one analysis result;
S302, from the obtained at least one classification of division, select the classification for meeting the first preset condition as correct
Classification;
Preferably, the step S302 is specifically included:
S3021, when the number of classification that division obtains is 1, then the classification obtained dividing is as correct classification;
S3022, when the number of classification that division obtains is at least 2, then the analysis result according to included in classification
The issuing time of number, the sum of the certainty value of financial documentation and/or financial documentation, selects corresponding from least two classification
Classification is as correct classification;
Specifically, the step S3022 is comprised preferably:
S30221, count the classification obtained in above-mentioned steps S301;
S30222, from multiple and different classifications, select containing the largest number of classifications of analysis result;
If the number of the classification selected in S30223, step S30222 is 1, using the classification selected as correct
Classification;
If the number of the classification selected in S30224, step S30222 is at least two, it is right to calculate each classification institute
The sum of certainty value of financial documentation answered, then, according to the sum of certainty value of financial documentation, therefrom selects the letter of financial documentation
The classification of the sum of angle value numerical value maximum;
For example, the classification selected in step S30222 has classification 1, classification 2, classification 3 respectively, classification 1 is then calculated
The sum of certainty value of financial documentation, its specific calculation are:It is right for multiple and different analysis result institutes included in classification 1
The financial documentation source answered, their confidence level score value k is added, obtained summation be financial documentation certainty value it
With;For the sum of certainty value of financial documentation corresponding to classification 2 and classification 3, their calculation is also such;Then from class
Other 1, in classification 2, classification 3, according to the sum of certainty value of financial documentation, the highest classification of value, such as classification 1 are selected;
If the number of the classification selected in S30225, step S30224 is 1, using the classification selected as correct
Classification;
If the number of the classification selected in S30226, step S30224 is at least two, compare the finance text of classification
The issuing time of shelves, selects the classification closest to the analysis result of current date comprising issuing time;
Specifically, different analysis results included in a classification, the different document source corresponding to them have different
Document issuing time, then, by the analysis result corresponding to issuing time closest to the document source of current date, the class where it
The classification do not chosen as needed for;
If the number of the classification selected in S30227, step S30226 is 1, using the classification selected as correct
Classification;
If the number of the classification selected in S30228, step S30226 is at least two, according to other preset strategies
The judgement for carrying out correct classification is chosen, and final choose obtains satisfactory classification as correct classification;For described other
Preset strategy, it can be configured selection according to actual conditions;
S303, select the analysis result for meeting the second preset condition as correct data from correct classification, and aligns
Exact figures are according to the corresponding data reliability of setting;
Specifically, for analysis result included in correct classification, if comprising analysis result number be more than 1 when,
Then using the highest analysis result of precision as correct data, and the confidence level of this correct data is set to high;If comprising parsing
As a result when number is 1, then using this 1 analysis result as correct data, and the confidence level of this correct data is set to;
S304, using correct data as the financial data obtained after verification.
Step (4), document analytic modification Optimization Steps.
Preferably, the step (4) specifically includes:
S401, the different analysis results according to corresponding to financial data and correct data, are being calculated data parsing just
True rate;
S402, when the data parsing accuracy being calculated is less than five threshold values, according to default amendment optimisation strategy,
Optimization is modified to document resolving, until the data parsing accuracy being calculated is more than or equal to the 5th threshold value;
Specifically, the default amendment optimisation strategy is specially:A certain amount of document is often parsed, then to these documents
In, the data of the incorrect classification obtained for step (3), randomly select a part and are checked, when an error is discovered, then
Modify, optimize;If be the discovery that belong to non-source document malfunctions in itself, the generation of format module is corresponded in analytical procedure (two)
Code, finds out error reason and is improved, while the parameters involved in adjusting and optimizing step (2), then re-executes step
Suddenly (two), (three), (four) so that data parsing accuracy is constantly lifted, until being finally reached the 5th threshold value.
The preferred embodiment of the present embodiment is further used as, it is described to utilize duplicate removal technology and format conversion techniques, structure
The step for multi-format document pond, i.e. step (1), and/or the regular expression using financial index is regular, starting is special
Levy index and terminate characteristic index, after carrying out positioning dissection process to the financial documentations of multiple and different document formats, obtain finance
The step for data and index name corresponding with financial data and time, i.e. step (2), using distributed processing mode
To perform;
And/or
The multi-format document pond is stored in distributed storage server, to reach the processing of Document distribution formula and/or deposit
Store up effect.
Specifically, for every a document, since the execution time needed for step S103 and/or step (2) is longer, because
This, employs distributed processing mode to step S103 and/or step (2) and is performed to realize, to improve document disposed of in its entirety effect
Rate;And for the document pond of multiple and different forms, then distributed storage is carried out, to lift the follow-up reading efficiency of document.Therefore, it is right
In the present invention, it further preferably comprises the following steps:
Step 1., according to the waiting task amount n of each server, the PDF financial documentations that step S102 is downloaded are sent out
It is sent on the server of waiting task amount minimum, carries out the processing of step S103;
2., to each server step is monitored, that is, setting up monitoring module;
Specifically, if maximum is n_max, minimum value n_min in the waiting task amount of each server;Judge each clothes
Whether the n_min in business device is greater than or equal to the 6th threshold value n_extreme set in advance, if so, then pause step S102, directly
It is less than or equal to reasonable value n_recommend set in advance to the n_max in each server, at this time, then reopens step
S102;If there is n_min in a time period t>The number c of=n_extreme exceedes early warning number c_ set in advance
Alarm, at this time, then sends system alarm, so that monitoring personnel is reminded by increasing number of servers to solve the problems, such as, such energy
The problem of enough raising document overall treatment efficiencies, timely processing is produced because of server resource deficiency, improve whole document process
Fluency, stability and the reliability of flow;
Step 3., for the document pond of caused multiple and different forms in step S103, be stored in fast_dfs distributions
In formula storage server, follow-up dissection process is waited;
Step 4., according to the waiting task amount n of each server, the financial documentation in document pond is sent to pending
On the server of business amount minimum, the processing of step (2) is carried out, to obtain analysis result, lifts the analyzing efficiency to magnanimity document.
Obtained by above-mentioned, the present invention realizes a kind of PDF document processing method for magnanimity document, by using duplicate removal
The data of every a PDF document, are extended to " multi-source " of different-format, with structure by technology and format conversion techniques from " source "
Multi-format document pond, recycles the regular expression rule, initiation feature index and end characteristic index of financial index, to document
The document of multiple and different forms in pond carries out the data parsing of financial statement, and using corresponding to the financial data of same index
Different analysis results verified, obtain the highest data of confidence level as obtained financial data is finally parsed, thus may be used
See, by using the processing scheme of the present invention, can quickly, accurately realize the dissection process of financial data, obtain high accurate
The financial data of degree.In addition, always according to check results, the iterated revision optimization of document process of analysis is carried out, so can be further
The accuracy rate of ground lifting document parsing.
For the document process scheme of the present invention, it is suitable for the financial data dissection process in enterprise annual reports document, hair
It is big that debt raises the finance such as the financial data dissection process in specification, the financial data dissection process in tracking grading report of issuing debts
In data dissection process field.
All technology contents in the present embodiment can arbitrarily split/be applied in combination in above-described embodiment 1~3.
Above is the preferable of the present invention is implemented to be illustrated, but the invention is not limited to the implementation
Example, those skilled in the art can also make a variety of equivalent variations on the premise of without prejudice to spirit of the invention or replace
Change, these equivalent deformations or replacement are all contained in the application claim limited range.
Claims (10)
- A kind of 1. PDF document processing method based on big data, it is characterised in that:This method comprises the following steps:Using duplicate removal technology and format conversion techniques, multi-format document pond is built, wherein, the multi-format document pond includes multiple The financial documentation of different document form;Using the regular expression rule of financial index, initiation feature index and terminate characteristic index, to multiple and different document lattice After the financial documentation of formula carries out positioning dissection process, obtain financial data and index name corresponding with financial data and when Between;Using the different analysis results corresponding to financial data, financial data is verified.
- A kind of 2. PDF document processing method based on big data according to claim 1, it is characterised in that:Described utilize is gone The step for weight technology and format conversion techniques, structure multi-format document pond, it is specifically included:Utilize duplicate removal technology, structure profile download link pond;Using at least one PDF financial documentations download link included in profile download link pond, download obtains corresponding At least one PDF financial documentations;, will after the PDF financial documentations that download obtains are converted into the financial documentation of different document form using format conversion techniques The financial documentation of different document form is put into multi-format document pond.
- A kind of 3. PDF document processing method based on big data according to claim 2, it is characterised in that:Described utilize is gone The step for weight technology and format conversion techniques, structure multi-format document pond, it is also specifically included:Calculate the certainty value of each financial documentation in multi-format document pond.
- A kind of 4. PDF document processing method based on big data according to claim 3, it is characterised in that:It is described to utilize wealth Regular expression rule, initiation feature index and the end characteristic index for index of being engaged in, to the finance text of multiple and different document formats After shelves carry out positioning dissection process, financial data is obtained and the step for index name corresponding with financial data and time, It is specifically included:Using the regular expression rule of financial index, initiation feature index and terminate characteristic index, to multiple and different document lattice The financial documentation of formula carries out the localization process of financial statement;After carrying out localization process to the data in the obtained financial statement of positioning, financial data and corresponding with financial data is recorded Index name and the time;Unit conversion is carried out to the financial data for belonging to numeric type.
- A kind of 5. PDF document processing method based on big data according to claim 4, it is characterised in that:It is described to utilize wealth Regular expression rule, initiation feature index and the end characteristic index for index of being engaged in, to the finance text of multiple and different document formats Following steps are provided with before the step for shelves carry out positioning dissection process:Build the regular expression rule of financial index;And/orObtain the initiation feature index of financial statement and terminate characteristic index.
- A kind of 6. PDF document processing method based on big data according to claim 4, it is characterised in that:It is described to utilize wealth Different analysis results corresponding to data of being engaged in, the step for verification to financial data, it is specifically included:Category division is carried out to the different analysis results corresponding to financial data;In at least one classification obtained from division, the classification for meeting the first preset condition is selected as correct classification;The analysis result for meeting the second preset condition is selected from correct classification as correct data, and correct data is set Corresponding data reliability;Using correct data as the financial data after verification.
- A kind of 7. PDF document processing method based on big data according to claim 6, it is characterised in that:It is described from division In obtained at least one classification, the step for selecting the classification for meeting the first preset condition as correct classification, its is specific Including:When the number for the classification that division obtains is 1, then the classification obtained dividing is as correct classification;When the number for the classification that division obtains is at least 2, then the number of analysis result, finance are literary according to included in classification The issuing time of the sum of certainty value of shelves and/or financial documentation, selects corresponding classification as correct from least two classification Classification.
- 8. according to a kind of PDF document processing method based on big data of claim 6 or 7, it is characterised in that:This method is also Including document analytic modification Optimization Steps, the document analytic modification Optimization Steps specifically include following steps:Different analysis results and correct data according to corresponding to financial data, are calculated data parsing accuracy;When the data parsing accuracy being calculated is less than threshold value, according to default amendment optimisation strategy, document is parsed Journey is modified optimization, until the data parsing accuracy being calculated is more than or equal to threshold value.
- 9. according to a kind of any one of claim 1-7 PDF document processing methods based on big data, it is characterised in that:Institute The step for stating and utilize duplicate removal technology and format conversion techniques, building multi-format document pond, and/or it is described using financial index Regular expression rule, initiation feature index and end characteristic index, determine the financial documentation of multiple and different document formats After the dissection process of position, financial data is obtained and the step for index name corresponding with financial data and time, using distribution Formula processing mode performs;And/orThe multi-format document pond is stored in distributed storage server.
- A kind of 10. PDF document processing unit based on big data, it is characterised in that:The device includes:At least one processor;At least one processor, for storing at least one program;When at least one program is performed by least one processor so that at least one processor is realized as weighed Profit requires a kind of any one of 1-9 PDF document processing methods based on big data.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711080720.3A CN107943785B (en) | 2017-11-06 | 2017-11-06 | PDF document processing method and device based on big data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711080720.3A CN107943785B (en) | 2017-11-06 | 2017-11-06 | PDF document processing method and device based on big data |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107943785A true CN107943785A (en) | 2018-04-20 |
CN107943785B CN107943785B (en) | 2021-07-20 |
Family
ID=61934391
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711080720.3A Active CN107943785B (en) | 2017-11-06 | 2017-11-06 | PDF document processing method and device based on big data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107943785B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110543475A (en) * | 2019-08-29 | 2019-12-06 | 深圳市原点参数科技有限公司 | financial statement data automatic identification and analysis method based on machine learning |
CN110909226A (en) * | 2019-11-28 | 2020-03-24 | 达而观信息科技(上海)有限公司 | Financial document information processing method and device, electronic equipment and storage medium |
CN112015727A (en) * | 2020-09-01 | 2020-12-01 | 民生科技有限责任公司 | Automatic checking and correcting system and method for financial statement data and readable storage device |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102063493A (en) * | 2010-12-30 | 2011-05-18 | 北京大学 | Content extraction method based on regular expression group and control logic |
CN102508860A (en) * | 2011-09-29 | 2012-06-20 | 广州中浩控制技术有限公司 | Data mining method based on XBRL (extensible business reporting language) embodiment document |
CN103744983A (en) * | 2014-01-15 | 2014-04-23 | 北京理工大学 | Method for extracting meta-information of electronic documents |
US20140195891A1 (en) * | 2013-01-04 | 2014-07-10 | Cognizant Technology Solutions India Pvt. Ltd. | System and method for automatically extracting multi-format data from documents and converting into xml |
CN104731941A (en) * | 2015-03-31 | 2015-06-24 | 浪潮集团有限公司 | Method for capturing data from unstructured financial report based on XBRL technology |
CN105843783A (en) * | 2016-03-21 | 2016-08-10 | 哈尔滨工程大学 | Chinese PDF file text content extraction method oriented to network flow transmission |
CN106445910A (en) * | 2015-09-02 | 2017-02-22 | 深圳市览网络股份有限公司 | Document analysis method and apparatus |
CN106484663A (en) * | 2016-10-12 | 2017-03-08 | 天闻数媒科技(湖南)有限公司 | A kind of extracting method of document content and device |
US20170235848A1 (en) * | 2012-08-29 | 2017-08-17 | Dennis Van Dusen | System and method for fuzzy concept mapping, voting ontology crowd sourcing, and technology prediction |
-
2017
- 2017-11-06 CN CN201711080720.3A patent/CN107943785B/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102063493A (en) * | 2010-12-30 | 2011-05-18 | 北京大学 | Content extraction method based on regular expression group and control logic |
CN102508860A (en) * | 2011-09-29 | 2012-06-20 | 广州中浩控制技术有限公司 | Data mining method based on XBRL (extensible business reporting language) embodiment document |
US20170235848A1 (en) * | 2012-08-29 | 2017-08-17 | Dennis Van Dusen | System and method for fuzzy concept mapping, voting ontology crowd sourcing, and technology prediction |
US20140195891A1 (en) * | 2013-01-04 | 2014-07-10 | Cognizant Technology Solutions India Pvt. Ltd. | System and method for automatically extracting multi-format data from documents and converting into xml |
CN103744983A (en) * | 2014-01-15 | 2014-04-23 | 北京理工大学 | Method for extracting meta-information of electronic documents |
CN104731941A (en) * | 2015-03-31 | 2015-06-24 | 浪潮集团有限公司 | Method for capturing data from unstructured financial report based on XBRL technology |
CN106445910A (en) * | 2015-09-02 | 2017-02-22 | 深圳市览网络股份有限公司 | Document analysis method and apparatus |
CN105843783A (en) * | 2016-03-21 | 2016-08-10 | 哈尔滨工程大学 | Chinese PDF file text content extraction method oriented to network flow transmission |
CN106484663A (en) * | 2016-10-12 | 2017-03-08 | 天闻数媒科技(湖南)有限公司 | A kind of extracting method of document content and device |
Non-Patent Citations (3)
Title |
---|
DUY DUC AN BUI 等: "PDF text classification to leverage information extraction from publication reports", 《JOURNAL OF BIOMEDICAL INFORMATICS》 * |
刘力: "科技文档信息抽取与格式化技术研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
胡伟: "基于数据挖掘的上市公司财务数据分析系统的设计", 《中国优秀硕士学位论文全文数据库 经济与管理科学辑》 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110543475A (en) * | 2019-08-29 | 2019-12-06 | 深圳市原点参数科技有限公司 | financial statement data automatic identification and analysis method based on machine learning |
CN110909226A (en) * | 2019-11-28 | 2020-03-24 | 达而观信息科技(上海)有限公司 | Financial document information processing method and device, electronic equipment and storage medium |
CN110909226B (en) * | 2019-11-28 | 2023-06-06 | 达而观信息科技(上海)有限公司 | Financial document information processing method and device, electronic equipment and storage medium |
CN112015727A (en) * | 2020-09-01 | 2020-12-01 | 民生科技有限责任公司 | Automatic checking and correcting system and method for financial statement data and readable storage device |
Also Published As
Publication number | Publication date |
---|---|
CN107943785B (en) | 2021-07-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US8010470B2 (en) | Method of and apparatus for automated behavior prediction | |
US7577963B2 (en) | Event data translation system | |
US20160306876A1 (en) | Systems and methods of detecting information via natural language processing | |
CN101878461B (en) | Method and system for analysis of system for matching data records | |
CN109948340B (en) | PHP-Webshell detection method combining convolutional neural network and XGboost | |
Hausladen et al. | Text classification of ideological direction in judicial opinions | |
CN112632989B (en) | Method, device and equipment for prompting risk information in contract text | |
CN113535963B (en) | Long text event extraction method and device, computer equipment and storage medium | |
EP2857985A1 (en) | Knowledge extraction device, knowledge updating device, and program | |
WO2005010727A2 (en) | Extracting data from semi-structured text documents | |
CN109522350B (en) | Method for analyzing standing book control | |
CN107943785A (en) | A kind of PDF document processing method and processing device based on big data | |
CN112052396A (en) | Course matching method, system, computer equipment and storage medium | |
CN110599289A (en) | Method for formatting official document | |
CN110689371B (en) | Intelligent marketing cloud service platform based on AI and big data | |
CN108027814A (en) | Disable word recognition method and device | |
CN115063035A (en) | Customer evaluation method, system, equipment and storage medium based on neural network | |
Caruso et al. | Telcordia's database reconciliation and data quality analysis tool | |
Dagar et al. | Twitter sentiment analysis using supervised machine learning techniques | |
CN117171650A (en) | Document data processing method, system and medium based on web crawler technology | |
CN112036841A (en) | Policy analysis system and method based on intelligent semantic recognition | |
CN108038124A (en) | A kind of PDF document acquiring and processing method, system and device based on big data | |
CN116701506A (en) | Demand plan compliance verification method fusing unstructured data | |
CN115544235A (en) | Power grid planning intelligent question-answering system based on text parsing | |
CN115269769A (en) | Method and device for excavating highly-dependent imported product, computer equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |