CN106919553A - Document analysis method and apparatus - Google Patents

Document analysis method and apparatus Download PDF

Info

Publication number
CN106919553A
CN106919553A CN201610716428.5A CN201610716428A CN106919553A CN 106919553 A CN106919553 A CN 106919553A CN 201610716428 A CN201610716428 A CN 201610716428A CN 106919553 A CN106919553 A CN 106919553A
Authority
CN
China
Prior art keywords
file
parsing
document analysis
fragment
task
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610716428.5A
Other languages
Chinese (zh)
Inventor
毛启明
王啸
曾宪玺
吴笑笑
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Advanced New Technologies Co Ltd
Advantageous New Technologies Co Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201610716428.5A priority Critical patent/CN106919553A/en
Publication of CN106919553A publication Critical patent/CN106919553A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A kind of document analysis method and apparatus are the embodiment of the invention provides, by dividing documents at least two file fragments, so as to be parsed to dividing at least two file fragments for obtaining parallel.So as to by way of this parallel parsing, improve the resolution speed of file, solve the slow technical problem of document analysis, especially in the case where the data volume of file is larger, it was able to ensure that before Preset Time point and completes the parsing of file, under this kind of ageing scene higher of financial business, it is to avoid document analysis do not complete the economic loss caused by follow-up business is stagnated.

Description

Document analysis method and apparatus
Technical field
The present invention relates to information technology, more particularly to a kind of document analysis method and apparatus.
Background technology
, it is necessary to be parsed to file first after file is received, to ensure the accuracy of file, while ensuring text Identifiability of the part in subsequent treatment, consequently facilitating being subsequently for further processing to file.In resolving, can have Body acupuncture is to perhaps form carries out the operation such as a series of scanning, verification in file.
For example:It is next in order to ensure when having the file of finance data from external company's acquisition record for financial business The accuracy of step data treatment, before data processing is carried out to these finance datas, therefore, to assure that the form and field of file The accuracy of content, therefore, the file accessed by external company is parsed.After parsing passes through, it is further continued for holding The corresponding data handling procedure of row.
At present, accessed file is parsed using single process mostly, in the feelings that the data volume of file is larger Under condition, the speed of parsing is slower so that it cannot be guaranteed that completed the parsing to file before Preset Time point, financial business this Under the ageing requirement of class scene higher, follow-up business can be caused to stagnate and then bring more serious loss.
The content of the invention
The present invention provides a kind of document analysis method and apparatus, and for solving in the prior art, document analysis speed is slower Technical problem.
To reach above-mentioned purpose, embodiments of the invention are adopted the following technical scheme that:
A kind of first aspect, there is provided document analysis method, including:
Divide documents at least two file fragments;
At least two file fragment is parsed parallel.
A kind of second aspect, there is provided document analysis device, including:
Division module, for dividing documents at least two file fragments;
Parsing module, for being parsed at least two file fragment parallel.
Document analysis method and apparatus provided in an embodiment of the present invention, by dividing documents at least two file pieces Section, so as to be parsed to dividing at least two file fragments for obtaining parallel.So that by way of this parallel parsing, Improve the resolution speed of file, solve the slow technical problem of document analysis, especially file data volume compared with It is this kind of ageing higher in financial business in the case of big, it can be ensured that the parsing of file was completed before Preset Time point Under scene, it is to avoid document analysis do not complete the economic loss caused by follow-up business is stagnated.
Described above is only the general introduction of technical solution of the present invention, in order to better understand technological means of the invention, And can be practiced according to the content of specification, and in order to allow the above and other objects of the present invention, feature and advantage can Become apparent, below especially exemplified by specific embodiment of the invention.
Brief description of the drawings
By reading the detailed description of hereafter preferred embodiment, various other advantages and benefit is common for this area Technical staff will be clear understanding.Accompanying drawing is only used for showing the purpose of preferred embodiment, and is not considered as to the present invention Limitation.And in whole accompanying drawing, identical part is denoted by the same reference numerals.In the accompanying drawings:
Fig. 1 is a kind of schematic flow sheet of document analysis method that the embodiment of the present invention one is provided;
Fig. 2 is the schematic diagram of a scenario of document analysis;
Fig. 3 is the principle schematic of document analysis;
Fig. 4 is a kind of schematic flow sheet of document analysis method that the embodiment of the present invention two is provided;
Fig. 5 is a kind of structural representation of document analysis device that the embodiment of the present invention three is provided;
Fig. 6 is a kind of structural representation of document analysis device that the embodiment of the present invention four is provided;
Fig. 7 is a kind of structural representation of document analysis system that the embodiment of the present invention five is provided;
Fig. 8 is the principle schematic of document analysis system.
Specific embodiment
The exemplary embodiment of the disclosure is more fully described below with reference to accompanying drawings.Although showing the disclosure in accompanying drawing Exemplary embodiment, it being understood, however, that may be realized in various forms the disclosure without should be by embodiments set forth here Limited.Conversely, there is provided these embodiments are able to be best understood from the disclosure, and can be by the scope of the present disclosure Complete conveys to those skilled in the art.
Document analysis methods, devices and systems provided in an embodiment of the present invention are described in detail below in conjunction with the accompanying drawings.
Embodiment one
Fig. 1 is a kind of schematic flow sheet of document analysis method that the embodiment of the present invention one is provided, and the present embodiment is provided Method, can be performed by document analysis system, the schematic diagram of a scenario of document analysis as shown in Figure 2, this document resolution system Can be obtained from external system or receive file to be resolved, so as to be parsed to it, so as to after being parsed by other industry Business processing system performs business processing flow.Specifically, used as a kind of possible application scenarios, this document resolution system is operated in In distributed type colony, or, used as alternatively possible application scenarios, this document resolution system is operated in unit, this reality Apply and this is not limited in example.
As shown in figure 1, document analysis method includes:
Step 101, divide documents at least two file fragments.
Specifically, when being divided to file, multiple file pieces can be divided documents into according to default partition strategy Section.It is the division that is carried out to the byte of configuration file by file here is divided, therefore before being divided, in addition it is also necessary to According to the type of file, the minimum division unit of the type file is determined, file is divided using the minimum division unit, Avoid being divided in the byte of each several part in a minimum division unit in different file fragments, lead to not carry out file solution Analysis.It can be seen that, when determining minimum division unit here, refer to can when subsequent file is parsed solve each type file The minimum unit of analysis is determined.
For example:For the file of Doctype, minimum division unit can be sentence or paragraph or page;And for The file of form types, minimum division unit can be cell, or be row, or be row.
Wherein, for partition strategy, can be according to the preset data amount of file fragment, or according to file fragment Predetermined number file is divided, so as to obtain multiple file fragments.
For example:100 fixed file fragments can be divided documents into, the data volume of each file fragment accounts for file Data volume 1%.Again for example:Can also be when dividing, it is ensured that the data volume of each file fragment is about fixed data Amount, but the quantity of file fragment is uncertain, it is necessary to the data volume according to file is adjusted.
Step 102, at least two file fragments are parsed parallel.
Specifically, the position according to each file fragment hereof, generates each parsing task, each parsing task scheduling is arrived Corresponding process at least two processes.The parsing task that scheduling is obtained is performed by least two task parallelisms.
As a kind of possible implementation, when the parsing task that scheduling is obtained is performed by least two task parallelisms, Each process can be directed to, according to the position recorded in the parsing task that scheduling is obtained, be read from file and is obtained correspondence File fragment.Wherein, the position recorded in parsing task can be that first of file fragment and last byte exist Relative position in file.When the task of parsing indicates to perform verification operation to file fragment, by the process according to being pre-configured with Verification rule, the file fragment to being read carries out verification operation, wherein, verification rule specifically for the default row of verification or Whether the word in row meets certain form, or whether numeral meets certain span, taking in certain cell Whether value is sky etc..
It should be noted that can specifically include a series of behaviour such as verification, form conversion, scanning for parsing this process One or more combinations made, the verification operation in the present embodiment is only schematically illustrated as one kind, is not constituted to of the invention real Apply the limitation of example.
It can be seen that, as shown in figure 3, by dividing documents at least two file fragments, so that parallel obtain to dividing At least two file fragments parsed.So as to by way of this parallel parsing, improve the resolution speed of file, solve The slow technical problem of document analysis of having determined, especially in the case where the data volume of file is larger, it can be ensured that pre- If completing the parsing of file before time point.
Embodiment two
Fig. 4 is a kind of schematic flow sheet of document analysis method that the embodiment of the present invention two is provided, and the present embodiment is provided The document analysis system that can be provided by Fig. 2 of method perform, the report file for being commonly used for financial class is parsed, That is, the type of the file in the present embodiment is form.As shown in figure 4, method includes:
Step 201, when documentary information is received, obtain file.
Specifically, when documentary information is received, the file for having generated is obtained from external system, by file storage in internal memory In.
Step 202, accessed file is scanned.
Specifically, the file in internal memory is scanned, including byte scanning and Hash scanning.Wherein, byte scanning is used In it is determined that the data volume of file, and Hash scans the check code for determining file, and the check code is used for interpretable file in transmission During with the presence or absence of damaging, so that it is determined that when file is not damaged, perform next step, otherwise obtain text from external system again Part.After the scan is complete, the attribute informations such as the data volume of file can be obtained.
Step 203, the file fragment for dividing documents into predetermined number, obtain each file fragment position hereof Put.
Wherein, position is specially the first character section of file fragment relative position hereof and last byte exists Relative position in file.For example:Can be that first in file fragment and last byte are respectively the hereof Several bytes, can also for file fragment in first and last byte in file institute storage region storage position Put.
Specifically, file in the present embodiment is form types, such that it is able to cell be minimum division unit, by text Part is divided into the file fragment of predetermined number, has roughly equal data volume between each file fragment.
For example:Fixation divides documents into 100 file fragments, and the data volume of each file fragment accounts for the total of file The 1% of data volume.Several fixed cells in each file fragment correspondence report file, and each file fragment institute is right The cell answered can constitute complete report file.
Step 204, for each file fragment generate one parsing task.
Specifically, parsing task includes:The first character section of file fragment relative position hereof and last Individual byte relative position hereof, the data volume of file fragment.Rower can also be entered to parsing task according to file fragment Know, consequently facilitating task scheduling to the corresponding process of file fragment will be parsed.
Step 205, will parsing task scheduling to corresponding process.
Specifically, the corresponding relation between each file fragment and process can be preset, so as to the mark according to parsing task Know indicated file fragment, determine corresponding process, the parsing task scheduling to corresponding process is parsed.
Step 206, by each task parallelism perform parsing task.
Specifically, in advance for each process is configured with corresponding verification rule, this verification rule is indicated to this article Form or data that each row in part fragment is verified needed for respectively arranging.Such as verification rule includes:First row in file fragment should It is word, the secondary series in file fragment should be numerical value and must not be sky etc..
There is at least one corresponding verification rule so as to each process, and between each process, verification rule can Can be different.
Step 207, generation analysis result, judge whether analysis result is successfully, to carry out Business Processing if success, otherwise The file fragment for causing parsing to fail is indicated to point out artificial treatment.
If the check results of at least one process are not pass through, it is determined that document analysis fail.If the verification of whole processes Result be by when, it is determined that document analysis success.
When document analysis fail, the file fragment that positioning parsing fails position hereof, or even can also position Verify the cell position hereof of failure, so as to include indicating in analysis result the file fragment of verification failure with/ Or cell position hereof.
This is because, the data volume of file is larger in actual applications, if only knowing, document analysis fail, and cannot be to depositing When problem causes the position that parsing fails to be positioned, it is necessary to be modified to whole file in artificial treatment, to file The task amount being modified is larger, if the file fragment that can be parsed according to the process of verification failure, inquires about this document fragment Position hereof, and then parse position of the cell for failing in file fragment further according to what the process was recorded, then The cell for specifically causing parsing to fail position hereof is capable of determining that, only the cell can be modified, greatly Reduce the task amount being modified to file greatly.
Embodiment three
Fig. 5 is a kind of structural representation of document analysis device that the embodiment of the present invention three is provided, as shown in figure 5, including: Division module 31 and parsing module 32.
Division module 31, for dividing documents at least two file fragments.
Specifically, division module 31, specifically for determining that the minimum of the file divides single according to the type of the file Unit;The file is divided using the minimum division unit.
Parsing module 32, for being parsed at least two file fragment parallel.
At least two file fragments are divided documents into by division module 31, so that parsing module 32 is parallel to being divided At least two file fragments for obtaining are parsed.So as to by way of this parallel parsing, improve the parsing speed of file Degree, solves the slow technical problem of document analysis, especially in the case where the data volume of file is larger, it can be ensured that The parsing of file was completed before Preset Time point, under this kind of ageing scene higher of financial business, it is to avoid due to text Part parsing does not complete follow-up business and stagnates caused economic loss.
The document analysis device that the present embodiment is provided, specifically for performing the document analysis side that embodiment one is provided Method, the realization of specific functional modules is repeated no more in the present embodiment referring to the associated description in previous embodiment to this.
Example IV
For the document analysis device that the clear explanation embodiment of the present invention three is provided, Fig. 6 is carried for the embodiment of the present invention four The structural representation of a kind of document analysis device for supplying, on the basis of Fig. 5, as a kind of possible implementation, such as Fig. 6 institutes Show that division module 31 includes:First division unit 311 and the second division unit 312.
First division unit 311, for according to preset data amount, being divided to the file, to obtain meeting described The file fragment of preset data amount.
Second division unit 312, for according to predetermined number, being divided to the file, to obtain meeting described pre- If the file fragment of quantity.
It should be noted that in document analysis device in the present embodiment, division module 31 can include that first divides single The division unit 312 of unit 311 and/or second, Fig. 6 is not constituted to of the invention real only as a kind of signal of possible implementation Apply the limitation of example.
Further, parsing module 32 includes:Generation unit 321, scheduling unit 322 and execution unit 323.
Generation unit 321, for the position according to each file fragment in the file, generates each parsing task.
Scheduling unit 322, for by it is each parsing task scheduling at least two processes in corresponding process.
Execution unit 323, for performing the parsing task that scheduling is obtained by least two task parallelism.
Wherein, execution unit 323 includes:Read subelement 3231 and parsing subelement 3232.
Subelement 3231 is read, for for each process, according to the position recorded in the parsing task that scheduling is obtained Put, read from the file and obtain corresponding file fragment.
Parsing subelement 3232, for being parsed to the file fragment for being read.
Used as a kind of possible implementation, parsing task is used to carry out verification operation to the file fragment, based on this, Parsing subelement 3232, specifically for the verification rule that basis is pre-configured with to the process, to the file piece for being read Duan Jinhang is verified.
Further, document analysis device also includes:Scan module 33, object module 34 and locating module 35.
Scan module 33, for being scanned operation to the file, to determine that the file has been properly received.
Object module 34, for when at least one of described at least two file fragment parses failure, it is determined that described Document analysis fail.
Locating module 35, for positioning position of the file fragment of parsing failure in the file.
The document analysis device that the present embodiment is provided, specifically for performing the document analysis that embodiment one and two is provided Method, the realization of specific functional modules is repeated no more in the present embodiment referring to the associated description in previous embodiment to this.
Embodiment five
Fig. 7 is a kind of structural representation of document analysis system that the embodiment of the present invention five is provided, this document resolution system In operating in distributed type colony, as shown in fig. 7, the framework of this document resolution system includes preposition piecemeal layer, dispatch layer and appoints Business analytic sheaf.
Operation has at least one preposition piecemeal server on preposition piecemeal layer, and operation has at least one scheduling clothes on dispatch layer Operation has many task resolution servers on business device and task analytic sheaf.
Wherein, preposition piecemeal server is used to divide documents at least two file fragments.
The number of units of dispatch server can be one or more, and different dispatch servers can run on different clusters It is interior, every dispatch server dispatch its task resolution server in the cluster perform parsing task.
Specifically, when dispatch layer only contains a dispatch server, preposition piecemeal layer is by whole file pieces to be resolved Section is sent to the dispatch server, is scheduled according to these file fragments generation parsing task and to parsing task by it;And When dispatch layer includes two or more platform dispatch servers, need to be first carried out one between two or more platform dispatch servers Secondary scheduling, determines the file fragment for the treatment of needed for every dispatch server and transmission is to corresponding dispatch server, then again by Every dispatch server carries out including generation parsing task and parsing task is carried out to each received file fragment Scheduling is in interior treatment.It can be seen that, work as scheduling once to distribute for file fragment when dispatch layer only contains a dispatch server It is secondary distribution when layer is comprising at least two dispatch servers.
In order to understand the operation of supporting paper resolution system, Fig. 8 is the principle schematic of document analysis system, such as Fig. 8 institutes Show, including:
After step 501, external system generation file, the notification message that file is reached is sent to document analysis system.
Step 502, the preposition piecemeal layer of document analysis system obtain current partition strategy, according to current partition strategy File is divided, each file fragment is obtained.
Specifically, when being divided to file, multiple file pieces can be divided documents into according to default partition strategy Section.It is the division that is carried out to the byte of configuration file by file here is divided, therefore before being divided, in addition it is also necessary to According to the type of file, the minimum division unit of the type file is determined, file is divided using the minimum division unit, Avoid being divided in the byte of each several part in a minimum division unit in different file fragments, lead to not carry out file solution Analysis.It can be seen that, when determining minimum division unit here, refer to can when subsequent file is parsed solve each type file The minimum unit of analysis is determined.
For example:For the file of Doctype, minimum division unit can be sentence or paragraph or page;And for The file of form types, minimum division unit can be cell, or be row, or be row.
Wherein, for partition strategy, can be according to the preset data amount of file fragment, or according to file fragment Predetermined number file is divided, so as to obtain multiple file fragments.
Step 503, preposition piecemeal layer record first of each file fragment and last byte hereof relative Position, obtains nodes records.
Step 504, content and nodes records to file are stored, the content and nodes records of the file that will be stored Send to distribution scheduling layer.
, according to nodes records, each file fragment to file carries out piecemeal storage for step 505, distribution scheduling layer.
Specifically, distribution scheduling layer reads the byte recorded in nodes records, file is determined according to the byte for being recorded Each file fragment first character section and last byte, so that it is determined that go out the content of each file fragment, to being determined Each file fragment carry out piecemeal storage.
Used as a kind of possible implementation, each file fragment can be stored on a physical store piecemeal, Have on this physical store piecemeal and this file fragment that is only stored with.
As alternatively possible implementation, in memory space inadequate, different file fragment storages can be existed On same physical store piecemeal, further, it is possible to using the special byte that can be used in distinguishing, between two file fragments It is separated.
Step 506, distribution scheduling layer generate parsing task according to the storage location of each file fragment.
Wherein, each parsing task is corresponding with a memory partitioning, and this document piece can have been recorded in parsing task The first character section and last byte of section.
Step 507, distribution scheduling layer by it is each parsing task scheduling to task analytic sheaf in corresponding task resolution server, So that task resolution server performs the parsing task so as to be parsed to file fragment.
Specifically, distribution scheduling layer can be scheduled using default scheduling strategy, and parsing task is sent to right The task resolution server answered.For example:Scheduling strategy can be default correspondence between parsing task and task resolution server Relation, can also be load balancing etc..
Parsing task of each task resolution server according to received by step 508, distribution scheduling layer, deposits from corresponding File fragment is read in storage piecemeal.
Each task resolution server is parsed to file fragment in step 509, distribution scheduling layer.
When the task of parsing indicates to perform verification operation to file fragment, according to the verification rule being pre-configured with, to being read The file fragment got carries out verification operation, wherein, whether verification rule accords with specifically for the word in the default row or column of verification Whether the fixed form of unification, or numeral meets certain span, and whether the value in certain cell is sky etc..
Each task resolution server returns to analysis result to distribution scheduling layer in step 510, distribution scheduling layer.
Wherein, in analysis result, including for the configured information for asking valency fragment successfully resolved or fail.
Step 511, distribution scheduling layer collect to received analysis result.
If at least one analysis result is not pass through, it is determined that document analysis fail.If whole analysis results are passing through When, it is determined that document analysis success.
When document analysis fail, the file fragment that positioning parsing fails position hereof specifically can be by root According to the task resolution server of the analysis result for sending parsing failure, it is inquired about scheduled to which file of correspondence in scheduling The parsing task of fragment, so that it is determined that this document fragment parsing failure in file.
Step 512, distribution scheduling layer return to summarized results to preposition piecemeal layer.
In previous step, because distribution scheduling layer is stored for each file fragment, such that it is able to know this The storage location and content of file fragment, so as to when summarized results is for parsing failure, can be carried in summarized results causes The content of the file fragment of failure is parsed, or, the mark such as relative position for example hereof.
So that preposition piecemeal layer can be according to summarized results, it is determined that continuing subsequent treatment to this document.For example:Work as remittance When overall result indicates document analysis success, the transaction processing system that notice carries out Business Processing to this document performs data processing stream Journey;When summarized results indicates document analysis to fail, artificial treatment is asked.
In the present embodiment, by dividing documents at least two file fragments so that it is parallel to divide obtain to Few two file fragments are parsed.So as to by way of this parallel parsing, improve the resolution speed of file, solve The slow technical problem of document analysis, especially in the case where the data volume of file is larger, it can be ensured that when default Between put before complete file parsing.
One of ordinary skill in the art will appreciate that:Realizing all or part of step of above-mentioned each method embodiment can lead to The related hardware of programmed instruction is crossed to complete.Foregoing program can be stored in a computer read/write memory medium.The journey Sequence upon execution, performs the step of including above-mentioned each method embodiment;And foregoing storage medium includes:ROM, RAM, magnetic disc or Person's CD etc. is various can be with the medium of store program codes.
Finally it should be noted that:Various embodiments above is merely illustrative of the technical solution of the present invention, rather than its limitations;To the greatest extent Pipe has been described in detail with reference to foregoing embodiments to the present invention, it will be understood by those within the art that:Its according to The technical scheme described in foregoing embodiments can so be modified, or which part or all technical characteristic are entered Row equivalent;And these modifications or replacement, the essence of appropriate technical solution is departed from various embodiments of the present invention technology The scope of scheme.

Claims (16)

1. a kind of document analysis method, it is characterised in that including:
Divide documents at least two file fragments;
At least two file fragment is parsed parallel.
2. document analysis method according to claim 1, it is characterised in that described to divide documents at least two files Fragment includes:
According to preset data amount, the file is divided, to obtain meeting the file fragment of the preset data amount;
And/or, according to predetermined number, the file is divided, to obtain meeting the file fragment of the predetermined number.
3. document analysis method according to claim 1, it is characterised in that described to divide documents at least two files Fragment includes:
Type according to the file determines the minimum division unit of the file;
The file is divided using the minimum division unit.
4. the document analysis method according to claim any one of 1-3, it is characterised in that described parallel to described at least two Individual file fragment carries out parsing to be included:
According to position of each file fragment in the file, each parsing task is generated;
By the corresponding process in each parsing task scheduling at least two processes;
The parsing task that scheduling is obtained is performed by least two task parallelism.
5. document analysis method according to claim 4, it is characterised in that described to perform tune by least two task parallelisms Spending the parsing task for obtaining includes:
For each process, according to the position recorded in the parsing task that scheduling is obtained, read from the file and obtained Corresponding file fragment;
File fragment to being read is parsed.
6. document analysis method according to claim 5, it is characterised in that the parsing task is used for the file piece Duan Jinhang verification operations, it is described parsing is carried out to the file fragment that is read to include:
According to the verification rule being pre-configured with to the process, the file fragment to being read is verified.
7. the document analysis method according to claim any one of 1-3, it is characterised in that described to divide documents at least Before two file fragments, also include:
Operation is scanned to the file, to determine that the file has been properly received.
8. the document analysis method according to claim any one of 1-3, it is characterised in that described parallel to described at least two After individual file fragment is parsed, also include:
When at least one of described at least two file fragment parses failure, the document analysis failure is determined;
Position of the file fragment of positioning parsing failure in the file.
9. a kind of document analysis device, it is characterised in that including:
Division module, for dividing documents at least two file fragments;
Parsing module, for being parsed at least two file fragment parallel.
10. document analysis device according to claim 9, it is characterised in that the division module includes:
First division unit, for according to preset data amount, being divided to the file, to obtain meeting the preset data The file fragment of amount;
And/or, the second division unit, for according to predetermined number, being divided to the file, to obtain meeting described default The file fragment of quantity.
11. document analysis devices according to claim 9, it is characterised in that
The division module, the minimum division unit specifically for determining the file according to the type of the file;Using institute Minimum division unit is stated to divide the file.
The 12. document analysis device according to claim any one of 9-11, it is characterised in that the parsing module includes:
Generation unit, for the position according to each file fragment in the file, generates each parsing task;
Scheduling unit, for by it is each parsing task scheduling at least two processes in corresponding process;
Execution unit, for performing the parsing task that scheduling is obtained by least two task parallelism.
13. document analysis devices according to claim 12, it is characterised in that the execution unit includes:
Subelement is read, for for each process, according to the position recorded in the parsing task that scheduling is obtained, from described Read in file and obtain corresponding file fragment;
Parsing subelement, for being parsed to the file fragment for being read.
14. document analysis devices according to claim 13, it is characterised in that the parsing task is used for the file Fragment carries out verification operation;
The parsing subelement, specifically for the verification rule that basis is pre-configured with to the process, to the text for being read Part fragment is verified.
The 15. document analysis device according to claim any one of 9-11, it is characterised in that described device, also includes:
Scan module, for being scanned operation to the file, to determine that the file has been properly received.
The 16. document analysis device according to claim any one of 9-11, it is characterised in that described device, also includes:
Object module, for when at least one of described at least two file fragment parses failure, determining the file solution Analysis failure;
Locating module, for positioning position of the file fragment of parsing failure in the file.
CN201610716428.5A 2016-08-24 2016-08-24 Document analysis method and apparatus Pending CN106919553A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610716428.5A CN106919553A (en) 2016-08-24 2016-08-24 Document analysis method and apparatus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610716428.5A CN106919553A (en) 2016-08-24 2016-08-24 Document analysis method and apparatus

Publications (1)

Publication Number Publication Date
CN106919553A true CN106919553A (en) 2017-07-04

Family

ID=59454269

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610716428.5A Pending CN106919553A (en) 2016-08-24 2016-08-24 Document analysis method and apparatus

Country Status (1)

Country Link
CN (1) CN106919553A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019242359A1 (en) * 2018-06-22 2019-12-26 阿里巴巴集团控股有限公司 File processing method and device

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101329665A (en) * 2007-06-18 2008-12-24 国际商业机器公司 Method for analyzing marking language document and analyzer
CN102411602A (en) * 2011-08-15 2012-04-11 浙江大学 Extensive makeup language (XML) parallel speculation analysis method realized on basis of field programmable gate array (FPGA)
CN102495722A (en) * 2011-10-18 2012-06-13 成都康赛电子科大信息技术有限责任公司 XML (extensible markup language) parallel parsing method for multi-core fragmentation
CN103020176A (en) * 2012-11-28 2013-04-03 方跃坚 Data block dividing method in XML parsing and XML parsing method
CN104462581A (en) * 2014-12-30 2015-03-25 成都因纳伟盛科技股份有限公司 Micro-channel memory mapping and Smart-Slice based ultrafast file fingerprint extraction system and method
CN105491132A (en) * 2015-12-11 2016-04-13 北京元心科技有限公司 File server, terminal and file subpackage transmission method

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101329665A (en) * 2007-06-18 2008-12-24 国际商业机器公司 Method for analyzing marking language document and analyzer
CN102411602A (en) * 2011-08-15 2012-04-11 浙江大学 Extensive makeup language (XML) parallel speculation analysis method realized on basis of field programmable gate array (FPGA)
CN102495722A (en) * 2011-10-18 2012-06-13 成都康赛电子科大信息技术有限责任公司 XML (extensible markup language) parallel parsing method for multi-core fragmentation
CN103020176A (en) * 2012-11-28 2013-04-03 方跃坚 Data block dividing method in XML parsing and XML parsing method
CN104462581A (en) * 2014-12-30 2015-03-25 成都因纳伟盛科技股份有限公司 Micro-channel memory mapping and Smart-Slice based ultrafast file fingerprint extraction system and method
CN105491132A (en) * 2015-12-11 2016-04-13 北京元心科技有限公司 File server, terminal and file subpackage transmission method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
赵刚等: "《轧制过程的计算机控制系统》", 31 January 2002, 冶金工业出版社 *
邢锋: "《电磁场数值计算与仿真分析》", 30 June 2014, 国防工业出版社 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019242359A1 (en) * 2018-06-22 2019-12-26 阿里巴巴集团控股有限公司 File processing method and device
TWI711935B (en) * 2018-06-22 2020-12-01 開曼群島商創新先進技術有限公司 File processing method and device

Similar Documents

Publication Publication Date Title
CN107958057B (en) Code generation method and device for data migration in heterogeneous database
CN104317618B (en) A kind of firmware partition treating method and apparatus
CN110781231A (en) Batch import method, device, equipment and storage medium based on database
CN108228166A (en) A kind of back-end code generation method and system based on template
CN105701215B (en) Data connecting method and device based on Hadoop MapReduce
CN106682036A (en) Data exchange system and exchange method thereof
CN107741903A (en) Application compatibility method of testing, device, computer equipment and storage medium
CN112181804A (en) Parameter checking method, equipment and storage medium
CN105760450A (en) Form file analyzing method and device
CN107256233A (en) A kind of date storage method and device
CN107704529A (en) The recognition methods of information uniqueness, application server, system and storage medium
CN106919553A (en) Document analysis method and apparatus
CN110928941B (en) Data fragment extraction method and device
CN109743133A (en) Data account checking method and device
CN110688823B (en) XML file verification method and device
CN116341514A (en) File analysis method and device based on dynamic configuration
CN111625330A (en) Cross-thread task processing method and device, server and storage medium
CN115205032A (en) Credit investigation query method, apparatus, device and storage medium
CN104536897A (en) Automatic testing method and system based on keyword
CN109584091A (en) Insure the generation method and device of image file
US11645307B1 (en) Method and apparatus for grouping records based upon a prediction of the content of the records
CN105930329A (en) Transaction log analysis method and apparatus
CN114756440B (en) Data writing method, device and equipment of smart card and storage medium
CN110196793A (en) For the log analysis method and equipment in plug-in's data library
CN107025114A (en) A kind of software requirement information matches degree inspection method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20201015

Address after: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman, British Islands

Applicant after: Innovative advanced technology Co.,Ltd.

Address before: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman, British Islands

Applicant before: Advanced innovation technology Co.,Ltd.

Effective date of registration: 20201015

Address after: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman, British Islands

Applicant after: Advanced innovation technology Co.,Ltd.

Address before: A four-storey 847 mailbox in Grand Cayman Capital Building, British Cayman Islands

Applicant before: Alibaba Group Holding Ltd.

TA01 Transfer of patent application right
RJ01 Rejection of invention patent application after publication

Application publication date: 20170704

RJ01 Rejection of invention patent application after publication