CN105574164B - The data analysis method and device of Excel document - Google Patents
The data analysis method and device of Excel document Download PDFInfo
- Publication number
- CN105574164B CN105574164B CN201510946709.5A CN201510946709A CN105574164B CN 105574164 B CN105574164 B CN 105574164B CN 201510946709 A CN201510946709 A CN 201510946709A CN 105574164 B CN105574164 B CN 105574164B
- Authority
- CN
- China
- Prior art keywords
- document
- worksheet
- data
- excel
- stream
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/25—Integrating or interfacing systems involving database management systems
- G06F16/258—Data format conversion from or to a database
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention relates to a kind of data analysis methods of Excel document.This method comprises: the file stream for the Excel document that step 10, acquisition need to parse;Step 20, parsing this document stream, obtain the information in this document stream about book and worksheet;Step 30 reads the corresponding xml document of each worksheet using multithreading respectively;Shared data xml document in step 40, parsing this document stream, is found out the storage location of shared data corresponding with each worksheet and is read respectively using multithreading;Step 50 is parsed respectively using multithreading to obtain the data of the Excel document in conjunction with the corresponding xml document of each worksheet and the corresponding shared data of each worksheet.The invention further relates to a kind of data analysis devices of Excel document.The data analysis method and device of Excel document of the invention are generally applicable in the Excel document of various formats, and can cope with biggish Excel document, improve data analyzing efficiency.
Description
Technical field
This application involves technical field of data processing, in particular to the data analysis method and device of a kind of Excel document.
Background technique
Microsoft Excel is one of the component of the office software Microsoft office of Microsoft, and
Microsoft office has used version different from the past (using binary file format) since the version 2 007
Office Open XML file format.The container of new file format is the ZIP file lattice of the compression based on simple component
Formula, these components include between the XML file for describing application data, metadata and self-defining data, and description component
The non-XML file such as relationship, the binary file of picture or ole object that is embedded in document, in new Office Open XML
The core of format uses the reference framework and a ZIP container of some XML.With the blank Excel of the newly-built entitled xlsx of suffix
For document, after being decompressed, file _ rels, xl and docProps is formd under first class catalogue, there are also files
[Content_Types] .xml further includes various XML files and non-XML file under each file.
The existing data in Excel document (book) are parsed the method to read the data in Excel document
Following two: first, API (the Application Programming provided using specific system is substantially provided
Interface, application programming interface) parsing Excel document, it then imports data in its system;Second, pass through calling
The API of some more mature open source softwares parses Excel document, and that more popular is POI (the Java for calling Apache to provide
API for Microsoft Documents) API parse Excel document.All there is inevitably lack for both methods
Fall into: the first analytic method can only parse the Excel document of specific format, so cannot generally be applicable in;Second of analytic method
Biggish Excel document can not be coped with, reason is that entire document is all loaded into memory by second method in parsing, can
Memory can be will cause to overflow and not continuing to parse Excel document data.Therefore, it is urgent to provide it is a kind of more commonly be applicable in and
Efficient data analysis method.
Summary of the invention
It is generally applicable in and the data analysis method of efficient Excel document the purpose of the present invention is to provide a kind of.
It is generally applicable in and the data analysis device of efficient Excel document another object of the present invention is to provide a kind of.
To achieve the above object, the present invention provides a kind of data analysis methods of Excel document, comprising:
Step 10 obtains the file stream for needing the Excel document parsed;
Step 20, parsing this document stream, obtain the information in this document stream about book and worksheet;
Step 30 reads the corresponding xml document of each worksheet using multithreading respectively;
Shared data xml document in step 40, parsing this document stream, finds out shared data corresponding with each worksheet
Storage location is simultaneously read using multithreading respectively;
Step 50 utilizes multithreading in conjunction with the corresponding xml document of each worksheet and the corresponding shared data of each worksheet
It is parsed respectively to obtain the data of the Excel document.
Wherein, the file stream is zip format.
Wherein, the information includes the information of book and the mapping relations of worksheet and the xml document in file stream.
Wherein, the corresponding xml document of each worksheet is closed according to the mapping of worksheet and xml document in file stream information
System obtains.
Wherein, the step of being restored the method further includes the data of the Excel document obtained to parsing.
To achieve the above object, the present invention also provides a kind of data analysis devices of Excel document, comprising:
File flow module obtains the file stream for needing the Excel document parsed;
File stream information parsing module parses this document stream, obtains in this document stream about book and worksheet
Information;
Worksheet read module reads the corresponding xml document of each worksheet using multithreading respectively;
Shared data parsing module parses the shared data xml document in this document stream, finds out corresponding with each worksheet
Shared data storage location and read respectively using multithreading;
Excel document data resolution module, it is corresponding total in conjunction with the corresponding xml document of each worksheet and each worksheet
Data are enjoyed to be parsed respectively using multithreading to obtain the data of the Excel document.
Wherein, the file stream is zip format.
Wherein, the information includes the information of book and the mapping relations of worksheet and the xml document in file stream.
Wherein, the corresponding xml document of each worksheet is closed according to the mapping of worksheet and xml document in file stream information
System obtains.
Wherein, the data for the Excel document that described device further also obtains parsing restore.
In conclusion the data analysis method and device of Excel document of the invention are generally applicable in the Excel of various formats
Document, and biggish Excel document can be coped with, improve data analyzing efficiency.
Detailed description of the invention
Fig. 1 is the flow chart of one preferred embodiment of data analysis method of Excel document of the present invention.
Specific embodiment
With reference to the accompanying drawing, by the way that detailed description of specific embodiments of the present invention, technical solution of the present invention will be made
And its advantages are apparent.
It referring to Fig.1, is the flow chart of one preferred embodiment of data analysis method of Excel document of the present invention.This method
It specifically includes that
Step 10 obtains the file stream for needing the Excel document parsed.In this step, file stream is zip format, root
According to the regulation of Office Open XML file format, in the file stream of zip format include at least description application data,
Metadata and the xml document of self-defining data etc..
Step 20, parsing this document stream, obtain the information in this document stream about book and worksheet.Main includes logical
Two relevant file streams in parsing zip formatted file stream are crossed, workbook (workbook) and worksheet in file stream are read
The information of (worksheet or sheet).The overview that can be appreciated that Excel document by the information about book, by about work
The information for making table can be appreciated that the mapping relations of worksheet and the xml document in file stream, specifically by all in file stream
Such as workbook.xml file obtains these information.
Step 30 reads the corresponding xml document of each worksheet using multithreading respectively.
The corresponding xml document of each worksheet to be resolved can be according to the mapping of worksheet and xml document in file stream information
Relationship obtains.Treatment effeciency can be improved in such a way that multithreading is read respectively.
Shared data xml document in step 40, parsing this document stream, finds out shared data corresponding with each worksheet
Storage location is simultaneously read using multithreading respectively.It, can be according to the position of storing data in shared data xml document in resolving
It sets and determines the corresponding data of each worksheet with the relationship of each worksheet.
Shared data xml document is used to store the shared data of all working table, and the shared data of each worksheet are usually
It is stored in specific shared data xml document, under generally specially/xl/sharedStrings.xml, that is, xl catalogue
SharedStrings.xml.If the step for not using, when parsing the data of some worksheet, in addition to parse pair
The data for the xml document answered, while also needing to load the data ability that the worksheet is corresponded in parsing sharedStrings.xml
The data of the worksheet are completely obtained, correspond to the data of other worksheets in sharedStrings.xml at this time also while being added
Parsing is carried, and requires to repeat to load parsing sharedStrings.xml for each worksheet, memory use is wanted
Per thread all needs when asking high and drag slow resolution speed, while to correspond to xml document using each worksheet of multiple threads parsing
SharedStrings.xml is accessed, so that multiple threads mode is become complicated, in some instances it may even be possible to offset and be brought using multithreading
Resolution speed promoted.
By this step, the data and each worksheet relationship stored in literary shared data xml document can be first found out, into
And multithreading is recycled only to parse data corresponding to dependent work worksheet, so that resolution speed is faster, memory requirement is lower.
Step 50 utilizes multithreading in conjunction with the corresponding xml document of each worksheet and the corresponding shared data of each worksheet
It is parsed respectively to obtain the data of the Excel document.The present invention is obtained by the file stream information of acquisition Excel document
The xml document for needing to parse and corresponding shared data use multithreading to be parsed to obtain and need to parse in Excel document
Correct data.
The step of data that the present invention can further include the Excel document obtained to parsing restore.Example
Such as parse the data in specified xml document are as follows:
<c r=" F2 " s=" 3 ">
<v>41574.833599537036</v>
</c>
The actually date data that above-mentioned data corresponding position in Excel document is shown: 2013/10/2720:00:
23, it can be according to other xml element property values such as xml document format description document, " s=3 " and these attribute values in xml text
Positional relationship in part determines that the value of this cell is date (date) type, and carrying out conversion can be obtained by a date class
Then this date types value is switched to the time showing value of different-format, i.e. data by offset as needed
41574.833599537036 can finally be reduced to the show value of 2013/10/2720:00:23 by parsing and reduction.From
And the step of some data are restored after realization parsing.
The present invention is analyzed by the structure to Excel document, only parses specific xml document, is realized a kind of general
All over the data analysis mode for the Excel document for being applicable in various formats, the data for reducing Excel document are parsed to computer system
The demand of memory can cope with biggish Excel document, while improve the data analyzing efficiency of Excel document.
Correspondingly, the present invention also provides a kind of data analysis devices of Excel document.The device includes:
File flow module obtains the file stream for needing the Excel document parsed.File stream is zip format, according to
The regulation of Office Open XML file format includes at least description application data, member in the file stream of zip format
Data and the xml document of self-defining data etc..
File stream information parsing module parses this document stream, obtains in this document stream about book and worksheet
Information.The information of acquisition includes the information of book and the mapping relations of worksheet and the xml document in file stream.
Worksheet read module reads the corresponding xml document of each worksheet using multithreading respectively.Each work to be resolved
Making the corresponding xml document of table can obtain according to the mapping relations of worksheet and xml document in file stream information.Using multithreading point
Treatment effeciency can be improved in the mode not read.
Shared data parsing module parses the shared data xml document in this document stream, finds out corresponding with each worksheet
Shared data storage location and read respectively using multithreading.It, can be according in shared data xml document in resolving
The position of storing data and the corresponding data of each worksheet are determined with the relationship of each worksheet.
Excel document data resolution module, it is corresponding total in conjunction with the corresponding xml document of each worksheet and each worksheet
Data are enjoyed to be parsed respectively using multithreading to obtain the data of the Excel document.The module can also be further to parsing
The data of the Excel document obtained are restored.
The data analysis device of Excel document of the invention is obtained and is needed by the file stream information of acquisition Excel document
The xml document of parsing and corresponding shared data use multithreading to be parsed to obtain and need to parse just in Excel document
Exact figures evidence.
In conclusion the data analysis method and device of Excel document of the invention are generally applicable in the Excel of various formats
Document, and biggish Excel document can be coped with, improve data analyzing efficiency.
The foregoing is merely presently preferred embodiments of the present invention, is not intended to limit the invention, it is all in spirit of the invention and
Made any modifications, equivalent replacements, and improvements etc., should all be included in the protection scope of the present invention within principle.
Claims (4)
1. a kind of data analysis method of Excel document characterized by comprising
Step 10 obtains the file stream for needing the Excel document parsed;
Step 20, parsing this document stream, obtain the information in this document stream about book and worksheet;
Step 30 reads the corresponding xml document of each worksheet using multithreading respectively;
Shared data xml document in step 40, parsing this document stream, finds out the storage of shared data corresponding with each worksheet
It is simultaneously read respectively using multithreading position;
Step 50 is distinguished in conjunction with the corresponding xml document of each worksheet and the corresponding shared data of each worksheet using multithreading
It is parsed to obtain the data of the Excel document;
The file stream is zip format;
The information includes the information of book and the mapping relations of worksheet and the xml document in file stream;
The corresponding xml document of each worksheet is obtained according to the mapping relations of worksheet and xml document in file stream information.
2. the data analysis method of Excel document according to claim 1, which is characterized in that the method is further wrapped
Include the step of restoring to the data of the Excel document of parsing acquisition.
3. a kind of data analysis device of Excel document characterized by comprising
File flow module obtains the file stream for needing the Excel document parsed;
File stream information parsing module parses this document stream, obtains the information in this document stream about book and worksheet;
Worksheet read module reads the corresponding xml document of each worksheet using multithreading respectively;
Shared data parsing module parses the shared data xml document in this document stream, finds out corresponding with each worksheet total
It enjoys the storage location of data and is read respectively using multithreading;
Excel document data resolution module, in conjunction with the corresponding xml document of each worksheet and the corresponding shared number of each worksheet
According to being parsed respectively using multithreading to obtain the data of the Excel document;
The file stream is zip format;
The information includes the information of book and the mapping relations of worksheet and the xml document in file stream;
The corresponding xml document of each worksheet is obtained according to the mapping relations of worksheet and xml document in file stream information.
4. the data analysis device of Excel document as claimed in claim 3, which is characterized in that described device is further also right
The data for parsing the Excel document obtained are restored.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510946709.5A CN105574164B (en) | 2015-12-16 | 2015-12-16 | The data analysis method and device of Excel document |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510946709.5A CN105574164B (en) | 2015-12-16 | 2015-12-16 | The data analysis method and device of Excel document |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105574164A CN105574164A (en) | 2016-05-11 |
CN105574164B true CN105574164B (en) | 2019-03-19 |
Family
ID=55884295
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510946709.5A Active CN105574164B (en) | 2015-12-16 | 2015-12-16 | The data analysis method and device of Excel document |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105574164B (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106021390A (en) * | 2016-05-12 | 2016-10-12 | 福建南威软件有限公司 | File management method and device |
CN107977440B (en) * | 2017-12-07 | 2020-11-27 | 网宿科技股份有限公司 | Method, device and system for analyzing data file |
CN109783554A (en) * | 2018-12-13 | 2019-05-21 | 重庆金融资产交易所有限责任公司 | Excel document analytic method, device and computer readable storage medium |
CN113900656A (en) * | 2021-09-24 | 2022-01-07 | 紫金诚征信有限公司 | Java-based multi-file data report concurrent analysis method and device and computer medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102495722A (en) * | 2011-10-18 | 2012-06-13 | 成都康赛电子科大信息技术有限责任公司 | XML (extensible markup language) parallel parsing method for multi-core fragmentation |
CN102760118A (en) * | 2011-04-25 | 2012-10-31 | 中兴通讯股份有限公司 | Method and device for exporting data as Excel file |
CN103020176A (en) * | 2012-11-28 | 2013-04-03 | 方跃坚 | Data block dividing method in XML parsing and XML parsing method |
CN104881275A (en) * | 2015-02-11 | 2015-09-02 | 中国农业银行股份有限公司 | Electronic spreadsheet generating method and device |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090094606A1 (en) * | 2007-10-04 | 2009-04-09 | National Chung Cheng University | Method for fast XSL transformation on multithreaded environment |
-
2015
- 2015-12-16 CN CN201510946709.5A patent/CN105574164B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102760118A (en) * | 2011-04-25 | 2012-10-31 | 中兴通讯股份有限公司 | Method and device for exporting data as Excel file |
CN102495722A (en) * | 2011-10-18 | 2012-06-13 | 成都康赛电子科大信息技术有限责任公司 | XML (extensible markup language) parallel parsing method for multi-core fragmentation |
CN103020176A (en) * | 2012-11-28 | 2013-04-03 | 方跃坚 | Data block dividing method in XML parsing and XML parsing method |
CN104881275A (en) * | 2015-02-11 | 2015-09-02 | 中国农业银行股份有限公司 | Electronic spreadsheet generating method and device |
Also Published As
Publication number | Publication date |
---|---|
CN105574164A (en) | 2016-05-11 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105574164B (en) | The data analysis method and device of Excel document | |
CN105447099B (en) | Log-structuredization information extracting method and device | |
US9256582B2 (en) | Conversion of a presentation to Darwin Information Typing Architecture (DITA) | |
CN104484216A (en) | Method and device for generating service interface document and on-line test tool | |
US9235559B2 (en) | Progressive page loading | |
US9128912B2 (en) | Efficient XML interchange schema document encoding | |
CN103970736A (en) | Method for converting Excel sheet to database table | |
CN103885925A (en) | Method for encapsulating XBRL (extensible business reporting language) instance documents | |
CN106227575B (en) | Method for generating and analyzing text file | |
CN105573967A (en) | Multi-format file online browsing method and system | |
CN102387120B (en) | File transmission method and network transmission system | |
CN104572744B (en) | structured document generation method and device | |
CN103345501A (en) | Method and device for updating database | |
CN107566090B (en) | Fixed-length/variable-length text message processing method and device | |
US10282400B2 (en) | Grammar generation for simple datatypes | |
CN102479216A (en) | Method for realizing multimedia annotation of electronic book | |
CN109165198A (en) | A kind of increment amending method based on OFD document | |
CN103646015B (en) | Transmission, the method and system for receiving and transmitting XML message | |
IL192265A (en) | Automatic package conformance validation | |
CN113010473A (en) | Method and equipment for editing YAML file | |
CN104317935B (en) | A kind of method and system of XML billing files Mass production html page | |
US9519627B2 (en) | Grammar generation for XML schema definitions | |
CN104503753A (en) | Software management method and system based on modular design | |
CN109408577A (en) | ORACLE database JSON analytic method, system, device and can storage medium | |
CN115841095A (en) | Document establishing method, system, medium and equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right |
Effective date of registration: 20230505 Address after: 518000 2203/2204, Building 1, Huide Building, Beizhan Community, Minzhi Street, Longhua District, Shenzhen, Guangdong Patentee after: SHENZHEN AUDAQUE DATA TECHNOLOGY Ltd. Address before: 15th Floor, Design Building, No. 8 Huixin East Street, Chaoyang District, Beijing, 100029 Patentee before: BEIJING HUAAODA DATA TECHNOLOGY Co.,Ltd. Patentee before: SHENZHEN AUDAQUE DATA TECHNOLOGY Ltd. |
|
TR01 | Transfer of patent right |