CN105574164B - The data analysis method and device of Excel document - Google Patents

The data analysis method and device of Excel document Download PDF

Info

Publication number
CN105574164B
CN105574164B CN201510946709.5A CN201510946709A CN105574164B CN 105574164 B CN105574164 B CN 105574164B CN 201510946709 A CN201510946709 A CN 201510946709A CN 105574164 B CN105574164 B CN 105574164B
Authority
CN
China
Prior art keywords
document
worksheet
data
excel
stream
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510946709.5A
Other languages
Chinese (zh)
Other versions
CN105574164A (en
Inventor
刘倍材
樊文飞
贾西贝
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Huaao Data Technology Co Ltd
Original Assignee
BEIJING HUAAODA DATA TECHNOLOGY Co Ltd
Shenzhen Huaao Data Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BEIJING HUAAODA DATA TECHNOLOGY Co Ltd, Shenzhen Huaao Data Technology Co Ltd filed Critical BEIJING HUAAODA DATA TECHNOLOGY Co Ltd
Priority to CN201510946709.5A priority Critical patent/CN105574164B/en
Publication of CN105574164A publication Critical patent/CN105574164A/en
Application granted granted Critical
Publication of CN105574164B publication Critical patent/CN105574164B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/258Data format conversion from or to a database
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention relates to a kind of data analysis methods of Excel document.This method comprises: the file stream for the Excel document that step 10, acquisition need to parse;Step 20, parsing this document stream, obtain the information in this document stream about book and worksheet;Step 30 reads the corresponding xml document of each worksheet using multithreading respectively;Shared data xml document in step 40, parsing this document stream, is found out the storage location of shared data corresponding with each worksheet and is read respectively using multithreading;Step 50 is parsed respectively using multithreading to obtain the data of the Excel document in conjunction with the corresponding xml document of each worksheet and the corresponding shared data of each worksheet.The invention further relates to a kind of data analysis devices of Excel document.The data analysis method and device of Excel document of the invention are generally applicable in the Excel document of various formats, and can cope with biggish Excel document, improve data analyzing efficiency.

Description

The data analysis method and device of Excel document
Technical field
This application involves technical field of data processing, in particular to the data analysis method and device of a kind of Excel document.
Background technique
Microsoft Excel is one of the component of the office software Microsoft office of Microsoft, and Microsoft office has used version different from the past (using binary file format) since the version 2 007 Office Open XML file format.The container of new file format is the ZIP file lattice of the compression based on simple component Formula, these components include between the XML file for describing application data, metadata and self-defining data, and description component The non-XML file such as relationship, the binary file of picture or ole object that is embedded in document, in new Office Open XML The core of format uses the reference framework and a ZIP container of some XML.With the blank Excel of the newly-built entitled xlsx of suffix For document, after being decompressed, file _ rels, xl and docProps is formd under first class catalogue, there are also files [Content_Types] .xml further includes various XML files and non-XML file under each file.
The existing data in Excel document (book) are parsed the method to read the data in Excel document Following two: first, API (the Application Programming provided using specific system is substantially provided Interface, application programming interface) parsing Excel document, it then imports data in its system;Second, pass through calling The API of some more mature open source softwares parses Excel document, and that more popular is POI (the Java for calling Apache to provide API for Microsoft Documents) API parse Excel document.All there is inevitably lack for both methods Fall into: the first analytic method can only parse the Excel document of specific format, so cannot generally be applicable in;Second of analytic method Biggish Excel document can not be coped with, reason is that entire document is all loaded into memory by second method in parsing, can Memory can be will cause to overflow and not continuing to parse Excel document data.Therefore, it is urgent to provide it is a kind of more commonly be applicable in and Efficient data analysis method.
Summary of the invention
It is generally applicable in and the data analysis method of efficient Excel document the purpose of the present invention is to provide a kind of.
It is generally applicable in and the data analysis device of efficient Excel document another object of the present invention is to provide a kind of.
To achieve the above object, the present invention provides a kind of data analysis methods of Excel document, comprising:
Step 10 obtains the file stream for needing the Excel document parsed;
Step 20, parsing this document stream, obtain the information in this document stream about book and worksheet;
Step 30 reads the corresponding xml document of each worksheet using multithreading respectively;
Shared data xml document in step 40, parsing this document stream, finds out shared data corresponding with each worksheet Storage location is simultaneously read using multithreading respectively;
Step 50 utilizes multithreading in conjunction with the corresponding xml document of each worksheet and the corresponding shared data of each worksheet It is parsed respectively to obtain the data of the Excel document.
Wherein, the file stream is zip format.
Wherein, the information includes the information of book and the mapping relations of worksheet and the xml document in file stream.
Wherein, the corresponding xml document of each worksheet is closed according to the mapping of worksheet and xml document in file stream information System obtains.
Wherein, the step of being restored the method further includes the data of the Excel document obtained to parsing.
To achieve the above object, the present invention also provides a kind of data analysis devices of Excel document, comprising:
File flow module obtains the file stream for needing the Excel document parsed;
File stream information parsing module parses this document stream, obtains in this document stream about book and worksheet Information;
Worksheet read module reads the corresponding xml document of each worksheet using multithreading respectively;
Shared data parsing module parses the shared data xml document in this document stream, finds out corresponding with each worksheet Shared data storage location and read respectively using multithreading;
Excel document data resolution module, it is corresponding total in conjunction with the corresponding xml document of each worksheet and each worksheet Data are enjoyed to be parsed respectively using multithreading to obtain the data of the Excel document.
Wherein, the file stream is zip format.
Wherein, the information includes the information of book and the mapping relations of worksheet and the xml document in file stream.
Wherein, the corresponding xml document of each worksheet is closed according to the mapping of worksheet and xml document in file stream information System obtains.
Wherein, the data for the Excel document that described device further also obtains parsing restore.
In conclusion the data analysis method and device of Excel document of the invention are generally applicable in the Excel of various formats Document, and biggish Excel document can be coped with, improve data analyzing efficiency.
Detailed description of the invention
Fig. 1 is the flow chart of one preferred embodiment of data analysis method of Excel document of the present invention.
Specific embodiment
With reference to the accompanying drawing, by the way that detailed description of specific embodiments of the present invention, technical solution of the present invention will be made And its advantages are apparent.
It referring to Fig.1, is the flow chart of one preferred embodiment of data analysis method of Excel document of the present invention.This method It specifically includes that
Step 10 obtains the file stream for needing the Excel document parsed.In this step, file stream is zip format, root According to the regulation of Office Open XML file format, in the file stream of zip format include at least description application data, Metadata and the xml document of self-defining data etc..
Step 20, parsing this document stream, obtain the information in this document stream about book and worksheet.Main includes logical Two relevant file streams in parsing zip formatted file stream are crossed, workbook (workbook) and worksheet in file stream are read The information of (worksheet or sheet).The overview that can be appreciated that Excel document by the information about book, by about work The information for making table can be appreciated that the mapping relations of worksheet and the xml document in file stream, specifically by all in file stream Such as workbook.xml file obtains these information.
Step 30 reads the corresponding xml document of each worksheet using multithreading respectively.
The corresponding xml document of each worksheet to be resolved can be according to the mapping of worksheet and xml document in file stream information Relationship obtains.Treatment effeciency can be improved in such a way that multithreading is read respectively.
Shared data xml document in step 40, parsing this document stream, finds out shared data corresponding with each worksheet Storage location is simultaneously read using multithreading respectively.It, can be according to the position of storing data in shared data xml document in resolving It sets and determines the corresponding data of each worksheet with the relationship of each worksheet.
Shared data xml document is used to store the shared data of all working table, and the shared data of each worksheet are usually It is stored in specific shared data xml document, under generally specially/xl/sharedStrings.xml, that is, xl catalogue SharedStrings.xml.If the step for not using, when parsing the data of some worksheet, in addition to parse pair The data for the xml document answered, while also needing to load the data ability that the worksheet is corresponded in parsing sharedStrings.xml The data of the worksheet are completely obtained, correspond to the data of other worksheets in sharedStrings.xml at this time also while being added Parsing is carried, and requires to repeat to load parsing sharedStrings.xml for each worksheet, memory use is wanted Per thread all needs when asking high and drag slow resolution speed, while to correspond to xml document using each worksheet of multiple threads parsing SharedStrings.xml is accessed, so that multiple threads mode is become complicated, in some instances it may even be possible to offset and be brought using multithreading Resolution speed promoted.
By this step, the data and each worksheet relationship stored in literary shared data xml document can be first found out, into And multithreading is recycled only to parse data corresponding to dependent work worksheet, so that resolution speed is faster, memory requirement is lower.
Step 50 utilizes multithreading in conjunction with the corresponding xml document of each worksheet and the corresponding shared data of each worksheet It is parsed respectively to obtain the data of the Excel document.The present invention is obtained by the file stream information of acquisition Excel document The xml document for needing to parse and corresponding shared data use multithreading to be parsed to obtain and need to parse in Excel document Correct data.
The step of data that the present invention can further include the Excel document obtained to parsing restore.Example Such as parse the data in specified xml document are as follows:
<c r=" F2 " s=" 3 ">
<v>41574.833599537036</v>
</c>
The actually date data that above-mentioned data corresponding position in Excel document is shown: 2013/10/2720:00: 23, it can be according to other xml element property values such as xml document format description document, " s=3 " and these attribute values in xml text Positional relationship in part determines that the value of this cell is date (date) type, and carrying out conversion can be obtained by a date class Then this date types value is switched to the time showing value of different-format, i.e. data by offset as needed 41574.833599537036 can finally be reduced to the show value of 2013/10/2720:00:23 by parsing and reduction.From And the step of some data are restored after realization parsing.
The present invention is analyzed by the structure to Excel document, only parses specific xml document, is realized a kind of general All over the data analysis mode for the Excel document for being applicable in various formats, the data for reducing Excel document are parsed to computer system The demand of memory can cope with biggish Excel document, while improve the data analyzing efficiency of Excel document.
Correspondingly, the present invention also provides a kind of data analysis devices of Excel document.The device includes:
File flow module obtains the file stream for needing the Excel document parsed.File stream is zip format, according to The regulation of Office Open XML file format includes at least description application data, member in the file stream of zip format Data and the xml document of self-defining data etc..
File stream information parsing module parses this document stream, obtains in this document stream about book and worksheet Information.The information of acquisition includes the information of book and the mapping relations of worksheet and the xml document in file stream.
Worksheet read module reads the corresponding xml document of each worksheet using multithreading respectively.Each work to be resolved Making the corresponding xml document of table can obtain according to the mapping relations of worksheet and xml document in file stream information.Using multithreading point Treatment effeciency can be improved in the mode not read.
Shared data parsing module parses the shared data xml document in this document stream, finds out corresponding with each worksheet Shared data storage location and read respectively using multithreading.It, can be according in shared data xml document in resolving The position of storing data and the corresponding data of each worksheet are determined with the relationship of each worksheet.
Excel document data resolution module, it is corresponding total in conjunction with the corresponding xml document of each worksheet and each worksheet Data are enjoyed to be parsed respectively using multithreading to obtain the data of the Excel document.The module can also be further to parsing The data of the Excel document obtained are restored.
The data analysis device of Excel document of the invention is obtained and is needed by the file stream information of acquisition Excel document The xml document of parsing and corresponding shared data use multithreading to be parsed to obtain and need to parse just in Excel document Exact figures evidence.
In conclusion the data analysis method and device of Excel document of the invention are generally applicable in the Excel of various formats Document, and biggish Excel document can be coped with, improve data analyzing efficiency.
The foregoing is merely presently preferred embodiments of the present invention, is not intended to limit the invention, it is all in spirit of the invention and Made any modifications, equivalent replacements, and improvements etc., should all be included in the protection scope of the present invention within principle.

Claims (4)

1. a kind of data analysis method of Excel document characterized by comprising
Step 10 obtains the file stream for needing the Excel document parsed;
Step 20, parsing this document stream, obtain the information in this document stream about book and worksheet;
Step 30 reads the corresponding xml document of each worksheet using multithreading respectively;
Shared data xml document in step 40, parsing this document stream, finds out the storage of shared data corresponding with each worksheet It is simultaneously read respectively using multithreading position;
Step 50 is distinguished in conjunction with the corresponding xml document of each worksheet and the corresponding shared data of each worksheet using multithreading It is parsed to obtain the data of the Excel document;
The file stream is zip format;
The information includes the information of book and the mapping relations of worksheet and the xml document in file stream;
The corresponding xml document of each worksheet is obtained according to the mapping relations of worksheet and xml document in file stream information.
2. the data analysis method of Excel document according to claim 1, which is characterized in that the method is further wrapped Include the step of restoring to the data of the Excel document of parsing acquisition.
3. a kind of data analysis device of Excel document characterized by comprising
File flow module obtains the file stream for needing the Excel document parsed;
File stream information parsing module parses this document stream, obtains the information in this document stream about book and worksheet;
Worksheet read module reads the corresponding xml document of each worksheet using multithreading respectively;
Shared data parsing module parses the shared data xml document in this document stream, finds out corresponding with each worksheet total It enjoys the storage location of data and is read respectively using multithreading;
Excel document data resolution module, in conjunction with the corresponding xml document of each worksheet and the corresponding shared number of each worksheet According to being parsed respectively using multithreading to obtain the data of the Excel document;
The file stream is zip format;
The information includes the information of book and the mapping relations of worksheet and the xml document in file stream;
The corresponding xml document of each worksheet is obtained according to the mapping relations of worksheet and xml document in file stream information.
4. the data analysis device of Excel document as claimed in claim 3, which is characterized in that described device is further also right The data for parsing the Excel document obtained are restored.
CN201510946709.5A 2015-12-16 2015-12-16 The data analysis method and device of Excel document Active CN105574164B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510946709.5A CN105574164B (en) 2015-12-16 2015-12-16 The data analysis method and device of Excel document

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510946709.5A CN105574164B (en) 2015-12-16 2015-12-16 The data analysis method and device of Excel document

Publications (2)

Publication Number Publication Date
CN105574164A CN105574164A (en) 2016-05-11
CN105574164B true CN105574164B (en) 2019-03-19

Family

ID=55884295

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510946709.5A Active CN105574164B (en) 2015-12-16 2015-12-16 The data analysis method and device of Excel document

Country Status (1)

Country Link
CN (1) CN105574164B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106021390A (en) * 2016-05-12 2016-10-12 福建南威软件有限公司 File management method and device
CN107977440B (en) * 2017-12-07 2020-11-27 网宿科技股份有限公司 Method, device and system for analyzing data file
CN109783554A (en) * 2018-12-13 2019-05-21 重庆金融资产交易所有限责任公司 Excel document analytic method, device and computer readable storage medium
CN113900656A (en) * 2021-09-24 2022-01-07 紫金诚征信有限公司 Java-based multi-file data report concurrent analysis method and device and computer medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102495722A (en) * 2011-10-18 2012-06-13 成都康赛电子科大信息技术有限责任公司 XML (extensible markup language) parallel parsing method for multi-core fragmentation
CN102760118A (en) * 2011-04-25 2012-10-31 中兴通讯股份有限公司 Method and device for exporting data as Excel file
CN103020176A (en) * 2012-11-28 2013-04-03 方跃坚 Data block dividing method in XML parsing and XML parsing method
CN104881275A (en) * 2015-02-11 2015-09-02 中国农业银行股份有限公司 Electronic spreadsheet generating method and device

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090094606A1 (en) * 2007-10-04 2009-04-09 National Chung Cheng University Method for fast XSL transformation on multithreaded environment

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102760118A (en) * 2011-04-25 2012-10-31 中兴通讯股份有限公司 Method and device for exporting data as Excel file
CN102495722A (en) * 2011-10-18 2012-06-13 成都康赛电子科大信息技术有限责任公司 XML (extensible markup language) parallel parsing method for multi-core fragmentation
CN103020176A (en) * 2012-11-28 2013-04-03 方跃坚 Data block dividing method in XML parsing and XML parsing method
CN104881275A (en) * 2015-02-11 2015-09-02 中国农业银行股份有限公司 Electronic spreadsheet generating method and device

Also Published As

Publication number Publication date
CN105574164A (en) 2016-05-11

Similar Documents

Publication Publication Date Title
CN105574164B (en) The data analysis method and device of Excel document
CN105447099B (en) Log-structuredization information extracting method and device
US9256582B2 (en) Conversion of a presentation to Darwin Information Typing Architecture (DITA)
CN104484216A (en) Method and device for generating service interface document and on-line test tool
US9235559B2 (en) Progressive page loading
US9128912B2 (en) Efficient XML interchange schema document encoding
CN103970736A (en) Method for converting Excel sheet to database table
CN103885925A (en) Method for encapsulating XBRL (extensible business reporting language) instance documents
CN106227575B (en) Method for generating and analyzing text file
CN105573967A (en) Multi-format file online browsing method and system
CN102387120B (en) File transmission method and network transmission system
CN104572744B (en) structured document generation method and device
CN103345501A (en) Method and device for updating database
CN107566090B (en) Fixed-length/variable-length text message processing method and device
US10282400B2 (en) Grammar generation for simple datatypes
CN102479216A (en) Method for realizing multimedia annotation of electronic book
CN109165198A (en) A kind of increment amending method based on OFD document
CN103646015B (en) Transmission, the method and system for receiving and transmitting XML message
IL192265A (en) Automatic package conformance validation
CN113010473A (en) Method and equipment for editing YAML file
CN104317935B (en) A kind of method and system of XML billing files Mass production html page
US9519627B2 (en) Grammar generation for XML schema definitions
CN104503753A (en) Software management method and system based on modular design
CN109408577A (en) ORACLE database JSON analytic method, system, device and can storage medium
CN115841095A (en) Document establishing method, system, medium and equipment

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20230505

Address after: 518000 2203/2204, Building 1, Huide Building, Beizhan Community, Minzhi Street, Longhua District, Shenzhen, Guangdong

Patentee after: SHENZHEN AUDAQUE DATA TECHNOLOGY Ltd.

Address before: 15th Floor, Design Building, No. 8 Huixin East Street, Chaoyang District, Beijing, 100029

Patentee before: BEIJING HUAAODA DATA TECHNOLOGY Co.,Ltd.

Patentee before: SHENZHEN AUDAQUE DATA TECHNOLOGY Ltd.

TR01 Transfer of patent right