CN105574164A - Excel document data analysis method and device - Google Patents

Excel document data analysis method and device Download PDF

Info

Publication number
CN105574164A
CN105574164A CN201510946709.5A CN201510946709A CN105574164A CN 105574164 A CN105574164 A CN 105574164A CN 201510946709 A CN201510946709 A CN 201510946709A CN 105574164 A CN105574164 A CN 105574164A
Authority
CN
China
Prior art keywords
document
worksheet
data
excel
xml file
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510946709.5A
Other languages
Chinese (zh)
Other versions
CN105574164B (en
Inventor
刘倍材
樊文飞
贾西贝
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Huaao Data Technology Co Ltd
Original Assignee
BEIJING HUAAODA DATA TECHNOLOGY Co Ltd
Shenzhen Huaao Data Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BEIJING HUAAODA DATA TECHNOLOGY Co Ltd, Shenzhen Huaao Data Technology Co Ltd filed Critical BEIJING HUAAODA DATA TECHNOLOGY Co Ltd
Priority to CN201510946709.5A priority Critical patent/CN105574164B/en
Publication of CN105574164A publication Critical patent/CN105574164A/en
Application granted granted Critical
Publication of CN105574164B publication Critical patent/CN105574164B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/258Data format conversion from or to a database
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities

Abstract

The invention relates to an Excel document data analysis method. The method comprises steps as follows: step 10, acquiring a document flow of an Excel document requiring analysis; step 20, analyzing the document flow, and acquiring information about workbooks and worksheets in the document flow; step 30, reading xml (extensive markup language) documents corresponding to the worksheets respectively through multiple threads; step 40, analyzing xml documents of shared data in the document flow, and finding storage locations of the shared data corresponding to the worksheets and reading the storage locations through the multiple threads; step 50, performing analysis in combination of the xml documents corresponding to the worksheets and the shared data corresponding to the worksheets through the multiple threads so as to acquire data of the Excel document. The invention further relates to an Excel document data analysis device. The Excel document data analysis method and device are generally applicable to the all kinds of formats of Excel documents and can be applied to larger Excel documents, and the data analysis efficiency is improved.

Description

The data analysis method of Excel document and device
Technical field
The application relates to technical field of data processing, particularly a kind of data analysis method of Excel document and device.
Background technology
MicrosoftExcel is one of assembly of the office software Microsoftoffice of Microsoft, and Microsoftoffice starts from version 2 007, employ the OfficeOpenXML file layout of version different from the past (use binary file format).The container of new file layout is the ZIP file layout of the compression based on simple parts, these parts comprise the XML file describing application data, metadata and self-defining data, and the non-XML file such as the binary file of relation between description parts, the picture embedded in document or ole object, use in the core of new OfficeOpenXML form that some XML's quote framework and a ZIP container.The blank Excel document of xlsx is called for newly-built suffix, after being decompressed, under first class catalogue, define file _ rels, xl and docProps, also have file [Content_Types] .xml, under each file, also comprise various XML file and non-XML file.
Existing roughly have following two kinds: first to the method that the data in Excel document (book) resolve the data read in Excel document, use the API (ApplicationProgrammingInterface that specific system provides, application programming interface) resolve Excel document, then by its system of data importing; The second, resolve Excel document by the API calling some more ripe open source softwares, more popular is that the API calling the POI (theJavaAPIforMicrosoftDocuments) that Apache provides resolves Excel document.These two kinds of methods all also exist inevitable defect: the first analytic method can only resolve the Excel document of specific format, so can not generally be suitable for; The second analytic method cannot tackle larger Excel document, and reason is that whole document is all loaded in internal memory when resolving by second method, and internal memory may be caused to overflow and make cannot continue to resolve Excel document data.Therefore, need badly provide one be more generally suitable for and efficient data analysis method.
Summary of the invention
The object of the present invention is to provide a kind of data analysis method with efficient Excel document that is generally suitable for.
Another object of the present invention is to provide a kind of data analysis device with efficient Excel document that is generally suitable for.
For achieving the above object, the invention provides a kind of data analysis method of Excel document, comprising:
Step 10, acquisition need the document flow of the Excel document of resolving;
Step 20, resolve this document flow, obtain the information about book and worksheet in this document flow;
Step 30, multithreading is utilized to read xml file corresponding to each worksheet respectively;
Step 40, the shared data xml file of resolving in this document flow, find out the memory location of the shared data corresponding with each worksheet and utilize multithreading to read respectively;
Step 50, carry out in conjunction with xml file corresponding to each worksheet and shared data separate multithreading corresponding to each worksheet the data resolving to obtain this Excel document respectively.
Wherein, described document flow is zip form.
Wherein, described information comprises the mapping relations of the xml file in the information of book and worksheet and document flow.
Wherein, the xml file that described each worksheet is corresponding obtains according to the mapping relations of worksheet and xml file in document flow information.
Wherein, described method comprises the step of reducing to the data of resolving this Excel document obtained further.
For achieving the above object, present invention also offers a kind of data analysis device of Excel document, comprising:
Document flow module, it obtains the document flow needing the Excel document of resolving;
Document flow information analysis module, it resolves this document flow, obtains the information about book and worksheet in this document flow;
Worksheet read module, it utilizes multithreading to read xml file corresponding to each worksheet respectively;
Share data resolution module, it resolves the shared data xml file in this document flow, finds out the memory location of the shared data corresponding with each worksheet and utilizes multithreading to read respectively;
Excel document data resolution module, it carries out in conjunction with xml file corresponding to each worksheet and shared data separate multithreading corresponding to each worksheet the data resolving to obtain this Excel document respectively.
Wherein, described document flow is zip form.
Wherein, described information comprises the mapping relations of the xml file in the information of book and worksheet and document flow.
Wherein, the xml file that described each worksheet is corresponding obtains according to the mapping relations of worksheet and xml file in document flow information.
Wherein, described device also reduces to the data of resolving this Excel document obtained further.
In sum, the data analysis method of Excel document of the present invention and device are generally suitable for the Excel document of various form, and can tackle larger Excel document, improve Data Analysis efficiency.
Accompanying drawing explanation
Fig. 1 is the process flow diagram of data analysis method one preferred embodiment of Excel document of the present invention.
Embodiment
Below in conjunction with accompanying drawing, by the specific embodiment of the present invention describe in detail, will make technical scheme of the present invention and beneficial effect apparent.
With reference to Fig. 1, it is the process flow diagram of data analysis method one preferred embodiment of Excel document of the present invention.The method mainly comprises:
Step 10, acquisition need the document flow of the Excel document of resolving.In this step, document flow is zip form, according to the regulation of OfficeOpenXML file layout, at least comprises the xml file etc. describing application data, metadata and self-defining data in the document flow of zip form.
Step 20, resolve this document flow, obtain the information about book and worksheet in this document flow.Mainly comprise by the relevant document flow of two in parsing zip formatted file stream, the information of workbook (workbook) and worksheet (worksheet or sheet) in file reading stream.By can understand the overview of Excel document about the information of book, by can understand the mapping relations of the xml file in worksheet and document flow about the information of worksheet, obtain these information by files such as such as workbook.xml in document flow specifically.
Step 30, multithreading is utilized to read xml file corresponding to each worksheet respectively.
The xml file that each worksheet to be resolved is corresponding can obtain according to the mapping relations of worksheet and xml file in document flow information.The mode adopting multithreading to read respectively can improve treatment effeciency.
Step 40, the shared data xml file of resolving in this document flow, find out the memory location of the shared data corresponding with each worksheet and utilize multithreading to read respectively.In resolving, can according to storing the position of data in shared data xml file and determining with the relation of each worksheet the data that each worksheet is corresponding.
Share the data that data xml file is shared for storing all working table, the data that each worksheet is shared normally are stored in specific shared data xml file, generally be specially/xl/sharedStrings.xml, the sharedStrings.xml namely under xl catalogue.If do not adopt this step, when resolving the data of certain worksheet, except the data of corresponding xml file will be resolved, also need to load simultaneously resolve in sharedStrings.xml to should the data of worksheet could the data of complete this worksheet of acquisition, now in sharedStrings.xml, the data of other worksheets corresponding are also loaded parsing simultaneously, and need to repeat to load to resolve sharedStrings.xml for each worksheet, high and drag slow resolution speed to internal memory request for utilization, when making to utilize multiple thread to resolve the corresponding xml file of each worksheet, each thread needs to access sharedStrings.xml simultaneously, multiple threads mode is made to become complicated, even may offset the resolution speed adopting multithreading to bring to promote.
By this step, first can find out literary composition and share the data and each worksheet relation that store in data xml file, and then recycling multithreading only resolves the data corresponding to dependent work worksheet, make resolution speed faster, internal memory request for utilization is lower.
Step 50, carry out in conjunction with xml file corresponding to each worksheet and shared data separate multithreading corresponding to each worksheet the data resolving to obtain this Excel document respectively.The present invention, by obtaining the document flow information of Excel document, obtains the xml file and corresponding shared data that need to resolve, adopts multithreading to carry out resolving to obtain in Excel document the correct data needing to resolve.
The present invention can further include the step of reducing to the data of resolving this Excel document obtained.The data of such as resolving in the xml file of specifying are:
<cr=“F2”s=“3”>
<v>41574.833599537036</v>
</c>
The reality of above-mentioned data correspondence position display in Excel document is date data a: 2013/10/2720:00:23, can according to xml file layout specification documents, other xml element property values such as " s=3 " and the position relationship of these property values in xml file determine that the value of this cell is date (date) type, carry out conversion and just can obtain a date types value, then as required this date types value is transferred to the time showing value of different-format, namely data 41574.833599537036 are through resolving and reduction, finally can be reduced to the displayed value of 2013/10/2720:00:23.Thus the step of after realizing parsing, some data being reduced.
The present invention is by analyzing the structure of Excel document, only resolve specific xml file, achieve a kind of Data Analysis mode being generally suitable for the Excel document of various form, reduce the demand of Data Analysis to computer system memory of Excel document, larger Excel document can be tackled, improve the Data Analysis efficiency of Excel document simultaneously.
Accordingly, present invention also offers a kind of data analysis device of Excel document.This device comprises:
Document flow module, it obtains the document flow needing the Excel document of resolving.Document flow is zip form, according to the regulation of OfficeOpenXML file layout, at least comprises the xml file etc. describing application data, metadata and self-defining data in the document flow of zip form.
Document flow information analysis module, it resolves this document flow, obtains the information about book and worksheet in this document flow.The information obtained comprises the mapping relations of the xml file in the information of book and worksheet and document flow.
Worksheet read module, it utilizes multithreading to read xml file corresponding to each worksheet respectively.The xml file that each worksheet to be resolved is corresponding can obtain according to the mapping relations of worksheet and xml file in document flow information.The mode adopting multithreading to read respectively can improve treatment effeciency.
Share data resolution module, it resolves the shared data xml file in this document flow, finds out the memory location of the shared data corresponding with each worksheet and utilizes multithreading to read respectively.In resolving, can according to storing the position of data in shared data xml file and determining with the relation of each worksheet the data that each worksheet is corresponding.
Excel document data resolution module, it carries out in conjunction with xml file corresponding to each worksheet and shared data separate multithreading corresponding to each worksheet the data resolving to obtain this Excel document respectively.This module can also be reduced to the data of resolving this Excel document obtained further.
The data analysis device of Excel document of the present invention, by obtaining the document flow information of Excel document, obtains the xml file and corresponding shared data that need to resolve, adopts multithreading to carry out resolving to obtain in Excel document the correct data needing to resolve.
In sum, the data analysis method of Excel document of the present invention and device are generally suitable for the Excel document of various form, and can tackle larger Excel document, improve Data Analysis efficiency.
The foregoing is only preferred embodiment of the present invention, not in order to limit the present invention, all any amendments done within the spirit and principles in the present invention, equivalent replacement and improvement etc., all should be included within protection scope of the present invention.

Claims (10)

1. a data analysis method for Excel document, is characterized in that, comprising:
Step 10, acquisition need the document flow of the Excel document of resolving;
Step 20, resolve this document flow, obtain the information about book and worksheet in this document flow;
Step 30, multithreading is utilized to read xml file corresponding to each worksheet respectively;
Step 40, the shared data xml file of resolving in this document flow, find out the memory location of the shared data corresponding with each worksheet and utilize multithreading to read respectively;
Step 50, carry out in conjunction with xml file corresponding to each worksheet and shared data separate multithreading corresponding to each worksheet the data resolving to obtain this Excel document respectively.
2. the data analysis method of Excel document according to claim 1, is characterized in that, described document flow is zip form.
3. the data analysis method of Excel document according to claim 1, is characterized in that, described information comprises the mapping relations of the xml file in the information of book and worksheet and document flow.
4. the data analysis method of Excel document according to claim 1, is characterized in that, the xml file that described each worksheet is corresponding obtains according to the mapping relations of worksheet and xml file in document flow information.
5. the data analysis method of Excel document according to claim 1, is characterized in that, described method comprises the step of reducing to the data of resolving this Excel document obtained further.
6. a data analysis device for Excel document, is characterized in that, comprising:
Document flow module, it obtains the document flow needing the Excel document of resolving;
Document flow information analysis module, it resolves this document flow, obtains the information about book and worksheet in this document flow;
Worksheet read module, it utilizes multithreading to read xml file corresponding to each worksheet respectively;
Share data resolution module, it resolves the shared data xml file in this document flow, finds out the memory location of the shared data corresponding with each worksheet and utilizes multithreading to read respectively;
Excel document data resolution module, it carries out in conjunction with xml file corresponding to each worksheet and shared data separate multithreading corresponding to each worksheet the data resolving to obtain this Excel document respectively.
7. the data analysis device of Excel document as claimed in claim 6, it is characterized in that, described document flow is zip form.
8. the data analysis device of Excel document as claimed in claim 6, is characterized in that, described information comprises the mapping relations of the xml file in the information of book and worksheet and document flow.
9. the data analysis device of Excel document as claimed in claim 6, is characterized in that, the xml file that described each worksheet is corresponding obtains according to the mapping relations of worksheet and xml file in document flow information.
10. the data analysis device of Excel document as claimed in claim 6, is characterized in that, described device also reduces to the data of resolving this Excel document obtained further.
CN201510946709.5A 2015-12-16 2015-12-16 The data analysis method and device of Excel document Active CN105574164B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510946709.5A CN105574164B (en) 2015-12-16 2015-12-16 The data analysis method and device of Excel document

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510946709.5A CN105574164B (en) 2015-12-16 2015-12-16 The data analysis method and device of Excel document

Publications (2)

Publication Number Publication Date
CN105574164A true CN105574164A (en) 2016-05-11
CN105574164B CN105574164B (en) 2019-03-19

Family

ID=55884295

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510946709.5A Active CN105574164B (en) 2015-12-16 2015-12-16 The data analysis method and device of Excel document

Country Status (1)

Country Link
CN (1) CN105574164B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106021390A (en) * 2016-05-12 2016-10-12 福建南威软件有限公司 File management method and device
CN107977440A (en) * 2017-12-07 2018-05-01 网宿科技股份有限公司 A kind of methods, devices and systems for parsing data file
CN109783554A (en) * 2018-12-13 2019-05-21 重庆金融资产交易所有限责任公司 Excel document analytic method, device and computer readable storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090094606A1 (en) * 2007-10-04 2009-04-09 National Chung Cheng University Method for fast XSL transformation on multithreaded environment
CN102495722A (en) * 2011-10-18 2012-06-13 成都康赛电子科大信息技术有限责任公司 XML (extensible markup language) parallel parsing method for multi-core fragmentation
CN102760118A (en) * 2011-04-25 2012-10-31 中兴通讯股份有限公司 Method and device for exporting data as Excel file
CN103020176A (en) * 2012-11-28 2013-04-03 方跃坚 Data block dividing method in XML parsing and XML parsing method
CN104881275A (en) * 2015-02-11 2015-09-02 中国农业银行股份有限公司 Electronic spreadsheet generating method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090094606A1 (en) * 2007-10-04 2009-04-09 National Chung Cheng University Method for fast XSL transformation on multithreaded environment
CN102760118A (en) * 2011-04-25 2012-10-31 中兴通讯股份有限公司 Method and device for exporting data as Excel file
CN102495722A (en) * 2011-10-18 2012-06-13 成都康赛电子科大信息技术有限责任公司 XML (extensible markup language) parallel parsing method for multi-core fragmentation
CN103020176A (en) * 2012-11-28 2013-04-03 方跃坚 Data block dividing method in XML parsing and XML parsing method
CN104881275A (en) * 2015-02-11 2015-09-02 中国农业银行股份有限公司 Electronic spreadsheet generating method and device

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106021390A (en) * 2016-05-12 2016-10-12 福建南威软件有限公司 File management method and device
CN107977440A (en) * 2017-12-07 2018-05-01 网宿科技股份有限公司 A kind of methods, devices and systems for parsing data file
CN107977440B (en) * 2017-12-07 2020-11-27 网宿科技股份有限公司 Method, device and system for analyzing data file
CN109783554A (en) * 2018-12-13 2019-05-21 重庆金融资产交易所有限责任公司 Excel document analytic method, device and computer readable storage medium

Also Published As

Publication number Publication date
CN105574164B (en) 2019-03-19

Similar Documents

Publication Publication Date Title
US10380235B2 (en) Method and system for annotation and connection of electronic documents
US7802179B2 (en) Synchronizing data between different editor views
US20180341371A1 (en) Scatter copy supporting partial paste functionality
Gavish et al. A universal identifier for computational results
US20070061706A1 (en) Mapping property hierarchies to schemas
US20100241948A1 (en) Overriding XSLT Generation
US20200364044A1 (en) Application programming interface documentation annotation
CN107168695B (en) Excel data analysis method and system
CN102937949B (en) A kind of method and system realizing English spelling and check in editor
CN104699714A (en) Method and device for transferring files of book edition format into files of EPUB format
US20100251227A1 (en) Binary resource format and compiler
CN105574164A (en) Excel document data analysis method and device
US9811574B2 (en) Extract Transform Load (ETL) applications for job matching
CN108664458B (en) PDF file table analysis method and system
US10606891B2 (en) JSON data validation
US20150106478A1 (en) File handlers supporting dynamic data streams
US8073879B2 (en) Systems and methods that transform constructs from domain to domain
US8060490B2 (en) Analyzer engine
CN111930708B (en) Ceph object storage-based object tag expansion system and method
US20070214411A1 (en) Reducing Resource Requirements When Transforming Source Data in a Source Markup Language to Target Data in a Target Markup Language using Transformation Rules
Eckert Provenance and annotations for linked data
CN102855283A (en) Method for filing and storing files of office automation system
US9223766B2 (en) Preserving formatting of content selection through snippets
CN104317935A (en) Method and system for generating HTML (hypertext markup language) pages from XML (extensible markup language) bill files in batches
GB2458692A (en) A process for generating database-backed, web-based documents

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20230505

Address after: 518000 2203/2204, Building 1, Huide Building, Beizhan Community, Minzhi Street, Longhua District, Shenzhen, Guangdong

Patentee after: SHENZHEN AUDAQUE DATA TECHNOLOGY Ltd.

Address before: 15th Floor, Design Building, No. 8 Huixin East Street, Chaoyang District, Beijing, 100029

Patentee before: BEIJING HUAAODA DATA TECHNOLOGY Co.,Ltd.

Patentee before: SHENZHEN AUDAQUE DATA TECHNOLOGY Ltd.

TR01 Transfer of patent right