CN105574164A - Excel document data analysis method and device - Google Patents
Excel document data analysis method and device Download PDFInfo
- Publication number
- CN105574164A CN105574164A CN201510946709.5A CN201510946709A CN105574164A CN 105574164 A CN105574164 A CN 105574164A CN 201510946709 A CN201510946709 A CN 201510946709A CN 105574164 A CN105574164 A CN 105574164A
- Authority
- CN
- China
- Prior art keywords
- document
- worksheet
- data
- excel
- xml file
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/25—Integrating or interfacing systems involving database management systems
- G06F16/258—Data format conversion from or to a database
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
Abstract
The invention relates to an Excel document data analysis method. The method comprises steps as follows: step 10, acquiring a document flow of an Excel document requiring analysis; step 20, analyzing the document flow, and acquiring information about workbooks and worksheets in the document flow; step 30, reading xml (extensive markup language) documents corresponding to the worksheets respectively through multiple threads; step 40, analyzing xml documents of shared data in the document flow, and finding storage locations of the shared data corresponding to the worksheets and reading the storage locations through the multiple threads; step 50, performing analysis in combination of the xml documents corresponding to the worksheets and the shared data corresponding to the worksheets through the multiple threads so as to acquire data of the Excel document. The invention further relates to an Excel document data analysis device. The Excel document data analysis method and device are generally applicable to the all kinds of formats of Excel documents and can be applied to larger Excel documents, and the data analysis efficiency is improved.
Description
Technical field
The application relates to technical field of data processing, particularly a kind of data analysis method of Excel document and device.
Background technology
MicrosoftExcel is one of assembly of the office software Microsoftoffice of Microsoft, and Microsoftoffice starts from version 2 007, employ the OfficeOpenXML file layout of version different from the past (use binary file format).The container of new file layout is the ZIP file layout of the compression based on simple parts, these parts comprise the XML file describing application data, metadata and self-defining data, and the non-XML file such as the binary file of relation between description parts, the picture embedded in document or ole object, use in the core of new OfficeOpenXML form that some XML's quote framework and a ZIP container.The blank Excel document of xlsx is called for newly-built suffix, after being decompressed, under first class catalogue, define file _ rels, xl and docProps, also have file [Content_Types] .xml, under each file, also comprise various XML file and non-XML file.
Existing roughly have following two kinds: first to the method that the data in Excel document (book) resolve the data read in Excel document, use the API (ApplicationProgrammingInterface that specific system provides, application programming interface) resolve Excel document, then by its system of data importing; The second, resolve Excel document by the API calling some more ripe open source softwares, more popular is that the API calling the POI (theJavaAPIforMicrosoftDocuments) that Apache provides resolves Excel document.These two kinds of methods all also exist inevitable defect: the first analytic method can only resolve the Excel document of specific format, so can not generally be suitable for; The second analytic method cannot tackle larger Excel document, and reason is that whole document is all loaded in internal memory when resolving by second method, and internal memory may be caused to overflow and make cannot continue to resolve Excel document data.Therefore, need badly provide one be more generally suitable for and efficient data analysis method.
Summary of the invention
The object of the present invention is to provide a kind of data analysis method with efficient Excel document that is generally suitable for.
Another object of the present invention is to provide a kind of data analysis device with efficient Excel document that is generally suitable for.
For achieving the above object, the invention provides a kind of data analysis method of Excel document, comprising:
Step 10, acquisition need the document flow of the Excel document of resolving;
Step 20, resolve this document flow, obtain the information about book and worksheet in this document flow;
Step 30, multithreading is utilized to read xml file corresponding to each worksheet respectively;
Step 40, the shared data xml file of resolving in this document flow, find out the memory location of the shared data corresponding with each worksheet and utilize multithreading to read respectively;
Step 50, carry out in conjunction with xml file corresponding to each worksheet and shared data separate multithreading corresponding to each worksheet the data resolving to obtain this Excel document respectively.
Wherein, described document flow is zip form.
Wherein, described information comprises the mapping relations of the xml file in the information of book and worksheet and document flow.
Wherein, the xml file that described each worksheet is corresponding obtains according to the mapping relations of worksheet and xml file in document flow information.
Wherein, described method comprises the step of reducing to the data of resolving this Excel document obtained further.
For achieving the above object, present invention also offers a kind of data analysis device of Excel document, comprising:
Document flow module, it obtains the document flow needing the Excel document of resolving;
Document flow information analysis module, it resolves this document flow, obtains the information about book and worksheet in this document flow;
Worksheet read module, it utilizes multithreading to read xml file corresponding to each worksheet respectively;
Share data resolution module, it resolves the shared data xml file in this document flow, finds out the memory location of the shared data corresponding with each worksheet and utilizes multithreading to read respectively;
Excel document data resolution module, it carries out in conjunction with xml file corresponding to each worksheet and shared data separate multithreading corresponding to each worksheet the data resolving to obtain this Excel document respectively.
Wherein, described document flow is zip form.
Wherein, described information comprises the mapping relations of the xml file in the information of book and worksheet and document flow.
Wherein, the xml file that described each worksheet is corresponding obtains according to the mapping relations of worksheet and xml file in document flow information.
Wherein, described device also reduces to the data of resolving this Excel document obtained further.
In sum, the data analysis method of Excel document of the present invention and device are generally suitable for the Excel document of various form, and can tackle larger Excel document, improve Data Analysis efficiency.
Accompanying drawing explanation
Fig. 1 is the process flow diagram of data analysis method one preferred embodiment of Excel document of the present invention.
Embodiment
Below in conjunction with accompanying drawing, by the specific embodiment of the present invention describe in detail, will make technical scheme of the present invention and beneficial effect apparent.
With reference to Fig. 1, it is the process flow diagram of data analysis method one preferred embodiment of Excel document of the present invention.The method mainly comprises:
Step 10, acquisition need the document flow of the Excel document of resolving.In this step, document flow is zip form, according to the regulation of OfficeOpenXML file layout, at least comprises the xml file etc. describing application data, metadata and self-defining data in the document flow of zip form.
Step 20, resolve this document flow, obtain the information about book and worksheet in this document flow.Mainly comprise by the relevant document flow of two in parsing zip formatted file stream, the information of workbook (workbook) and worksheet (worksheet or sheet) in file reading stream.By can understand the overview of Excel document about the information of book, by can understand the mapping relations of the xml file in worksheet and document flow about the information of worksheet, obtain these information by files such as such as workbook.xml in document flow specifically.
Step 30, multithreading is utilized to read xml file corresponding to each worksheet respectively.
The xml file that each worksheet to be resolved is corresponding can obtain according to the mapping relations of worksheet and xml file in document flow information.The mode adopting multithreading to read respectively can improve treatment effeciency.
Step 40, the shared data xml file of resolving in this document flow, find out the memory location of the shared data corresponding with each worksheet and utilize multithreading to read respectively.In resolving, can according to storing the position of data in shared data xml file and determining with the relation of each worksheet the data that each worksheet is corresponding.
Share the data that data xml file is shared for storing all working table, the data that each worksheet is shared normally are stored in specific shared data xml file, generally be specially/xl/sharedStrings.xml, the sharedStrings.xml namely under xl catalogue.If do not adopt this step, when resolving the data of certain worksheet, except the data of corresponding xml file will be resolved, also need to load simultaneously resolve in sharedStrings.xml to should the data of worksheet could the data of complete this worksheet of acquisition, now in sharedStrings.xml, the data of other worksheets corresponding are also loaded parsing simultaneously, and need to repeat to load to resolve sharedStrings.xml for each worksheet, high and drag slow resolution speed to internal memory request for utilization, when making to utilize multiple thread to resolve the corresponding xml file of each worksheet, each thread needs to access sharedStrings.xml simultaneously, multiple threads mode is made to become complicated, even may offset the resolution speed adopting multithreading to bring to promote.
By this step, first can find out literary composition and share the data and each worksheet relation that store in data xml file, and then recycling multithreading only resolves the data corresponding to dependent work worksheet, make resolution speed faster, internal memory request for utilization is lower.
Step 50, carry out in conjunction with xml file corresponding to each worksheet and shared data separate multithreading corresponding to each worksheet the data resolving to obtain this Excel document respectively.The present invention, by obtaining the document flow information of Excel document, obtains the xml file and corresponding shared data that need to resolve, adopts multithreading to carry out resolving to obtain in Excel document the correct data needing to resolve.
The present invention can further include the step of reducing to the data of resolving this Excel document obtained.The data of such as resolving in the xml file of specifying are:
<cr=“F2”s=“3”>
<v>41574.833599537036</v>
</c>
The reality of above-mentioned data correspondence position display in Excel document is date data a: 2013/10/2720:00:23, can according to xml file layout specification documents, other xml element property values such as " s=3 " and the position relationship of these property values in xml file determine that the value of this cell is date (date) type, carry out conversion and just can obtain a date types value, then as required this date types value is transferred to the time showing value of different-format, namely data 41574.833599537036 are through resolving and reduction, finally can be reduced to the displayed value of 2013/10/2720:00:23.Thus the step of after realizing parsing, some data being reduced.
The present invention is by analyzing the structure of Excel document, only resolve specific xml file, achieve a kind of Data Analysis mode being generally suitable for the Excel document of various form, reduce the demand of Data Analysis to computer system memory of Excel document, larger Excel document can be tackled, improve the Data Analysis efficiency of Excel document simultaneously.
Accordingly, present invention also offers a kind of data analysis device of Excel document.This device comprises:
Document flow module, it obtains the document flow needing the Excel document of resolving.Document flow is zip form, according to the regulation of OfficeOpenXML file layout, at least comprises the xml file etc. describing application data, metadata and self-defining data in the document flow of zip form.
Document flow information analysis module, it resolves this document flow, obtains the information about book and worksheet in this document flow.The information obtained comprises the mapping relations of the xml file in the information of book and worksheet and document flow.
Worksheet read module, it utilizes multithreading to read xml file corresponding to each worksheet respectively.The xml file that each worksheet to be resolved is corresponding can obtain according to the mapping relations of worksheet and xml file in document flow information.The mode adopting multithreading to read respectively can improve treatment effeciency.
Share data resolution module, it resolves the shared data xml file in this document flow, finds out the memory location of the shared data corresponding with each worksheet and utilizes multithreading to read respectively.In resolving, can according to storing the position of data in shared data xml file and determining with the relation of each worksheet the data that each worksheet is corresponding.
Excel document data resolution module, it carries out in conjunction with xml file corresponding to each worksheet and shared data separate multithreading corresponding to each worksheet the data resolving to obtain this Excel document respectively.This module can also be reduced to the data of resolving this Excel document obtained further.
The data analysis device of Excel document of the present invention, by obtaining the document flow information of Excel document, obtains the xml file and corresponding shared data that need to resolve, adopts multithreading to carry out resolving to obtain in Excel document the correct data needing to resolve.
In sum, the data analysis method of Excel document of the present invention and device are generally suitable for the Excel document of various form, and can tackle larger Excel document, improve Data Analysis efficiency.
The foregoing is only preferred embodiment of the present invention, not in order to limit the present invention, all any amendments done within the spirit and principles in the present invention, equivalent replacement and improvement etc., all should be included within protection scope of the present invention.
Claims (10)
1. a data analysis method for Excel document, is characterized in that, comprising:
Step 10, acquisition need the document flow of the Excel document of resolving;
Step 20, resolve this document flow, obtain the information about book and worksheet in this document flow;
Step 30, multithreading is utilized to read xml file corresponding to each worksheet respectively;
Step 40, the shared data xml file of resolving in this document flow, find out the memory location of the shared data corresponding with each worksheet and utilize multithreading to read respectively;
Step 50, carry out in conjunction with xml file corresponding to each worksheet and shared data separate multithreading corresponding to each worksheet the data resolving to obtain this Excel document respectively.
2. the data analysis method of Excel document according to claim 1, is characterized in that, described document flow is zip form.
3. the data analysis method of Excel document according to claim 1, is characterized in that, described information comprises the mapping relations of the xml file in the information of book and worksheet and document flow.
4. the data analysis method of Excel document according to claim 1, is characterized in that, the xml file that described each worksheet is corresponding obtains according to the mapping relations of worksheet and xml file in document flow information.
5. the data analysis method of Excel document according to claim 1, is characterized in that, described method comprises the step of reducing to the data of resolving this Excel document obtained further.
6. a data analysis device for Excel document, is characterized in that, comprising:
Document flow module, it obtains the document flow needing the Excel document of resolving;
Document flow information analysis module, it resolves this document flow, obtains the information about book and worksheet in this document flow;
Worksheet read module, it utilizes multithreading to read xml file corresponding to each worksheet respectively;
Share data resolution module, it resolves the shared data xml file in this document flow, finds out the memory location of the shared data corresponding with each worksheet and utilizes multithreading to read respectively;
Excel document data resolution module, it carries out in conjunction with xml file corresponding to each worksheet and shared data separate multithreading corresponding to each worksheet the data resolving to obtain this Excel document respectively.
7. the data analysis device of Excel document as claimed in claim 6, it is characterized in that, described document flow is zip form.
8. the data analysis device of Excel document as claimed in claim 6, is characterized in that, described information comprises the mapping relations of the xml file in the information of book and worksheet and document flow.
9. the data analysis device of Excel document as claimed in claim 6, is characterized in that, the xml file that described each worksheet is corresponding obtains according to the mapping relations of worksheet and xml file in document flow information.
10. the data analysis device of Excel document as claimed in claim 6, is characterized in that, described device also reduces to the data of resolving this Excel document obtained further.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510946709.5A CN105574164B (en) | 2015-12-16 | 2015-12-16 | The data analysis method and device of Excel document |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510946709.5A CN105574164B (en) | 2015-12-16 | 2015-12-16 | The data analysis method and device of Excel document |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105574164A true CN105574164A (en) | 2016-05-11 |
CN105574164B CN105574164B (en) | 2019-03-19 |
Family
ID=55884295
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510946709.5A Active CN105574164B (en) | 2015-12-16 | 2015-12-16 | The data analysis method and device of Excel document |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105574164B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106021390A (en) * | 2016-05-12 | 2016-10-12 | 福建南威软件有限公司 | File management method and device |
CN107977440A (en) * | 2017-12-07 | 2018-05-01 | 网宿科技股份有限公司 | A kind of methods, devices and systems for parsing data file |
CN109783554A (en) * | 2018-12-13 | 2019-05-21 | 重庆金融资产交易所有限责任公司 | Excel document analytic method, device and computer readable storage medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090094606A1 (en) * | 2007-10-04 | 2009-04-09 | National Chung Cheng University | Method for fast XSL transformation on multithreaded environment |
CN102495722A (en) * | 2011-10-18 | 2012-06-13 | 成都康赛电子科大信息技术有限责任公司 | XML (extensible markup language) parallel parsing method for multi-core fragmentation |
CN102760118A (en) * | 2011-04-25 | 2012-10-31 | 中兴通讯股份有限公司 | Method and device for exporting data as Excel file |
CN103020176A (en) * | 2012-11-28 | 2013-04-03 | 方跃坚 | Data block dividing method in XML parsing and XML parsing method |
CN104881275A (en) * | 2015-02-11 | 2015-09-02 | 中国农业银行股份有限公司 | Electronic spreadsheet generating method and device |
-
2015
- 2015-12-16 CN CN201510946709.5A patent/CN105574164B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090094606A1 (en) * | 2007-10-04 | 2009-04-09 | National Chung Cheng University | Method for fast XSL transformation on multithreaded environment |
CN102760118A (en) * | 2011-04-25 | 2012-10-31 | 中兴通讯股份有限公司 | Method and device for exporting data as Excel file |
CN102495722A (en) * | 2011-10-18 | 2012-06-13 | 成都康赛电子科大信息技术有限责任公司 | XML (extensible markup language) parallel parsing method for multi-core fragmentation |
CN103020176A (en) * | 2012-11-28 | 2013-04-03 | 方跃坚 | Data block dividing method in XML parsing and XML parsing method |
CN104881275A (en) * | 2015-02-11 | 2015-09-02 | 中国农业银行股份有限公司 | Electronic spreadsheet generating method and device |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106021390A (en) * | 2016-05-12 | 2016-10-12 | 福建南威软件有限公司 | File management method and device |
CN107977440A (en) * | 2017-12-07 | 2018-05-01 | 网宿科技股份有限公司 | A kind of methods, devices and systems for parsing data file |
CN107977440B (en) * | 2017-12-07 | 2020-11-27 | 网宿科技股份有限公司 | Method, device and system for analyzing data file |
CN109783554A (en) * | 2018-12-13 | 2019-05-21 | 重庆金融资产交易所有限责任公司 | Excel document analytic method, device and computer readable storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN105574164B (en) | 2019-03-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10380235B2 (en) | Method and system for annotation and connection of electronic documents | |
US7802179B2 (en) | Synchronizing data between different editor views | |
US20180341371A1 (en) | Scatter copy supporting partial paste functionality | |
Gavish et al. | A universal identifier for computational results | |
US20070061706A1 (en) | Mapping property hierarchies to schemas | |
US20100241948A1 (en) | Overriding XSLT Generation | |
US20200364044A1 (en) | Application programming interface documentation annotation | |
CN107168695B (en) | Excel data analysis method and system | |
CN102937949B (en) | A kind of method and system realizing English spelling and check in editor | |
CN104699714A (en) | Method and device for transferring files of book edition format into files of EPUB format | |
US20100251227A1 (en) | Binary resource format and compiler | |
CN105574164A (en) | Excel document data analysis method and device | |
US9811574B2 (en) | Extract Transform Load (ETL) applications for job matching | |
CN108664458B (en) | PDF file table analysis method and system | |
US10606891B2 (en) | JSON data validation | |
US20150106478A1 (en) | File handlers supporting dynamic data streams | |
US8073879B2 (en) | Systems and methods that transform constructs from domain to domain | |
US8060490B2 (en) | Analyzer engine | |
CN111930708B (en) | Ceph object storage-based object tag expansion system and method | |
US20070214411A1 (en) | Reducing Resource Requirements When Transforming Source Data in a Source Markup Language to Target Data in a Target Markup Language using Transformation Rules | |
Eckert | Provenance and annotations for linked data | |
CN102855283A (en) | Method for filing and storing files of office automation system | |
US9223766B2 (en) | Preserving formatting of content selection through snippets | |
CN104317935A (en) | Method and system for generating HTML (hypertext markup language) pages from XML (extensible markup language) bill files in batches | |
GB2458692A (en) | A process for generating database-backed, web-based documents |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right |
Effective date of registration: 20230505 Address after: 518000 2203/2204, Building 1, Huide Building, Beizhan Community, Minzhi Street, Longhua District, Shenzhen, Guangdong Patentee after: SHENZHEN AUDAQUE DATA TECHNOLOGY Ltd. Address before: 15th Floor, Design Building, No. 8 Huixin East Street, Chaoyang District, Beijing, 100029 Patentee before: BEIJING HUAAODA DATA TECHNOLOGY Co.,Ltd. Patentee before: SHENZHEN AUDAQUE DATA TECHNOLOGY Ltd. |
|
TR01 | Transfer of patent right |