CN112069130A - Data recombination and restoration method for Microsoft EXCEL file - Google Patents
Data recombination and restoration method for Microsoft EXCEL file Download PDFInfo
- Publication number
- CN112069130A CN112069130A CN202010896794.XA CN202010896794A CN112069130A CN 112069130 A CN112069130 A CN 112069130A CN 202010896794 A CN202010896794 A CN 202010896794A CN 112069130 A CN112069130 A CN 112069130A
- Authority
- CN
- China
- Prior art keywords
- sector
- data
- sid
- stream
- directory
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/13—File access structures, e.g. distributed indices
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/0223—User address space allocation, e.g. contiguous or non contiguous base addressing
- G06F12/023—Free address space management
- G06F12/0238—Memory management in non-volatile memory, e.g. resistive RAM or ferroelectric memory
- G06F12/0246—Memory management in non-volatile memory, e.g. resistive RAM or ferroelectric memory in block erasable memory, e.g. flash memory
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a data recombination and restoration method aiming at Microsoft EXCEL files, which is characterized by comprising the following steps: s100: judging whether the byte content conforms to the sector structure of the first sector of the Microsoft EXCEL file, if so, executing the step S300, otherwise, executing the step S200; s200: traversing all sectors to obtain a directory stream sector and a configuration table sector; s300: analyzing the sector of the configuration table and the sector of the directory stream, detecting whether the structure of the sector is complete, if so, executing the step S400, otherwise, executing the step S600; s400: analyzing, checking and constructing a sector structure of a file header of the Microsoft EXCEL file; s500: recombining all sector data of the Microsoft EXCEL file, and executing the step S800; s600: analyzing data stream sector data to obtain sub-stream data of the workbook data; s700: recombining the work table data in the Microsoft EXCEL file and executing the step S900; s800: writing the recombined new EXCEL file sector data into a new EXCEL file, and ending the process; s900: and writing the TXT files with the corresponding number according to the number of the reorganized worksheet data.
Description
Technical Field
The invention belongs to the field of electronic evidence obtaining and data recovery, and relates to a data recombination and restoration method for Microsoft EXCEL files.
Background
Microsoft office from Microsoft corporation is the most common office software, and EXCEL files are the indispensable tools for processing large amounts of tabular data, the contents of which usually include statistical and numerical data marked with text, or graphics or pictures embedded in a spreadsheet with the suffix name of xls. Mathematical equations for calculating numerical data or statistical data, and partial function libraries are also provided as an integration function of Microsoft EXCEL. Formatting properties and other layout elements can be implemented into a new EXCEL spreadsheet using XLS files (commonly referred to as EXCEL templates).
Microsoft Excel files store Microsoft Excel documents using the xls proprietary format. This file format is called binary exchange file format (BIFF). The data in all Office document files is stored in a series of fixed-size streams. The location of each fixed-size stream is described by several sector allocation tables. These streams contain directory information, configuration table information, and all data information about the document.
Microsoft Office is used as the most widely applied Office software in daily work and life, wherein the EXCEL format is used as a table format file, and the application range is very wide. The EXCEL file also has a fragile characteristic as one of electronic data due to its convenience in data processing and extremely high frequency of use. When some emergency conditions affect the normal use of Microsoft EXCEL, the risk of damage or loss of important files is increased, and the bad or lost EXCEL files are extremely difficult to recover, so that the data in the EXCEL files of users are lost, and the working efficiency of the users is reduced and even the loss is caused. In the prior art, no data recombination and repair method for microsoft EXCEL files exists.
Disclosure of Invention
Aiming at the defects of the prior art, the data reorganization and restoration method for the EXCEL file of Microsoft Office is provided, and by judging whether the file format is the Xls format (namely, the EXCEL format), traversing the file data content sector by sector, analyzing the data type of each sector and classifying the sector content, performing word-by-word analysis on each sector data in different modes, extracting the effective data for restoring the EXCEL file, recombining the data by combining the cell data content and the cell style information, thereby realizing the EXCEL file restoration, and finally exporting a new Xls file or exporting the txt file restored for each worksheet.
The invention comprises the following steps:
s100: loading the Microsoft EXCEL file, reading byte content of a first sector, judging whether the byte content conforms to a sector structure of the first sector of the Microsoft EXCEL file, if so, executing a step S300, otherwise, executing a step S200, wherein the Microsoft EXCEL file is a file to be repaired;
s200: traversing all sectors of the Microsoft EXCEL file, and acquiring a directory stream sector and a configuration table sector;
s300: analyzing the configuration table sector and the directory stream sector, detecting whether the sector structure is complete, if so, executing the step S400, otherwise, executing the step S600;
s400: analyzing, checking and constructing a sector structure of a file header of the Microsoft EXCEL file according to complete configuration table sector information and complete directory stream sector information;
s500: according to the constructed head structure sector content and the sector data of the file to be repaired, all the sector data of the Microsoft EXCEL file are recombined, and the step S800 is executed;
s600: analyzing data stream sector data to obtain sub-stream data of the workbook data;
s700: analyzing the sub-stream data of the workbook data, recombining the worksheet data in the Microsoft EXCEL file, and executing step S900;
s800: writing the recombined new EXCEL file sector data into a new EXCEL file, and ending the process;
s900: writing the data into TXT files with corresponding number according to the number of the reorganized worksheet data: and newly creating a TXT file, and writing the cell data analyzed from the workbook substream of the worksheet type into the TXT file according to the contents of the array reordered by the two-dimensional array.
Preferably, the sector structure of the first sector of the microsoft EXCEL file is shown in table 1 below, wherein the content of the first 8 bytes is 0xD0 CF 11E0 a 1B 11A E1;
the SID represents a Sector Identifier, the SSAT represents a short Sector allocation table, the SAT represents a Sector allocation table, and the MSAT represents a main Sector allocation table.
Preferably, the step S200 includes the steps of:
s201: reading the byte content of each sector one by one for parsing each sector and comparing with the directory stream storage structure shown in table 2, determining whether the current sector belongs to a directory stream sector or a configuration table sector,
table 2 directory stream structure
Wherein, the DID represents a directory entry Identifier directory _ entry Identifier;
s202: judging whether the byte content of the current sector simultaneously meets the following five conditions of the directory stream sector, if so, judging that the current sector is the directory stream sector, recording the offset address of the current sector in the Microsoft EXCEL file, and recording the directory stream structure of the current sector according to the directory stream sector structure of the table 2; otherwise, judging whether the byte content of the next sector simultaneously meets the following five conditions of the directory stream sector until the judgment of the byte content of each sector is completed;
the first condition is as follows: the directory structure comprises 4 directory stream structures shown in the table 2, and each directory stream structure has a fixed byte length of 0x80 bytes;
and a second condition: the size of the area storing the name does not exceed 0x40 bytes;
and (3) carrying out a third condition: entry type does not exceed 0x05 bytes;
and a fourth condition: DID values of a left node, a right node and a root node in the directory structure meet the condition that DID is greater than 0 or DID is equal to-1;
and a fifth condition: the entry SID value of the flow is not less than 0 or SID value is equal to-2;
s203: judging whether the byte content of the current sector simultaneously meets the following two conditions of the configuration table sector, if so, judging that the current sector is the configuration table sector, and continuously judging which SSAT/SAT/MSAT the current configuration table sector belongs to, otherwise, judging whether the byte content of the next sector simultaneously meets the following two conditions of the configuration table sector until the judgment of the byte content of each sector is completed;
the first condition is as follows: the 0x80 byte content of the current sector uses 4 bytes as a group, and the SID values of each group are judged one by one, and each SID value needs to satisfy:
when the SID is greater than 0, the SID is smaller than (the total size of the Microsoft EXCEL file is-0 x200)/0x 200;
when the SID is less than 0, the SID value can be only one of-1, -2, -3 or-4;
and a second condition: when the SID value is equal to-1, all the SID values thereafter are-1;
judging the sector of the current configuration table belongs to one of four categories of SSAT/SAT/MSAT:
the first condition is as follows: if the configuration table sector is positioned in front of the directory stream sector, the current configuration table sector is SAT;
case two: if the configuration table sector is positioned behind the directory stream sector where the Root Entry is positioned, and the SID of the first positive number in the configuration table sector is equal to the intra-sector offset address/4 of the current SID value, the current configuration table sector is SSAT;
case three: if only the positive value and-1 negative value exist in the sector of the configuration table, the current sector of the configuration table is MSAT;
case four: if SID values with negative values of-3 and-4 exist in the sector of the configuration table, the sector of the current configuration table is SAT;
if the Root Entry exists in the directory stream sector, judging the maximum SID value of the SSAT according to an Entry Sid value and an Entry byte number in a Root Entry directory stream structure;
s204: generating SAT-SID linked list by SID value of SAT configuration table sector according to the classified configuration table sector: the SAT-SID linked list is formed by SAT array indexes and SIDs in one-to-one correspondence, wherein the array indexes are natural numbers, and the content of the SIDs is the SAT array index corresponding to the next SID of the current SID;
s205: according to the classified sectors of the configuration table, forming SSAT-SID linked list by SID values of the sectors of the SSAT configuration table: the SSAT-SID linked list is formed by SSAT array indexes and SIDs in one-to-one correspondence, wherein the SSAT array indexes are natural numbers, and the SIDs are SSAT array indexes corresponding to the SIDs next to the current SID.
Preferably, the step S300 includes the steps of:
judging whether the sector structure simultaneously meets the following four conditions, if so, the sector structure is a complete sector structure, and executing the step S400, otherwise, executing the step S600:
the first condition is as follows: detecting the SID values of all SAT configuration table sectors according to the SAT-SID linked list, wherein the linked list values corresponding to the SAT-SID linked list are all-3;
and a second condition: detecting that SID values of all directory stream sectors have a complete sub-chain table in the SAT-SID linked list according to the SAT-SID linked list, wherein the sub-chain table starts with the SID value of the directory stream sector where Root Entry is located, and the sector of the pointed next SID value is also the directory stream sector; and the child chain length of a directory stream sector is equal to the number of directory stream sectors and the last SID value of the child chain table of a directory stream sector is-2;
and (3) carrying out a third condition: when the entry byte number is more than or equal to 0x1000, detecting the entry SID value and the entry byte number of each directory stream structure in the directory stream sector according to the SAT-SID linked list to form a complete sub-link list in the SAT-SID linked list: starting with the entry SID value of the directory stream structure, the sub-chain length formed in the SAT-SID linked list is the larger of the entry byte number/0 x200-1 and the entry byte number/0 x 200;
and a fourth condition: when the entry byte number is less than 0x1000, detecting the entry SID value and entry byte number of each directory flow structure in the directory flow sector according to the SSAT-SID linked list to form a complete sub-chain list in the SSAT-SID linked list: starting with the entry SID value of the directory flow structure, the length of the child chain formed in the SSAT-SID linked list is equal to the larger of the entry byte number/0 x40-1 and the entry byte number/0 x 40.
Preferably, the step S400 includes the steps of:
according to the sector structure of the first sector of the microsoft EXCEL file shown in table 1, a new first sector of the file is constructed:
the fixed value parameter is assigned according to the content of the table 1;
variable parameter part: and setting the variable parameters by combining the known SSAT configuration table sector information, SAT configuration table sector information and directory stream sector information to generate a complete file first sector.
Preferably, the step S500 includes the steps of:
and recombining the complete file first sector generated in the step S400 and all the sector data except the complete file first sector in the file to be repaired into the sector data of the new EXCEL file.
7. The method for data reorganization and restoration according to microsoft EXCEL file according to claim 1, wherein the step S600 comprises the following steps:
s601: acquiring data stream sectors in a file to be repaired, wherein the data stream sectors except a file head sector, a configuration table sector and a directory stream sector in the file to be repaired are data stream sectors;
s602: analyzing the sector content of the data stream, judging according to the identifiers of the data sub-streams pointed by different directory stream inlets, and acquiring the content of the first sector of the data sub-stream, wherein the data sub-stream is the data stream content with the following identifiers as the initial part:
Root Entry:0x01 00 00 02 08 00 00 00/0xFE FF 00 00/0x01 00 FE FF
\005DocumentInformation:0xFE FF 00 00
\005Summary DocumentInfomation:0xFE FF 00 00
workbook of 0x 090810000006
S603: after determining the first sector of the data substream of the workbook, traversing each data stream sector behind the first sector of the data substream of the workbook, and judging whether each data sector has one of a Root Entry identifier, a DocumentInformation identifier and a Summary DocumentInformation identifier one by one, if not, judging that the current sector is still the data of the substream sector of the workbook data, if so, judging that the data of the substream sector of the workbook data is finished, and neither the current sector nor the data stream sectors behind the current sector are the data of the substream sector of the workbook data;
s604: all sector data of the workbook data substream is obtained.
Preferably, the step S700 includes the steps of:
s701: the storage structure of the analysis workbook data sub-stream is 2 byte record type +2 byte record length + record content;
s702: the parse workbook sub-stream data is of the following type:
0x 0005: representing Workbook Global workbooks globals
0x 0006: representing VB Module Visual Basic Module
0x 0010: representing a table or dialog box Sheet or dialog
0x 0020: representing a Chart Chart
0x 0040: representing Macro-watch Macro sheet
0x 0100: represent workplace
The workbook substream of the 0x 0005 record type comprises all character string data and relevant file attribute information in the microsoft EXCEL file and the number and name of worksheets contained in the microsoft EXCEL file;
the workbook sub-stream of the 0x 0010 record type comprises cell data of a worksheet of a Microsoft EXCEL file, and the storage structure of the workbook sub-stream is cell row index value, cell column index value and cell data content;
s703: analyzing the workflow data record of the workbook according to the workflow data type of the workbook in step S702: the content of the cell data in the current worksheet is recorded in the workbook of the 0x 0010 worksheet type, the worksheet cell data is recorded according to the cell row index value, the cell column index value and the cell data content format, and the cell data information is shown in table 3:
TABLE 3 cell data information
Acquiring cell data of the workbook sub-stream according to the record content described in the above table 3;
s704: and for the cell data in each worksheet substream data, sorting the acquired cell data according to the row and column formats of the tables by using the cell row index value and the cell column index value.
Preferably, the step S704 includes the steps of:
s7041: presetting a two-dimensional array, wherein the size of the two-dimensional array is determined by the value of DIMENSION in a table 3 and is used for determining the range of rows and columns of the current working table; the two-dimensional array is used for storing cell data;
s7042: writing the records into the two-dimensional array according to respective cell row index values and cell column index values;
s7043: writing all the cell data acquired from the working table into the two-dimensional array according to the respective cell row index value and cell column index value to realize the reordering of the cell data of the working table, and executing step S900.
The invention has the beneficial effects that:
1. the method solves the technical problem that no data recombination and restoration method aiming at the Microsoft EXCEL file exists in the prior art.
2. Not only effectively extracting the EXCEL file data stream, but also performing sequencing and recombination to recover the data.
Drawings
FIG. 1 is a general flow diagram of a method provided by the present invention;
FIG. 2 is a diagram illustrating the generation of a SAT-SID linked list according to an embodiment of the present invention.
Detailed Description
Fig. 1 shows a general flow chart of the method provided by the present invention. As shown in fig. 1, the method comprises the following steps:
s100: loading the Microsoft EXCEL file, reading the byte content of the first sector, judging whether the byte content conforms to the sector structure of the first sector of the Microsoft EXCEL file, if so, executing the step S300, otherwise, executing the step S200, wherein the Microsoft EXCEL file is the file to be repaired.
Specifically, firstly, microsoft EXCEL file is loaded, the data of the first sector of microsoft EXCEL file is read, and the size of the read sector is defaulted to 0x200 bytes.
Secondly, according to the sector structure of the first sector of the microsoft EXCEL file shown in table 1, it is determined whether the first sector of the EXCEL file to be repaired meets the sector structure of the first sector of the microsoft EXCEL file, that is, it is determined whether the following conditions are met:
whether the first 8 bytes of the first sector is 0xD0 CF 11E0 a 1B 11A E1, if yes, execute step S300, otherwise execute step S200.
Table 1: sector structure of first sector of Microsoft EXCEL file
The SID represents a Sector Identifier, the SSAT represents a short Sector allocation table, the SAT represents a Sector allocation table, and the MSAT represents a main Sector allocation table.
S200: traversing all sectors of the Microsoft EXCEL file, acquiring a directory stream sector and a configuration table sector,
the method comprises the following steps:
s201: reading the byte content of each sector one by one for parsing each sector and comparing with the directory stream storage structure shown in table 2, determining whether the current sector belongs to a directory stream sector or a configuration table sector,
table 2 directory stream structure
Wherein, the DID represents a directory entry Identifier directory _ entry Identifier;
s202: judging whether the byte content of the current sector simultaneously meets the following five conditions of the directory stream sector, if so, judging that the current sector is the directory stream sector, recording the offset address of the current sector in the Microsoft EXCEL file, and recording the directory stream structure of the current sector according to the directory stream sector structure in the table 2; otherwise, judging whether the byte content of the next sector simultaneously meets the following five conditions of the directory stream sector until the judgment of the byte content of each sector is completed;
the first condition is as follows: contains 4 directory stream structures shown in table 2, and each directory stream structure has a fixed byte length of 0x80 bytes;
and a second condition: the size of the area storing the name does not exceed 0x40 bytes;
and (3) carrying out a third condition: entry type does not exceed 0x05 bytes;
and a fourth condition: DID values of a left node, a right node and a root node in the directory structure meet the condition that DID is greater than 0 or DID is equal to-1;
and a fifth condition: the entry SID value of the flow is not less than 0 or SID value is equal to-2;
s203: judging whether the byte content of the current sector simultaneously meets the following two conditions of the configuration table sector, if so, judging that the current sector is the configuration table sector, and continuously judging which SSAT/SAT/MSAT the current configuration table sector belongs to, otherwise, judging whether the byte content of the next sector simultaneously meets the following two conditions of the configuration table sector until the judgment of the byte content of each sector is completed;
the first condition is as follows: the 0x80 byte content of the current sector uses 4 bytes as a group, and the SID values of each group are judged one by one, and each SID value needs to satisfy:
when the SID is greater than 0, the SID is smaller than (Microsoft EXCEL file total size-0 x200)/0x 200;
when the SID is less than 0, the SID value can be only one of-1, -2, -3 or-4;
and a second condition: when the SID value is equal to-1, all the SID values thereafter are-1;
judging the sector of the current configuration table belongs to one of four categories of SSAT/SAT/MSAT:
the first condition is as follows: if the configuration table sector is positioned in front of the directory stream sector, the current configuration table sector is SAT;
case two: if the configuration table sector is positioned behind the directory stream sector where the Root Entry is positioned, and the SID of the first positive number in the configuration table sector is equal to the intra-sector offset address/4 of the current SID value, the current configuration table sector is SSAT;
case three: if only the positive value and-1 negative value exist in the sector of the configuration table, the current sector of the configuration table is MSAT;
case four: if SID values with negative values of-3 and-4 exist in the sector of the configuration table, the sector of the current configuration table is SAT;
if the Root Entry exists in the directory stream sector, judging the maximum SID value of the SSAT according to an Entry Sid value and an Entry byte number in a Root Entry directory stream structure;
s204: FIG. 2 is a diagram illustrating the generation of a SAT-SID linked list according to an embodiment of the present invention. As shown in fig. 2, according to the classified configuration table sectors, the SID value of the SAT configuration table sector is generated into an SAT-SID linked list:
the SAT-SID linked list is formed by SAT Array indexes (namely, Array indexes) and SIDs (namely, SAT contents (SecIDs)) in a one-to-one correspondence mode, wherein the Array indexes (Array indexes) are natural numbers, and the content of the SIDs is the SAT Array index (Array indexes) corresponding to the SID next to the current SID;
for example, as shown in fig. 2, the Array index (Array indexes) is 0, and then SID (SAT contents (SecIDs)) is 2, i.e., the SAT Array index (Array indexes) corresponding to the next SID is 2, and its corresponding SID (SAT contents (SecIDs)) is 3, and then 3 is equal to the SAT Array index (Array indexes) corresponding to the next SID, and its corresponding SID (SAT contents (SecIDs)) is-2, and so on.
S205: similarly, according to the classified configuration table sector, the SID value of the SSAT configuration table sector forms an SSAT-SID linked list: the SSAT-SID linked list is formed by SSAT array index and SID in one-to-one correspondence, the SSAT array index is a natural number, and the SID content is the SSAT array index corresponding to the next SID of the current SID.
S300: analyzing the sector of the configuration table and the sector of the directory stream, detecting whether the sector structure is complete, if so, executing the step S400, otherwise, executing the step S600, and comprising the following steps:
judging whether the sector structure simultaneously meets the following four conditions, if so, the sector structure is a complete sector structure, and executing the step S400, otherwise, executing the step S600:
the first condition is as follows: detecting the SID values of all SAT configuration table sectors according to the SAT-SID linked list, wherein the linked list values corresponding to the SAT-SID linked list are all-3;
and a second condition: detecting that SID values of all directory stream sectors have a complete sub-chain table in the SAT-SID linked list according to the SAT-SID linked list, wherein the sub-chain table starts with the SID value of the directory stream sector where Root Entry is located, and the sector of the pointed next SID value is also the directory stream sector; and the child chain length of a directory stream sector is equal to the number of directory stream sectors and the last SID value of the child chain table of a directory stream sector is-2;
and (3) carrying out a third condition: when the entry byte number is more than or equal to 0x1000, detecting the entry SID value and the entry byte number of each directory stream structure in the directory stream sector according to the SAT-SID linked list to form a complete sub-link list in the SAT-SID linked list: starting with the entry SID value of the directory stream structure, the sub-chain length formed in the SAT-SID linked list is the larger of the entry byte number/0 x200-1 and the entry byte number/0 x 200;
and a fourth condition: when the entry byte number is less than 0x1000, detecting the entry SID value and entry byte number of each directory flow structure in the directory flow sector according to the SSAT-SID linked list to form a complete sub-chain list in the SSAT-SID linked list: starting with the entry SID value of the directory flow structure, the length of the child chain formed in the SSAT-SID linked list is equal to the larger of the entry byte number/0 x40-1 and the entry byte number/0 x 40.
S400: analyzing, checking and constructing the sector structure of the file header of the Microsoft EXCEL file according to the complete configuration table sector information and the complete directory stream sector information, comprising the following steps:
constructing a new file first sector according to the sector structure of the first sector of the Microsoft EXCEL file shown in the table 1, wherein the fixed value parameter is assigned according to the content of the table 1;
variable parameter part: and setting the variable parameters by combining the known SSAT configuration table sector information, SAT configuration table sector information and directory stream sector information to generate a complete file first sector.
S500: according to the constructed head structure sector content and the sector data of the file to be repaired, all the sector data of the microsoft EXCEL file are recombined, and the step S800 is executed, which comprises the following steps:
and recombining the complete file first sector generated in the step S400 and all the sector data except the complete file first sector in the file to be repaired into the sector data of the new EXCEL file.
S600: analyzing data stream sector data to obtain data substream data of the workbook data, comprising the following steps:
s601: acquiring data stream sectors in a file to be repaired, wherein the data stream sectors except a file head sector, a configuration table sector and a directory stream sector in the file to be repaired are data stream sectors;
s602: analyzing the sector content of the data stream, judging according to the identifiers of the data sub-streams pointed by different directory stream inlets, and acquiring the content of the first sector of the data sub-stream, wherein the data sub-stream is the data stream content with the following identifiers as the initial part:
Root Entry:0x01 00 00 02 08 00 00 00/0xFE FF 00 00/0x01 00 FE FF
\005DocumentInformation:0xFE FF 00 00
\005Summary DocumentInfomation:0xFE FF 00 00
workbook of 0x 090810000006
S603: after determining the first sector of the data substream of the workbook, traversing each data stream sector behind the first sector of the data substream of the workbook, and judging whether each data sector has one of a Root Entry identifier, a DocumentInformation identifier and a Summary DocumentInformation identifier one by one, if not, judging that the current sector is still the data of the data substream sector of the workbook, if so, judging that the data of the substream sector of the workbook data is finished, and neither the current sector nor the data stream sectors behind the current sector are the data of the substream sector of the workbook data;
s604: all sector data of the workbook data substream is obtained.
S700: analyzing the sub-stream data of the workbook data, recombining the worksheet data in the Microsoft EXCEL file, and executing the step S900;
the method comprises the following steps:
s701: the storage structure of the analysis workbook data sub-stream is 2 byte record type +2 byte record length + record content;
s702: the parse workbook sub-stream data is of the following type:
0x 0005: representing Workbook Global workbooks globals
0x 0006: representing VB Module Visual Basic Module
0x 0010: representing a table or dialog box Sheet or dialog
0x 0020: representing a Chart Chart
0x 0040: representing Macro-watch Macro sheet
0x 0100: represent workplace
The workbook substream of the 0x 0005 record type comprises all character string data and relevant file attribute information in the microsoft EXCEL file and the number and name of worksheets contained in the microsoft EXCEL file;
the workbook sub-stream of the 0x 0010 record type comprises cell data of a worksheet of a Microsoft EXCEL file, and the storage structure of the workbook sub-stream is cell row index value, cell column index value and cell data content;
s703: analyzing the workflow data record of the workbook according to the workflow data type of the workbook in step S702: the content of the cell data in the current worksheet is recorded in the workbook of the 0x 0010 worksheet type, the worksheet cell data is recorded according to the cell row index value, the cell column index value and the cell data content format, and the cell data information is shown in table 3:
TABLE 3 cell data information
Acquiring cell data of the workbook sub-stream according to the record content described in the above table 3;
s704: for the cell data in each worksheet substream data, sorting the acquired cell data according to the row and column formats of the table by using the cell row index value and the cell column index value, comprising the following steps:
s7041: presetting a two-dimensional array, wherein the size of the two-dimensional array is determined by the value of DIMENSION in a table 3 and is used for determining the range of rows and columns of the current working table; the two-dimensional array is used for storing cell data;
s7042: writing the records into a two-dimensional array according to respective cell row index values and cell column index values;
s7043: writing all the cell data acquired from the worksheet into the two-dimensional array according to the respective cell row index value and cell column index value to realize the reordering of the worksheet cell data, and executing step S900.
S800: writing the recombined new EXCEL file sector data into a new EXCEL file, and ending the process;
s900: writing the data into TXT files with corresponding number according to the number of the reorganized worksheet data: and newly creating a TXT file, and writing the cell data analyzed from the workbook substream of the worksheet type into the TXT file according to the contents of the array reordered by the two-dimensional array.
It is to be understood that the invention is not limited to the examples described above, but that modifications and variations are possible to those skilled in the art in light of the above teachings, and that all such modifications and variations are intended to be included within the scope of the invention as defined in the appended claims.
Claims (9)
1. A data reorganization and restoration method for Microsoft EXCEL files is characterized by comprising the following steps:
s100: loading the Microsoft EXCEL file, reading byte content of a first sector, judging whether the byte content conforms to a sector structure of the first sector of the Microsoft EXCEL file, if so, executing a step S300, otherwise, executing a step S200, wherein the Microsoft EXCEL file is a file to be repaired;
s200: traversing all sectors of the Microsoft EXCEL file, and acquiring a directory stream sector and a configuration table sector;
s300: analyzing the configuration table sector and the directory stream sector, detecting whether the sector structure is complete, if so, executing the step S400, otherwise, executing the step S600;
s400: analyzing, checking and constructing a sector structure of a file header of the Microsoft EXCEL file according to complete configuration table sector information and complete directory stream sector information;
s500: according to the constructed head structure sector content and the sector data of the file to be repaired, all the sector data of the Microsoft EXCEL file are recombined, and the step S800 is executed;
s600: analyzing data stream sector data to obtain sub-stream data of the workbook data;
s700: analyzing the sub-stream data of the workbook data, recombining the worksheet data in the Microsoft EXCEL file, and executing step S900;
s800: writing the recombined new EXCEL file sector data into a new EXCEL file, and ending the process;
s900: writing the data into TXT files with corresponding number according to the number of the reorganized worksheet data: and newly creating a TXT file, and writing the cell data analyzed from the workbook substream of the worksheet type into the TXT file according to the contents of the array reordered by the two-dimensional array.
2. The method for data reassembly and repair of microsoft EXCEL file according to claim 1, wherein the sector structure of the first sector of microsoft EXCEL file is shown in table 1 below, wherein the content of the first 8 bytes is 0xD0 CF 11E0 a 1B 11A E1;
table 1: sector structure of first sector of Microsoft EXCEL file
The SID represents a Sector Identifier, the SSAT represents a short Sector allocation table, the SAT represents a Sector allocation table, and the MSAT represents a main Sector allocation table.
3. The method for data reorganization and restoration according to microsoft EXCEL file according to claim 1, wherein the step S200 comprises the steps of:
s201: reading the byte content of each sector one by one for parsing each sector and comparing with the directory stream storage structure shown in table 2, determining whether the current sector belongs to a directory stream sector or a configuration table sector,
table 2 directory stream structure
Wherein, the DID represents a directory entry Identifier directory _ entry Identifier;
s202: judging whether the byte content of the current sector simultaneously meets the following five conditions of the directory stream sector, if so, judging that the current sector is the directory stream sector, recording the offset address of the current sector in the Microsoft EXCEL file, and recording the directory stream structure of the current sector according to the directory stream sector structure of the table 2; otherwise, judging whether the byte content of the next sector simultaneously meets the following five conditions of the directory stream sector until the judgment of the byte content of each sector is completed;
the first condition is as follows: the directory structure comprises 4 directory stream structures shown in the table 2, and each directory stream structure has a fixed byte length of 0x80 bytes;
and a second condition: the size of the area storing the name does not exceed 0x40 bytes;
and (3) carrying out a third condition: entry type does not exceed 0x05 bytes;
and a fourth condition: DID values of a left node, a right node and a root node in the directory structure meet the condition that DID is greater than 0 or DID is equal to-1;
and a fifth condition: the entry SID value of the flow is not less than 0 or SID value is equal to-2;
s203: judging whether the byte content of the current sector simultaneously meets the following two conditions of the configuration table sector, if so, judging that the current sector is the configuration table sector, and continuously judging which SSAT/SAT/MSAT the current configuration table sector belongs to, otherwise, judging whether the byte content of the next sector simultaneously meets the following two conditions of the configuration table sector until the judgment of the byte content of each sector is completed;
the first condition is as follows: the 0x80 byte content of the current sector uses 4 bytes as a group, and the SID values of each group are judged one by one, and each SID value needs to satisfy:
when the SID is greater than 0, the SID is smaller than (the total size of the Microsoft EXCEL file is-0 x200)/0x 200;
when the SID is less than 0, the SID value can be only one of-1, -2, -3 or-4;
and a second condition: when the SID value is equal to-1, all the SID values thereafter are-1;
judging the sector of the current configuration table belongs to one of four categories of SSAT/SAT/MSAT:
the first condition is as follows: if the configuration table sector is positioned in front of the directory stream sector, the current configuration table sector is SAT;
case two: if the configuration table sector is positioned behind the directory stream sector where the Root Entry is positioned, and the SID of the first positive number in the configuration table sector is equal to the intra-sector offset address/4 of the current SID value, the current configuration table sector is SSAT;
case three: if only the positive value and-1 negative value exist in the sector of the configuration table, the current sector of the configuration table is MSAT;
case four: if SID values with negative values of-3 and-4 exist in the sector of the configuration table, the sector of the current configuration table is SAT;
if the Root Entry exists in the directory stream sector, judging the maximum SID value of the SSAT according to an Entry Sid value and an Entry byte number in a Root Entry directory stream structure;
s204: generating SAT-SID linked list by SID value of SAT configuration table sector according to the classified configuration table sector: the SAT-SID linked list is formed by SAT array indexes and SIDs in one-to-one correspondence, wherein the array indexes are natural numbers, and the content of the SIDs is the SAT array index corresponding to the next SID of the current SID;
s205: according to the classified sectors of the configuration table, forming SSAT-SID linked list by SID values of the sectors of the SSAT configuration table: the SSAT-SID linked list is formed by SSAT array indexes and SIDs in one-to-one correspondence, wherein the SSAT array indexes are natural numbers, and the SIDs are SSAT array indexes corresponding to the SIDs next to the current SID.
4. The method for data reorganization and restoration according to microsoft EXCEL file according to claim 1, wherein the step S300 comprises the steps of:
judging whether the sector structure simultaneously meets the following four conditions, if so, the sector structure is a complete sector structure, and executing the step S400, otherwise, executing the step S600:
the first condition is as follows: detecting the SID values of all SAT configuration table sectors according to the SAT-SID linked list, wherein the linked list values corresponding to the SAT-SID linked list are all-3;
and a second condition: detecting that SID values of all directory stream sectors have a complete sub-chain table in the SAT-SID linked list according to the SAT-SID linked list, wherein the sub-chain table starts with the SID value of the directory stream sector where Root Entry is located, and the sector of the pointed next SID value is also the directory stream sector; and the child chain length of a directory stream sector is equal to the number of directory stream sectors and the last SID value of the child chain table of a directory stream sector is-2;
and (3) carrying out a third condition: when the entry byte number is more than or equal to 0x1000, detecting the entry SID value and the entry byte number of each directory stream structure in the directory stream sector according to the SAT-SID linked list to form a complete sub-link list in the SAT-SID linked list: starting with the entry SID value of the directory stream structure, the sub-chain length formed in the SAT-SID linked list is the larger of the entry byte number/0 x200-1 and the entry byte number/0 x 200;
and a fourth condition: when the entry byte number is less than 0x1000, detecting the entry SID value and entry byte number of each directory flow structure in the directory flow sector according to the SSAT-SID linked list to form a complete sub-chain list in the SSAT-SID linked list: starting with the entry SID value of the directory flow structure, the length of the child chain formed in the SSAT-SID linked list is equal to the larger of the entry byte number/0 x40-1 and the entry byte number/0 x 40.
5. The method for data reorganization and restoration according to microsoft EXCEL file according to claim 1, wherein the step S400 comprises the steps of:
according to the sector structure of the first sector of the microsoft EXCEL file shown in table 1, a new first sector of the file is constructed:
the fixed value parameter is assigned according to the content of the table 1;
variable parameter part: and setting the variable parameters by combining the known SSAT configuration table sector information, SAT configuration table sector information and directory stream sector information to generate a complete file first sector.
6. The method for data reorganization and restoration according to microsoft EXCEL file according to claim 5, wherein the step S500 comprises the steps of:
and recombining the complete file first sector generated in the step S400 and all the sector data except the complete file first sector in the file to be repaired into the sector data of the new EXCEL file.
7. The method for data reorganization and restoration according to microsoft EXCEL file according to claim 1, wherein the step S600 comprises the following steps:
s601: acquiring data stream sectors in a file to be repaired, wherein the data stream sectors except a file head sector, a configuration table sector and a directory stream sector in the file to be repaired are data stream sectors;
s602: analyzing the sector content of the data stream, judging according to the identifiers of the data sub-streams pointed by different directory stream inlets, and acquiring the content of the first sector of the data sub-stream, wherein the data sub-stream is the data stream content with the following identifiers as the initial part:
Root Entry:0x01 00 00 02 08 00 00 00/0xFE FF 00 00/0x01 00 FE FF
\005DocumentInformation:0xFE FF 00 00
\005Summary DocumentInfomation:0xFE FF 00 00
workbook of 0x 090810000006
S603: after determining the first sector of the data substream of the workbook, traversing each data stream sector behind the first sector of the data substream of the workbook, and judging whether each data sector has one of a Root Entry identifier, a DocumentInformation identifier and a Summary DocumentInformation identifier one by one, if not, judging that the current sector is still the data of the substream sector of the workbook data, if so, judging that the data of the substream sector of the workbook data is finished, and neither the current sector nor the data stream sectors behind the current sector are the data of the substream sector of the workbook data;
s604: all sector data of the workbook data substream is obtained.
8. The method for data reorganization and restoration according to microsoft EXCEL file according to claim 1, wherein the step S700 comprises the steps of:
s701: the storage structure of the analysis workbook data sub-stream is 2 byte record type +2 byte record length + record content;
s702: the parse workbook sub-stream data is of the following type:
0x 0005: representing Workbook Global workbooks globals
0x 0006: representing VB Module Visual Basic Module
0x 0010: representing a table or dialog box Sheet or dialog
0x 0020: representing a Chart Chart
0x 0040: representing Macro-watch Macro sheet
0x 0100: represent workplace
The workbook substream of the 0x 0005 record type comprises all character string data and relevant file attribute information in the microsoft EXCEL file and the number and name of worksheets contained in the microsoft EXCEL file;
the workbook sub-stream of the 0x 0010 record type comprises cell data of a worksheet of a Microsoft EXCEL file, and the storage structure of the workbook sub-stream is cell row index value, cell column index value and cell data content;
s703: analyzing the workflow data record of the workbook according to the workflow data type of the workbook in step S702: the content of the cell data in the current worksheet is recorded in the workbook of the 0x 0010 worksheet type, the worksheet cell data is recorded according to the cell row index value, the cell column index value and the cell data content format, and the cell data information is shown in table 3:
TABLE 3 cell data information
Acquiring cell data of the workbook sub-stream according to the record content described in the above table 3;
s704: and for the cell data in each worksheet substream data, sorting the acquired cell data according to the row and column formats of the tables by using the cell row index value and the cell column index value.
9. The method for data reorganization and restoration according to microsoft EXCEL file according to claim 8, wherein the step S704 includes the steps of:
s7041: presetting a two-dimensional array, wherein the size of the two-dimensional array is determined by the value of DIMENSION in a table 3 and is used for determining the range of rows and columns of the current working table; the two-dimensional array is used for storing cell data;
s7042: writing the records into the two-dimensional array according to respective cell row index values and cell column index values;
s7043: writing all the cell data acquired from the working table into the two-dimensional array according to the respective cell row index value and cell column index value to realize the reordering of the cell data of the working table, and executing step S900.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010896794.XA CN112069130B (en) | 2020-08-31 | 2020-08-31 | Data reorganization and repair method for Microsoft EXCEL file |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010896794.XA CN112069130B (en) | 2020-08-31 | 2020-08-31 | Data reorganization and repair method for Microsoft EXCEL file |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112069130A true CN112069130A (en) | 2020-12-11 |
CN112069130B CN112069130B (en) | 2023-05-02 |
Family
ID=73666131
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010896794.XA Active CN112069130B (en) | 2020-08-31 | 2020-08-31 | Data reorganization and repair method for Microsoft EXCEL file |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112069130B (en) |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050028031A1 (en) * | 2003-07-28 | 2005-02-03 | Matsushita Electric Industrial Co., Ltd. | Data recovery method and data recording apparatus |
CN101882132A (en) * | 2009-04-27 | 2010-11-10 | 深圳市闪联信息技术有限公司 | Creating and reading method and device for virtual FAT file system |
CN106155845A (en) * | 2016-08-02 | 2016-11-23 | 四川效率源信息安全技术股份有限公司 | A kind of restructuring restoration methods based on XFS file system data |
CN106445736A (en) * | 2016-08-31 | 2017-02-22 | 四川效率源信息安全技术股份有限公司 | Method for extracting and recombining MTK62 series word stock data of mobile phone |
CN110083102A (en) * | 2019-05-10 | 2019-08-02 | 河南省启源电力勘测设计有限公司 | A kind of high-performance low-voltage power distribution cabinet is long-range and site monitoring system |
WO2020103493A1 (en) * | 2018-11-22 | 2020-05-28 | 厦门市美亚柏科信息股份有限公司 | Method and system for recovering deleted file based on fat32 file system |
-
2020
- 2020-08-31 CN CN202010896794.XA patent/CN112069130B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050028031A1 (en) * | 2003-07-28 | 2005-02-03 | Matsushita Electric Industrial Co., Ltd. | Data recovery method and data recording apparatus |
CN101882132A (en) * | 2009-04-27 | 2010-11-10 | 深圳市闪联信息技术有限公司 | Creating and reading method and device for virtual FAT file system |
CN106155845A (en) * | 2016-08-02 | 2016-11-23 | 四川效率源信息安全技术股份有限公司 | A kind of restructuring restoration methods based on XFS file system data |
CN106445736A (en) * | 2016-08-31 | 2017-02-22 | 四川效率源信息安全技术股份有限公司 | Method for extracting and recombining MTK62 series word stock data of mobile phone |
WO2020103493A1 (en) * | 2018-11-22 | 2020-05-28 | 厦门市美亚柏科信息股份有限公司 | Method and system for recovering deleted file based on fat32 file system |
CN110083102A (en) * | 2019-05-10 | 2019-08-02 | 河南省启源电力勘测设计有限公司 | A kind of high-performance low-voltage power distribution cabinet is long-range and site monitoring system |
Non-Patent Citations (2)
Title |
---|
ZHAO ZHENZHOU等: "The research of Excel file fragmentaton data recovery" * |
王维雄: "基于介质存储结构的数据恢复技术研究" * |
Also Published As
Publication number | Publication date |
---|---|
CN112069130B (en) | 2023-05-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US6502101B1 (en) | Converting a hierarchical data structure into a flat data structure | |
US6904430B1 (en) | Method and system for efficiently identifying differences between large files | |
US6016492A (en) | Forward extensible property modifiers for formatting information in a program module | |
US7689630B1 (en) | Two-level bitmap structure for bit compression and data management | |
US20060239562A1 (en) | System and method for binary persistence format for a recognition result lattice | |
US8458231B1 (en) | Word processor data organization | |
CN1794224A (en) | File formats, methods, and computer program products for representing workbooks | |
CN104715039A (en) | Column-based storage and research method and equipment based on hard disk and internal storage | |
CN105144157A (en) | System and method for compressing data in database | |
CN116580414A (en) | Contract document difference detection method and device based on ICR character matrix | |
CN114021543B (en) | Document comparison analysis method and system based on table structure analysis | |
CN107291832A (en) | A kind of date storage method based on list storage structure | |
CN112069130B (en) | Data reorganization and repair method for Microsoft EXCEL file | |
CN105938469A (en) | Code storage method, data storage structure of texts and method for compressed storage of texts and statistics output | |
CN112434085A (en) | Roaring Bitmap-based user data statistical method | |
CN101464875B (en) | Method for representing electronic dictionary catalog data by XML | |
CN110111852A (en) | A kind of magnanimity DNA sequencing data lossless Fast Compression platform | |
US6118391A (en) | Compression into arbitrary character sets | |
CN112183029A (en) | Digital conversion method for PDF drawing in sheet metal industry | |
CN112579546B (en) | File compression method, system, storage medium and terminal | |
CN117217172B (en) | Table information acquisition method, apparatus, computer device, and storage medium | |
US20240088913A1 (en) | Graph data compression method and apparatus | |
CN112232032B (en) | Automatic conversion method for content style of docx document | |
CN114218895B (en) | Method for converting UOF document | |
CN113779218B (en) | Question-answer pair construction method, question-answer pair construction device, computer equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |