CN112069130A - Data recombination and restoration method for Microsoft EXCEL file - Google Patents

Data recombination and restoration method for Microsoft EXCEL file Download PDF

Info

Publication number
CN112069130A
CN112069130A CN202010896794.XA CN202010896794A CN112069130A CN 112069130 A CN112069130 A CN 112069130A CN 202010896794 A CN202010896794 A CN 202010896794A CN 112069130 A CN112069130 A CN 112069130A
Authority
CN
China
Prior art keywords
sector
data
sid
stream
directory
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010896794.XA
Other languages
Chinese (zh)
Other versions
CN112069130B (en
Inventor
梁效宁
许超明
何丽萍
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xly Salvationdata Technology Inc
Original Assignee
Xly Salvationdata Technology Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xly Salvationdata Technology Inc filed Critical Xly Salvationdata Technology Inc
Priority to CN202010896794.XA priority Critical patent/CN112069130B/en
Publication of CN112069130A publication Critical patent/CN112069130A/en
Application granted granted Critical
Publication of CN112069130B publication Critical patent/CN112069130B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/13File access structures, e.g. distributed indices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/0223User address space allocation, e.g. contiguous or non contiguous base addressing
    • G06F12/023Free address space management
    • G06F12/0238Memory management in non-volatile memory, e.g. resistive RAM or ferroelectric memory
    • G06F12/0246Memory management in non-volatile memory, e.g. resistive RAM or ferroelectric memory in block erasable memory, e.g. flash memory

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a data recombination and restoration method aiming at Microsoft EXCEL files, which is characterized by comprising the following steps: s100: judging whether the byte content conforms to the sector structure of the first sector of the Microsoft EXCEL file, if so, executing the step S300, otherwise, executing the step S200; s200: traversing all sectors to obtain a directory stream sector and a configuration table sector; s300: analyzing the sector of the configuration table and the sector of the directory stream, detecting whether the structure of the sector is complete, if so, executing the step S400, otherwise, executing the step S600; s400: analyzing, checking and constructing a sector structure of a file header of the Microsoft EXCEL file; s500: recombining all sector data of the Microsoft EXCEL file, and executing the step S800; s600: analyzing data stream sector data to obtain sub-stream data of the workbook data; s700: recombining the work table data in the Microsoft EXCEL file and executing the step S900; s800: writing the recombined new EXCEL file sector data into a new EXCEL file, and ending the process; s900: and writing the TXT files with the corresponding number according to the number of the reorganized worksheet data.

Description

Data recombination and restoration method for Microsoft EXCEL file
Technical Field
The invention belongs to the field of electronic evidence obtaining and data recovery, and relates to a data recombination and restoration method for Microsoft EXCEL files.
Background
Microsoft office from Microsoft corporation is the most common office software, and EXCEL files are the indispensable tools for processing large amounts of tabular data, the contents of which usually include statistical and numerical data marked with text, or graphics or pictures embedded in a spreadsheet with the suffix name of xls. Mathematical equations for calculating numerical data or statistical data, and partial function libraries are also provided as an integration function of Microsoft EXCEL. Formatting properties and other layout elements can be implemented into a new EXCEL spreadsheet using XLS files (commonly referred to as EXCEL templates).
Microsoft Excel files store Microsoft Excel documents using the xls proprietary format. This file format is called binary exchange file format (BIFF). The data in all Office document files is stored in a series of fixed-size streams. The location of each fixed-size stream is described by several sector allocation tables. These streams contain directory information, configuration table information, and all data information about the document.
Microsoft Office is used as the most widely applied Office software in daily work and life, wherein the EXCEL format is used as a table format file, and the application range is very wide. The EXCEL file also has a fragile characteristic as one of electronic data due to its convenience in data processing and extremely high frequency of use. When some emergency conditions affect the normal use of Microsoft EXCEL, the risk of damage or loss of important files is increased, and the bad or lost EXCEL files are extremely difficult to recover, so that the data in the EXCEL files of users are lost, and the working efficiency of the users is reduced and even the loss is caused. In the prior art, no data recombination and repair method for microsoft EXCEL files exists.
Disclosure of Invention
Aiming at the defects of the prior art, the data reorganization and restoration method for the EXCEL file of Microsoft Office is provided, and by judging whether the file format is the Xls format (namely, the EXCEL format), traversing the file data content sector by sector, analyzing the data type of each sector and classifying the sector content, performing word-by-word analysis on each sector data in different modes, extracting the effective data for restoring the EXCEL file, recombining the data by combining the cell data content and the cell style information, thereby realizing the EXCEL file restoration, and finally exporting a new Xls file or exporting the txt file restored for each worksheet.
The invention comprises the following steps:
s100: loading the Microsoft EXCEL file, reading byte content of a first sector, judging whether the byte content conforms to a sector structure of the first sector of the Microsoft EXCEL file, if so, executing a step S300, otherwise, executing a step S200, wherein the Microsoft EXCEL file is a file to be repaired;
s200: traversing all sectors of the Microsoft EXCEL file, and acquiring a directory stream sector and a configuration table sector;
s300: analyzing the configuration table sector and the directory stream sector, detecting whether the sector structure is complete, if so, executing the step S400, otherwise, executing the step S600;
s400: analyzing, checking and constructing a sector structure of a file header of the Microsoft EXCEL file according to complete configuration table sector information and complete directory stream sector information;
s500: according to the constructed head structure sector content and the sector data of the file to be repaired, all the sector data of the Microsoft EXCEL file are recombined, and the step S800 is executed;
s600: analyzing data stream sector data to obtain sub-stream data of the workbook data;
s700: analyzing the sub-stream data of the workbook data, recombining the worksheet data in the Microsoft EXCEL file, and executing step S900;
s800: writing the recombined new EXCEL file sector data into a new EXCEL file, and ending the process;
s900: writing the data into TXT files with corresponding number according to the number of the reorganized worksheet data: and newly creating a TXT file, and writing the cell data analyzed from the workbook substream of the worksheet type into the TXT file according to the contents of the array reordered by the two-dimensional array.
Preferably, the sector structure of the first sector of the microsoft EXCEL file is shown in table 1 below, wherein the content of the first 8 bytes is 0xD0 CF 11E0 a 1B 11A E1;
Figure BDA0002658645410000031
the SID represents a Sector Identifier, the SSAT represents a short Sector allocation table, the SAT represents a Sector allocation table, and the MSAT represents a main Sector allocation table.
Preferably, the step S200 includes the steps of:
s201: reading the byte content of each sector one by one for parsing each sector and comparing with the directory stream storage structure shown in table 2, determining whether the current sector belongs to a directory stream sector or a configuration table sector,
table 2 directory stream structure
Figure BDA0002658645410000041
Wherein, the DID represents a directory entry Identifier directory _ entry Identifier;
s202: judging whether the byte content of the current sector simultaneously meets the following five conditions of the directory stream sector, if so, judging that the current sector is the directory stream sector, recording the offset address of the current sector in the Microsoft EXCEL file, and recording the directory stream structure of the current sector according to the directory stream sector structure of the table 2; otherwise, judging whether the byte content of the next sector simultaneously meets the following five conditions of the directory stream sector until the judgment of the byte content of each sector is completed;
the first condition is as follows: the directory structure comprises 4 directory stream structures shown in the table 2, and each directory stream structure has a fixed byte length of 0x80 bytes;
and a second condition: the size of the area storing the name does not exceed 0x40 bytes;
and (3) carrying out a third condition: entry type does not exceed 0x05 bytes;
and a fourth condition: DID values of a left node, a right node and a root node in the directory structure meet the condition that DID is greater than 0 or DID is equal to-1;
and a fifth condition: the entry SID value of the flow is not less than 0 or SID value is equal to-2;
s203: judging whether the byte content of the current sector simultaneously meets the following two conditions of the configuration table sector, if so, judging that the current sector is the configuration table sector, and continuously judging which SSAT/SAT/MSAT the current configuration table sector belongs to, otherwise, judging whether the byte content of the next sector simultaneously meets the following two conditions of the configuration table sector until the judgment of the byte content of each sector is completed;
the first condition is as follows: the 0x80 byte content of the current sector uses 4 bytes as a group, and the SID values of each group are judged one by one, and each SID value needs to satisfy:
when the SID is greater than 0, the SID is smaller than (the total size of the Microsoft EXCEL file is-0 x200)/0x 200;
when the SID is less than 0, the SID value can be only one of-1, -2, -3 or-4;
and a second condition: when the SID value is equal to-1, all the SID values thereafter are-1;
judging the sector of the current configuration table belongs to one of four categories of SSAT/SAT/MSAT:
the first condition is as follows: if the configuration table sector is positioned in front of the directory stream sector, the current configuration table sector is SAT;
case two: if the configuration table sector is positioned behind the directory stream sector where the Root Entry is positioned, and the SID of the first positive number in the configuration table sector is equal to the intra-sector offset address/4 of the current SID value, the current configuration table sector is SSAT;
case three: if only the positive value and-1 negative value exist in the sector of the configuration table, the current sector of the configuration table is MSAT;
case four: if SID values with negative values of-3 and-4 exist in the sector of the configuration table, the sector of the current configuration table is SAT;
if the Root Entry exists in the directory stream sector, judging the maximum SID value of the SSAT according to an Entry Sid value and an Entry byte number in a Root Entry directory stream structure;
s204: generating SAT-SID linked list by SID value of SAT configuration table sector according to the classified configuration table sector: the SAT-SID linked list is formed by SAT array indexes and SIDs in one-to-one correspondence, wherein the array indexes are natural numbers, and the content of the SIDs is the SAT array index corresponding to the next SID of the current SID;
s205: according to the classified sectors of the configuration table, forming SSAT-SID linked list by SID values of the sectors of the SSAT configuration table: the SSAT-SID linked list is formed by SSAT array indexes and SIDs in one-to-one correspondence, wherein the SSAT array indexes are natural numbers, and the SIDs are SSAT array indexes corresponding to the SIDs next to the current SID.
Preferably, the step S300 includes the steps of:
judging whether the sector structure simultaneously meets the following four conditions, if so, the sector structure is a complete sector structure, and executing the step S400, otherwise, executing the step S600:
the first condition is as follows: detecting the SID values of all SAT configuration table sectors according to the SAT-SID linked list, wherein the linked list values corresponding to the SAT-SID linked list are all-3;
and a second condition: detecting that SID values of all directory stream sectors have a complete sub-chain table in the SAT-SID linked list according to the SAT-SID linked list, wherein the sub-chain table starts with the SID value of the directory stream sector where Root Entry is located, and the sector of the pointed next SID value is also the directory stream sector; and the child chain length of a directory stream sector is equal to the number of directory stream sectors and the last SID value of the child chain table of a directory stream sector is-2;
and (3) carrying out a third condition: when the entry byte number is more than or equal to 0x1000, detecting the entry SID value and the entry byte number of each directory stream structure in the directory stream sector according to the SAT-SID linked list to form a complete sub-link list in the SAT-SID linked list: starting with the entry SID value of the directory stream structure, the sub-chain length formed in the SAT-SID linked list is the larger of the entry byte number/0 x200-1 and the entry byte number/0 x 200;
and a fourth condition: when the entry byte number is less than 0x1000, detecting the entry SID value and entry byte number of each directory flow structure in the directory flow sector according to the SSAT-SID linked list to form a complete sub-chain list in the SSAT-SID linked list: starting with the entry SID value of the directory flow structure, the length of the child chain formed in the SSAT-SID linked list is equal to the larger of the entry byte number/0 x40-1 and the entry byte number/0 x 40.
Preferably, the step S400 includes the steps of:
according to the sector structure of the first sector of the microsoft EXCEL file shown in table 1, a new first sector of the file is constructed:
the fixed value parameter is assigned according to the content of the table 1;
variable parameter part: and setting the variable parameters by combining the known SSAT configuration table sector information, SAT configuration table sector information and directory stream sector information to generate a complete file first sector.
Preferably, the step S500 includes the steps of:
and recombining the complete file first sector generated in the step S400 and all the sector data except the complete file first sector in the file to be repaired into the sector data of the new EXCEL file.
7. The method for data reorganization and restoration according to microsoft EXCEL file according to claim 1, wherein the step S600 comprises the following steps:
s601: acquiring data stream sectors in a file to be repaired, wherein the data stream sectors except a file head sector, a configuration table sector and a directory stream sector in the file to be repaired are data stream sectors;
s602: analyzing the sector content of the data stream, judging according to the identifiers of the data sub-streams pointed by different directory stream inlets, and acquiring the content of the first sector of the data sub-stream, wherein the data sub-stream is the data stream content with the following identifiers as the initial part:
Root Entry:0x01 00 00 02 08 00 00 00/0xFE FF 00 00/0x01 00 FE FF
\005DocumentInformation:0xFE FF 00 00
\005Summary DocumentInfomation:0xFE FF 00 00
workbook of 0x 090810000006
S603: after determining the first sector of the data substream of the workbook, traversing each data stream sector behind the first sector of the data substream of the workbook, and judging whether each data sector has one of a Root Entry identifier, a DocumentInformation identifier and a Summary DocumentInformation identifier one by one, if not, judging that the current sector is still the data of the substream sector of the workbook data, if so, judging that the data of the substream sector of the workbook data is finished, and neither the current sector nor the data stream sectors behind the current sector are the data of the substream sector of the workbook data;
s604: all sector data of the workbook data substream is obtained.
Preferably, the step S700 includes the steps of:
s701: the storage structure of the analysis workbook data sub-stream is 2 byte record type +2 byte record length + record content;
s702: the parse workbook sub-stream data is of the following type:
0x 0005: representing Workbook Global workbooks globals
0x 0006: representing VB Module Visual Basic Module
0x 0010: representing a table or dialog box Sheet or dialog
0x 0020: representing a Chart Chart
0x 0040: representing Macro-watch Macro sheet
0x 0100: represent workplace
The workbook substream of the 0x 0005 record type comprises all character string data and relevant file attribute information in the microsoft EXCEL file and the number and name of worksheets contained in the microsoft EXCEL file;
the workbook sub-stream of the 0x 0010 record type comprises cell data of a worksheet of a Microsoft EXCEL file, and the storage structure of the workbook sub-stream is cell row index value, cell column index value and cell data content;
s703: analyzing the workflow data record of the workbook according to the workflow data type of the workbook in step S702: the content of the cell data in the current worksheet is recorded in the workbook of the 0x 0010 worksheet type, the worksheet cell data is recorded according to the cell row index value, the cell column index value and the cell data content format, and the cell data information is shown in table 3:
TABLE 3 cell data information
Figure BDA0002658645410000091
Acquiring cell data of the workbook sub-stream according to the record content described in the above table 3;
s704: and for the cell data in each worksheet substream data, sorting the acquired cell data according to the row and column formats of the tables by using the cell row index value and the cell column index value.
Preferably, the step S704 includes the steps of:
s7041: presetting a two-dimensional array, wherein the size of the two-dimensional array is determined by the value of DIMENSION in a table 3 and is used for determining the range of rows and columns of the current working table; the two-dimensional array is used for storing cell data;
s7042: writing the records into the two-dimensional array according to respective cell row index values and cell column index values;
s7043: writing all the cell data acquired from the working table into the two-dimensional array according to the respective cell row index value and cell column index value to realize the reordering of the cell data of the working table, and executing step S900.
The invention has the beneficial effects that:
1. the method solves the technical problem that no data recombination and restoration method aiming at the Microsoft EXCEL file exists in the prior art.
2. Not only effectively extracting the EXCEL file data stream, but also performing sequencing and recombination to recover the data.
Drawings
FIG. 1 is a general flow diagram of a method provided by the present invention;
FIG. 2 is a diagram illustrating the generation of a SAT-SID linked list according to an embodiment of the present invention.
Detailed Description
Fig. 1 shows a general flow chart of the method provided by the present invention. As shown in fig. 1, the method comprises the following steps:
s100: loading the Microsoft EXCEL file, reading the byte content of the first sector, judging whether the byte content conforms to the sector structure of the first sector of the Microsoft EXCEL file, if so, executing the step S300, otherwise, executing the step S200, wherein the Microsoft EXCEL file is the file to be repaired.
Specifically, firstly, microsoft EXCEL file is loaded, the data of the first sector of microsoft EXCEL file is read, and the size of the read sector is defaulted to 0x200 bytes.
Secondly, according to the sector structure of the first sector of the microsoft EXCEL file shown in table 1, it is determined whether the first sector of the EXCEL file to be repaired meets the sector structure of the first sector of the microsoft EXCEL file, that is, it is determined whether the following conditions are met:
whether the first 8 bytes of the first sector is 0xD0 CF 11E0 a 1B 11A E1, if yes, execute step S300, otherwise execute step S200.
Table 1: sector structure of first sector of Microsoft EXCEL file
Figure BDA0002658645410000101
Figure BDA0002658645410000111
The SID represents a Sector Identifier, the SSAT represents a short Sector allocation table, the SAT represents a Sector allocation table, and the MSAT represents a main Sector allocation table.
S200: traversing all sectors of the Microsoft EXCEL file, acquiring a directory stream sector and a configuration table sector,
the method comprises the following steps:
s201: reading the byte content of each sector one by one for parsing each sector and comparing with the directory stream storage structure shown in table 2, determining whether the current sector belongs to a directory stream sector or a configuration table sector,
table 2 directory stream structure
Figure BDA0002658645410000112
Wherein, the DID represents a directory entry Identifier directory _ entry Identifier;
s202: judging whether the byte content of the current sector simultaneously meets the following five conditions of the directory stream sector, if so, judging that the current sector is the directory stream sector, recording the offset address of the current sector in the Microsoft EXCEL file, and recording the directory stream structure of the current sector according to the directory stream sector structure in the table 2; otherwise, judging whether the byte content of the next sector simultaneously meets the following five conditions of the directory stream sector until the judgment of the byte content of each sector is completed;
the first condition is as follows: contains 4 directory stream structures shown in table 2, and each directory stream structure has a fixed byte length of 0x80 bytes;
and a second condition: the size of the area storing the name does not exceed 0x40 bytes;
and (3) carrying out a third condition: entry type does not exceed 0x05 bytes;
and a fourth condition: DID values of a left node, a right node and a root node in the directory structure meet the condition that DID is greater than 0 or DID is equal to-1;
and a fifth condition: the entry SID value of the flow is not less than 0 or SID value is equal to-2;
s203: judging whether the byte content of the current sector simultaneously meets the following two conditions of the configuration table sector, if so, judging that the current sector is the configuration table sector, and continuously judging which SSAT/SAT/MSAT the current configuration table sector belongs to, otherwise, judging whether the byte content of the next sector simultaneously meets the following two conditions of the configuration table sector until the judgment of the byte content of each sector is completed;
the first condition is as follows: the 0x80 byte content of the current sector uses 4 bytes as a group, and the SID values of each group are judged one by one, and each SID value needs to satisfy:
when the SID is greater than 0, the SID is smaller than (Microsoft EXCEL file total size-0 x200)/0x 200;
when the SID is less than 0, the SID value can be only one of-1, -2, -3 or-4;
and a second condition: when the SID value is equal to-1, all the SID values thereafter are-1;
judging the sector of the current configuration table belongs to one of four categories of SSAT/SAT/MSAT:
the first condition is as follows: if the configuration table sector is positioned in front of the directory stream sector, the current configuration table sector is SAT;
case two: if the configuration table sector is positioned behind the directory stream sector where the Root Entry is positioned, and the SID of the first positive number in the configuration table sector is equal to the intra-sector offset address/4 of the current SID value, the current configuration table sector is SSAT;
case three: if only the positive value and-1 negative value exist in the sector of the configuration table, the current sector of the configuration table is MSAT;
case four: if SID values with negative values of-3 and-4 exist in the sector of the configuration table, the sector of the current configuration table is SAT;
if the Root Entry exists in the directory stream sector, judging the maximum SID value of the SSAT according to an Entry Sid value and an Entry byte number in a Root Entry directory stream structure;
s204: FIG. 2 is a diagram illustrating the generation of a SAT-SID linked list according to an embodiment of the present invention. As shown in fig. 2, according to the classified configuration table sectors, the SID value of the SAT configuration table sector is generated into an SAT-SID linked list:
the SAT-SID linked list is formed by SAT Array indexes (namely, Array indexes) and SIDs (namely, SAT contents (SecIDs)) in a one-to-one correspondence mode, wherein the Array indexes (Array indexes) are natural numbers, and the content of the SIDs is the SAT Array index (Array indexes) corresponding to the SID next to the current SID;
for example, as shown in fig. 2, the Array index (Array indexes) is 0, and then SID (SAT contents (SecIDs)) is 2, i.e., the SAT Array index (Array indexes) corresponding to the next SID is 2, and its corresponding SID (SAT contents (SecIDs)) is 3, and then 3 is equal to the SAT Array index (Array indexes) corresponding to the next SID, and its corresponding SID (SAT contents (SecIDs)) is-2, and so on.
S205: similarly, according to the classified configuration table sector, the SID value of the SSAT configuration table sector forms an SSAT-SID linked list: the SSAT-SID linked list is formed by SSAT array index and SID in one-to-one correspondence, the SSAT array index is a natural number, and the SID content is the SSAT array index corresponding to the next SID of the current SID.
S300: analyzing the sector of the configuration table and the sector of the directory stream, detecting whether the sector structure is complete, if so, executing the step S400, otherwise, executing the step S600, and comprising the following steps:
judging whether the sector structure simultaneously meets the following four conditions, if so, the sector structure is a complete sector structure, and executing the step S400, otherwise, executing the step S600:
the first condition is as follows: detecting the SID values of all SAT configuration table sectors according to the SAT-SID linked list, wherein the linked list values corresponding to the SAT-SID linked list are all-3;
and a second condition: detecting that SID values of all directory stream sectors have a complete sub-chain table in the SAT-SID linked list according to the SAT-SID linked list, wherein the sub-chain table starts with the SID value of the directory stream sector where Root Entry is located, and the sector of the pointed next SID value is also the directory stream sector; and the child chain length of a directory stream sector is equal to the number of directory stream sectors and the last SID value of the child chain table of a directory stream sector is-2;
and (3) carrying out a third condition: when the entry byte number is more than or equal to 0x1000, detecting the entry SID value and the entry byte number of each directory stream structure in the directory stream sector according to the SAT-SID linked list to form a complete sub-link list in the SAT-SID linked list: starting with the entry SID value of the directory stream structure, the sub-chain length formed in the SAT-SID linked list is the larger of the entry byte number/0 x200-1 and the entry byte number/0 x 200;
and a fourth condition: when the entry byte number is less than 0x1000, detecting the entry SID value and entry byte number of each directory flow structure in the directory flow sector according to the SSAT-SID linked list to form a complete sub-chain list in the SSAT-SID linked list: starting with the entry SID value of the directory flow structure, the length of the child chain formed in the SSAT-SID linked list is equal to the larger of the entry byte number/0 x40-1 and the entry byte number/0 x 40.
S400: analyzing, checking and constructing the sector structure of the file header of the Microsoft EXCEL file according to the complete configuration table sector information and the complete directory stream sector information, comprising the following steps:
constructing a new file first sector according to the sector structure of the first sector of the Microsoft EXCEL file shown in the table 1, wherein the fixed value parameter is assigned according to the content of the table 1;
variable parameter part: and setting the variable parameters by combining the known SSAT configuration table sector information, SAT configuration table sector information and directory stream sector information to generate a complete file first sector.
S500: according to the constructed head structure sector content and the sector data of the file to be repaired, all the sector data of the microsoft EXCEL file are recombined, and the step S800 is executed, which comprises the following steps:
and recombining the complete file first sector generated in the step S400 and all the sector data except the complete file first sector in the file to be repaired into the sector data of the new EXCEL file.
S600: analyzing data stream sector data to obtain data substream data of the workbook data, comprising the following steps:
s601: acquiring data stream sectors in a file to be repaired, wherein the data stream sectors except a file head sector, a configuration table sector and a directory stream sector in the file to be repaired are data stream sectors;
s602: analyzing the sector content of the data stream, judging according to the identifiers of the data sub-streams pointed by different directory stream inlets, and acquiring the content of the first sector of the data sub-stream, wherein the data sub-stream is the data stream content with the following identifiers as the initial part:
Root Entry:0x01 00 00 02 08 00 00 00/0xFE FF 00 00/0x01 00 FE FF
\005DocumentInformation:0xFE FF 00 00
\005Summary DocumentInfomation:0xFE FF 00 00
workbook of 0x 090810000006
S603: after determining the first sector of the data substream of the workbook, traversing each data stream sector behind the first sector of the data substream of the workbook, and judging whether each data sector has one of a Root Entry identifier, a DocumentInformation identifier and a Summary DocumentInformation identifier one by one, if not, judging that the current sector is still the data of the data substream sector of the workbook, if so, judging that the data of the substream sector of the workbook data is finished, and neither the current sector nor the data stream sectors behind the current sector are the data of the substream sector of the workbook data;
s604: all sector data of the workbook data substream is obtained.
S700: analyzing the sub-stream data of the workbook data, recombining the worksheet data in the Microsoft EXCEL file, and executing the step S900;
the method comprises the following steps:
s701: the storage structure of the analysis workbook data sub-stream is 2 byte record type +2 byte record length + record content;
s702: the parse workbook sub-stream data is of the following type:
0x 0005: representing Workbook Global workbooks globals
0x 0006: representing VB Module Visual Basic Module
0x 0010: representing a table or dialog box Sheet or dialog
0x 0020: representing a Chart Chart
0x 0040: representing Macro-watch Macro sheet
0x 0100: represent workplace
The workbook substream of the 0x 0005 record type comprises all character string data and relevant file attribute information in the microsoft EXCEL file and the number and name of worksheets contained in the microsoft EXCEL file;
the workbook sub-stream of the 0x 0010 record type comprises cell data of a worksheet of a Microsoft EXCEL file, and the storage structure of the workbook sub-stream is cell row index value, cell column index value and cell data content;
s703: analyzing the workflow data record of the workbook according to the workflow data type of the workbook in step S702: the content of the cell data in the current worksheet is recorded in the workbook of the 0x 0010 worksheet type, the worksheet cell data is recorded according to the cell row index value, the cell column index value and the cell data content format, and the cell data information is shown in table 3:
TABLE 3 cell data information
Figure BDA0002658645410000171
Acquiring cell data of the workbook sub-stream according to the record content described in the above table 3;
s704: for the cell data in each worksheet substream data, sorting the acquired cell data according to the row and column formats of the table by using the cell row index value and the cell column index value, comprising the following steps:
s7041: presetting a two-dimensional array, wherein the size of the two-dimensional array is determined by the value of DIMENSION in a table 3 and is used for determining the range of rows and columns of the current working table; the two-dimensional array is used for storing cell data;
s7042: writing the records into a two-dimensional array according to respective cell row index values and cell column index values;
s7043: writing all the cell data acquired from the worksheet into the two-dimensional array according to the respective cell row index value and cell column index value to realize the reordering of the worksheet cell data, and executing step S900.
S800: writing the recombined new EXCEL file sector data into a new EXCEL file, and ending the process;
s900: writing the data into TXT files with corresponding number according to the number of the reorganized worksheet data: and newly creating a TXT file, and writing the cell data analyzed from the workbook substream of the worksheet type into the TXT file according to the contents of the array reordered by the two-dimensional array.
It is to be understood that the invention is not limited to the examples described above, but that modifications and variations are possible to those skilled in the art in light of the above teachings, and that all such modifications and variations are intended to be included within the scope of the invention as defined in the appended claims.

Claims (9)

1. A data reorganization and restoration method for Microsoft EXCEL files is characterized by comprising the following steps:
s100: loading the Microsoft EXCEL file, reading byte content of a first sector, judging whether the byte content conforms to a sector structure of the first sector of the Microsoft EXCEL file, if so, executing a step S300, otherwise, executing a step S200, wherein the Microsoft EXCEL file is a file to be repaired;
s200: traversing all sectors of the Microsoft EXCEL file, and acquiring a directory stream sector and a configuration table sector;
s300: analyzing the configuration table sector and the directory stream sector, detecting whether the sector structure is complete, if so, executing the step S400, otherwise, executing the step S600;
s400: analyzing, checking and constructing a sector structure of a file header of the Microsoft EXCEL file according to complete configuration table sector information and complete directory stream sector information;
s500: according to the constructed head structure sector content and the sector data of the file to be repaired, all the sector data of the Microsoft EXCEL file are recombined, and the step S800 is executed;
s600: analyzing data stream sector data to obtain sub-stream data of the workbook data;
s700: analyzing the sub-stream data of the workbook data, recombining the worksheet data in the Microsoft EXCEL file, and executing step S900;
s800: writing the recombined new EXCEL file sector data into a new EXCEL file, and ending the process;
s900: writing the data into TXT files with corresponding number according to the number of the reorganized worksheet data: and newly creating a TXT file, and writing the cell data analyzed from the workbook substream of the worksheet type into the TXT file according to the contents of the array reordered by the two-dimensional array.
2. The method for data reassembly and repair of microsoft EXCEL file according to claim 1, wherein the sector structure of the first sector of microsoft EXCEL file is shown in table 1 below, wherein the content of the first 8 bytes is 0xD0 CF 11E0 a 1B 11A E1;
table 1: sector structure of first sector of Microsoft EXCEL file
Figure FDA0002658645400000021
The SID represents a Sector Identifier, the SSAT represents a short Sector allocation table, the SAT represents a Sector allocation table, and the MSAT represents a main Sector allocation table.
3. The method for data reorganization and restoration according to microsoft EXCEL file according to claim 1, wherein the step S200 comprises the steps of:
s201: reading the byte content of each sector one by one for parsing each sector and comparing with the directory stream storage structure shown in table 2, determining whether the current sector belongs to a directory stream sector or a configuration table sector,
table 2 directory stream structure
Figure FDA0002658645400000022
Figure FDA0002658645400000031
Wherein, the DID represents a directory entry Identifier directory _ entry Identifier;
s202: judging whether the byte content of the current sector simultaneously meets the following five conditions of the directory stream sector, if so, judging that the current sector is the directory stream sector, recording the offset address of the current sector in the Microsoft EXCEL file, and recording the directory stream structure of the current sector according to the directory stream sector structure of the table 2; otherwise, judging whether the byte content of the next sector simultaneously meets the following five conditions of the directory stream sector until the judgment of the byte content of each sector is completed;
the first condition is as follows: the directory structure comprises 4 directory stream structures shown in the table 2, and each directory stream structure has a fixed byte length of 0x80 bytes;
and a second condition: the size of the area storing the name does not exceed 0x40 bytes;
and (3) carrying out a third condition: entry type does not exceed 0x05 bytes;
and a fourth condition: DID values of a left node, a right node and a root node in the directory structure meet the condition that DID is greater than 0 or DID is equal to-1;
and a fifth condition: the entry SID value of the flow is not less than 0 or SID value is equal to-2;
s203: judging whether the byte content of the current sector simultaneously meets the following two conditions of the configuration table sector, if so, judging that the current sector is the configuration table sector, and continuously judging which SSAT/SAT/MSAT the current configuration table sector belongs to, otherwise, judging whether the byte content of the next sector simultaneously meets the following two conditions of the configuration table sector until the judgment of the byte content of each sector is completed;
the first condition is as follows: the 0x80 byte content of the current sector uses 4 bytes as a group, and the SID values of each group are judged one by one, and each SID value needs to satisfy:
when the SID is greater than 0, the SID is smaller than (the total size of the Microsoft EXCEL file is-0 x200)/0x 200;
when the SID is less than 0, the SID value can be only one of-1, -2, -3 or-4;
and a second condition: when the SID value is equal to-1, all the SID values thereafter are-1;
judging the sector of the current configuration table belongs to one of four categories of SSAT/SAT/MSAT:
the first condition is as follows: if the configuration table sector is positioned in front of the directory stream sector, the current configuration table sector is SAT;
case two: if the configuration table sector is positioned behind the directory stream sector where the Root Entry is positioned, and the SID of the first positive number in the configuration table sector is equal to the intra-sector offset address/4 of the current SID value, the current configuration table sector is SSAT;
case three: if only the positive value and-1 negative value exist in the sector of the configuration table, the current sector of the configuration table is MSAT;
case four: if SID values with negative values of-3 and-4 exist in the sector of the configuration table, the sector of the current configuration table is SAT;
if the Root Entry exists in the directory stream sector, judging the maximum SID value of the SSAT according to an Entry Sid value and an Entry byte number in a Root Entry directory stream structure;
s204: generating SAT-SID linked list by SID value of SAT configuration table sector according to the classified configuration table sector: the SAT-SID linked list is formed by SAT array indexes and SIDs in one-to-one correspondence, wherein the array indexes are natural numbers, and the content of the SIDs is the SAT array index corresponding to the next SID of the current SID;
s205: according to the classified sectors of the configuration table, forming SSAT-SID linked list by SID values of the sectors of the SSAT configuration table: the SSAT-SID linked list is formed by SSAT array indexes and SIDs in one-to-one correspondence, wherein the SSAT array indexes are natural numbers, and the SIDs are SSAT array indexes corresponding to the SIDs next to the current SID.
4. The method for data reorganization and restoration according to microsoft EXCEL file according to claim 1, wherein the step S300 comprises the steps of:
judging whether the sector structure simultaneously meets the following four conditions, if so, the sector structure is a complete sector structure, and executing the step S400, otherwise, executing the step S600:
the first condition is as follows: detecting the SID values of all SAT configuration table sectors according to the SAT-SID linked list, wherein the linked list values corresponding to the SAT-SID linked list are all-3;
and a second condition: detecting that SID values of all directory stream sectors have a complete sub-chain table in the SAT-SID linked list according to the SAT-SID linked list, wherein the sub-chain table starts with the SID value of the directory stream sector where Root Entry is located, and the sector of the pointed next SID value is also the directory stream sector; and the child chain length of a directory stream sector is equal to the number of directory stream sectors and the last SID value of the child chain table of a directory stream sector is-2;
and (3) carrying out a third condition: when the entry byte number is more than or equal to 0x1000, detecting the entry SID value and the entry byte number of each directory stream structure in the directory stream sector according to the SAT-SID linked list to form a complete sub-link list in the SAT-SID linked list: starting with the entry SID value of the directory stream structure, the sub-chain length formed in the SAT-SID linked list is the larger of the entry byte number/0 x200-1 and the entry byte number/0 x 200;
and a fourth condition: when the entry byte number is less than 0x1000, detecting the entry SID value and entry byte number of each directory flow structure in the directory flow sector according to the SSAT-SID linked list to form a complete sub-chain list in the SSAT-SID linked list: starting with the entry SID value of the directory flow structure, the length of the child chain formed in the SSAT-SID linked list is equal to the larger of the entry byte number/0 x40-1 and the entry byte number/0 x 40.
5. The method for data reorganization and restoration according to microsoft EXCEL file according to claim 1, wherein the step S400 comprises the steps of:
according to the sector structure of the first sector of the microsoft EXCEL file shown in table 1, a new first sector of the file is constructed:
the fixed value parameter is assigned according to the content of the table 1;
variable parameter part: and setting the variable parameters by combining the known SSAT configuration table sector information, SAT configuration table sector information and directory stream sector information to generate a complete file first sector.
6. The method for data reorganization and restoration according to microsoft EXCEL file according to claim 5, wherein the step S500 comprises the steps of:
and recombining the complete file first sector generated in the step S400 and all the sector data except the complete file first sector in the file to be repaired into the sector data of the new EXCEL file.
7. The method for data reorganization and restoration according to microsoft EXCEL file according to claim 1, wherein the step S600 comprises the following steps:
s601: acquiring data stream sectors in a file to be repaired, wherein the data stream sectors except a file head sector, a configuration table sector and a directory stream sector in the file to be repaired are data stream sectors;
s602: analyzing the sector content of the data stream, judging according to the identifiers of the data sub-streams pointed by different directory stream inlets, and acquiring the content of the first sector of the data sub-stream, wherein the data sub-stream is the data stream content with the following identifiers as the initial part:
Root Entry:0x01 00 00 02 08 00 00 00/0xFE FF 00 00/0x01 00 FE FF
\005DocumentInformation:0xFE FF 00 00
\005Summary DocumentInfomation:0xFE FF 00 00
workbook of 0x 090810000006
S603: after determining the first sector of the data substream of the workbook, traversing each data stream sector behind the first sector of the data substream of the workbook, and judging whether each data sector has one of a Root Entry identifier, a DocumentInformation identifier and a Summary DocumentInformation identifier one by one, if not, judging that the current sector is still the data of the substream sector of the workbook data, if so, judging that the data of the substream sector of the workbook data is finished, and neither the current sector nor the data stream sectors behind the current sector are the data of the substream sector of the workbook data;
s604: all sector data of the workbook data substream is obtained.
8. The method for data reorganization and restoration according to microsoft EXCEL file according to claim 1, wherein the step S700 comprises the steps of:
s701: the storage structure of the analysis workbook data sub-stream is 2 byte record type +2 byte record length + record content;
s702: the parse workbook sub-stream data is of the following type:
0x 0005: representing Workbook Global workbooks globals
0x 0006: representing VB Module Visual Basic Module
0x 0010: representing a table or dialog box Sheet or dialog
0x 0020: representing a Chart Chart
0x 0040: representing Macro-watch Macro sheet
0x 0100: represent workplace
The workbook substream of the 0x 0005 record type comprises all character string data and relevant file attribute information in the microsoft EXCEL file and the number and name of worksheets contained in the microsoft EXCEL file;
the workbook sub-stream of the 0x 0010 record type comprises cell data of a worksheet of a Microsoft EXCEL file, and the storage structure of the workbook sub-stream is cell row index value, cell column index value and cell data content;
s703: analyzing the workflow data record of the workbook according to the workflow data type of the workbook in step S702: the content of the cell data in the current worksheet is recorded in the workbook of the 0x 0010 worksheet type, the worksheet cell data is recorded according to the cell row index value, the cell column index value and the cell data content format, and the cell data information is shown in table 3:
TABLE 3 cell data information
Figure FDA0002658645400000081
Acquiring cell data of the workbook sub-stream according to the record content described in the above table 3;
s704: and for the cell data in each worksheet substream data, sorting the acquired cell data according to the row and column formats of the tables by using the cell row index value and the cell column index value.
9. The method for data reorganization and restoration according to microsoft EXCEL file according to claim 8, wherein the step S704 includes the steps of:
s7041: presetting a two-dimensional array, wherein the size of the two-dimensional array is determined by the value of DIMENSION in a table 3 and is used for determining the range of rows and columns of the current working table; the two-dimensional array is used for storing cell data;
s7042: writing the records into the two-dimensional array according to respective cell row index values and cell column index values;
s7043: writing all the cell data acquired from the working table into the two-dimensional array according to the respective cell row index value and cell column index value to realize the reordering of the cell data of the working table, and executing step S900.
CN202010896794.XA 2020-08-31 2020-08-31 Data reorganization and repair method for Microsoft EXCEL file Active CN112069130B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010896794.XA CN112069130B (en) 2020-08-31 2020-08-31 Data reorganization and repair method for Microsoft EXCEL file

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010896794.XA CN112069130B (en) 2020-08-31 2020-08-31 Data reorganization and repair method for Microsoft EXCEL file

Publications (2)

Publication Number Publication Date
CN112069130A true CN112069130A (en) 2020-12-11
CN112069130B CN112069130B (en) 2023-05-02

Family

ID=73666131

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010896794.XA Active CN112069130B (en) 2020-08-31 2020-08-31 Data reorganization and repair method for Microsoft EXCEL file

Country Status (1)

Country Link
CN (1) CN112069130B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050028031A1 (en) * 2003-07-28 2005-02-03 Matsushita Electric Industrial Co., Ltd. Data recovery method and data recording apparatus
CN101882132A (en) * 2009-04-27 2010-11-10 深圳市闪联信息技术有限公司 Creating and reading method and device for virtual FAT file system
CN106155845A (en) * 2016-08-02 2016-11-23 四川效率源信息安全技术股份有限公司 A kind of restructuring restoration methods based on XFS file system data
CN106445736A (en) * 2016-08-31 2017-02-22 四川效率源信息安全技术股份有限公司 Method for extracting and recombining MTK62 series word stock data of mobile phone
CN110083102A (en) * 2019-05-10 2019-08-02 河南省启源电力勘测设计有限公司 A kind of high-performance low-voltage power distribution cabinet is long-range and site monitoring system
WO2020103493A1 (en) * 2018-11-22 2020-05-28 厦门市美亚柏科信息股份有限公司 Method and system for recovering deleted file based on fat32 file system

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050028031A1 (en) * 2003-07-28 2005-02-03 Matsushita Electric Industrial Co., Ltd. Data recovery method and data recording apparatus
CN101882132A (en) * 2009-04-27 2010-11-10 深圳市闪联信息技术有限公司 Creating and reading method and device for virtual FAT file system
CN106155845A (en) * 2016-08-02 2016-11-23 四川效率源信息安全技术股份有限公司 A kind of restructuring restoration methods based on XFS file system data
CN106445736A (en) * 2016-08-31 2017-02-22 四川效率源信息安全技术股份有限公司 Method for extracting and recombining MTK62 series word stock data of mobile phone
WO2020103493A1 (en) * 2018-11-22 2020-05-28 厦门市美亚柏科信息股份有限公司 Method and system for recovering deleted file based on fat32 file system
CN110083102A (en) * 2019-05-10 2019-08-02 河南省启源电力勘测设计有限公司 A kind of high-performance low-voltage power distribution cabinet is long-range and site monitoring system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
ZHAO ZHENZHOU等: "The research of Excel file fragmentaton data recovery" *
王维雄: "基于介质存储结构的数据恢复技术研究" *

Also Published As

Publication number Publication date
CN112069130B (en) 2023-05-02

Similar Documents

Publication Publication Date Title
US6502101B1 (en) Converting a hierarchical data structure into a flat data structure
US6904430B1 (en) Method and system for efficiently identifying differences between large files
US6016492A (en) Forward extensible property modifiers for formatting information in a program module
US7689630B1 (en) Two-level bitmap structure for bit compression and data management
US20060239562A1 (en) System and method for binary persistence format for a recognition result lattice
US8458231B1 (en) Word processor data organization
CN1794224A (en) File formats, methods, and computer program products for representing workbooks
CN104715039A (en) Column-based storage and research method and equipment based on hard disk and internal storage
CN105144157A (en) System and method for compressing data in database
CN116580414A (en) Contract document difference detection method and device based on ICR character matrix
CN114021543B (en) Document comparison analysis method and system based on table structure analysis
CN107291832A (en) A kind of date storage method based on list storage structure
CN112069130B (en) Data reorganization and repair method for Microsoft EXCEL file
CN105938469A (en) Code storage method, data storage structure of texts and method for compressed storage of texts and statistics output
CN112434085A (en) Roaring Bitmap-based user data statistical method
CN101464875B (en) Method for representing electronic dictionary catalog data by XML
CN110111852A (en) A kind of magnanimity DNA sequencing data lossless Fast Compression platform
US6118391A (en) Compression into arbitrary character sets
CN112183029A (en) Digital conversion method for PDF drawing in sheet metal industry
CN112579546B (en) File compression method, system, storage medium and terminal
CN117217172B (en) Table information acquisition method, apparatus, computer device, and storage medium
US20240088913A1 (en) Graph data compression method and apparatus
CN112232032B (en) Automatic conversion method for content style of docx document
CN114218895B (en) Method for converting UOF document
CN113779218B (en) Question-answer pair construction method, question-answer pair construction device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant