CN112069130B - Data reorganization and repair method for Microsoft EXCEL file - Google Patents

Data reorganization and repair method for Microsoft EXCEL file Download PDF

Info

Publication number
CN112069130B
CN112069130B CN202010896794.XA CN202010896794A CN112069130B CN 112069130 B CN112069130 B CN 112069130B CN 202010896794 A CN202010896794 A CN 202010896794A CN 112069130 B CN112069130 B CN 112069130B
Authority
CN
China
Prior art keywords
sector
data
sid
stream
directory
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010896794.XA
Other languages
Chinese (zh)
Other versions
CN112069130A (en
Inventor
梁效宁
许超明
何丽萍
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xly Salvationdata Technology Inc
Original Assignee
Xly Salvationdata Technology Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xly Salvationdata Technology Inc filed Critical Xly Salvationdata Technology Inc
Priority to CN202010896794.XA priority Critical patent/CN112069130B/en
Publication of CN112069130A publication Critical patent/CN112069130A/en
Application granted granted Critical
Publication of CN112069130B publication Critical patent/CN112069130B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/13File access structures, e.g. distributed indices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/0223User address space allocation, e.g. contiguous or non contiguous base addressing
    • G06F12/023Free address space management
    • G06F12/0238Memory management in non-volatile memory, e.g. resistive RAM or ferroelectric memory
    • G06F12/0246Memory management in non-volatile memory, e.g. resistive RAM or ferroelectric memory in block erasable memory, e.g. flash memory

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a data reorganization and repair method for Microsoft EXCEL files, which is characterized by comprising the following steps: s100: judging whether the byte content accords with the sector structure of the first sector of the Microsoft EXCEL file, if so, executing the step S300, otherwise, executing the step S200; s200: traversing all the sectors to obtain a catalog stream sector and a configuration table sector; s300: analyzing the configuration table sector and the directory stream sector, detecting whether the sector structure is complete, if yes, executing step S400, otherwise, executing step S600; s400: resolving, checking and constructing a sector structure of a file header of the Microsoft EXCEL file; s500: reorganizing all sector data of the Microsoft EXCEL file, and executing step S800; s600: analyzing data of a data stream sector to obtain data of a workbook data substream; s700: recombining the work table data in the Microsoft EXCEL file, and executing the step S900; s800: writing the recombined new EXCEL file sector data into a new EXCEL file, and ending the flow; s900: and writing TXT files with corresponding numbers according to the number of the recombined worksheet data.

Description

Data reorganization and repair method for Microsoft EXCEL file
Technical Field
The invention belongs to the field of electronic evidence obtaining and data recovery, and relates to a data reorganization and restoration method for Microsoft EXCEL files.
Background
Microsoft office, microsoft corporation, is the most commonly used office software, and EXCEL files are an essential tool for handling large amounts of form data, whose content typically includes text-tagged statistics and digital data, or graphics or pictures embedded in a spreadsheet with a file suffix name of. Xls. Mathematical equations for calculating numerical data or statistical data, and a partial function library is also provided as an integrated function of Microsoft EXCEL. Formatting properties and other layout elements can be implemented into the new EXCEL spreadsheet using XLS files (commonly referred to as EXCEL templates).
Microsoft Excel files use the xls proprietary format to store Microsoft Excel documents. This file format is referred to as binary exchange file format (BIFF). All data in Office document files is stored in a series of fixed-size streams. The location of each fixed-size stream is described by several sector allocation tables. These streams contain directory information about the documents, configuration table information, and all data information.
Microsoft Office is used as the most widely used Office software in daily work and life, wherein the EXCEL format is used as a table format file, and the application range is quite wide. Because of the convenience and extremely high frequency of use of EXCEL files for data processing, it also has vulnerable characteristics as one of electronic data. When some emergency conditions influence the normal use of Microsoft EXCEL, the risk of damaging or losing important files is increased, and the damaged or lost EXCEL files are extremely difficult to recover, so that data in the EXCEL files of users are lost, and meanwhile, the working efficiency of the users is reduced, and even the loss is caused. In the prior art, a data reorganization and repair method aiming at Microsoft EXCEL files does not exist.
Disclosure of Invention
For overcoming the defects of the prior art, we propose a method for restoring the EXCEL file data of Microsoft Office by judging whether the file format is in the type xls format (namely, the EXCEL format), traversing the file data content by sector, analyzing the data type of each sector, classifying the sector content, analyzing each sector data word by word in different modes, extracting the effective data of restoring the EXCEL file, restoring the data by combining the cell data content and the cell style information, thereby realizing restoring the EXCEL file, and finally, deriving a new type xls file or deriving the type txt file restored for each worksheet.
The application of the invention comprises the following steps:
s100: loading the Microsoft EXCEL file and reading byte content of a first sector, judging whether the byte content accords with the sector structure of the first sector of the Microsoft EXCEL file, if so, executing step S300, otherwise, executing step S200, wherein the Microsoft EXCEL file is a file to be repaired;
s200: traversing all sectors of the Microsoft EXCEL file to obtain a catalog stream sector and a configuration table sector;
s300: analyzing the configuration table sector and the directory stream sector, detecting whether the sector structure is complete, if yes, executing step S400, otherwise, executing step S600;
s400: analyzing, checking and constructing a sector structure of a file header of the Microsoft EXCEL file according to the complete configuration table sector information and the complete directory stream sector information;
s500: according to the constructed sector content of the header structure and the sector data of the file to be repaired, reorganizing all sector data of the Microsoft EXCEL file, and executing step S800;
s600: analyzing data of a data stream sector to obtain data of a workbook data substream;
s700: analyzing the workbook data sub-stream data, reorganizing the worktable data in the Microsoft EXCEL file, and executing step S900;
s800: writing the recombined new EXCEL file sector data into a new EXCEL file, and ending the flow;
s900: writing TXT files with corresponding numbers according to the number of the recombined worksheet data: and creating a TXT file, and writing the cell data analyzed from the workflow book sub-stream of the worksheet type into the TXT file according to the content of the array reordered by the two-dimensional array.
Preferably, the sector structure of the first sector of the microsoft EXCEL file is shown in table 1 below, wherein the first 8 bytes of content is 0xD0 CF 11E0 A1 B11AE1;
Figure GDA0004084626480000031
where SID represents sector identifier Sector Identifier, SSAT represents short sector configuration table short sector allocation table, SAT represents sector configuration table sector allocation table, and MSAT represents master sector configuration table main sector allocation table.
Preferably, the step S200 includes the steps of:
s201: the byte content of each sector is read one by one to analyze each sector and compare with the directory stream storage structure shown in table 2 to determine whether the current sector belongs to the directory stream sector or the configuration table sector,
table 2 directory stream structure
Figure GDA0004084626480000041
Wherein, DID represents directory entry Identifier directory_entry Identifier;
s202: judging whether byte contents of a current sector simultaneously meet the following five conditions of a directory stream sector, if so, judging the current sector as the directory stream sector, recording an offset address of the current sector in the Microsoft EXCEL file, and recording a directory stream structure of the current sector according to the directory stream sector structure of the table 2; otherwise, judging whether the byte content of the next sector meets the following five conditions of the directory stream sector at the same time until the judgment of the byte content of each sector is completed;
condition one: the table 2 shows 4 directory stream structures, and each directory stream structure has a fixed byte length of 0x80 bytes;
condition II: the area size storing the name is not more than 0x40 bytes;
and (3) a third condition: the entry type does not exceed 0x05 bytes;
condition four: DID values of a left node, a right node and a root node in the directory structure all meet the condition that the DID is larger than 0 or the DID is equal to-1;
condition five: the entry SID value of the stream is not less than 0 or the SID value is equal to-2;
s203: judging whether the byte content of the current sector meets the following two conditions of the configuration table sector at the same time, if so, judging that the current sector is the configuration table sector, continuously judging which of SSAT/SAT/MSAT the current configuration table sector belongs to, otherwise, judging whether the byte content of the next sector meets the following two conditions of the configuration table sector at the same time until the judgment of the byte content of each sector is completed;
condition one: the 0x80 byte content of the current sector takes 4 bytes as a group, SID values of each group are judged one by one, and each SID value needs to satisfy:
when the SID is greater than 0, the SID is less than (the Microsoft EXCEL file total size-0 x 200)/0 x200;
when SID is smaller than 0, SID value can be only one of-1, -2, -3 or-4;
condition II: when SID value is equal to-1, all SID values are-1 afterwards;
judging which of four classifications SSAT/SAT/MSAT the current configuration table sector belongs to:
case one: if the configuration table sector is located before the directory stream sector, the current configuration table sector is SAT;
and a second case: if the configuration table sector is positioned behind the directory stream sector where the Root Entry is positioned and the SID of the first positive number in the configuration table sector is equal to the intra-sector offset address/4 of the current SID value, the current configuration table sector is indicated to be SSAT;
and a third case: if the configuration table sector has only a positive value and a negative value of-1, the current configuration table sector is MSAT;
case four: if the SID values with minus 3 and minus 4 negative values exist in the configuration table sector, the current configuration table sector is SAT;
if the Root Entry exists in the directory stream sector, judging the maximum SID value of the SSAT according to the Entry Sid value and the Entry byte number in the Root Entry directory stream structure;
s204: generating a SAT-SID linked list from SID values of SAT configuration table sectors according to the classified configuration table sectors: the SAT-SID linked list is formed by one-to-one correspondence of SAT array indexes and SIDs, wherein the array indexes are natural numbers, and the content of the SIDs is the SAT array index corresponding to the next SID of the current SID;
s205: forming the SID value of the SSAT configuration table sector into an SSAT-SID linked list according to the classified configuration table sector: the SSAT-SID linked list is formed by one-to-one correspondence of SSAT array indexes and SIDs, wherein the SSAT array indexes are natural numbers, and the content of the SIDs is the SSAT array index corresponding to the next SID of the current SID.
Preferably, the step S300 includes the steps of:
judging whether the sector structure meets the following four conditions simultaneously, if yes, executing step S400 for the complete sector structure, otherwise, executing step S600:
condition one: detecting SID values of all SAT configuration table sectors according to the SAT-SID linked list, wherein the corresponding linked list values in the SAT-SID linked list are all-3;
condition II: according to the SAT-SID linked list, detecting that SID values of all directory stream sectors have complete sub-linked lists in the SAT-SID linked list, wherein the sub-linked list starts with SID values of directory stream sectors where Root Entry is located, and the sector with the next directed SID value is also the directory stream sector; and the sub chain length of the directory stream sector is equal to the number of the directory stream sectors and the last SID value of the sub linked list of the directory stream sector is-2;
and (3) a third condition: when the number of the entrance bytes is greater than or equal to 0x1000, detecting the entrance SID value and the number of the entrance bytes of each directory flow structure in the directory flow sector according to the SAT-SID linked list to form a complete sub-linked list in the SAT-SID linked list: starting with the entry SID value of the directory stream structure, the sub-link length formed in the SAT-SID linked list is the larger of the entry byte number/0 x200-1 and the entry byte number/0 x200;
condition four: when the number of the entrance bytes is smaller than 0x1000, detecting the entrance SID value and the number of the entrance bytes of each directory flow structure in the directory flow sector according to the SSAT-SID linked list to form a complete sub-linked list in the SSAT-SID linked list: starting with the entry SID value of the directory stream structure, the sub-chain length formed in the SSAT-SID linked list is equal to the greater of the entry byte number/0 x40-1 and the entry byte number/0 x 40.
Preferably, the step S400 includes the steps of:
constructing a new first sector of the file according to the sector structure of the first sector of the microsoft EXCEL file shown in table 1:
the fixed value parameter is assigned according to the content of the table 1;
variable parameter part: and setting variable parameters by combining known SSAT configuration table sector information, SAT configuration table sector information and directory stream sector information to generate a complete first sector of the file.
Preferably, the step S500 includes the steps of:
and (3) reconstructing the first sector of the complete file generated in the step (S400) and all sector data except the first sector of the complete file in the file to be repaired into the sector data of the new EXCEL file.
7. The method for restoring data to microsoft EXCEL files according to claim 1, wherein said step S600 comprises the steps of:
s601: acquiring a data stream sector in a file to be repaired, wherein the data stream sector is the file to be repaired except a first sector, a configuration table sector and a directory stream sector;
s602: analyzing the sector content of the data stream, judging according to identifiers of data substreams pointed by different directory inflow ports, and obtaining the content of the first sector of the data substream, wherein the data substream is the data stream content with the following identifiers as the starting point:
Root Entry:0x0100000208000000/0xFE FF 0000/0x0100 FE FF
\005DocumentInformation:0xFE FF 0000
\005Summary DocumentInfomation:0xFE FF 0000
workbook 0x090810000006
S603: after determining the first sector of the working book data substream, traversing each data stream sector after the first sector of the working book data substream, judging whether each data sector has one of a Root Entry identifier, a documetInformationidentifier and a Summary DocumentInfomation identifier one by one, if not, judging that the current sector is still the data of the data substream sector of the working book data, if so, judging that the data of the data substream sector of the working book data is finished, and judging that the current sector and the data stream sectors after the current sector are not the data of the data substream sector of the working book data;
s604: all sector data of the workbook data substream is acquired.
Preferably, the step S700 includes the steps of:
s701: the storage structure of the data substream of the analysis workbook is 2-byte record type + 2-byte record length + record content;
s702: the parsing workflow data is of the following type:
0x0005: representing Workbook global
0x0006: representing VB Module Visual Basic module
0x0010: a representation or dialog box Sheet or dialogue
0x0020: representation Chart Chart
0x0040: macro sheet representing Macro table
0x0100: representing a Workspace
The work book sub-stream of the 0x0005 record type comprises all character string data in the Microsoft EXCEL file, information of related file attributes and the number and names of worksheets contained in the Microsoft EXCEL file;
the workflow of the 0x0010 record type comprises cell data of a worksheet of the Microsoft EXCEL file, and the storage structure of the workflow is cell row index value, cell column index value and cell data content;
s703: according to the workflow sub-stream data type in step S702, the workflow sub-stream data record is parsed: the 0x0010 worksheet type workbook records the cell data content in the current worksheet, and records the worksheet cell data according to the cell row index value, the cell column index value and the cell data content format, wherein the cell data information is shown in table 3:
TABLE 3 cell data information
Figure GDA0004084626480000091
Obtaining cell data of a workflow according to the record content described in the table 3;
s704: and ordering the acquired cell data according to the row and column formats of the table by using the cell row index value and the cell column index value for the cell data in each worksheet sub-stream data.
Preferably, the step S704 includes the steps of:
s7041: presetting a two-dimensional array, wherein the size of the two-dimensional array is determined by the value of a DIMENSION in a table 3, and the two-dimensional array is used for determining the row and column range of the current worksheet; the two-dimensional array is used for storing cell data;
s7042: writing the records into the two-dimensional array according to the respective cell row index values and the cell column index values;
s7043: and writing all the cell data acquired in the worksheet into the two-dimensional array according to the respective cell row index value and the cell column index value to reorder the cell data of the worksheet, and executing step S900.
The beneficial effects of the invention are as follows:
1. the technical problem that a data reorganization and repair method aiming at Microsoft EXCEL files does not exist in the prior art is solved.
2. Not only can the EXCEL file data stream be effectively extracted, but also the data can be recovered by sequencing and reorganizing.
Drawings
FIG. 1 is a general flow chart of the method provided by the present invention;
FIG. 2 is a schematic diagram of generating a SAT-SID linked list in an embodiment of the present invention.
Detailed Description
Fig. 1 shows a general flow chart of the method provided by the invention. As shown in fig. 1, the method comprises the following steps:
s100: and loading the Microsoft EXCEL file and reading the byte content of the first sector, judging whether the byte content accords with the sector structure of the first sector of the Microsoft EXCEL file, if so, executing the step S300, otherwise, executing the step S200, wherein the Microsoft EXCEL file is the file to be repaired.
Specifically, first, the microsoft EXCEL file is loaded, the data of the first sector of the microsoft EXCEL file is read, and the read sector size defaults to 0×200 bytes.
Next, according to the sector structure of the first sector of the microsoft EXCEL file as shown in table 1, it is determined whether the first sector of the EXCEL file to be repaired satisfies the sector structure of the first sector of the microsoft EXCEL file, that is, whether the following conditions are satisfied:
if the first 8 bytes of the first sector is 0xD0 CF 11E0 A1 B11A E1, step S300 is executed, otherwise step S200 is executed.
Table 1: sector structure of first sector of Microsoft EXCEL file
Figure GDA0004084626480000101
Figure GDA0004084626480000111
Where SID represents sector identifier Sector Identifier, SSAT represents short sector configuration table short sector allocation table, SAT represents sector configuration table sector allocation table, and MSAT represents master sector configuration table main sector allocation table.
S200: traversing all sectors of Microsoft EXCEL file to obtain catalog stream sector and configuration table sector,
the method comprises the following steps:
s201: the byte content of each sector is read one by one to analyze each sector and compare with the directory stream storage structure shown in table 2 to determine whether the current sector belongs to the directory stream sector or the configuration table sector,
table 2 directory stream structure
Figure GDA0004084626480000112
Wherein, DID represents directory entry Identifier directory_entry Identifier;
s202: judging whether byte contents of the current sector simultaneously meet the following five conditions of the directory stream sector, if so, judging the current sector as the directory stream sector, recording an offset address of the current sector in a Microsoft EXCEL file, and recording a directory stream structure of the current sector according to the directory stream sector structure of the table 2; otherwise, judging whether the byte content of the next sector meets the following five conditions of the directory stream sector at the same time until the judgment of the byte content of each sector is completed;
condition one: comprises 4 directory stream structures shown in table 2, and each directory stream structure has a fixed byte length of 0x80 bytes;
condition II: the area size storing the name is not more than 0x40 bytes;
and (3) a third condition: the entry type does not exceed 0x05 bytes;
condition four: DID values of a left node, a right node and a root node in the directory structure all meet the condition that the DID is larger than 0 or the DID is equal to-1;
condition five: the entry SID value of the stream is not less than 0 or the SID value is equal to-2;
s203: judging whether the byte content of the current sector meets the following two conditions of the configuration table sector at the same time, if so, judging that the current sector is the configuration table sector, continuously judging which of SSAT/SAT/MSAT the current configuration table sector belongs to, otherwise, judging whether the byte content of the next sector meets the following two conditions of the configuration table sector at the same time until the judgment of the byte content of each sector is completed;
condition one: the 0x80 byte content of the current sector takes 4 bytes as a group, SID values of each group are judged one by one, and each SID value needs to satisfy:
when SID is greater than 0, SID is less than (Microsoft EXCEL file total size-0 x 200)/0 x200;
when SID is smaller than 0, SID value can be only one of-1, -2, -3 or-4;
condition II: when SID value is equal to-1, all SID values are-1 afterwards;
judging which of four classifications SSAT/SAT/MSAT the current configuration table sector belongs to:
case one: if the configuration table sector is located before the directory stream sector, the current configuration table sector is SAT;
and a second case: if the configuration table sector is positioned behind the directory stream sector where the Root Entry is positioned and the SID of the first positive number in the configuration table sector is equal to the intra-sector offset address/4 of the current SID value, the current configuration table sector is indicated to be SSAT;
and a third case: if the configuration table sector has only a positive value and a negative value of-1, the current configuration table sector is MSAT;
case four: if the SID values with minus 3 and minus 4 negative values exist in the configuration table sector, the current configuration table sector is SAT;
if the Root Entry exists in the directory stream sector, judging the maximum SID value of the SSAT according to the Entry Sid value and the Entry byte number in the Root Entry directory stream structure;
s204: FIG. 2 is a schematic diagram of generating a SAT-SID linked list in an embodiment of the present invention. As shown in fig. 2, from the classified configuration table sectors, the SID values of the SAT configuration table sectors are generated into a SAT-SID linked list:
the SAT-SID linked list is formed by one-to-one correspondence of SAT Array indexes (namely, array indexes) and SIDs (namely, SAT contents (secIDs)), wherein the Array indexes (Array indexes) are natural numbers, and the content of the SIDs is the SAT Array index (Array indexes) corresponding to the next SID of the current SID;
for example, as shown in fig. 2, the Array index (Array indexes) =0, and SID (SAT contents) =2, i.e., the SAT Array index (Array indexes) corresponding to the next SID
=2, its corresponding SID (i.e., SAT contents (SecIDs))=3, at this time, 3 is equal to the SAT Array index (Array index) corresponding to the next SID, its corresponding SID (i.e., SAT contents (SecIDs))= -2, and so on.
S205: similarly, according to the classified configuration table sectors, the SID values of the SSAT configuration table sectors form an SSAT-SID linked list: the SSAT-SID linked list is formed by one-to-one correspondence of SSAT array indexes and SIDs, wherein the SSAT array indexes are natural numbers, and the SIDs are SSAT array indexes corresponding to the next SID of the current SID.
S300: analyzing the configuration table sector and the directory stream sector, detecting whether the sector structure is complete, if yes, executing step S400, otherwise, executing step S600, including the following steps:
judging whether the sector structure meets the following four conditions simultaneously, if yes, executing step S400 for the complete sector structure, otherwise, executing step S600:
condition one: detecting SID values of all SAT configuration table sectors according to the SAT-SID linked list, wherein the corresponding linked list values in the SAT-SID linked list are all-3;
condition II: according to the SAT-SID linked list, detecting that SID values of all directory stream sectors have complete sub-linked lists in the SAT-SID linked list, wherein the sub-linked list starts with the SID value of the directory stream sector where the Root Entry is located, and the sector with the next directed SID value is also the directory stream sector; and the sub chain length of the directory stream sector is equal to the number of the directory stream sectors and the last SID value of the sub linked list of the directory stream sector is-2;
and (3) a third condition: when the number of the entrance bytes is greater than or equal to 0x1000, detecting the entrance SID value and the number of the entrance bytes of each directory flow structure in the directory flow sector according to the SAT-SID linked list to form a complete sub-linked list in the SAT-SID linked list: starting with the entry SID value of the directory stream structure, the sub-link length formed in the SAT-SID linked list is the larger of the entry byte number/0 x200-1 and the entry byte number/0 x200;
condition four: when the number of the entrance bytes is smaller than 0x1000, detecting the entrance SID value and the number of the entrance bytes of each directory flow structure in the directory flow sector according to the SSAT-SID linked list to form a complete sub-linked list in the SSAT-SID linked list: starting with the entry SID value of the directory stream structure, the sub-chain length formed in the SSAT-SID linked list is equal to the greater of the entry byte number/0 x40-1 and the entry byte number/0 x 40.
S400: according to the complete configuration table sector information and the complete directory stream sector information, resolving, checking and constructing the sector structure of the file header of the Microsoft EXCEL file, comprising the following steps:
constructing a new first sector of the file according to the sector structure of the first sector of the Microsoft EXCEL file shown in the table 1, wherein fixed value parameters are assigned according to the content of the table 1;
variable parameter part: and setting variable parameters by combining known SSAT configuration table sector information, SAT configuration table sector information and directory stream sector information to generate a complete first sector of the file.
S500: and (2) reorganizing all sector data of the Microsoft EXCEL file according to the constructed head structure sector content and the sector data of the file to be repaired, and executing a step S800, wherein the method comprises the following steps:
and (3) reconstructing the first sector of the complete file generated in the step (S400) and all sector data except the first sector of the complete file in the file to be repaired into the sector data of the new EXCEL file.
S600: analyzing the data stream sector data to obtain the data sub-stream data of the workbook data, comprising the following steps:
s601: acquiring a data stream sector in a file to be repaired, wherein the data stream sector is the file to be repaired except a first sector, a configuration table sector and a directory stream sector;
s602: analyzing the sector content of the data stream, judging according to identifiers of data substreams pointed by different directory inflow ports, and obtaining the content of the first sector of the data substream, wherein the data substream is the data stream content with the following identifiers as the starting point:
Root Entry:0x0100000208000000/0xFE FF 0000/0x0100 FE FF
\005DocumentInformation:0xFE FF 0000
\005Summary DocumentInfomation:0xFE FF 0000
workbook 0x090810000006
S603: after determining the first sector of the working book data substream, traversing each data stream sector after the first sector of the working book data substream, judging whether each data sector has one of a Root Entry identifier, a document information identifier and a Summary DocumentInfomation identifier one by one, if not, judging that the current sector is still the data of the data substream sector of the working book data, if so, judging that the data of the data substream sector of the working book data is finished, and judging that the current sector and the data stream sectors after the current sector are not the data of the data substream sector of the working book data;
s604: all sector data of the workbook data substream is acquired.
S700: analyzing the workbook data sub-stream data, reorganizing the worktable data in the Microsoft EXCEL file, and executing step S900;
the method comprises the following steps:
s701: the storage structure of the data substream of the analysis workbook is 2-byte record type + 2-byte record length + record content;
s702: the parsing workflow data is of the following type:
0x0005: representing Workbook global
0x0006: representing VB Module Visual Basic module
0x0010: a representation or dialog box Sheet or dialogue
0x0020: representation Chart Chart
0x0040: macro sheet representing Macro table
0x0100: representing a Workspace
The work book sub-stream of the 0x0005 record type comprises all character string data in the Microsoft EXCEL file, information of related file attributes and the number and names of worksheets contained in the Microsoft EXCEL file;
the workflow of the 0x0010 record type comprises cell data of a worksheet of the Microsoft EXCEL file, and the storage structure of the workflow is cell row index value, cell column index value and cell data content;
s703: according to the workbook sub-stream data type of step S702, the workbook sub-stream data record is parsed: the 0x0010 worksheet type workbook records the cell data content in the current worksheet, and records the worksheet cell data according to the cell row index value, the cell column index value and the cell data content format, wherein the cell data information is shown in table 3:
TABLE 3 cell data information
Figure GDA0004084626480000171
Obtaining cell data of a workflow according to the record content described in the table 3;
s704: the method comprises the steps of ordering the acquired cell data according to the row and column formats of the table by using the cell row index value and the cell column index value for the cell data in each worksheet sub-stream data, and comprises the following steps:
s7041: presetting a two-dimensional array, wherein the size of the two-dimensional array is determined by the value of a DIMENSION in a table 3, and the two-dimensional array is used for determining the row and column range of the current worksheet; the two-dimensional array is used for storing cell data;
s7042: writing the records into a two-dimensional array according to the respective cell row index values and the cell column index values;
s7043: and writing all the cell data acquired in the worksheet into a two-dimensional array according to the respective cell row index value and the cell column index value to reorder the cell data of the worksheet, and executing step S900.
S800: writing the recombined new EXCEL file sector data into a new EXCEL file, and ending the flow;
s900: writing TXT files with corresponding numbers according to the number of the recombined worksheet data: and creating a TXT file, and writing the cell data analyzed from the workflow book sub-stream of the worksheet type into the TXT file according to the content of the array reordered by the two-dimensional array.
It is to be understood that the invention is not limited to the examples described above, and that modifications and variations may be effected in light of the above teachings by those skilled in the art, all of which are intended to be within the scope of the invention as defined in the appended claims.

Claims (9)

1. The data reorganization and repair method for the Microsoft EXCEL file is characterized by comprising the following steps of:
s100: loading the Microsoft EXCEL file and reading byte content of a first sector, judging whether the byte content accords with the sector structure of the first sector of the Microsoft EXCEL file, if so, executing step S300, otherwise, executing step S200, wherein the Microsoft EXCEL file is a file to be repaired;
s200: traversing all sectors of the Microsoft EXCEL file to obtain a catalog stream sector and a configuration table sector;
s300: analyzing the configuration table sector and the directory stream sector, detecting whether the sector structure is complete, if yes, executing step S400, otherwise, executing step S600;
s400: analyzing, checking and constructing a sector structure of a file header of the Microsoft EXCEL file according to the complete configuration table sector information and the complete directory stream sector information;
s500: according to the constructed sector content of the header structure and the sector data of the file to be repaired, reorganizing all sector data of the Microsoft EXCEL file, and executing step S800;
s600: analyzing data of a data stream sector to obtain data of a workbook data substream;
s700: analyzing the workbook data sub-stream data, reorganizing the worktable data in the Microsoft EXCEL file, and executing step S900;
s800: writing the recombined new EXCEL file sector data into a new EXCEL file, and ending the flow;
s900: writing TXT files with corresponding numbers according to the number of the recombined worksheet data: and creating a TXT file, and writing the cell data analyzed from the workflow book sub-stream of the worksheet type into the TXT file according to the content of the array reordered by the two-dimensional array.
2. The method for reorganizing and repairing data for microsoft EXCEL files according to claim 1, wherein the sector structure of the first sector of the microsoft EXCEL file is shown in table 1, wherein the first 8 bytes are 0xD0 CF 11E0 A1 B11A E1;
table 1: sector structure of first sector of Microsoft EXCEL file
Figure FDA0004084626470000021
Where SID represents sector identifier Sector Identifier, SSAT represents short sector configuration table short sector allocation table, SAT represents sector configuration table sector allocation table, and MSAT represents master sector configuration table main sector allocation table.
3. The method for restoring data to microsoft EXCEL files according to claim 1, wherein said step S200 comprises the steps of:
s201: the byte content of each sector is read one by one to analyze each sector and compare with the directory stream storage structure shown in table 2 to determine whether the current sector belongs to the directory stream sector or the configuration table sector,
table 2 directory stream structure
Figure FDA0004084626470000022
Figure FDA0004084626470000031
Wherein, DID represents directory entry Identifier directory_entry Identifier;
s202: judging whether byte contents of a current sector simultaneously meet the following five conditions of a directory stream sector, if so, judging the current sector as the directory stream sector, recording an offset address of the current sector in the Microsoft EXCEL file, and recording a directory stream structure of the current sector according to the directory stream sector structure of the table 2; otherwise, judging whether the byte content of the next sector meets the following five conditions of the directory stream sector at the same time until the judgment of the byte content of each sector is completed;
condition one: the table 2 shows 4 directory stream structures, and each directory stream structure has a fixed byte length of 0x80 bytes;
condition II: the area size storing the name is not more than 0x40 bytes;
and (3) a third condition: the entry type does not exceed 0x05 bytes;
condition four: DID values of a left node, a right node and a root node in the directory structure all meet the condition that the DID is larger than 0 or the DID is equal to-1;
condition five: the entry SID value of the stream is not less than 0 or the SID value is equal to-2;
s203: judging whether the byte content of the current sector meets the following two conditions of the configuration table sector at the same time, if so, judging that the current sector is the configuration table sector, continuously judging which of SSAT/SAT/MSAT the current configuration table sector belongs to, otherwise, judging whether the byte content of the next sector meets the following two conditions of the configuration table sector at the same time until the judgment of the byte content of each sector is completed;
condition one: the 0x80 byte content of the current sector takes 4 bytes as a group, SID values of each group are judged one by one, and each SID value needs to satisfy:
when the SID is greater than 0, the SID is less than (the Microsoft EXCEL file total size-0 x 200)/0 x200;
when SID is smaller than 0, SID value can be only one of-1, -2, -3 or-4;
condition II: when SID value is equal to-1, all SID values are-1 afterwards;
judging which of four classifications SSAT/SAT/MSAT the current configuration table sector belongs to:
case one: if the configuration table sector is located before the directory stream sector, the current configuration table sector is SAT;
and a second case: if the configuration table sector is positioned behind the directory stream sector where the Root Entry is positioned and the SID of the first positive number in the configuration table sector is equal to the intra-sector offset address/4 of the current SID value, the current configuration table sector is indicated to be SSAT;
and a third case: if the configuration table sector has only a positive value and a negative value of-1, the current configuration table sector is MSAT;
case four: if the SID values with minus 3 and minus 4 negative values exist in the configuration table sector, the current configuration table sector is SAT;
if the Root Entry exists in the directory stream sector, judging the maximum SID value of the SSAT according to the Entry Sid value and the Entry byte number in the Root Entry directory stream structure;
s204: generating a SAT-SID linked list from SID values of SAT configuration table sectors according to the classified configuration table sectors: the SAT-SID linked list is formed by one-to-one correspondence of SAT array indexes and SIDs, wherein the array indexes are natural numbers, and the content of the SIDs is the SAT array index corresponding to the next SID of the current SID;
s205: forming the SID value of the SSAT configuration table sector into an SSAT-SID linked list according to the classified configuration table sector: the SSAT-SID linked list is formed by one-to-one correspondence of SSAT array indexes and SIDs, wherein the SSAT array indexes are natural numbers, and the content of the SIDs is the SSAT array index corresponding to the next SID of the current SID.
4. The method for restoring data to microsoft EXCEL files according to claim 1, wherein said step S300 comprises the steps of:
judging whether the sector structure meets the following four conditions simultaneously, if yes, executing step S400 for the complete sector structure, otherwise, executing step S600:
condition one: detecting SID values of all SAT configuration table sectors according to the SAT-SID linked list, wherein the corresponding linked list values in the SAT-SID linked list are all-3;
condition II: according to the SAT-SID linked list, detecting that SID values of all directory stream sectors have complete sub-linked lists in the SAT-SID linked list, wherein the sub-linked list starts with SID values of directory stream sectors where Root Entry is located, and the sector with the next directed SID value is also the directory stream sector; and the sub chain length of the directory stream sector is equal to the number of the directory stream sectors and the last SID value of the sub linked list of the directory stream sector is-2;
and (3) a third condition: when the number of the entrance bytes is greater than or equal to 0x1000, detecting the entrance SID value and the number of the entrance bytes of each directory flow structure in the directory flow sector according to the SAT-SID linked list to form a complete sub-linked list in the SAT-SID linked list: starting with the entry SID value of the directory stream structure, the sub-link length formed in the SAT-SID linked list is the larger of the entry byte number/0 x200-1 and the entry byte number/0 x200;
condition four: when the number of the entrance bytes is smaller than 0x1000, detecting the entrance SID value and the number of the entrance bytes of each directory flow structure in the directory flow sector according to the SSAT-SID linked list to form a complete sub-linked list in the SSAT-SID linked list: starting with the entry SID value of the directory stream structure, the sub-chain length formed in the SSAT-SID linked list is equal to the greater of the entry byte number/0 x40-1 and the entry byte number/0 x 40.
5. The method for restoring data to microsoft EXCEL files according to claim 1, wherein said step S400 comprises the steps of:
constructing a new first sector of the file according to the sector structure of the first sector of the microsoft EXCEL file shown in table 1:
the fixed value parameter is assigned according to the content of the table 1;
variable parameter part: and setting variable parameters by combining known SSAT configuration table sector information, SAT configuration table sector information and directory stream sector information to generate a complete first sector of the file.
6. The method for restoring data to microsoft EXCEL files according to claim 5, wherein said step S500 comprises the steps of:
and (3) reconstructing the first sector of the complete file generated in the step (S400) and all sector data except the first sector of the complete file in the file to be repaired into the sector data of the new EXCEL file.
7. The method for restoring data to microsoft EXCEL files according to claim 1, wherein said step S600 comprises the steps of:
s601: acquiring a data stream sector in a file to be repaired, wherein the data stream sector is the file to be repaired except a first sector, a configuration table sector and a directory stream sector;
s602: analyzing the sector content of the data stream, judging according to identifiers of data substreams pointed by different directory inflow ports, and obtaining the content of the first sector of the data substream, wherein the data substream is the data stream content with the following identifiers as the starting point:
Root Entry:0x0100000208000000/0xFE FF 0000/0x0100FE FF
\005DocumentInformation:0xFE FF 0000
\005Summary DocumentInfomation:0xFE FF 0000
workbook 0x090810000006
S603: after determining the first sector of the working book data substream, traversing each data stream sector after the first sector of the working book data substream, judging whether each data sector has one of a Root Entry identifier, a documetInformationidentifier and a Summary DocumentInfomation identifier one by one, if not, judging that the current sector is still the data of the data substream sector of the working book data, if so, judging that the data of the data substream sector of the working book data is finished, and judging that the current sector and the data stream sectors after the current sector are not the data of the data substream sector of the working book data;
s604: all sector data of the workbook data substream is acquired.
8. The method for restoring data to microsoft EXCEL files according to claim 1, wherein said step S700 comprises the steps of:
s701: the storage structure of the data substream of the analysis workbook is 2-byte record type + 2-byte record length + record content;
s702: the parsing workflow data is of the following type:
0x0005: representing Workbook global
0x0006: representing VB Module Visual Basic module
0x0010: a representation or dialog box Sheet or dialogue
0x0020: representation Chart Chart
0x0040: macro sheet representing Macro table
0x0100: representing a Workspace
The work book sub-stream of the 0x0005 record type comprises all character string data in the Microsoft EXCEL file, information of related file attributes and the number and names of worksheets contained in the Microsoft EXCEL file;
the workflow of the 0x0010 record type comprises cell data of a worksheet of the Microsoft EXCEL file, and the storage structure of the workflow is cell row index value, cell column index value and cell data content;
s703: according to the workflow sub-stream data type in step S702, the workflow sub-stream data record is parsed: the 0x0010 worksheet type workbook records the cell data content in the current worksheet, and records the worksheet cell data according to the cell row index value, the cell column index value and the cell data content format, wherein the cell data information is shown in table 3:
TABLE 3 cell data information
Figure FDA0004084626470000081
Obtaining cell data of a workflow according to the record content described in the table 3;
s704: and ordering the acquired cell data according to the row and column formats of the table by using the cell row index value and the cell column index value for the cell data in each worksheet sub-stream data.
9. The method for restoring data to microsoft EXCEL files according to claim 8, wherein said step S704 comprises the steps of:
s7041: presetting a two-dimensional array, wherein the size of the two-dimensional array is determined by the value of a DIMENSION in a table 3, and the two-dimensional array is used for determining the row and column range of the current worksheet; the two-dimensional array is used for storing cell data;
s7042: writing the records into the two-dimensional array according to the respective cell row index values and the cell column index values;
s7043: and writing all the cell data acquired in the worksheet into the two-dimensional array according to the respective cell row index value and the cell column index value to reorder the cell data of the worksheet, and executing step S900.
CN202010896794.XA 2020-08-31 2020-08-31 Data reorganization and repair method for Microsoft EXCEL file Active CN112069130B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010896794.XA CN112069130B (en) 2020-08-31 2020-08-31 Data reorganization and repair method for Microsoft EXCEL file

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010896794.XA CN112069130B (en) 2020-08-31 2020-08-31 Data reorganization and repair method for Microsoft EXCEL file

Publications (2)

Publication Number Publication Date
CN112069130A CN112069130A (en) 2020-12-11
CN112069130B true CN112069130B (en) 2023-05-02

Family

ID=73666131

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010896794.XA Active CN112069130B (en) 2020-08-31 2020-08-31 Data reorganization and repair method for Microsoft EXCEL file

Country Status (1)

Country Link
CN (1) CN112069130B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101882132A (en) * 2009-04-27 2010-11-10 深圳市闪联信息技术有限公司 Creating and reading method and device for virtual FAT file system
CN106155845A (en) * 2016-08-02 2016-11-23 四川效率源信息安全技术股份有限公司 A kind of restructuring restoration methods based on XFS file system data
CN106445736A (en) * 2016-08-31 2017-02-22 四川效率源信息安全技术股份有限公司 Method for extracting and recombining MTK62 series word stock data of mobile phone
CN110083102A (en) * 2019-05-10 2019-08-02 河南省启源电力勘测设计有限公司 A kind of high-performance low-voltage power distribution cabinet is long-range and site monitoring system
WO2020103493A1 (en) * 2018-11-22 2020-05-28 厦门市美亚柏科信息股份有限公司 Method and system for recovering deleted file based on fat32 file system

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005050073A (en) * 2003-07-28 2005-02-24 Matsushita Electric Ind Co Ltd Data restoration method, and data recorder

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101882132A (en) * 2009-04-27 2010-11-10 深圳市闪联信息技术有限公司 Creating and reading method and device for virtual FAT file system
CN106155845A (en) * 2016-08-02 2016-11-23 四川效率源信息安全技术股份有限公司 A kind of restructuring restoration methods based on XFS file system data
CN106445736A (en) * 2016-08-31 2017-02-22 四川效率源信息安全技术股份有限公司 Method for extracting and recombining MTK62 series word stock data of mobile phone
WO2020103493A1 (en) * 2018-11-22 2020-05-28 厦门市美亚柏科信息股份有限公司 Method and system for recovering deleted file based on fat32 file system
CN110083102A (en) * 2019-05-10 2019-08-02 河南省启源电力勘测设计有限公司 A kind of high-performance low-voltage power distribution cabinet is long-range and site monitoring system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Zhao zhenzhou等.The research of Excel file fragmentaton data recovery.《Mechanics and Materials》.2014,2274-2278页. *
王维雄.基于介质存储结构的数据恢复技术研究.《中国优秀硕士学位论文全文数据库 信息科技辑》.2012,I138-112. *

Also Published As

Publication number Publication date
CN112069130A (en) 2020-12-11

Similar Documents

Publication Publication Date Title
US6502101B1 (en) Converting a hierarchical data structure into a flat data structure
US6457013B1 (en) Data formating property modifiers
KR101169089B1 (en) File formats, methods, and computer program products for representing presentations
US5355472A (en) System for substituting tags for non-editable data sets in hypertext documents and updating web files containing links between data sets corresponding to changes made to the tags
US7617444B2 (en) File formats, methods, and computer program products for representing workbooks
US8458231B1 (en) Word processor data organization
CN112036144B (en) Data analysis method, device, computer equipment and readable storage medium
CN112926299B (en) Text comparison method, contract review method and auditing system
CN100338605C (en) Recording method for extendable mark language file repairing trace
CN112069130B (en) Data reorganization and repair method for Microsoft EXCEL file
CN110297781B (en) Method for recovering deleted data in APFS (advanced File System) based on copy-on-write
KR100697359B1 (en) Automatic recording method of description details and file names for title block of AUTO CAD
CN116580414A (en) Contract document difference detection method and device based on ICR character matrix
CN107291832A (en) A kind of date storage method based on list storage structure
US11816419B2 (en) Method for saving documents in blocks
CN101464875B (en) Method for representing electronic dictionary catalog data by XML
CN105938469A (en) Code storage method, data storage structure of texts and method for compressed storage of texts and statistics output
CN113326063B (en) Data processing method, electronic device and storage medium
US20100042640A1 (en) Migration Apparatus Which Convert SAM/VSAM Files of Mainframe System into SAM/VSAM Files of Open System and Method for Thereof
JP3036505B2 (en) CAD data management device
KR102253751B1 (en) A System for Generating Documents by Integrating Word-Processor and Database and a Method Therefore
CN118012656A (en) Damaged PDF document repairing method, device, equipment and storage medium
KR100216068B1 (en) A method of converting dxf data format to informap database
TW527543B (en) Method for directly converting drawing file and integrating into database and recovering the drawing file
CN115858480A (en) Sharing platform and sharing method thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant