CN111241096A

CN111241096A - Text extraction method, system, terminal and storage medium for EXCEL document

Info

Publication number: CN111241096A
Application number: CN202010013792.1A
Authority: CN
Inventors: 苗功勋; 董盼山; 崔新安; 王金国
Original assignee: BEIJING ZHONGFU TAIHE TECHNOLOGY DEVELOPMENT CO LTD; Nanjing Zhongfu Information Technology Co Ltd; Zhongfu Information Co Ltd; Zhongfu Safety Technology Co Ltd
Current assignee: BEIJING ZHONGFU TAIHE TECHNOLOGY DEVELOPMENT CO LTD; Nanjing Zhongfu Information Technology Co Ltd; Zhongfu Information Co Ltd; Zhongfu Safety Technology Co Ltd
Priority date: 2020-01-07
Filing date: 2020-01-07
Publication date: 2020-06-05

Abstract

The invention provides a text extraction method, a system, a terminal and a storage medium of an EXCEL document, comprising the following steps: screening SST labels from a WorkBook data stream of an Excel document, and extracting text index library information from the SST labels; screening all Sheet tags in a WorkBook data stream through screening tag Type values, and reading the Sheet tags to generate a Sheet index library; extracting offset positions of the sheet pages in the sheet index library, extracting Labelsst labels from all labels of the sheet pages according to the offset positions, and reading index information of the Labelsst labels; and searching a corresponding text in a text index base according to the index information, and splicing the corresponding text according to the position of the Labelsst label to which the index information belongs in the sheet index base. The invention has good compatibility, adopts binary system to read the file, and carries out accurate positioning processing, thereby obviously improving the execution efficiency; all the realization is realized through self manual coding and calling system API functions, and does not depend on any third party program files, and the program package is small.

Description

Text extraction method, system, terminal and storage medium for EXCEL document

Technical Field

The invention relates to the technical field of character extraction, in particular to a text extraction method, a text extraction system, a text extraction terminal and a text extraction storage medium for an EXCEL document.

Background

Microsoft Excel as a form processing Office software has an absolute market share in the current Office environment, and the format of an Excel file has become a common form standard commonly used in the industry, such as Jinshan Office et.

In many application scenarios, characters in Excel need to be extracted for secondary processing, such as a character inspection tool, a text comparison tool, a file format conversion tool, and the like, and how to extract the characters in Excel completely and efficiently becomes a problem faced at present.

At present, there are two general Excel character extraction methods, one is to extract characters by using Office Excel or secondary interface development provided by the jinshan et software (hereinafter, the description of documents is referred to as secondary interface development for short), and the other is to extract characters by using an OPI technology provided by JAVA. However, these two techniques have the following disadvantages:

the development defect of the secondary interface is as follows:

the compatibility is poor. The method completely depends on Office Excel or Jinshan et components to develop a secondary interface, and an Office Excel program or a manual extraction Office Excel component program is required to be installed in advance on a running computer and integrated into an extraction tool. The method depends on Office Excel, and if the Excel is not installed sufficiently or configured incorrectly, the character extraction is failed.

The efficiency is low. The development of the secondary interface adopts com technology, and the data is converted in multiple layers, so that the efficiency is low.

JAVA OPI technical disadvantage:

the program package is large. Utilizing JAVA OPI technology. The JAVA OPI technology extracts the characters in the document, a JAVA virtual machine environment is required to be carried in the running environment, and the method causes the program installation package to be too large.

The efficiency is low. JAVA runs are low, resulting in inefficient extraction of words.

Disclosure of Invention

In view of the above-mentioned deficiencies of the prior art, the present invention provides a method, a system, a terminal and a storage medium for extracting a text of an EXCEL document, so as to solve the above-mentioned technical problems.

In a first aspect, the present invention provides a text extraction method for an EXCEL document, including:

screening SST labels from a WorkBook data stream of an Excel document, and extracting text index library information from the SST labels;

screening all Sheet tags in a WorkBook data stream through screening tag Type values, and reading the Sheet tags to generate a Sheet index library;

extracting offset positions of the sheet pages in the sheet index library, extracting Labelsst labels from all labels of the sheet pages according to the offset positions, and reading index information of the Labelsst labels;

and searching a corresponding text in a text index base according to the index information, and splicing the corresponding text according to the position of the Labelsst label to which the index information belongs in the sheet index base.

Further, the extracting of the text index library information from the SST tag includes:

entering SST labels to read the size of a text index library;

reading the number of characters of a text in an XLINCODeRichExtendedString tag with the number of 0 and the encoding format of the text;

and extracting corresponding texts through the number of characters and the encoding format of the texts and calculating the length of the texts.

Further, screening out all Sheet tags in the workbok data stream through the screening tag Type value, reading the Sheet tags and generating a Sheet index library, including:

screening a Sheet tag with the Type value of 133 from the WorkBook data stream;

and reading the offset position of the current Sheet page in the Excel file and the name of the current Sheet page from the Sheet tag.

Further, the splicing of the corresponding texts at the positions of the sheet index library according to the Labelsst labels to which the index information belongs comprises:

sequencing LabelSst labels under the same sheet page;

splicing corresponding texts according to the sequence of the Labelsst labels;

and marking the name of the sheet page to which the spliced text belongs.

In a second aspect, the present invention provides a text extraction system for EXCEL documents, comprising:

the system comprises a first creating unit, a second creating unit and a third creating unit, wherein the first creating unit is configured to screen SST tags from a WorkBook data stream of an Excel document and extract text index library information from the SST tags;

the second creating unit is configured to screen out all Sheet tags in the WorkBook data stream through the screening tag Type value, and read the Sheet tags to generate a Sheet index library;

the index extraction unit is configured to extract offset positions of the sheet pages in the sheet index library, extract Labelsst labels from all labels of the sheet pages according to the offset positions, and read index information of the Labelsst labels;

and the text splicing unit is configured and used for searching corresponding texts in a text index base according to the index information and splicing the corresponding texts at the position of the sheet index base according to the Labelsst label to which the index information belongs.

Further, the first creating unit includes:

the size reading module is configured for entering SST tags to read the size of the text index library;

the tag reading module is configured for reading the number of characters of a text and the encoding format of the text in an XLINCODeRichExtendedString tag with the number of 0;

and the text extraction module is configured for extracting corresponding texts through the number of characters and the encoding format of the texts and calculating the length of the texts.

Further, the second creating unit includes:

the tag screening module is configured to screen a Sheet tag with a Type value of 133 from the WorkBook data stream;

and the form reading module is configured for reading the offset position of the current Sheet page in the Excel file and the name of the current Sheet page from the Sheet tag.

Further, the text splicing unit includes:

the label sequencing module is configured and used for sequencing LabelSst labels under the same sheet page;

the text splicing module is configured to splice corresponding texts according to the ordering of the LabelSst labels;

and the name marking module is configured to mark the spliced text with the name of the sheet page to which the text belongs.

In a third aspect, a terminal is provided, including:

a processor, a memory, wherein,

the memory is used for storing a computer program which,

the processor is used for calling and running the computer program from the memory so as to make the terminal execute the method of the terminal.

In a fourth aspect, a computer storage medium is provided having stored therein instructions that, when executed on a computer, cause the computer to perform the method of the above aspects.

The beneficial effect of the invention is that,

according to the text extraction method, the system, the terminal and the storage medium of the EXCEL document, the binary data of the EXCEL document are analyzed, and the characters in the EXCEL document are extracted according to the original text sequence according to the storage principle of the characters in the document. The method is not limited to the character extraction of the Office Excel file, and the method can be used for extracting characters of files adopting the Office Excel character storage principle, such as the Jinshan Office et and the like. The invention does not depend on an OfficeExcel/Jinshan et component, and has good compatibility; the binary system reads the file and carries out accurate positioning processing, so that the execution efficiency is obviously improved; all the realization is realized through self manual coding and calling system API functions, and does not depend on any third party program files, and the program package is small.

In addition, the invention has reliable design principle, simple structure and very wide application prospect.

Drawings

In order to more clearly illustrate the embodiments or technical solutions in the prior art of the present invention, the drawings used in the description of the embodiments or prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained based on these drawings without creative efforts.

FIG. 1 is a schematic diagram of Excel document storage according to the method of one embodiment of the present invention.

FIG. 2 is a schematic flow diagram of a method of one embodiment of the invention.

FIG. 3 is a schematic flow chart diagram of a method of one embodiment of the present invention.

FIG. 4 is a schematic block diagram of a system of one embodiment of the present invention.

Fig. 5 is a schematic structural diagram of a terminal according to an embodiment of the present invention.

Detailed Description

In order to make those skilled in the art better understand the technical solution of the present invention, the technical solution in the embodiment of the present invention will be clearly and completely described below with reference to the drawings in the embodiment of the present invention, and it is obvious that the described embodiment is only a part of the embodiment of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The following explains key terms appearing in the present invention.

Excel document storage principle: as shown in fig. 1, by analyzing an Excel file, it is found that "compressed storage" is adopted for Excel characters, the "compressed storage" of Excel is not compressed storage in a normal compression software manner, but only one character is stored in a table with the same character content (regardless of format) by taking a cell of the Excel file as a unit, each character has a corresponding index number, the Excel file only needs to record an index position of the cell for the character, and then the corresponding character can be found by querying the index position.

For example, the text of a cell of 1 line and 2 columns (a2) in a certain Excel file sheet1 is "123", the text of a cell of 2 lines and 1 columns (B1) in sheet2 is also "123", at this time, the Excel file only stores one copy of "123", if the index value of "123" is 1, the index value recorded by the sheet1 a2 cell is 1, and the index value recorded by the sheet 2B 1 cell is also 1.

FIG. 2 is a schematic flow diagram of a method of one embodiment of the invention. Among them, fig. 1 implements a text extraction system in which the subject may be an EXCEL document.

As shown in fig. 2, the method 200 includes:

step 210, screening SST labels from a WorkBook data stream of an Excel document, and extracting text index library information from the SST labels;

step 220, screening out all Sheet tags in the WorkBook data stream through the screening tag Type value, and reading the Sheet tags to generate a Sheet index library;

step 230, extracting the offset position of the sheet page in the sheet index library, extracting Labelsst labels from all labels of the sheet page according to the offset position, and reading the index information of the Labelsst labels;

and 240, searching a corresponding text in a text index base according to the index information, and splicing the corresponding text according to the position of the Labelsst label of the index information in the sheet index base.

In order to facilitate understanding of the present invention, the following further describes the text extraction method of the EXCEL document provided by the present invention with reference to the text extraction process of the EXCEL document in the embodiment.

Referring to fig. 3, specifically, the text extraction method for the EXCEL document includes:

s1, screening SST labels from the WorkBook data stream of the Excel document, and extracting text index library information from the SST labels.

Opening a WorkBook data stream of an Excel document, reading from the first byte, reading 4 bytes, analyzing a Type value and a Length value, starting to read downwards by cyclic offset until the end of checking that the Type value is 10 (EOF). Many tags are traversed in the process of cycle traversal, and when a tag with a Type value of 252(SST) is found, an index library of all characters in an Excel file is stored in the SST.

Entering an SST, then reading the size cstUnique of an index library, reading characters in the index library circularly according to the size of the index library, firstly reading an XLINUCeRichExtendedString label with the number of 0, storing the character number cch of the characters in the label, and storing the coded format fHighByte of the characters, wherein when the fHighByte is 1, the characters are Unicode codes, and other values are ANSI codes. And taking out the corresponding text according to the number of characters and the coding format of the characters, then calculating the length of the text, and continuously taking out the next XLINCODeRichExtendedString to extract the characters until all the XLINCODeRichExtendedString is extracted.

When the XLINCODeRichExtendedString in the SST is extracted, the character index library is created.

S2, screening all Sheet labels in the WorkBook data stream through the screening label Type value, and reading the Sheet labels to generate a Sheet index library.

When analyzing the WorkBook data for cyclic search, if the Type value is found to be 133(sheet) tags, the information shows that the document is a sheet of an Excel document, and how many sheets of the Excel document have, how many tags with Type values of 133 appear during traversal.

Two sub-tags lpPlyPos and SheetName are mainly stored in the Sheet tag, the lpPlyPos records the offset position of the current Sheet in the Excel file, and the table in the Sheet can be read according to the offset position. The SheetName records the name of the current sheet page. In the SheetName sub-label, the character number cch of the characters and the coding format A of the characters are stored, when A is 1, the characters are Unicode codes, and other values are ANSI codes.

And after all the sheet labels are read, establishing the sheet index database.

S3, extracting the offset position of the sheet page in the sheet index library, extracting LabelSst labels from all labels of the sheet page according to the offset position, and reading the index information of the LabelSst labels.

And circularly traversing the sheet index databases, taking out the offset position of the sheet page recorded in one of the sheet index databases, starting to read data, starting to read 4 bytes, analyzing the Type value and the Length value, and starting to circularly read the offset downwards until the Type value is found to be 10 (EOF). In the process of cycle traversal, many labels are traversed, and when a label with a Type value of 253(Labelsst) is found, the Labelsst label represents a cell which contains the following information row number (rw) and column number (col), and the current cell stores the number (isst) of the referenced character index library.

S4, searching corresponding texts in a text index base according to the index information, and splicing the corresponding texts at the position of the sheet index base according to the Labelsst label to which the index information belongs.

The text content in the current cell can be taken out by positioning the corresponding text position of the text index library through the index information of the Labelsst label, all the characters taken out by the Labelsst in a sheet index library are spliced to form all the characters in one sheet, then the Labelsts of other sheets in the sheet index library are sequentially and circularly extracted for splicing, and the finally spliced characters are all the characters extracted by the Excel file.

As shown in fig. 4, the system 400 includes:

the first creating unit 410 is configured to filter SST tags from a workbok data stream of an Excel document, and extract text index library information from the SST tags;

the second creating unit 420 is configured to screen out all Sheet tags in the workbok data stream through the screening tag Type value, and read the Sheet tags to generate a Sheet index library;

an index extracting unit 430, configured to extract offset positions of the sheet pages in the sheet index library, extract LabelSst labels from all labels of the sheet pages according to the offset positions, and read index information of the LabelSst labels;

and the text splicing unit 440 is configured to search a corresponding text in a text index base according to the index information, and splice the corresponding text according to the position of the Labelsst label to which the index information belongs in the sheet index base.

Optionally, as an embodiment of the present invention, the first creating unit includes:

Optionally, as an embodiment of the present invention, the second creating unit includes:

Optionally, as an embodiment of the present invention, the text splicing unit includes:

Fig. 5 is a schematic structural diagram of a terminal system 500 according to an embodiment of the present invention, where the terminal system 500 may be used to execute the text extraction method of the EXCEL document according to the embodiment of the present invention.

The terminal system 500 may include: a processor 510, a memory 520, and a communication unit 530. The components communicate via one or more buses, and those skilled in the art will appreciate that the architecture of the servers shown in the figures is not intended to be limiting, and may be a bus architecture, a star architecture, a combination of more or less components than those shown, or a different arrangement of components.

The memory 520 may be used for storing instructions executed by the processor 510, and the memory 520 may be implemented by any type of volatile or non-volatile storage terminal or combination thereof, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic disk or optical disk. The executable instructions in memory 520, when executed by processor 510, enable terminal 500 to perform some or all of the steps in the method embodiments described below.

The processor 510 is a control center of the storage terminal, connects various parts of the entire electronic terminal using various interfaces and lines, and performs various functions of the electronic terminal and/or processes data by operating or executing software programs and/or modules stored in the memory 520 and calling data stored in the memory. The processor may be composed of an Integrated Circuit (IC), for example, a single packaged IC, or a plurality of packaged ICs connected with the same or different functions. For example, processor 510 may include only a Central Processing Unit (CPU). In the embodiment of the present invention, the CPU may be a single operation core, or may include multiple operation cores.

A communication unit 530 for establishing a communication channel so that the storage terminal can communicate with other terminals. And receiving user data sent by other terminals or sending the user data to other terminals.

The present invention also provides a computer storage medium, wherein the computer storage medium may store a program, and the program may include some or all of the steps in the embodiments provided by the present invention when executed. The storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM) or a Random Access Memory (RAM).

Therefore, the binary data of the Excel document are analyzed, and the characters in the Excel document are extracted according to the sequence of the original characters according to the storage principle of the characters in the document. The method is not limited to the character extraction of the Office Excel file, and the method can be used for extracting characters of files adopting the Office Excel character storage principle, such as the Jinshan Office et and the like. The invention does not depend on Office Excel/Jinshan et components, and has good compatibility; the binary system reads the file and carries out accurate positioning processing, so that the execution efficiency is obviously improved; all the implementation is realized through manual coding and calling of the system API function, no third-party program file is relied on, the program package is small, the technical effect achieved by the embodiment can be referred to the description above, and the description is omitted here.

Those skilled in the art will readily appreciate that the techniques of the embodiments of the present invention may be implemented as software plus a required general purpose hardware platform. Based on such understanding, the technical solutions in the embodiments of the present invention may be embodied in the form of a software product, where the computer software product is stored in a storage medium, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and the like, and the storage medium can store program codes, and includes instructions for enabling a computer terminal (which may be a personal computer, a server, or a second terminal, a network terminal, and the like) to perform all or part of the steps of the method in the embodiments of the present invention.

The same and similar parts in the various embodiments in this specification may be referred to each other. Especially, for the terminal embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and the relevant points can be referred to the description in the method embodiment.

In the embodiments provided in the present invention, it should be understood that the disclosed system and method can be implemented in other ways. For example, the above-described system embodiments are merely illustrative, and for example, the division of the units is only one logical functional division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, systems or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

Although the present invention has been described in detail by referring to the drawings in connection with the preferred embodiments, the present invention is not limited thereto. Various equivalent modifications or substitutions can be made on the embodiments of the present invention by those skilled in the art without departing from the spirit and scope of the present invention, and these modifications or substitutions are within the scope of the present invention/any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A text extraction method of an EXCEL document is characterized by comprising the following steps:

2. The method of claim 1, wherein extracting text index library information from the SST tag comprises:

entering SST labels to read the size of a text index library;

3. The method as claimed in claim 1, wherein the screening out all Sheet tags in the workbok data stream by screening out tag Type values, reading the Sheet tags to generate a Sheet index library, comprising:

screening a Sheet tag with the Type value of 133 from the WorkBook data stream;

4. The method of claim 1, wherein the splicing the corresponding texts at the position of the sheet index library according to the Labelsst label to which the index information belongs comprises:

sequencing LabelSst labels under the same sheet page;

splicing corresponding texts according to the sequence of the Labelsst labels;

and marking the name of the sheet page to which the spliced text belongs.

5. A text extraction system for EXCEL documents, comprising:

6. The system according to claim 5, wherein the first creating unit includes:

7. The system according to claim 5, wherein the second creating unit includes:

8. The system of claim 5, wherein the text stitching unit comprises:

9. A terminal, comprising:

a processor;

a memory for storing instructions for execution by the processor;

wherein the processor is configured to perform the method of any one of claims 1-4.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-4.