CN111241787A

CN111241787A - Method and system for analyzing word binary format and extracting characters in document

Info

Publication number: CN111241787A
Application number: CN202010031347.8A
Authority: CN
Inventors: 苗功勋; 董盼山; 李显程; 崔新安; 王金国
Original assignee: BEIJING ZHONGFU TAIHE TECHNOLOGY DEVELOPMENT CO LTD; Nanjing Zhongfu Information Technology Co Ltd; Zhongfu Information Co Ltd; Zhongfu Safety Technology Co Ltd
Current assignee: BEIJING ZHONGFU TAIHE TECHNOLOGY DEVELOPMENT CO LTD; Nanjing Zhongfu Information Technology Co Ltd; Zhongfu Information Co Ltd; Zhongfu Safety Technology Co Ltd
Priority date: 2020-01-13
Filing date: 2020-01-13
Publication date: 2020-06-05

Abstract

The invention provides a method and a system for analyzing Word binary format and extracting characters in a document, wherein the method and the system are characterized in that all characters in the Word document are extracted by analyzing binary data of the Word document according to the storage principle of the characters in the document, and compared with the method for extracting the characters in the document by using secondary interface development, JAVA OPI technical interface and conversion tool, the method and the system have the characteristics of good compatibility, high efficiency and small program package by analyzing the binary format and extracting the characters, do not depend on Office PowerPoint/Jinshan Wps components, do not need to install Office or extract Office component files, and call a system API function; the file is read through the binary system, and the accurate positioning processing is carried out, so that the execution efficiency is obviously improved; all the implementations are realized by manually coding and calling system API functions, and do not depend on any third-party program files.

Description

Method and system for analyzing word binary format and extracting characters in document

Technical Field

The invention relates to the technical field of word character extraction, in particular to a method and a system for analyzing a word binary format and extracting characters in a document.

Background

With the increasing popularization of office informatization, electronic office becomes a basic requirement of enterprises and public institutions and individuals for office work. Microsoft Word is popularized from the 90 s as Word office software, is widely applied to office environments at present, and a plurality of office systems such as ERP (enterprise resource planning) also adopt Word to transmit information. The format of Word files has become a common standard in the industry, such as Jinshan Office Wps.

In many application scenarios, it is necessary to extract words in Word for secondary processing, such as a Word inspection tool, a text comparison tool, and a file format conversion tool, and how to extract words in Word completely and efficiently becomes an important problem in the prior art.

The prior general Word character extraction method mainly comprises three methods, one is to adopt a secondary interface provided by Office Word or Jinshan Wps software to develop and extract characters; the second method is to adopt OPI technology provided by JAVA to extract; the third is to use the conversion tool provided by the third party to convert Word into html and other formats, and then to extract words from html; the three techniques described above have the following disadvantages:

for the first: firstly, the compatibility is poor, the method completely depends on Office Word or Jinshan Wps components to carry out secondary interface development, an Office Word program must be pre-installed on a running computer or an Office Word component program is manually extracted and integrated into an extraction tool, characters in a Word document are extracted by utilizing secondary interface development, the method depends on Office Word, and the character extraction failure can be caused if the Word is not fully installed or is incorrectly configured; then, the efficiency is low, the secondary interface development adopts com technology, and the data is subjected to multi-layer conversion, so that the efficiency is low.

For the second: firstly, a program package is large, characters in a document are extracted by using a JAVA OPI technology, a JAVA virtual machine environment is required to be arranged in an operating environment, and the method causes the program installation package to be overlarge; then, the efficiency is low, the JAVA performance is low, and the efficiency of extracting the text is low.

For the third type: the efficiency is low, the words are extracted by converting the words into the html of the intermediate file through other tools, and the extraction process needs to be subjected to two layers of transfer, so that the extraction efficiency is low.

Disclosure of Invention

The invention aims to provide a method and a system for analyzing a Word binary format and extracting characters in a document, which aim to solve the problems of low efficiency and poor compatibility of Word character extraction in the prior art, realize independence on Word components and improve the compatibility and the extraction efficiency.

In order to achieve the technical purpose, the invention provides a method for analyzing a word binary format and extracting characters in a document, which comprises the following steps:

s1, opening the Worddocument data flow of the Word document in a binary mode, and respectively reading a data header FIB and a data structure storage chain table FIBTable;

s2, reading a root node Clx of the character according to the start position fcClx and the offset lcbClx of the data in the FIBTable and according to the start position fcClx and the offset lcbClx;

s3, establishing a linked list PcdtList of each segment of characters according to the root node Clx;

s4, circularly reading a linked list PcdtList of each segment of characters, reading a coding format value from a node Pcdt in the PcdtList, calculating the number of characters and the storage positions of the characters, and taking character data stored by the character node Pcdt from a Worddocument data stream according to the number of the characters with the coding format value and the storage positions of the characters;

and S5, splicing the character data extracted from each section of the character linked list PcdtList to form the character data of the whole Word file.

Preferably, the data header FIB is 32 bytes of content calculated from wordddocument start position, and the data structure storage link list FIBTable is 774 bytes of content calculated from 154 th byte in wordddocument.

Preferably, the character node Pcdt in the linked list PcdtList records the number of characters, the character codes of the characters and the offset positions of the characters.

Preferably, the string stored in ANSI format is expressed when the value of the encoding format is 1, and the string stored in Unicode format is expressed when the value of the encoding format is 0.

The invention also provides a system for analyzing the binary word format and extracting the characters in the document, which comprises:

the initialization module is used for opening a Worddocument data stream of a Word document in a binary mode and respectively reading a data head FIB and a data structure storage chain table FIBTable;

the root node reading module is used for reading the root node Clx of the characters according to the starting position fcClx and the offset lcbClx of the data in the FIBTable and according to the starting position fcClx and the offset lcbClx;

the character linked list establishing module is used for establishing a linked list PcdtList of each section of characters according to the root node Clx;

the character extraction module is used for circularly reading a linked list PcdtList of each section of characters, reading a coding format value from a node Pcdt in the PcdtList, calculating the number of the characters and the storage positions of the characters, and taking character data stored by the character node Pcdt from a Worddocument data stream according to the number of the characters with the coding format value and the storage positions of the characters;

and the character splicing module is used for splicing the character data extracted from each section of the character linked list PcdtList to form the character data of the whole Word file.

The effect provided in the summary of the invention is only the effect of the embodiment, not all the effects of the invention, and one of the above technical solutions has the following advantages or beneficial effects:

compared with the prior art, the binary data of the Word document is analyzed, the characters in the Word document are all extracted according to the storage principle of the characters in the document, and compared with the method for extracting the characters in the document by using secondary interface development, JAVA OPI technical interface and conversion tool, the method has the characteristics of good compatibility, high efficiency and small program package by analyzing the binary format and extracting the characters, does not depend on the Office PowerPoint/Jinshan Wps component, does not need to install the Office or extract the Office component file, and calls the system API function only; the file is read through the binary system, and the accurate positioning processing is carried out, so that the execution efficiency is obviously improved; all the implementations are realized by manually coding and calling system API functions, and do not depend on any third-party program files. The invention is not limited to the character extraction of Office Word files, and all files adopting the Office Word character storage principle can be extracted by adopting the method, such as Jinshan Wps and the like.

Drawings

FIG. 1 is a flowchart of a method for parsing a word binary format and extracting a text in a document according to an embodiment of the present invention;

FIG. 2 is a block diagram of a system for parsing a binary word format and extracting text in a document according to an embodiment of the present invention.

Detailed Description

In order to clearly explain the technical features of the present invention, the following detailed description of the present invention is provided with reference to the accompanying drawings. The following disclosure provides many different embodiments, or examples, for implementing different features of the invention. To simplify the disclosure of the present invention, the components and arrangements of specific examples are described below. Furthermore, the present invention may repeat reference numerals and/or letters in the various examples. This repetition is for the purpose of simplicity and clarity and does not in itself dictate a relationship between the various embodiments and/or configurations discussed. It should be noted that the components illustrated in the figures are not necessarily drawn to scale. Descriptions of well-known components and processing techniques and procedures are omitted so as to not unnecessarily limit the invention.

The following describes a method and a system for parsing a word binary format and extracting a text in a document according to an embodiment of the present invention in detail with reference to the accompanying drawings.

As shown in FIG. 1, the embodiment of the invention discloses a method for analyzing word binary format and extracting characters in a document, which comprises the following steps:

Microsoft Office Word 2003 and earlier versions use MS Word binary file format as their default file format, and the underlying data storage mode is compound document format.

In the binary data of the Word 2003 document, the Word data in the Word are associated through a linked list structure, and one node is taken out and then the next node is taken out.

The method comprises the steps of opening a Word document in a binary mode, opening a WordDocument data stream of the Word document, reading a 32-byte data head FIB from a starting position, and recording some basic information of the WordDocument data stream in the FIB.

The data structure storage link table FIBTable of the Word document of 774 bytes is read from the 154 th byte of the wordddocument data stream, and the start address fcClx and the offset lcbClx of various data in Word in binary data are recorded in the data structure storage link table FIBTable. Reading a starting address fcClx and an offset lcbClx in the fibble, and reading a literal root node Clx according to the starting address fcClx and the offset lcbClx.

And establishing a linked list PcdtList of each section of characters according to the root node Clx, and recording the number of the characters, character codes of the characters, offset positions of the characters and the like in character nodes Pcdt.

Reading a linked list PcdtList of the characters in a circulating way, extracting a character node Pcdt from the PcdtList, reading a coding format CharSet of the characters, representing a character string stored in an ANSI format when the value of the coding format is 1, representing a character string stored in a Unicode format when the value of the coding format is 0, and calculating the number of the characters and the storage position of the characters.

And taking out the character data stored by the character node Pcdt from the Worddocument data stream according to the coding format CharSet, the number Count of the characters and the position Pos for storing the characters, taking the characters in all the character nodes Pcdt in the linked list PcdtList as an example, extracting and splicing the characters, and obtaining the character data of the whole Word file.

Compared with the method for extracting the characters in the document by using the secondary interface development, the JAVA OPI technical interface and the conversion tool, the method for extracting the characters in the document by analyzing the binary format has the characteristics of good compatibility, high efficiency and small program package, does not depend on the Office PowerPoint/Jinshan Wps component, does not need to install the Office or extract the Office component file, and calls a system API function; the file is read through the binary system, and the accurate positioning processing is carried out, so that the execution efficiency is obviously improved; all the implementations are realized by manually coding and calling system API functions, and do not depend on any third-party program files. By analyzing the binary data of the Word document, all the characters in the Word document are extracted according to the storage principle of the characters in the document. The invention is not limited to the character extraction of Office Word files, and all files adopting the Office Word character storage principle can be extracted by adopting the method, such as Jinshan Wps and the like.

As shown in fig. 2, the embodiment of the present invention further discloses a system for parsing word binary format and extracting text in a document, where the system includes:

Reading and initializing the file through an initialization module, opening a Word document in a binary mode, opening a WordDocument data stream of the Word document, reading a 32-byte data head FIB from a starting position, and recording some basic information of the WordDocument data stream in the FIB.

The data structure storage link table FIBTable of the Word document of 774 bytes is read from the 154 th byte of the wordddocument data stream, and the start address fcClx and the offset lcbClx of various data in Word in binary data are recorded in the data structure storage link table FIBTable.

And reading the starting address fcClx and the offset lcbClx in the FIBTable through a root node reading module, and reading the literal root node Clx according to the starting address fcClx and the offset lcbClx.

And establishing a linked list PcdtList of each section of characters according to the root node Clx through a character linked list establishing module, and recording the number of the characters, character codes of the characters, offset positions of the characters and the like in character nodes Pcdt.

The method comprises the steps of circularly reading a linked list PcdtList of characters through a character extraction module, extracting a character node Pcdt from the PcdtList, reading a coding format CharSet of the characters, representing a character string stored in an ANSI format when the value of the coding format is 1, representing a character string stored in a Unicode format when the value of the coding format is 0, and calculating the number of the characters and the storage position of the characters.

And taking out the character data stored in the character node Pcdt from the Worddocument data stream according to the coding format CharSet, the number Count of the characters and the position Pos for storing the characters, taking the character data as an example, extracting and splicing the characters in all the character nodes Pcdt in the linked list PcdtList through a character splicing module, and obtaining the character data of the whole Word file.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims

1. A method for analyzing word binary format and extracting characters in a document is characterized by comprising the following steps:

2. The method of claim 1, wherein the header FIB is 32 bytes of content calculated from wordddocument starting position, and the data structure storage list FIBTable is 774 bytes of content calculated from 154 th byte in wordddocument.

3. The method of claim 1, wherein the word nodes Pcdt in the chain PcdtList record the number of words, the character codes of the words and the offset positions of the words.

4. The method for parsing word binary format and extracting words in document according to any one of claims 1-3, wherein when the value of the encoding format is 1, it represents the character string stored in ANSI format, and when the value of the encoding format is 0, it represents the character string stored in Unicode format.

5. A system for parsing word binary format and extracting text in a document, the system comprising:

6. The system for parsing word binary format and extracting words in document according to claim 5, wherein the header FIB is 32 bytes of content calculated from WordDocument starting position, and the data structure storage list FIBTable is 774 bytes of content calculated from WordDocument 154 th byte.

7. The system of claim 5, wherein the word nodes Pcdt in the PcdtList record the number of words, the character codes of the words, and the offset positions of the words.

8. The system for parsing word binary format and extracting words in a document according to any one of claims 5-7, wherein the string stored in ANSI format is represented when the value of the encoding format is 1, and the string stored in Unicode format is represented when the value of the encoding format is 0.