CN111241787A - Method and system for analyzing word binary format and extracting characters in document - Google Patents

Method and system for analyzing word binary format and extracting characters in document Download PDF

Info

Publication number
CN111241787A
CN111241787A CN202010031347.8A CN202010031347A CN111241787A CN 111241787 A CN111241787 A CN 111241787A CN 202010031347 A CN202010031347 A CN 202010031347A CN 111241787 A CN111241787 A CN 111241787A
Authority
CN
China
Prior art keywords
characters
character
word
format
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010031347.8A
Other languages
Chinese (zh)
Inventor
苗功勋
董盼山
李显程
崔新安
王金国
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BEIJING ZHONGFU TAIHE TECHNOLOGY DEVELOPMENT CO LTD
Nanjing Zhongfu Information Technology Co Ltd
Zhongfu Information Co Ltd
Zhongfu Safety Technology Co Ltd
Original Assignee
BEIJING ZHONGFU TAIHE TECHNOLOGY DEVELOPMENT CO LTD
Nanjing Zhongfu Information Technology Co Ltd
Zhongfu Information Co Ltd
Zhongfu Safety Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BEIJING ZHONGFU TAIHE TECHNOLOGY DEVELOPMENT CO LTD, Nanjing Zhongfu Information Technology Co Ltd, Zhongfu Information Co Ltd, Zhongfu Safety Technology Co Ltd filed Critical BEIJING ZHONGFU TAIHE TECHNOLOGY DEVELOPMENT CO LTD
Priority to CN202010031347.8A priority Critical patent/CN111241787A/en
Publication of CN111241787A publication Critical patent/CN111241787A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention provides a method and a system for analyzing Word binary format and extracting characters in a document, wherein the method and the system are characterized in that all characters in the Word document are extracted by analyzing binary data of the Word document according to the storage principle of the characters in the document, and compared with the method for extracting the characters in the document by using secondary interface development, JAVA OPI technical interface and conversion tool, the method and the system have the characteristics of good compatibility, high efficiency and small program package by analyzing the binary format and extracting the characters, do not depend on Office PowerPoint/Jinshan Wps components, do not need to install Office or extract Office component files, and call a system API function; the file is read through the binary system, and the accurate positioning processing is carried out, so that the execution efficiency is obviously improved; all the implementations are realized by manually coding and calling system API functions, and do not depend on any third-party program files.

Description

Method and system for analyzing word binary format and extracting characters in document
Technical Field
The invention relates to the technical field of word character extraction, in particular to a method and a system for analyzing a word binary format and extracting characters in a document.
Background
With the increasing popularization of office informatization, electronic office becomes a basic requirement of enterprises and public institutions and individuals for office work. Microsoft Word is popularized from the 90 s as Word office software, is widely applied to office environments at present, and a plurality of office systems such as ERP (enterprise resource planning) also adopt Word to transmit information. The format of Word files has become a common standard in the industry, such as Jinshan Office Wps.
In many application scenarios, it is necessary to extract words in Word for secondary processing, such as a Word inspection tool, a text comparison tool, and a file format conversion tool, and how to extract words in Word completely and efficiently becomes an important problem in the prior art.
The prior general Word character extraction method mainly comprises three methods, one is to adopt a secondary interface provided by Office Word or Jinshan Wps software to develop and extract characters; the second method is to adopt OPI technology provided by JAVA to extract; the third is to use the conversion tool provided by the third party to convert Word into html and other formats, and then to extract words from html; the three techniques described above have the following disadvantages:
for the first: firstly, the compatibility is poor, the method completely depends on Office Word or Jinshan Wps components to carry out secondary interface development, an Office Word program must be pre-installed on a running computer or an Office Word component program is manually extracted and integrated into an extraction tool, characters in a Word document are extracted by utilizing secondary interface development, the method depends on Office Word, and the character extraction failure can be caused if the Word is not fully installed or is incorrectly configured; then, the efficiency is low, the secondary interface development adopts com technology, and the data is subjected to multi-layer conversion, so that the efficiency is low.
For the second: firstly, a program package is large, characters in a document are extracted by using a JAVA OPI technology, a JAVA virtual machine environment is required to be arranged in an operating environment, and the method causes the program installation package to be overlarge; then, the efficiency is low, the JAVA performance is low, and the efficiency of extracting the text is low.
For the third type: the efficiency is low, the words are extracted by converting the words into the html of the intermediate file through other tools, and the extraction process needs to be subjected to two layers of transfer, so that the extraction efficiency is low.
Disclosure of Invention
The invention aims to provide a method and a system for analyzing a Word binary format and extracting characters in a document, which aim to solve the problems of low efficiency and poor compatibility of Word character extraction in the prior art, realize independence on Word components and improve the compatibility and the extraction efficiency.
In order to achieve the technical purpose, the invention provides a method for analyzing a word binary format and extracting characters in a document, which comprises the following steps:
s1, opening the Worddocument data flow of the Word document in a binary mode, and respectively reading a data header FIB and a data structure storage chain table FIBTable;
s2, reading a root node Clx of the character according to the start position fcClx and the offset lcbClx of the data in the FIBTable and according to the start position fcClx and the offset lcbClx;
s3, establishing a linked list PcdtList of each segment of characters according to the root node Clx;
s4, circularly reading a linked list PcdtList of each segment of characters, reading a coding format value from a node Pcdt in the PcdtList, calculating the number of characters and the storage positions of the characters, and taking character data stored by the character node Pcdt from a Worddocument data stream according to the number of the characters with the coding format value and the storage positions of the characters;
and S5, splicing the character data extracted from each section of the character linked list PcdtList to form the character data of the whole Word file.
Preferably, the data header FIB is 32 bytes of content calculated from wordddocument start position, and the data structure storage link list FIBTable is 774 bytes of content calculated from 154 th byte in wordddocument.
Preferably, the character node Pcdt in the linked list PcdtList records the number of characters, the character codes of the characters and the offset positions of the characters.
Preferably, the string stored in ANSI format is expressed when the value of the encoding format is 1, and the string stored in Unicode format is expressed when the value of the encoding format is 0.
The invention also provides a system for analyzing the binary word format and extracting the characters in the document, which comprises:
the initialization module is used for opening a Worddocument data stream of a Word document in a binary mode and respectively reading a data head FIB and a data structure storage chain table FIBTable;
the root node reading module is used for reading the root node Clx of the characters according to the starting position fcClx and the offset lcbClx of the data in the FIBTable and according to the starting position fcClx and the offset lcbClx;
the character linked list establishing module is used for establishing a linked list PcdtList of each section of characters according to the root node Clx;
the character extraction module is used for circularly reading a linked list PcdtList of each section of characters, reading a coding format value from a node Pcdt in the PcdtList, calculating the number of the characters and the storage positions of the characters, and taking character data stored by the character node Pcdt from a Worddocument data stream according to the number of the characters with the coding format value and the storage positions of the characters;
and the character splicing module is used for splicing the character data extracted from each section of the character linked list PcdtList to form the character data of the whole Word file.
Preferably, the data header FIB is 32 bytes of content calculated from wordddocument start position, and the data structure storage link list FIBTable is 774 bytes of content calculated from 154 th byte in wordddocument.
Preferably, the character node Pcdt in the linked list PcdtList records the number of characters, the character codes of the characters and the offset positions of the characters.
Preferably, the string stored in ANSI format is expressed when the value of the encoding format is 1, and the string stored in Unicode format is expressed when the value of the encoding format is 0.
The effect provided in the summary of the invention is only the effect of the embodiment, not all the effects of the invention, and one of the above technical solutions has the following advantages or beneficial effects:
compared with the prior art, the binary data of the Word document is analyzed, the characters in the Word document are all extracted according to the storage principle of the characters in the document, and compared with the method for extracting the characters in the document by using secondary interface development, JAVA OPI technical interface and conversion tool, the method has the characteristics of good compatibility, high efficiency and small program package by analyzing the binary format and extracting the characters, does not depend on the Office PowerPoint/Jinshan Wps component, does not need to install the Office or extract the Office component file, and calls the system API function only; the file is read through the binary system, and the accurate positioning processing is carried out, so that the execution efficiency is obviously improved; all the implementations are realized by manually coding and calling system API functions, and do not depend on any third-party program files. The invention is not limited to the character extraction of Office Word files, and all files adopting the Office Word character storage principle can be extracted by adopting the method, such as Jinshan Wps and the like.
Drawings
FIG. 1 is a flowchart of a method for parsing a word binary format and extracting a text in a document according to an embodiment of the present invention;
FIG. 2 is a block diagram of a system for parsing a binary word format and extracting text in a document according to an embodiment of the present invention.
Detailed Description
In order to clearly explain the technical features of the present invention, the following detailed description of the present invention is provided with reference to the accompanying drawings. The following disclosure provides many different embodiments, or examples, for implementing different features of the invention. To simplify the disclosure of the present invention, the components and arrangements of specific examples are described below. Furthermore, the present invention may repeat reference numerals and/or letters in the various examples. This repetition is for the purpose of simplicity and clarity and does not in itself dictate a relationship between the various embodiments and/or configurations discussed. It should be noted that the components illustrated in the figures are not necessarily drawn to scale. Descriptions of well-known components and processing techniques and procedures are omitted so as to not unnecessarily limit the invention.
The following describes a method and a system for parsing a word binary format and extracting a text in a document according to an embodiment of the present invention in detail with reference to the accompanying drawings.
As shown in FIG. 1, the embodiment of the invention discloses a method for analyzing word binary format and extracting characters in a document, which comprises the following steps:
s1, opening the Worddocument data flow of the Word document in a binary mode, and respectively reading a data header FIB and a data structure storage chain table FIBTable;
s2, reading a root node Clx of the character according to the start position fcClx and the offset lcbClx of the data in the FIBTable and according to the start position fcClx and the offset lcbClx;
s3, establishing a linked list PcdtList of each segment of characters according to the root node Clx;
s4, circularly reading a linked list PcdtList of each segment of characters, reading a coding format value from a node Pcdt in the PcdtList, calculating the number of characters and the storage positions of the characters, and taking character data stored by the character node Pcdt from a Worddocument data stream according to the number of the characters with the coding format value and the storage positions of the characters;
and S5, splicing the character data extracted from each section of the character linked list PcdtList to form the character data of the whole Word file.
Microsoft Office Word 2003 and earlier versions use MS Word binary file format as their default file format, and the underlying data storage mode is compound document format.
In the binary data of the Word 2003 document, the Word data in the Word are associated through a linked list structure, and one node is taken out and then the next node is taken out.
The method comprises the steps of opening a Word document in a binary mode, opening a WordDocument data stream of the Word document, reading a 32-byte data head FIB from a starting position, and recording some basic information of the WordDocument data stream in the FIB.
The data structure storage link table FIBTable of the Word document of 774 bytes is read from the 154 th byte of the wordddocument data stream, and the start address fcClx and the offset lcbClx of various data in Word in binary data are recorded in the data structure storage link table FIBTable. Reading a starting address fcClx and an offset lcbClx in the fibble, and reading a literal root node Clx according to the starting address fcClx and the offset lcbClx.
And establishing a linked list PcdtList of each section of characters according to the root node Clx, and recording the number of the characters, character codes of the characters, offset positions of the characters and the like in character nodes Pcdt.
Reading a linked list PcdtList of the characters in a circulating way, extracting a character node Pcdt from the PcdtList, reading a coding format CharSet of the characters, representing a character string stored in an ANSI format when the value of the coding format is 1, representing a character string stored in a Unicode format when the value of the coding format is 0, and calculating the number of the characters and the storage position of the characters.
And taking out the character data stored by the character node Pcdt from the Worddocument data stream according to the coding format CharSet, the number Count of the characters and the position Pos for storing the characters, taking the characters in all the character nodes Pcdt in the linked list PcdtList as an example, extracting and splicing the characters, and obtaining the character data of the whole Word file.
Compared with the method for extracting the characters in the document by using the secondary interface development, the JAVA OPI technical interface and the conversion tool, the method for extracting the characters in the document by analyzing the binary format has the characteristics of good compatibility, high efficiency and small program package, does not depend on the Office PowerPoint/Jinshan Wps component, does not need to install the Office or extract the Office component file, and calls a system API function; the file is read through the binary system, and the accurate positioning processing is carried out, so that the execution efficiency is obviously improved; all the implementations are realized by manually coding and calling system API functions, and do not depend on any third-party program files. By analyzing the binary data of the Word document, all the characters in the Word document are extracted according to the storage principle of the characters in the document. The invention is not limited to the character extraction of Office Word files, and all files adopting the Office Word character storage principle can be extracted by adopting the method, such as Jinshan Wps and the like.
As shown in fig. 2, the embodiment of the present invention further discloses a system for parsing word binary format and extracting text in a document, where the system includes:
the initialization module is used for opening a Worddocument data stream of a Word document in a binary mode and respectively reading a data head FIB and a data structure storage chain table FIBTable;
the root node reading module is used for reading the root node Clx of the characters according to the starting position fcClx and the offset lcbClx of the data in the FIBTable and according to the starting position fcClx and the offset lcbClx;
the character linked list establishing module is used for establishing a linked list PcdtList of each section of characters according to the root node Clx;
the character extraction module is used for circularly reading a linked list PcdtList of each section of characters, reading a coding format value from a node Pcdt in the PcdtList, calculating the number of the characters and the storage positions of the characters, and taking character data stored by the character node Pcdt from a Worddocument data stream according to the number of the characters with the coding format value and the storage positions of the characters;
and the character splicing module is used for splicing the character data extracted from each section of the character linked list PcdtList to form the character data of the whole Word file.
Reading and initializing the file through an initialization module, opening a Word document in a binary mode, opening a WordDocument data stream of the Word document, reading a 32-byte data head FIB from a starting position, and recording some basic information of the WordDocument data stream in the FIB.
The data structure storage link table FIBTable of the Word document of 774 bytes is read from the 154 th byte of the wordddocument data stream, and the start address fcClx and the offset lcbClx of various data in Word in binary data are recorded in the data structure storage link table FIBTable.
And reading the starting address fcClx and the offset lcbClx in the FIBTable through a root node reading module, and reading the literal root node Clx according to the starting address fcClx and the offset lcbClx.
And establishing a linked list PcdtList of each section of characters according to the root node Clx through a character linked list establishing module, and recording the number of the characters, character codes of the characters, offset positions of the characters and the like in character nodes Pcdt.
The method comprises the steps of circularly reading a linked list PcdtList of characters through a character extraction module, extracting a character node Pcdt from the PcdtList, reading a coding format CharSet of the characters, representing a character string stored in an ANSI format when the value of the coding format is 1, representing a character string stored in a Unicode format when the value of the coding format is 0, and calculating the number of the characters and the storage position of the characters.
And taking out the character data stored in the character node Pcdt from the Worddocument data stream according to the coding format CharSet, the number Count of the characters and the position Pos for storing the characters, taking the character data as an example, extracting and splicing the characters in all the character nodes Pcdt in the linked list PcdtList through a character splicing module, and obtaining the character data of the whole Word file.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims (8)

1. A method for analyzing word binary format and extracting characters in a document is characterized by comprising the following steps:
s1, opening the Worddocument data flow of the Word document in a binary mode, and respectively reading a data header FIB and a data structure storage chain table FIBTable;
s2, reading a root node Clx of the character according to the start position fcClx and the offset lcbClx of the data in the FIBTable and according to the start position fcClx and the offset lcbClx;
s3, establishing a linked list PcdtList of each segment of characters according to the root node Clx;
s4, circularly reading a linked list PcdtList of each segment of characters, reading a coding format value from a node Pcdt in the PcdtList, calculating the number of characters and the storage positions of the characters, and taking character data stored by the character node Pcdt from a Worddocument data stream according to the number of the characters with the coding format value and the storage positions of the characters;
and S5, splicing the character data extracted from each section of the character linked list PcdtList to form the character data of the whole Word file.
2. The method of claim 1, wherein the header FIB is 32 bytes of content calculated from wordddocument starting position, and the data structure storage list FIBTable is 774 bytes of content calculated from 154 th byte in wordddocument.
3. The method of claim 1, wherein the word nodes Pcdt in the chain PcdtList record the number of words, the character codes of the words and the offset positions of the words.
4. The method for parsing word binary format and extracting words in document according to any one of claims 1-3, wherein when the value of the encoding format is 1, it represents the character string stored in ANSI format, and when the value of the encoding format is 0, it represents the character string stored in Unicode format.
5. A system for parsing word binary format and extracting text in a document, the system comprising:
the initialization module is used for opening a Worddocument data stream of a Word document in a binary mode and respectively reading a data head FIB and a data structure storage chain table FIBTable;
the root node reading module is used for reading the root node Clx of the characters according to the starting position fcClx and the offset lcbClx of the data in the FIBTable and according to the starting position fcClx and the offset lcbClx;
the character linked list establishing module is used for establishing a linked list PcdtList of each section of characters according to the root node Clx;
the character extraction module is used for circularly reading a linked list PcdtList of each section of characters, reading a coding format value from a node Pcdt in the PcdtList, calculating the number of the characters and the storage positions of the characters, and taking character data stored by the character node Pcdt from a Worddocument data stream according to the number of the characters with the coding format value and the storage positions of the characters;
and the character splicing module is used for splicing the character data extracted from each section of the character linked list PcdtList to form the character data of the whole Word file.
6. The system for parsing word binary format and extracting words in document according to claim 5, wherein the header FIB is 32 bytes of content calculated from WordDocument starting position, and the data structure storage list FIBTable is 774 bytes of content calculated from WordDocument 154 th byte.
7. The system of claim 5, wherein the word nodes Pcdt in the PcdtList record the number of words, the character codes of the words, and the offset positions of the words.
8. The system for parsing word binary format and extracting words in a document according to any one of claims 5-7, wherein the string stored in ANSI format is represented when the value of the encoding format is 1, and the string stored in Unicode format is represented when the value of the encoding format is 0.
CN202010031347.8A 2020-01-13 2020-01-13 Method and system for analyzing word binary format and extracting characters in document Pending CN111241787A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010031347.8A CN111241787A (en) 2020-01-13 2020-01-13 Method and system for analyzing word binary format and extracting characters in document

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010031347.8A CN111241787A (en) 2020-01-13 2020-01-13 Method and system for analyzing word binary format and extracting characters in document

Publications (1)

Publication Number Publication Date
CN111241787A true CN111241787A (en) 2020-06-05

Family

ID=70872598

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010031347.8A Pending CN111241787A (en) 2020-01-13 2020-01-13 Method and system for analyzing word binary format and extracting characters in document

Country Status (1)

Country Link
CN (1) CN111241787A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111767733A (en) * 2020-06-11 2020-10-13 安徽旅贲科技有限公司 Document security classification discrimination method based on statistical word segmentation
CN112256268A (en) * 2020-09-28 2021-01-22 中孚安全技术有限公司 Method, system and equipment for analyzing nested file in WORD

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103902918A (en) * 2012-12-30 2014-07-02 航天信息股份有限公司 Method and device for rapidly extracting text from Word document
CN106372507A (en) * 2016-08-30 2017-02-01 北京奇虎科技有限公司 Method and device for detecting malicious document
CN106372508A (en) * 2016-08-30 2017-02-01 北京奇虎科技有限公司 Method and device for processing malicious documents
US10019535B1 (en) * 2013-08-06 2018-07-10 Intuit Inc. Template-free extraction of data from documents
CN108920612A (en) * 2018-06-28 2018-11-30 山东中孚安全技术有限公司 Parsing doc binary format and the method and system for extracting picture in document

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103902918A (en) * 2012-12-30 2014-07-02 航天信息股份有限公司 Method and device for rapidly extracting text from Word document
US10019535B1 (en) * 2013-08-06 2018-07-10 Intuit Inc. Template-free extraction of data from documents
CN106372507A (en) * 2016-08-30 2017-02-01 北京奇虎科技有限公司 Method and device for detecting malicious document
CN106372508A (en) * 2016-08-30 2017-02-01 北京奇虎科技有限公司 Method and device for processing malicious documents
CN108920612A (en) * 2018-06-28 2018-11-30 山东中孚安全技术有限公司 Parsing doc binary format and the method and system for extracting picture in document

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
MICROSOFT CORPORATION: "了解 Word MS-DOC 二进制文件格式", 《HTTPS://DOCS.MICROSOFT.COM/ZH-CN/PREVIOUS-VERSIONS/OFFICE/GG615596(V=OFFICE.14)》 *
廖怨婷等: "Word文本解析和关键字快速匹配方法", 《通信技术》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111767733A (en) * 2020-06-11 2020-10-13 安徽旅贲科技有限公司 Document security classification discrimination method based on statistical word segmentation
CN112256268A (en) * 2020-09-28 2021-01-22 中孚安全技术有限公司 Method, system and equipment for analyzing nested file in WORD

Similar Documents

Publication Publication Date Title
CN108491199B (en) Method and terminal for automatically generating interface
US8495136B2 (en) Transaction-initiated batch processing
US8635634B2 (en) Seamless multiple format metadata abstraction
CN101558405B (en) Migration apparatus which convert database of mainframe system into database of open system and method for thereof
US7958133B2 (en) Application conversion of source data
CN111241787A (en) Method and system for analyzing word binary format and extracting characters in document
WO2006136055A1 (en) A text data mining method
CN108959200A (en) A kind of method and system for extracting the picture in PPT document
JP2008052740A (en) Spell checking method for document with marked data block, and signal carrying medium
US20070150477A1 (en) Validating a uniform resource locator ('URL') in a document
CN111552839B (en) Object conversion method based on XML template
CN116561146A (en) Database log recording method, device, computer equipment and computer readable storage medium
CN112084046B (en) Method and device for calling generalization interface in distributed computing
CN110839022A (en) Vehicle-mounted control software communication protocol analysis method based on xml language
CN113918770B (en) Method and device for converting character string and time field
US20020194352A1 (en) Infrared transmission system with automatic character identification
CN101553800B (en) Migration apparatus which convert SAM/VSAM files of mainframe system into SAM/VSAM files of open system and method for thereof
US20060294127A1 (en) Tagging based schema to enable processing of multilingual text data
CN112380142A (en) Interface document management method and device and test equipment
CN111241096A (en) Text extraction method, system, terminal and storage medium for EXCEL document
US20050071756A1 (en) XML to numeric conversion method, system, article of manufacture, and computer program product
CN111782882A (en) TCP message conversion method, device, system and computer storage medium
CN111444680B (en) Encoding expansion method and device for rarely used words, storage medium and electronic equipment
Kalin Input and Output
CN111046841A (en) Character extraction method, system, terminal and storage medium of PowerPoint file

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20200605

RJ01 Rejection of invention patent application after publication