CN109918622A

CN109918622A - The method and system converted from Word document to LaTeX document are realized based on JAVA

Info

Publication number: CN109918622A
Application number: CN201910143870.7A
Authority: CN
Inventors: 宋军; 徐衡; 朱超群; 彭艳; 张坤; 曹威; 吴雅笛
Original assignee: China University of Geosciences
Current assignee: Beijing anzhengtong Information Technology Co.,Ltd.
Priority date: 2019-02-27
Filing date: 2019-02-27
Publication date: 2019-06-21
Anticipated expiration: 2039-02-27
Also published as: CN109918622B

Abstract

The invention discloses the method and system that a kind of Word document is converted to LaTeX document, Word document file is submitted according to user, system is using JACOB technology to data progress initial analysis such as text, picture, formula, tables in file；The data element in source file is extracted using Apache POI, JACOB technology, and records the relative position information of each element；Classified according to each text element of the NB Algorithm to extraction, source file formula is realized based on stacking autocoder and is converted；The relative position information is combined with each data element, forms the information flow of LaTeX destination document；Above- mentioned information stream is written in file destination, to be converted into final LaTeX document.The present invention can reduce the difficulty and complexity converted from Word document to LaTeX document, provides the document conversion method of profession for colleges and universities teachers and students and scientific research personnel etc., improves the working efficiency to document process.

Description

The method and system converted from Word document to LaTeX document are realized based on JAVA

Technical field

The present invention relates to document conversion and data processing field, more specifically to one kind based on JAVA realize by The method that Word document is converted to LaTeX document.

Background technique

TeX provides a set of powerful and extremely flexible composition language, its up to 900 instruction, and TeX has Macroefficiency, user can define oneself applicable newer command constantly to extend the function of TeX system.Leslie Lamport is opened The LaTeX of hair is most popular and the most widely used TeX Hong Ji in the world today.Microsoft Office Word conduct The kernel program of Office suite provides many wieldy document creation tools, and occupancy volume is most currently on the market Big word processor.The dedicated file format Word file (.docx) of Word come true on most general document standard.Text Shelves conversion is to convert the document formats such as Word, Pdf, Txt, Ooxml, Odf, Html.Such as Fa Ming Ren ?the pure proposition of wood The method that the document of Ooxml, Odf are converted to html format document, Adobe Acrobat Professional software it is real Existing Word format and the conversion of Pdf format etc..Apache POI is the Java database an of open source code, main target It is the bottom document for accessing Word.JACOB is a Java-COM middleware, can be in java application by this component Middle calling com component and Win program library.It may be implemented using Apache POI and JACOB to Microsoft Office Word The read-write capability of format file.

In realizing process of the present invention, inventor has found that existing document conversion is primarily present in technology and user's use aspect Following three classes problem: firstly, format of the existing document switch technology generally be directed to a small number of source format documents and specific objective Document, transformation function is single, and for a user, actual use value is not high.Secondly, the document different for coding mode is real Now conversion has the conversion problem between certain difficulty, such as Microsoft Office Word and LaTeX document.Most Afterwards, LaTeX document is made of the markup language of Tex language, and a complete LaTeX document is made, needs to be grasped TeX language Nearly all description rule and written in code ability, for layman, document writes that there are higher with typesetting Difficulty and complexity.

Summary of the invention

The technical problem to be solved in the present invention is that in view of the foregoing drawbacks, the present invention provides a kind of Word document to The method and system of LaTeX document conversion.

The technical solution adopted by the present invention to solve the technical problems is: constructing one kind and is realized based on JAVA by Word document The method converted to LaTeX document, includes the following steps:

S1, the Word source document files submitted according to user, are opened by the Word calling program module in JACOB component Source document files；

S2, in open source document files, by JACOB component in source document files Various types of data element carry out just Begin to analyze, obtains and record the data information of each data element in source document files；

S3, the data information recorded according to step S2 extract source document using Apache POI component and JACOB component Various types of data element in file；

S4, the Various types of data element for extracting step S3 carry out the dealing with information flow；Wherein, every class data element distinguishes shape At information flow corresponding thereto；

S5, the data information that step S2 is recorded is combined with the information flow of every class data element, is guaranteeing source document In the case that each data element position is constant in files, the information flow of LaTeX destination document is formed；

S6, the information flow for the LaTeX destination document that step S5 is formed is written to file destination, thus by Word source document File is converted into LaTeX document.

Further, it is obtained in step S2 and the data information that records includes the classification and each data of data element Relative position of the element in source document；Data element by JACOB block analysis includes text, picture, table and formula Element.

Further, initial analysis is carried out to Various types of data element in source file in step S2, specifically in source file The storage states of all data elements judged.

Further, pass through Paragraphs, Item, Text and Table interface in JACOB component, note in step S2 Record classification and the relative position of each data element.

Further, extracting Various types of data element in step S3 in source document files includes:

For text element, pass through get (" Text "), the get (" Font ") and get (" Size ") letter in JACOB component Number, extraction obtain the text element in source document；The text element includes text data content, text type and text lattice Formula；

For picture element, using XWPFDocument interface in Apache POI component, extraction is obtained in source document Picture element；Using the FileOutputStream method carried in JAVA, the picture element extracted is saved as into local text Part；

For table element, in conjunction with the getTable function and ReadTable function in JACOB component, extraction obtains source Table element in document；Wherein, the specification of table by getTableRowsCount method in JACOB component and GetTableColumnsCount method obtains；

For formula element, in conjunction with the data information recorded in step S2, by copy method in JACOB component, and The getContents function of pasting boards subclass function in Toolkit tool-class, extraction obtain the formula element of source document；Wherein, Pasting boards are obtained by the Transferable variable in Toolkit tool-class, and will by getTransferData method Data are converted；

Wherein, when every extraction one kind data element, its relative position in source document is recorded.

Further, the data element includes text, picture, table and formula element, utilizes simple shellfish in step S4 This algorithm of leaf carries out classification judgement to the text element of extraction, forms corresponding LaTeX text element information flow；In step S4 The formula element of extraction is converted based on stacking autocoder, forms corresponding LaTeX formula element information flow；Step Remaining Various types of data element forms corresponding destination document format information stream directly according to relative position information in rapid S4.

Further, carrying out the step of classification determines using text element of the NB Algorithm to extraction includes:

A1, the n text element extracted is passed through into JIEBA segmentation methods, is converted into n dimensional feature vector X={ x₁、 x₂、…、x_n}；Wherein, x_iFor i-th dimension feature vector, i ∈ n；

A2, a two-value classification problem is converted by the text data classification problem extracted, i.e., any unknown text number Belong to category set C={ C according to sample d₀, C₁}；Wherein, C₀Represent body text, C₁Represent title text；

A3, each text data type is identified using NB Algorithm, including body text, title text two Class；

A4, the probability P that unknown text sample d belongs to classification c is calculated are as follows:

Wherein, it takes maximum probability value as the classification of unknown text sample d, forms corresponding LaTeX according to text categories Text element.

Further, based on stacking autocoder the formula element of extraction is converted the step of include:

B1, the formula element extracted in step S3 is encoded using stacking autocoding algorithm；

Have coded data in B2, the coding result that step B1 is obtained, with formula template library and carries out approximate match；

B3, the highest formula template data of matching degree is input to system equations transfer function module In WordMathToLaTeX, the formula format in source file is further converted, forms the volume that can be identified by LaTeX document Code mode.

Further, in the step B3 converted based on stacking autocoder to the formula element of extraction, according to layer The Euclidean distance y of folded autocoding arithmetic result x and known sample, judge the expression of the highest formula template of matching degree Formula are as follows:

Wherein, x₁、x₂、…x_n、y₁、y₂、…y_nRepresent the value of each vector space after formula coder.

A kind of system converted based on JAVA realization from Word document to LaTeX document proposed by the present invention, use are above-mentioned The method that any one Word document is converted to LaTeX document carries out document conversion.

In a kind of method and system converted based on JAVA realization from Word document to LaTeX document of the present invention In, according to the original Word document that user provides, using machine learning algorithm, intellectual analysis is carried out to source file data, automatically The most approximate or highest text element of matching degree and formula element are chosen, source file data integral layout and target text are integrated Shelves specific coding forms file destination data flow and file destination catalogue, caption, table and the supplemental streams such as illustrates, writes Enter into file destination, to realize the conversion between different type document.

Implement a kind of method converted based on JAVA realization from Word document to LaTeX document proposed by the present invention and is System, has the advantages that

1, the difficulty and complexity of the conversion of different type document be can reduce, be vast colleges and universities teachers and students, scientific research personnel etc. Conveniently professional document conversion regime is provided；

2, facilitate user that simple Word format is converted to the submission format of professional technical paper, solve vast section It grinds personnel and colleges and universities teachers and students needs to learn complexity LaTeX code and take a significant amount of time to carry out recompiling typesetting to paper Problem, improve work efficiency, compensate for the field blank that Now Domestic is converted from Word document to LaTeX document.

Detailed description of the invention

Present invention will be further explained below with reference to the attached drawings and examples, in attached drawing:

Fig. 1 is the flow chart that Word document is converted to LaTeX document；

Fig. 2 is that NB Algorithm and stacking autocoder classify to the text element and formula element of extraction Flow chart；

Fig. 3 is the table conversion effect figure that Word document is converted to LaTeX document；

Fig. 4 is the picture conversion effect figure that Word document is converted to LaTeX document；

Fig. 5 is the formula conversion effect that Word document is converted to LaTeX document；

Fig. 6 is the overall conversion effect picture that Word document is converted to LaTeX document.

Specific embodiment

For a clearer understanding of the technical characteristics, objects and effects of the present invention, now control attached drawing is described in detail A specific embodiment of the invention.

Referring to FIG. 1, it is the flow chart that Word document is converted to LaTeX document；One kind proposed by the present invention is based on JAVA realizes the method converted from Word document to LaTeX document, specifically includes the following steps:

S1, the Word source document files submitted according to user, are opened by the Word calling program module in JACOB component Source document files.

S2, in open source document files, by JACOB component in source document files Various types of data element carry out just Begin to analyze, obtains and record the data information of each data element in source document files；The data information for wherein obtaining and recording The relative position of classification and each data element in source document files including data element specifically leads in the present embodiment Paragraphs, Item, Text and Table interface in JACOB component are crossed, records the classification of each data element and with respect to position It sets；It include wherein text, picture, table and formula element by the data element of JACOB block analysis；Wherein to source document text Various types of data element carries out initial analysis in part, specifically carries out to the storage state of all data elements in source document files Judgement.

S3, the data information recorded according to step S2 extract source document using Apache POI component and JACOB component Various types of data element in file；Wherein, Various types of data element is extracted in source document files includes:

For text element, pass through get (" Text "), the get (" Font ") and get (" Size ") letter in JACOB component Number, extraction obtain the text element in source document files；The text element includes text data content, text type and text Format；

For picture element, using XWPFDocument interface in Apache POI component, extraction obtains source document files In picture element；The picture element of extraction is saved as this by the FileOutputStream method carried using JAVA program Ground file；

For formula element, in conjunction with the data information recorded in step S2, by copy method in JACOB component, and The getContents function of pasting boards subclass function in Toolkit tool-class, extraction obtain the formula element of source document；

S4, the Various types of data element for extracting step S3 carry out the dealing with information flow；Every class data element be respectively formed with Its corresponding information flow；Wherein, it for the processing of information flow, specifically includes: the public affairs based on stacking autocoder to extraction Formula element is converted, and corresponding LaTeX formula element information flow is formed；Using NB Algorithm to the text of extraction Element carries out classification judgement, forms corresponding LaTeX text element information flow；Remaining Various types of data element is directly according to opposite Location information forms corresponding destination document format information stream.

Referring to FIG. 2, it is NB Algorithm and autocoder is laminated to text element and the formula member of extraction The flow chart that element is classified；Specifically, carrying out the step of classification judgement using text element of the NB Algorithm to extraction Suddenly include:

Wherein, it takes maximum probability value as the classification δ of unknown text sample d, is formed according to classification δ corresponding LaTeX text element；

Specifically, the step of being converted based on stacking autocoder to the formula element of extraction includes:

B3, the highest formula template of matching degree is input in system equations transfer function module WordMathToLaTeX, Formula format in source file is further converted, the coding mode that can be identified by LaTeX document is formed.Wherein, according to layer The Euclidean distance y of folded autocoding arithmetic result x and known sample, judge the expression of the highest formula template of matching degree Formula are as follows:

By above-mentioned principle, it is proposed by the present invention another be based on JAVA realize from Word document to LaTeX document turn The system changed carries out the function of document conversion including the method that any one Word document is converted to LaTeX document.

Fig. 3 is the table conversion effect figure that Word document is converted to LaTeX document；Fig. 4 is Word document to LaTeX document The picture conversion effect figure of conversion；Fig. 5 is the formula conversion effect that Word document is converted to LaTeX document；Fig. 6 is Word document The overall conversion effect picture converted to LaTeX document；Pass through Fig. 3-Fig. 6, it is seen that proposed by the present invention a kind of based on JAVA realization Word document effectively can be changed into Latex document by the method converted from Word document to LaTeX document.

The embodiment of the present invention is described with above attached drawing, but the invention is not limited to above-mentioned specific Embodiment, the above mentioned embodiment is only schematical, rather than restrictive, those skilled in the art Under the inspiration of the present invention, without breaking away from the scope protected by the purposes and claims of the present invention, it can also make very much Form, all of these belong to the protection of the present invention.

Claims

1. a kind of realize the method converted from Word document to LaTeX document based on JAVA, which is characterized in that including walking as follows It is rapid:

S1, the Word source document files submitted according to user open source document by the Word calling program module in JACOB component Files；

S2, in open source document files, Various types of data element in source document files is initially divided by JACOB component Analysis, obtains and records the data information of each data element in source document files；

S3, the data information recorded according to step S2, utilize Apache POI component and JACOB component, extraction source document files In Various types of data element；

S4, the Various types of data element for extracting step S3 carry out the dealing with information flow；Wherein, every class data element be respectively formed with Its corresponding information flow；

S5, the data information that step S2 is recorded is combined with the information flow of every class data element, is guaranteeing source document text In the case that each data element position is constant in part, the information flow of LaTeX destination document is formed；

S6, the information flow for the LaTeX destination document that step S5 is formed is written to file destination, thus by Word source document files It is converted into LaTeX document.

2. the method that Word document according to claim 1 is converted to LaTeX document, which is characterized in that obtained in step S2 It takes and the data information recorded includes the opposite position of the classification and each data element of data element in source document files It sets；Data element by JACOB block analysis includes text, picture, table and formula element.

3. the method that Word document according to claim 1 is converted to LaTeX document, which is characterized in that right in step S2 Various types of data element carries out initial analysis, the specifically storage to all data elements in source document files in source document files State is judged.

4. the method that Word document according to claim 1 is converted to LaTeX document, which is characterized in that lead in step S2 Paragraphs, Item, Text and Table interface in JACOB component are crossed, records the classification of each data element and with respect to position It sets.

5. the method that Word document according to claim 1 is converted to LaTeX document, which is characterized in that in step S3 Various types of data element is extracted in source document files includes:

It is mentioned for text element by get (" Text "), the get (" Font ") and get (" Size ") function in JACOB component Obtain the text element in source document files；The text element includes text data content, text type and text formatting；

For picture element, using XWPFDocument interface in Apache POI component, extraction is obtained in source document files Picture element；The FileOutputStream method carried using JAVA program, saves as local text for the picture element of extraction Part；

For table element, in conjunction with the getTable function and ReadTable function in JACOB component, extraction obtains source document In table element；Wherein, the specification of table by getTableRowsCount method in JACOB component and GetTableColumnsCount method obtains；

6. the method that Word document according to claim 1 is converted to LaTeX document, which is characterized in that the data element Element includes text, picture, table and formula element, is carried out in step S4 using text element of the NB Algorithm to extraction Classification determines, forms corresponding LaTeX text element information flow；Based on stacking autocoder to the public affairs of extraction in step S4 Formula element is converted, and corresponding LaTeX formula element information flow is formed；Remaining Various types of data element is directly pressed in step S4 According to relative position information, corresponding destination document format information stream is formed.

7. the method that Word document according to claim 6 is converted to LaTeX document, which is characterized in that utilize simple shellfish This algorithm of leaf carries out the step of classification determines to the text element of extraction

A1, the n text element extracted is passed through into JIEBA segmentation methods, is converted into n dimensional feature vector X={ x₁、x₂、…、 x_n}；Wherein, x_iFor i-th dimension feature vector, i ∈ n；

A2, a two-value classification problem is converted by the text data classification problem extracted, i.e., any unknown text data sample This d belongs to category set C={ C₀, C₁}；Wherein, C₀Represent body text, C₁Represent title text；

A3, each text data type is identified using NB Algorithm, including body text, two class of title text；

Wherein, it takes maximum probability value as the classification δ of unknown text sample d, forms corresponding LaTeX text according to classification δ This element.

8. the method that Word document according to claim 6 is converted to LaTeX document, which is characterized in that certainly based on stacking Moving the step of encoder converts the formula element of extraction includes:

B3, the highest formula template of matching degree is input in system equations transfer function module WordMathToLaTeX, to source Formula format in file is further converted, and the coding mode that can be identified by LaTeX document is formed.

9. the method that Word document according to claim 8 is converted to LaTeX document, which is characterized in that the step B3 In, according to the Euclidean distance y of stacking autocoding arithmetic result x and known sample, judge the highest formula of matching degree The expression formula of template are as follows:

10. a kind of realize the system converted from Word document to LaTeX document based on JAVA, which is characterized in that using such as right It is required that the method that any one of 1-9 Word document is converted to LaTeX document carries out document conversion.