CN106802937A - The conversion method and system of Word document - Google Patents

The conversion method and system of Word document Download PDF

Info

Publication number
CN106802937A
CN106802937A CN201611252467.0A CN201611252467A CN106802937A CN 106802937 A CN106802937 A CN 106802937A CN 201611252467 A CN201611252467 A CN 201611252467A CN 106802937 A CN106802937 A CN 106802937A
Authority
CN
China
Prior art keywords
document
word document
word
predefined
markup language
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201611252467.0A
Other languages
Chinese (zh)
Inventor
诸葛峰
谢志雄
李济君
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu Sino Gifted Education Technology Development Co Ltd
Original Assignee
Jiangsu Sino Gifted Education Technology Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu Sino Gifted Education Technology Development Co Ltd filed Critical Jiangsu Sino Gifted Education Technology Development Co Ltd
Priority to CN201611252467.0A priority Critical patent/CN106802937A/en
Publication of CN106802937A publication Critical patent/CN106802937A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/178Techniques for file synchronisation in file systems
    • G06F16/1794Details of file format conversion

Abstract

The present invention discloses a kind of conversion method of Word document, including step:Full text to Word document carries out Html markup language textuals, exports Html markup language texts;Predefined structure with regular expression is set, search matching is performed to Html markup language text by predefined structure, export preliminary structure document data;The error message that user points out after being matched according to predefined search structure, each level and content to structure in preliminary structure document data carry out artificial correction respectively, export complete structured document type data.The conversion method of the Word document that the present invention is provided, by the method to Word document Htmlization, predefined search structure matching and human assistance amendment, to be switched to the structured document type data storage of computer language tissue, convenient storage, inquiry and analysis for content-data with the content of natural language tissue in Word document.

Description

The conversion method and system of Word document
Technical field
The present invention relates to Word document switch technology field, it is more particularly related to a kind of Word document turns Change method and system.
Background technology
Word document is presently the most popular electronic document tools.Prior art is usually directed to structured document type data (such as xml, json) switchs to Word document or the information extraction technology based on Word document.
But, Word is in itself binary file, and computer directly cannot be entered using the mode of text retrieval to its data Row is accessed.Current Word document information extraction technology, for solving the problem, is also only retrieved just for target content And extraction, it is impossible to realize to the content of the original natural language institutional framework of Word document and based on structured document type data It is completely reproduced up.
The content of the invention
For weak point present in above-mentioned technology, the present invention provides a kind of conversion method and system of Word document, By Word document Htmlization, predefined search structure matched and human assistance amendment method, by Word document with The content of natural language tissue switchs to the structured document type data storage of computer language tissue, for the facility of content-data Storage, inquiry and analysis.
In order to realize these purposes of the invention and further advantage, the present invention is achieved through the following technical solutions:
The present invention provides a kind of conversion method of WORD documents, and it is comprised the following steps:
Word document Htmlization:Full text to Word document carries out Html markup language textuals, exports Html label languages Speech text;
Predefined search structure matching:Predefined structure with regular expression is set, by the predefined structure Search matching is performed to Html markup language text, preliminary structure document data is exported;
Human assistance amendment:The error message that user points out after being matched according to predefined search structure, to the preliminary knot Each level and content of structure carry out artificial correction respectively in structure document data, export complete structured document type number According to.
Preferably, Word document Htmlization, comprises the following steps:
By Office automation engineerings, all texts in the Word document are converted into Html markup language text This;
By the text character that all non-textual resource conversions in the Word document are Base64 codings;
The text character that the Html markup language text and the Base64 are encoded is stored in Html.
Preferably, the non-textual resource includes embedding picture and object in the Word document.
Preferably, nest relation is provided between the structure of the predefined structure, the search matching includes that recurrence is searched Rope is matched.
Preferably, the operation of the artificial correction includes:To in the preliminary structure document data structure it is each Level is increased, deleted and is shifted;The content of structure in the preliminary structure document data is increased, is deleted And modification.
Preferably, the operation of the artificial correction also includes:
The renewal of predefined structure:The content of structure in the preliminary structure document data is increased, is deleted And after modification, each level to structure adds self-defined information.
Preferably, the complete structured document type data include structured document type data Xml and Json.
A kind of Word document converting system, it includes:
Local program end, it is used to receive the browser end request, selects Word document and to the full text of Word document Html markup language textuals are carried out, Html markup language texts are exported;
Browser end, its Ajax request for being used to respond the local program end sets predefined structure and its renewal, holds Line search matches and implements human assistance amendment, exports complete structured document type data;And,
Server end, it is used to receive the complete structured document type data of browser end output and store.
The present invention at least includes following beneficial effect:
1) conversion method of the Word document that the present invention is provided, by Word document Htmlization, predefined search structure Matching and the method for human assistance amendment, will be switched to computer language group in Word document with the content of natural language tissue The structured document type data storage knitted, convenient storage, inquiry and analysis for content-data;
2) nest relation is provided between the structure for predefining structure, then search matching includes that recursive search is matched, and promotes defeated The structured document type data that go out are complete, it is interrelated to exist between structured document type data, to the original nature of Word document The content of linguistic organization's structure realizes being completely reproduced up based on structured document type data;
3) each level and content to structure in preliminary structure document data carry out artificial correction, predefined knot respectively The renewal of structure, is respectively used to improve the accuracy of the complete structured document type data of output.
Further advantage of the invention, target and feature embody part by following explanation, and part will also be by this The research and practice of invention and be understood by the person skilled in the art.
Brief description of the drawings
Fig. 1 is the flow chart of the conversion method of Word document of the present invention;
Fig. 2 is the method flow diagram of Word document Htmlization of the present invention;
Fig. 3 is the schematic diagram of the converting system of Word document of the present invention.
Specific embodiment
The present invention is described in further detail below in conjunction with the accompanying drawings, to make those skilled in the art with reference to specification text Word can be implemented according to this.
It should be appreciated that it is used herein such as " have ", "comprising" and " including " term is not precluded from one or many The presence or addition of individual other elements or its combination.
Embodiment 1
As shown in figure 1, the present invention provides a kind of conversion method of Word document, it includes step:
S10, Word document Htmlization:Full text to Word document carries out Html markup language textuals, output Html marks Sign language text.
S20, predefines search structure matching:Predefined structure with regular expression is set, by predefining structure Search matching is performed to Html markup language text, preliminary structure document data is exported.
S30, human assistance amendment:The error message that user points out after being matched according to predefined search structure, to preliminary knot Each level and content of structure carry out artificial correction respectively in structure document data, export complete structured document type number According to.
In above-mentioned implementation method, in step S10, Word document Htmlization, as shown in Fig. 2 comprising the following steps:
All texts in Word document, by Office automation engineerings, are converted into Html markup language text by S11 This;
S12, by the text character that all non-textual resource conversions in Word document are Base64 codings;Non-textual resource Including embedding picture and object in Word document;
S13, the text character that Html markup language text and Base64 are encoded is stored in Html.
Illustrated so that the Word document Html of a paper is turned into example, Word document Htmlization refers to a side Face, Html markup language texts are converted into by all of text in the Word document of the paper by Office automation engineerings; On the other hand, the non-textual resource such as all embedded pictures and formula object in the Word document of the paper is switched into Base64 to compile Code, is stored in Html, without quoting other files.
In above-mentioned implementation method, in step S20, user can be by the predefined structure with regular expression of setting Search matching is performed to Html markup language text.Nest relation is provided between the structure of predefined structure, then by canonical table When performing search matching up to formula, recursive search matching can be carried out according to nest relation between structure, be presented with successively nested structure To user, and reappear the resource file in content, so as to promote the structured document type data of output complete and structured document Exist interrelated between type data, the content of the original natural language institutional framework of Word document is realized to be based on structured document Type data are completely reproduced up.Illustrated by taking the Word document of paper as an example:For paper, user can be according to subject, grade Difference, predefine the isostructural content composition information of different big topics (region, subregion), exercise question, small topic, exercise question group, so Afterwards, user by the content of said structure composition information switch to corresponding regular expression for perform search matching, then search for Timing, paper structure nest relation is as follows:
Region:
Subregion
Exercise question group
Exercise question
Subregion
Exercise question group
Exercise question
Exercise question group
Exercise question
Exercise question
Crosshead mesh
In above-mentioned implementation method, in step S30, the operation of artificial correction includes:To in preliminary structure document data Each level of structure is increased, deleted and is shifted;The content of structure in preliminary structure document data is increased, Delete and change.Preferably, complete structured document type data include structured document type data Xml and Json.With Illustrated as a example by the Word document of paper:Each structure that user can be directed to paper be increased, be deleted, move etc. and grasped Make, it is also possible to which the content in structure is operated, such as the word and picture to the stem part of exercise question increased, delete with And modification etc. operation.Setting artificial correction program, level and content for improving structure in preliminary structure document data, To export accurate, complete structured document type data.Used as the further preferred of the implementation method, the operation of artificial correction is also Including:The renewal of predefined structure:Content to structure in preliminary structure document data is increased, deleted and is changed Afterwards, each level to structure adds self-defined information.Illustrated by taking the Word document of paper as an example:Can be predetermined what is set In adopted structure, customized information is added, such as to exercise question setting topic type, correct option score value information.Predefined structure is more Newly, the accuracy of the complete structured document type data of output is improved.
It should be noted that during relative to data reusing in the prior art information extraction scrappy, onrelevant, be for example directed to The Word document of paper is extracted, and the result of extraction is likely to the exercise question that a lot of has no association, in data reusing, also can only In units of exercise question, because cannot learn which big topic the exercise question belongs to, also cannot show the exercise question has shared with which exercise question Context, also cannot directly know among the exercise question there is which small topic.And the Word document conversion side of present invention offer is provided Method, all original paper hierarchical relationships will be apparent from, can be with random access exercise question therein, big topic, small topic, exercise question The structures such as group are simultaneously reused, and can also know the relation between them.In addition, the office obtained during relative to data analysis in the prior art Limit, incomplete result, such as the data analysis after the Word document extraction of paper, for teacher, not only It is the analysis of each exercise question, also relates to distribution and arrangement, distributing order of difficulty exercise question of topic type etc., or even over the years One longitudinal comparison of paper, or the lateral comparison with other subject papers, in the prior art the result of data analysis obtain not To a complete paper structure, and pass through the Word document conversion method that the present invention is provided, paper hierarchical structure can be obtained And preserve, it is easy to teacher to carry out the analysis of various indexs.Therefore, the Word document conversion method that the present invention is provided, by right The method of Word document Htmlization, the matching of predefined search structure and human assistance amendment, by Word document with natural language Say that the content of tissue switchs to the structured document type data storage of computer language tissue, deposited with the facility for content-data The advantage of storage, inquiry and analysis.
Embodiment 2
On the basis of the Word document conversion method that embodiment 1 is provided, the present invention provides a kind of Word document conversion system System, as shown in figure 3, it includes local program end 10, browser end 20 and server end 30.Local program end 10 is used to receive The browser end request, selects Word document and the full text to Word document carries out Html markup language textuals, output Html markup language texts.Browser end 20 is used to respond the Ajax requests at local program end, sets predefined structure and its more Newly, perform search matching and implement human assistance amendment, export complete structured document type data.Server end 30 is used for Receive the complete structured document type data of browser end output and store.
In above-mentioned implementation method, after local program end 10 receives the Ajax requests of browser end 20, Word document is completed Select and and the full text to Word document carries out Html markup language textuals, output Html markup language texts.As for predetermined Adopted search structure matching and human assistance amendment are completed by browser end 20, i.e., browser end 20 responds local journey The Ajax requests at sequence end 10, set predefined structure and its renewal, perform search matching and implement human assistance amendment, with defeated Go out complete structured document type data.Server end 30 is mainly used in the complete structured document type data of storage, for follow-up Inquiry with analysis.
The Word document converting system that the present invention is provided, it is possible to achieve to Word document Htmlization, predefined search structure Matching and human assistance amendment, so as to be switched to computer language tissue with the content of natural language tissue in Word document Structured document type data storage, convenient storage for content-data, inquiry and analyze.
Although embodiment of the present invention is disclosed as above, it is not restricted to listed in specification and implementation method With.It can be applied to various suitable the field of the invention completely.Can be easily for those skilled in the art Realize other modification.Therefore under the universal limited without departing substantially from claim and equivalency range, the present invention is not limited In specific details and shown here as the legend with description.

Claims (8)

1. a kind of conversion method of Word document, it is characterised in that comprise the following steps:
Word document Htmlization:Full text to Word document carries out Html markup language textuals, output Html markup language texts This;
Predefined search structure matching:Predefined structure with regular expression is set, by the predefined structure to institute State Html markup language text and perform search matching, export preliminary structure document data;
Human assistance amendment:The error message that user points out after being matched according to predefined search structure, to the preliminary structure Each level and content of structure carry out artificial correction respectively in document data, export complete structured document type data.
2. the conversion method of Word document as described in claim 1, it is characterised in that Word document Htmlization, including with Lower step:
By Office automation engineerings, all texts in the Word document are converted into Html markup language texts;
By the text character that all non-textual resource conversions in the Word document are Base64 codings;
The text character that the Html markup language text and the Base64 are encoded is stored in Html.
3. the conversion method of Word document as described in claim 2, it is characterised in that the non-textual resource includes described Picture and object are embedded in Word document.
4. the conversion method of Word document as described in claim 1, it is characterised in that the structure of the predefined structure it Between be provided with nest relation, the search matching includes that recursive search is matched.
5. the conversion method of Word document as described in claim 1, it is characterised in that the operation bag of the artificial correction Include:Each level to structure in the preliminary structure document data is increased, deleted and is shifted;To the preliminary knot The content of structure is increased, deleted and is changed in structure document data.
6. the conversion method of Word document as claimed in claim 5, it is characterised in that the operation of the artificial correction is also wrapped Include:
The renewal of predefined structure:The content of structure in the preliminary structure document data is increased, delete and After modification, each level to structure adds self-defined information.
7. the conversion method of Word document as described in claim 1, it is characterised in that the complete structured document type Data include structured document type data Xml and Json.
8. the system that a kind of Word document conversion method of application as any one of claim 1-7 is changed, it is special Levy and be, it includes:
Local program end, it is used to receive the browser end request, selects Word document and the full text to Word document is carried out Html markup language textuals, export Html markup language texts;
Browser end, its Ajax request for being used to respond the local program end sets predefined structure and its renewal, execution is searched Rope matches and implements human assistance amendment, exports complete structured document type data;And,
Server end, it is used to receive the complete structured document type data of the browser end output and store.
CN201611252467.0A 2016-12-30 2016-12-30 The conversion method and system of Word document Pending CN106802937A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611252467.0A CN106802937A (en) 2016-12-30 2016-12-30 The conversion method and system of Word document

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611252467.0A CN106802937A (en) 2016-12-30 2016-12-30 The conversion method and system of Word document

Publications (1)

Publication Number Publication Date
CN106802937A true CN106802937A (en) 2017-06-06

Family

ID=58985252

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611252467.0A Pending CN106802937A (en) 2016-12-30 2016-12-30 The conversion method and system of Word document

Country Status (1)

Country Link
CN (1) CN106802937A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108255914A (en) * 2017-09-05 2018-07-06 深圳壹账通智能科技有限公司 webpage generating method and application server
CN110018863A (en) * 2018-01-09 2019-07-16 武汉斗鱼网络科技有限公司 A kind of mobile terminal text display method, storage medium, equipment and system
CN111209728A (en) * 2020-01-13 2020-05-29 深圳市企鹅网络科技有限公司 Automatic test question labeling and inputting method
CN111737949A (en) * 2020-07-22 2020-10-02 江西风向标教育科技有限公司 Topic content extraction method and device, readable storage medium and computer equipment
CN112783957A (en) * 2019-11-11 2021-05-11 上海遴睿教育科技有限公司 Method and system for importing word document format for English reading
US11366961B2 (en) 2019-06-14 2022-06-21 Mathresources Incorporated Systems and methods for document publishing

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030140311A1 (en) * 2002-01-18 2003-07-24 Lemon Michael J. Method for content mining of semi-structured documents
CN104199871A (en) * 2014-08-19 2014-12-10 南京富士通南大软件技术有限公司 High-speed test question inputting method for intelligent teaching
CN104699714A (en) * 2013-12-09 2015-06-10 北大方正集团有限公司 Method and device for transferring files of book edition format into files of EPUB format

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030140311A1 (en) * 2002-01-18 2003-07-24 Lemon Michael J. Method for content mining of semi-structured documents
CN104699714A (en) * 2013-12-09 2015-06-10 北大方正集团有限公司 Method and device for transferring files of book edition format into files of EPUB format
CN104199871A (en) * 2014-08-19 2014-12-10 南京富士通南大软件技术有限公司 High-speed test question inputting method for intelligent teaching

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108255914A (en) * 2017-09-05 2018-07-06 深圳壹账通智能科技有限公司 webpage generating method and application server
CN110018863A (en) * 2018-01-09 2019-07-16 武汉斗鱼网络科技有限公司 A kind of mobile terminal text display method, storage medium, equipment and system
CN110018863B (en) * 2018-01-09 2022-05-10 武汉斗鱼网络科技有限公司 Mobile terminal text display method, storage medium, equipment and system
US11366961B2 (en) 2019-06-14 2022-06-21 Mathresources Incorporated Systems and methods for document publishing
CN112783957A (en) * 2019-11-11 2021-05-11 上海遴睿教育科技有限公司 Method and system for importing word document format for English reading
CN111209728A (en) * 2020-01-13 2020-05-29 深圳市企鹅网络科技有限公司 Automatic test question labeling and inputting method
CN111209728B (en) * 2020-01-13 2024-01-30 深圳市企鹅网络科技有限公司 Automatic labeling and inputting method for test questions
CN111737949A (en) * 2020-07-22 2020-10-02 江西风向标教育科技有限公司 Topic content extraction method and device, readable storage medium and computer equipment
CN111737949B (en) * 2020-07-22 2021-07-06 江西风向标教育科技有限公司 Topic content extraction method and device, readable storage medium and computer equipment

Similar Documents

Publication Publication Date Title
CN106802937A (en) The conversion method and system of Word document
US20200301954A1 (en) Reply information obtaining method and apparatus
CN111143536B (en) Information extraction method based on artificial intelligence, storage medium and related device
CN108446286B (en) Method, device and server for generating natural language question answers
CN102262634B (en) Automatic questioning and answering method and system
CN106897559B (en) A kind of symptom and sign class entity recognition method and device towards multi-data source
CN104298651B (en) Biomedicine named entity recognition and protein interactive relationship extracting on-line method based on deep learning
CN109726274B (en) Question generation method, device and storage medium
CN107766371A (en) A kind of text message sorting technique and its device
CN104102630B (en) A kind of method for normalizing for Chinese and English mixing text in Chinese social networks
CN105677822A (en) Enrollment automatic question-answering method and system based on conversation robot
CN107247751B (en) LDA topic model-based content recommendation method
CN105335487A (en) Agricultural specialist information retrieval system and method on basis of agricultural technology information ontology library
CN104182412A (en) Webpage crawling method and webpage crawling system
CN107562836A (en) Method is recommended based on the answerer of topic model and machine learning
CN114757176A (en) Method for obtaining target intention recognition model and intention recognition method
CN106909573A (en) A kind of method and apparatus for evaluating question and answer to quality
CN111553138B (en) Auxiliary writing method and device for standardizing content structure document
CN108509539B (en) Information processing method and electronic device
CN104391969A (en) User query statement syntactic structure determining method and device
CN106897274B (en) Cross-language comment replying method
CN114912448A (en) Text extension method, device, equipment and medium
CN111401038B (en) Text processing method, device, electronic equipment and storage medium
CN112784022B (en) Government affair FAQ knowledge base automatic construction method and device and electronic equipment
KR101794547B1 (en) System and Method for Automatically generating of personal wordlist and learning-training word

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20170606