CN106802937A - The conversion method and system of Word document - Google Patents
The conversion method and system of Word document Download PDFInfo
- Publication number
- CN106802937A CN106802937A CN201611252467.0A CN201611252467A CN106802937A CN 106802937 A CN106802937 A CN 106802937A CN 201611252467 A CN201611252467 A CN 201611252467A CN 106802937 A CN106802937 A CN 106802937A
- Authority
- CN
- China
- Prior art keywords
- document
- word document
- word
- predefined
- markup language
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/17—Details of further file system functions
- G06F16/178—Techniques for file synchronisation in file systems
- G06F16/1794—Details of file format conversion
Abstract
The present invention discloses a kind of conversion method of Word document, including step:Full text to Word document carries out Html markup language textuals, exports Html markup language texts;Predefined structure with regular expression is set, search matching is performed to Html markup language text by predefined structure, export preliminary structure document data;The error message that user points out after being matched according to predefined search structure, each level and content to structure in preliminary structure document data carry out artificial correction respectively, export complete structured document type data.The conversion method of the Word document that the present invention is provided, by the method to Word document Htmlization, predefined search structure matching and human assistance amendment, to be switched to the structured document type data storage of computer language tissue, convenient storage, inquiry and analysis for content-data with the content of natural language tissue in Word document.
Description
Technical field
The present invention relates to Word document switch technology field, it is more particularly related to a kind of Word document turns
Change method and system.
Background technology
Word document is presently the most popular electronic document tools.Prior art is usually directed to structured document type data
(such as xml, json) switchs to Word document or the information extraction technology based on Word document.
But, Word is in itself binary file, and computer directly cannot be entered using the mode of text retrieval to its data
Row is accessed.Current Word document information extraction technology, for solving the problem, is also only retrieved just for target content
And extraction, it is impossible to realize to the content of the original natural language institutional framework of Word document and based on structured document type data
It is completely reproduced up.
The content of the invention
For weak point present in above-mentioned technology, the present invention provides a kind of conversion method and system of Word document,
By Word document Htmlization, predefined search structure matched and human assistance amendment method, by Word document with
The content of natural language tissue switchs to the structured document type data storage of computer language tissue, for the facility of content-data
Storage, inquiry and analysis.
In order to realize these purposes of the invention and further advantage, the present invention is achieved through the following technical solutions:
The present invention provides a kind of conversion method of WORD documents, and it is comprised the following steps:
Word document Htmlization:Full text to Word document carries out Html markup language textuals, exports Html label languages
Speech text;
Predefined search structure matching:Predefined structure with regular expression is set, by the predefined structure
Search matching is performed to Html markup language text, preliminary structure document data is exported;
Human assistance amendment:The error message that user points out after being matched according to predefined search structure, to the preliminary knot
Each level and content of structure carry out artificial correction respectively in structure document data, export complete structured document type number
According to.
Preferably, Word document Htmlization, comprises the following steps:
By Office automation engineerings, all texts in the Word document are converted into Html markup language text
This;
By the text character that all non-textual resource conversions in the Word document are Base64 codings;
The text character that the Html markup language text and the Base64 are encoded is stored in Html.
Preferably, the non-textual resource includes embedding picture and object in the Word document.
Preferably, nest relation is provided between the structure of the predefined structure, the search matching includes that recurrence is searched
Rope is matched.
Preferably, the operation of the artificial correction includes:To in the preliminary structure document data structure it is each
Level is increased, deleted and is shifted;The content of structure in the preliminary structure document data is increased, is deleted
And modification.
Preferably, the operation of the artificial correction also includes:
The renewal of predefined structure:The content of structure in the preliminary structure document data is increased, is deleted
And after modification, each level to structure adds self-defined information.
Preferably, the complete structured document type data include structured document type data Xml and Json.
A kind of Word document converting system, it includes:
Local program end, it is used to receive the browser end request, selects Word document and to the full text of Word document
Html markup language textuals are carried out, Html markup language texts are exported;
Browser end, its Ajax request for being used to respond the local program end sets predefined structure and its renewal, holds
Line search matches and implements human assistance amendment, exports complete structured document type data;And,
Server end, it is used to receive the complete structured document type data of browser end output and store.
The present invention at least includes following beneficial effect:
1) conversion method of the Word document that the present invention is provided, by Word document Htmlization, predefined search structure
Matching and the method for human assistance amendment, will be switched to computer language group in Word document with the content of natural language tissue
The structured document type data storage knitted, convenient storage, inquiry and analysis for content-data;
2) nest relation is provided between the structure for predefining structure, then search matching includes that recursive search is matched, and promotes defeated
The structured document type data that go out are complete, it is interrelated to exist between structured document type data, to the original nature of Word document
The content of linguistic organization's structure realizes being completely reproduced up based on structured document type data;
3) each level and content to structure in preliminary structure document data carry out artificial correction, predefined knot respectively
The renewal of structure, is respectively used to improve the accuracy of the complete structured document type data of output.
Further advantage of the invention, target and feature embody part by following explanation, and part will also be by this
The research and practice of invention and be understood by the person skilled in the art.
Brief description of the drawings
Fig. 1 is the flow chart of the conversion method of Word document of the present invention;
Fig. 2 is the method flow diagram of Word document Htmlization of the present invention;
Fig. 3 is the schematic diagram of the converting system of Word document of the present invention.
Specific embodiment
The present invention is described in further detail below in conjunction with the accompanying drawings, to make those skilled in the art with reference to specification text
Word can be implemented according to this.
It should be appreciated that it is used herein such as " have ", "comprising" and " including " term is not precluded from one or many
The presence or addition of individual other elements or its combination.
Embodiment 1
As shown in figure 1, the present invention provides a kind of conversion method of Word document, it includes step:
S10, Word document Htmlization:Full text to Word document carries out Html markup language textuals, output Html marks
Sign language text.
S20, predefines search structure matching:Predefined structure with regular expression is set, by predefining structure
Search matching is performed to Html markup language text, preliminary structure document data is exported.
S30, human assistance amendment:The error message that user points out after being matched according to predefined search structure, to preliminary knot
Each level and content of structure carry out artificial correction respectively in structure document data, export complete structured document type number
According to.
In above-mentioned implementation method, in step S10, Word document Htmlization, as shown in Fig. 2 comprising the following steps:
All texts in Word document, by Office automation engineerings, are converted into Html markup language text by S11
This;
S12, by the text character that all non-textual resource conversions in Word document are Base64 codings;Non-textual resource
Including embedding picture and object in Word document;
S13, the text character that Html markup language text and Base64 are encoded is stored in Html.
Illustrated so that the Word document Html of a paper is turned into example, Word document Htmlization refers to a side
Face, Html markup language texts are converted into by all of text in the Word document of the paper by Office automation engineerings;
On the other hand, the non-textual resource such as all embedded pictures and formula object in the Word document of the paper is switched into Base64 to compile
Code, is stored in Html, without quoting other files.
In above-mentioned implementation method, in step S20, user can be by the predefined structure with regular expression of setting
Search matching is performed to Html markup language text.Nest relation is provided between the structure of predefined structure, then by canonical table
When performing search matching up to formula, recursive search matching can be carried out according to nest relation between structure, be presented with successively nested structure
To user, and reappear the resource file in content, so as to promote the structured document type data of output complete and structured document
Exist interrelated between type data, the content of the original natural language institutional framework of Word document is realized to be based on structured document
Type data are completely reproduced up.Illustrated by taking the Word document of paper as an example:For paper, user can be according to subject, grade
Difference, predefine the isostructural content composition information of different big topics (region, subregion), exercise question, small topic, exercise question group, so
Afterwards, user by the content of said structure composition information switch to corresponding regular expression for perform search matching, then search for
Timing, paper structure nest relation is as follows:
Region:
Subregion
Exercise question group
Exercise question
Subregion
Exercise question group
Exercise question
Exercise question group
Exercise question
Exercise question
Crosshead mesh
In above-mentioned implementation method, in step S30, the operation of artificial correction includes:To in preliminary structure document data
Each level of structure is increased, deleted and is shifted;The content of structure in preliminary structure document data is increased,
Delete and change.Preferably, complete structured document type data include structured document type data Xml and Json.With
Illustrated as a example by the Word document of paper:Each structure that user can be directed to paper be increased, be deleted, move etc. and grasped
Make, it is also possible to which the content in structure is operated, such as the word and picture to the stem part of exercise question increased, delete with
And modification etc. operation.Setting artificial correction program, level and content for improving structure in preliminary structure document data,
To export accurate, complete structured document type data.Used as the further preferred of the implementation method, the operation of artificial correction is also
Including:The renewal of predefined structure:Content to structure in preliminary structure document data is increased, deleted and is changed
Afterwards, each level to structure adds self-defined information.Illustrated by taking the Word document of paper as an example:Can be predetermined what is set
In adopted structure, customized information is added, such as to exercise question setting topic type, correct option score value information.Predefined structure is more
Newly, the accuracy of the complete structured document type data of output is improved.
It should be noted that during relative to data reusing in the prior art information extraction scrappy, onrelevant, be for example directed to
The Word document of paper is extracted, and the result of extraction is likely to the exercise question that a lot of has no association, in data reusing, also can only
In units of exercise question, because cannot learn which big topic the exercise question belongs to, also cannot show the exercise question has shared with which exercise question
Context, also cannot directly know among the exercise question there is which small topic.And the Word document conversion side of present invention offer is provided
Method, all original paper hierarchical relationships will be apparent from, can be with random access exercise question therein, big topic, small topic, exercise question
The structures such as group are simultaneously reused, and can also know the relation between them.In addition, the office obtained during relative to data analysis in the prior art
Limit, incomplete result, such as the data analysis after the Word document extraction of paper, for teacher, not only
It is the analysis of each exercise question, also relates to distribution and arrangement, distributing order of difficulty exercise question of topic type etc., or even over the years
One longitudinal comparison of paper, or the lateral comparison with other subject papers, in the prior art the result of data analysis obtain not
To a complete paper structure, and pass through the Word document conversion method that the present invention is provided, paper hierarchical structure can be obtained
And preserve, it is easy to teacher to carry out the analysis of various indexs.Therefore, the Word document conversion method that the present invention is provided, by right
The method of Word document Htmlization, the matching of predefined search structure and human assistance amendment, by Word document with natural language
Say that the content of tissue switchs to the structured document type data storage of computer language tissue, deposited with the facility for content-data
The advantage of storage, inquiry and analysis.
Embodiment 2
On the basis of the Word document conversion method that embodiment 1 is provided, the present invention provides a kind of Word document conversion system
System, as shown in figure 3, it includes local program end 10, browser end 20 and server end 30.Local program end 10 is used to receive
The browser end request, selects Word document and the full text to Word document carries out Html markup language textuals, output
Html markup language texts.Browser end 20 is used to respond the Ajax requests at local program end, sets predefined structure and its more
Newly, perform search matching and implement human assistance amendment, export complete structured document type data.Server end 30 is used for
Receive the complete structured document type data of browser end output and store.
In above-mentioned implementation method, after local program end 10 receives the Ajax requests of browser end 20, Word document is completed
Select and and the full text to Word document carries out Html markup language textuals, output Html markup language texts.As for predetermined
Adopted search structure matching and human assistance amendment are completed by browser end 20, i.e., browser end 20 responds local journey
The Ajax requests at sequence end 10, set predefined structure and its renewal, perform search matching and implement human assistance amendment, with defeated
Go out complete structured document type data.Server end 30 is mainly used in the complete structured document type data of storage, for follow-up
Inquiry with analysis.
The Word document converting system that the present invention is provided, it is possible to achieve to Word document Htmlization, predefined search structure
Matching and human assistance amendment, so as to be switched to computer language tissue with the content of natural language tissue in Word document
Structured document type data storage, convenient storage for content-data, inquiry and analyze.
Although embodiment of the present invention is disclosed as above, it is not restricted to listed in specification and implementation method
With.It can be applied to various suitable the field of the invention completely.Can be easily for those skilled in the art
Realize other modification.Therefore under the universal limited without departing substantially from claim and equivalency range, the present invention is not limited
In specific details and shown here as the legend with description.
Claims (8)
1. a kind of conversion method of Word document, it is characterised in that comprise the following steps:
Word document Htmlization:Full text to Word document carries out Html markup language textuals, output Html markup language texts
This;
Predefined search structure matching:Predefined structure with regular expression is set, by the predefined structure to institute
State Html markup language text and perform search matching, export preliminary structure document data;
Human assistance amendment:The error message that user points out after being matched according to predefined search structure, to the preliminary structure
Each level and content of structure carry out artificial correction respectively in document data, export complete structured document type data.
2. the conversion method of Word document as described in claim 1, it is characterised in that Word document Htmlization, including with
Lower step:
By Office automation engineerings, all texts in the Word document are converted into Html markup language texts;
By the text character that all non-textual resource conversions in the Word document are Base64 codings;
The text character that the Html markup language text and the Base64 are encoded is stored in Html.
3. the conversion method of Word document as described in claim 2, it is characterised in that the non-textual resource includes described
Picture and object are embedded in Word document.
4. the conversion method of Word document as described in claim 1, it is characterised in that the structure of the predefined structure it
Between be provided with nest relation, the search matching includes that recursive search is matched.
5. the conversion method of Word document as described in claim 1, it is characterised in that the operation bag of the artificial correction
Include:Each level to structure in the preliminary structure document data is increased, deleted and is shifted;To the preliminary knot
The content of structure is increased, deleted and is changed in structure document data.
6. the conversion method of Word document as claimed in claim 5, it is characterised in that the operation of the artificial correction is also wrapped
Include:
The renewal of predefined structure:The content of structure in the preliminary structure document data is increased, delete and
After modification, each level to structure adds self-defined information.
7. the conversion method of Word document as described in claim 1, it is characterised in that the complete structured document type
Data include structured document type data Xml and Json.
8. the system that a kind of Word document conversion method of application as any one of claim 1-7 is changed, it is special
Levy and be, it includes:
Local program end, it is used to receive the browser end request, selects Word document and the full text to Word document is carried out
Html markup language textuals, export Html markup language texts;
Browser end, its Ajax request for being used to respond the local program end sets predefined structure and its renewal, execution is searched
Rope matches and implements human assistance amendment, exports complete structured document type data;And,
Server end, it is used to receive the complete structured document type data of the browser end output and store.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611252467.0A CN106802937A (en) | 2016-12-30 | 2016-12-30 | The conversion method and system of Word document |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611252467.0A CN106802937A (en) | 2016-12-30 | 2016-12-30 | The conversion method and system of Word document |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106802937A true CN106802937A (en) | 2017-06-06 |
Family
ID=58985252
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201611252467.0A Pending CN106802937A (en) | 2016-12-30 | 2016-12-30 | The conversion method and system of Word document |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106802937A (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108255914A (en) * | 2017-09-05 | 2018-07-06 | 深圳壹账通智能科技有限公司 | webpage generating method and application server |
CN110018863A (en) * | 2018-01-09 | 2019-07-16 | 武汉斗鱼网络科技有限公司 | A kind of mobile terminal text display method, storage medium, equipment and system |
CN111209728A (en) * | 2020-01-13 | 2020-05-29 | 深圳市企鹅网络科技有限公司 | Automatic test question labeling and inputting method |
CN111737949A (en) * | 2020-07-22 | 2020-10-02 | 江西风向标教育科技有限公司 | Topic content extraction method and device, readable storage medium and computer equipment |
CN112783957A (en) * | 2019-11-11 | 2021-05-11 | 上海遴睿教育科技有限公司 | Method and system for importing word document format for English reading |
US11366961B2 (en) | 2019-06-14 | 2022-06-21 | Mathresources Incorporated | Systems and methods for document publishing |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030140311A1 (en) * | 2002-01-18 | 2003-07-24 | Lemon Michael J. | Method for content mining of semi-structured documents |
CN104199871A (en) * | 2014-08-19 | 2014-12-10 | 南京富士通南大软件技术有限公司 | High-speed test question inputting method for intelligent teaching |
CN104699714A (en) * | 2013-12-09 | 2015-06-10 | 北大方正集团有限公司 | Method and device for transferring files of book edition format into files of EPUB format |
-
2016
- 2016-12-30 CN CN201611252467.0A patent/CN106802937A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030140311A1 (en) * | 2002-01-18 | 2003-07-24 | Lemon Michael J. | Method for content mining of semi-structured documents |
CN104699714A (en) * | 2013-12-09 | 2015-06-10 | 北大方正集团有限公司 | Method and device for transferring files of book edition format into files of EPUB format |
CN104199871A (en) * | 2014-08-19 | 2014-12-10 | 南京富士通南大软件技术有限公司 | High-speed test question inputting method for intelligent teaching |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108255914A (en) * | 2017-09-05 | 2018-07-06 | 深圳壹账通智能科技有限公司 | webpage generating method and application server |
CN110018863A (en) * | 2018-01-09 | 2019-07-16 | 武汉斗鱼网络科技有限公司 | A kind of mobile terminal text display method, storage medium, equipment and system |
CN110018863B (en) * | 2018-01-09 | 2022-05-10 | 武汉斗鱼网络科技有限公司 | Mobile terminal text display method, storage medium, equipment and system |
US11366961B2 (en) | 2019-06-14 | 2022-06-21 | Mathresources Incorporated | Systems and methods for document publishing |
CN112783957A (en) * | 2019-11-11 | 2021-05-11 | 上海遴睿教育科技有限公司 | Method and system for importing word document format for English reading |
CN111209728A (en) * | 2020-01-13 | 2020-05-29 | 深圳市企鹅网络科技有限公司 | Automatic test question labeling and inputting method |
CN111209728B (en) * | 2020-01-13 | 2024-01-30 | 深圳市企鹅网络科技有限公司 | Automatic labeling and inputting method for test questions |
CN111737949A (en) * | 2020-07-22 | 2020-10-02 | 江西风向标教育科技有限公司 | Topic content extraction method and device, readable storage medium and computer equipment |
CN111737949B (en) * | 2020-07-22 | 2021-07-06 | 江西风向标教育科技有限公司 | Topic content extraction method and device, readable storage medium and computer equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106802937A (en) | The conversion method and system of Word document | |
US20200301954A1 (en) | Reply information obtaining method and apparatus | |
CN111143536B (en) | Information extraction method based on artificial intelligence, storage medium and related device | |
CN108446286B (en) | Method, device and server for generating natural language question answers | |
CN102262634B (en) | Automatic questioning and answering method and system | |
CN106897559B (en) | A kind of symptom and sign class entity recognition method and device towards multi-data source | |
CN104298651B (en) | Biomedicine named entity recognition and protein interactive relationship extracting on-line method based on deep learning | |
CN109726274B (en) | Question generation method, device and storage medium | |
CN107766371A (en) | A kind of text message sorting technique and its device | |
CN104102630B (en) | A kind of method for normalizing for Chinese and English mixing text in Chinese social networks | |
CN105677822A (en) | Enrollment automatic question-answering method and system based on conversation robot | |
CN107247751B (en) | LDA topic model-based content recommendation method | |
CN105335487A (en) | Agricultural specialist information retrieval system and method on basis of agricultural technology information ontology library | |
CN104182412A (en) | Webpage crawling method and webpage crawling system | |
CN107562836A (en) | Method is recommended based on the answerer of topic model and machine learning | |
CN114757176A (en) | Method for obtaining target intention recognition model and intention recognition method | |
CN106909573A (en) | A kind of method and apparatus for evaluating question and answer to quality | |
CN111553138B (en) | Auxiliary writing method and device for standardizing content structure document | |
CN108509539B (en) | Information processing method and electronic device | |
CN104391969A (en) | User query statement syntactic structure determining method and device | |
CN106897274B (en) | Cross-language comment replying method | |
CN114912448A (en) | Text extension method, device, equipment and medium | |
CN111401038B (en) | Text processing method, device, electronic equipment and storage medium | |
CN112784022B (en) | Government affair FAQ knowledge base automatic construction method and device and electronic equipment | |
KR101794547B1 (en) | System and Method for Automatically generating of personal wordlist and learning-training word |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170606 |