CN115729465A

CN115729465A - Document decoupling and synthesizing system based on paragraph small file storage

Info

Publication number: CN115729465A
Application number: CN202211353362.XA
Authority: CN
Inventors: 刘嘉璇; 刘东升; 张阳; 张鹏; 亢俊钊; 智绪友
Original assignee: China Telecom Corp Ltd
Current assignee: China Telecom Corp Ltd
Priority date: 2022-11-01
Filing date: 2022-11-01
Publication date: 2023-03-03

Abstract

The invention discloses a document decoupling and synthesizing system based on paragraph small file storage, which comprises: the data document analyzing and storing module is used for storing the complete document and analyzing the document; the relevant document acquisition module acquires a relevant document according to the keywords; the paragraph small file storage module is used for carrying out transmission verification, storage and management on the split file; the document batch disassembling and merging module is used for realizing batch disassembling and merging of files based on the paragraph list tree structure title; the combined resource pool module selects paragraphs to add to the resource pool; and the resource pool content synthesizing and downloading module is used for synthesizing and downloading the resource pool content and generating a new file stream. According to the method, the small paragraph files are stored, the keyword-based relevance document is inquired, the big data search engine is combined to automatically screen out the relevance paragraphs which accord with the search content, and the document content is disassembled, merged and stored based on the content of the selected paragraphs, so that the user can check and screen the relevance paragraphs and correspondingly process the relevance paragraphs.

Description

Document decoupling and synthesizing system based on paragraph small file storage

Technical Field

The invention belongs to the technical field of office automation, and particularly relates to a document decoupling and synthesizing system based on paragraph small file storage.

Background

With the change of computer technology, the popularization of Word documents greatly improves the office efficiency, and with the defects of operation on paper, the office efficiency is greatly improved with the application of electronic documents. In everyday work, such as participating in bidding meetings, a large number of electronic documents are generated. At present, documents are mainly sorted through manual operation, but the manual operation is very troublesome, wastes time and labor, and is easy to make mistakes in sorting, so that the working efficiency cannot be improved.

In daily participation in the document preparation of the bidding document, a large number of documents and the correlation among document contents are not directly connected, great inconvenience is caused to a user to search valuable documents, each user has to download a large number of documents and browse the documents one by one to automatically delete the repeated documents with great similarity.

Therefore, a system for decoupling and merging new files based on document content is needed.

Disclosure of Invention

The technical problem to be solved by the present invention is to provide a document decoupling and synthesizing system based on the storage of small paragraph files, which automatically screens out relevant paragraphs according with the search content from the massive documents by the management of a small paragraph file storage directory tree and the query of relevant documents based on the keywords of the documents, and can specify and synthesize the relevant paragraphs in the documents as the titles of several levels based on the content of the selected paragraphs, and simultaneously realize the disassembly, combination and storage of the contents of the documents, the images, the videos and the texts for the user to check and screen and then perform the corresponding processing.

In order to achieve the technical purpose, the technical scheme adopted by the invention is as follows:

a document decoupling and synthesizing system based on paragraph small file storage comprises:

the data document analysis and storage module is used for storing the complete document, analyzing the document and obtaining an MD5 code, a paragraph code and paragraph contents based on Hash calculation;

the relevant document acquisition module is used for acquiring relevant documents according to keywords input by a user;

the paragraph small file storage module is used for carrying out transmission verification, storage and management on the split file;

the document batch disassembling and merging module is used for realizing batch disassembling and merging of Word files based on the paragraph list tree structure title;

the combined resource pool module is used for selecting paragraphs to add to the resource pool based on the paragraph tree structure list;

and the resource pool content synthesizing and downloading module is used for synthesizing and downloading the resource pool content and generating a new Word file stream.

In order to optimize the technical scheme, the specific measures adopted further comprise:

in the data document analysis and storage module, complete document storage is realized by building a MongoDB database in an application server; the method comprises the steps of using an ElasticSearch as a search engine, starting multithreading to upload a Word file to a MongoDB database based on a Java compiling program, carrying out Hash calculation on the file, analyzing the Word file to form an MD5 code of a unique file identifier, simultaneously carrying out paragraph decomposition on file content, using the MD5 code and the paragraph code of the unique file identifier uploaded as keys, and storing the paragraph content as a value in the ElasticSearch service for convenient search.

The above-mentioned relevant document acquisition module assigns corresponding authority to operate the document for different users, and submits the resource document to be synthesized, specifically:

(1) Distributing the operation of the user on the searched document through an intervening document authority system, and distributing the read-only and editable authority on the file downloaded and synthesized by the user;

(2) The keywords input by the user are obtained through the interface, the file name and the file content are searched and matched in the ElasticSearch service, and the paragraph information list of the returned file is traversed after the matched file name MD5 code set is obtained, so that the user can search conveniently.

The paragraph small file storage module realizes an HTTP protocol based on Netty and downloads files through the HTTP protocol; checking by adopting a data sub-package + md5 file; and storing and reforming the split file data format into a ground disk after protobuf serialization.

The paragraph small file storage module is provided with a file deletion and garbage can functional unit, so that the file is moved to a garbage can after being deleted, and the file is recovered from the garbage can.

The document batch disassembling and merging module is provided with a MongoDB document storage service unit; when decoupling the screened file, firstly judging whether decoupling data exists in the paragraph small file storage module, if decoupling data exists, searching a paragraph list tree structure title obtained by MongoDB service through document MD5 coding and returning the paragraph list tree structure title to a user side for selection; if no decoupling data exists, a decoupling operation is performed.

The decoupling operation process comprises the following steps:

acquiring a Word file, acquiring all title lists of the document, and circularly traversing all levels of titles to acquire paragraph contents;

converting the paragraph contents into a file stream form, storing the file stream form in a paragraph small file storage module, setting parameters corresponding to each paragraph, and finally constructing a document content resource pool in a tree structure storage mode;

add the corresponding parameters and return the tree structure to the front page for the user to select paragraphs.

In the merged resource pool module, the MD5 code and the paragraphs of the source file are selected and obtained based on the paragraph tree structure list and stored in the resource pool.

The resource pool content synthesizing and downloading module searches the current resource pool file paragraphs based on the source file MD5 code, obtains the corresponding paragraph contents in the original file Word, then performs the combination operation on the file stream to generate a new Word file stream, and then returns to the user side for downloading.

The invention has the following beneficial effects:

1. the title extraction freedom degree is given to the user, and the multi-dimensional, multi-angle and multi-level directory documents can be formed by free combination according to the selected keyword content.

2. The document processing capacity is large, and the advantages in the aspect of mass processing are obvious: under the condition of normal memory, the method can basically realize unlimited processing, and only needs to consume the time for searching the keywords, thereby greatly improving the processing efficiency; under the same condition, the manual acquisition of the document contents and the reprocessing thereof take a long time.

3. Document processing approaches zero error indefinitely. According to the existing test results, no error phenomenon occurs, and the number of samples to be tested can infinitely approach zero error. Under the same condition, manual processing is influenced by various factors such as individual time, energy, operation steps and the like, and various errors can occur in a large probability.

4. The user can acquire the related PPT document through the keywords. Acquiring corresponding keywords from the document, and then capturing corresponding required title texts according to different screening standards to form corresponding Word files so as to meet the requirement of a user on generating a new document by referring to multiple words; and adjusting the dragging sequence of Word files in the personal resource pool, and then synthesizing a new document.

5. The method can process multiple Word files in batch, perform decoupling, support preview in the resource pool, reduce content operation cost of a user among multiple files, and improve working efficiency.

6. The method comprises the steps that document storage is carried out based on a search engine, and document retrieval of the search engine mainly comprises the steps that content lookup is provided for a user by using index service, so that the efficiency of obtaining documents by the user is improved; the document uploaded by the user is divided into paragraphs encrypted by the MD5, and then the paragraphs are stored in a public resource pool, so that the user can search for the content, the document required by the user can be quickly searched by the service terminal, and the MD5 values of the relevant content can be searched for matching; whether the paragraphs are relevant or not can be defined through a set threshold value, so that relevant contents input by a user can be retrieved.

Drawings

FIG. 1 is a schematic diagram of a document decoupling and composition system design based on paragraph small file storage according to the present invention.

Detailed Description

Embodiments of the present invention are described in further detail below with reference to the accompanying drawings.

As shown in fig. 1, an embodiment of the present application aims to provide a document decoupling and synthesizing system based on paragraph small file storage, and the basic idea is to input a screening condition, select a relevant document, determine a content specification of the document, disassemble document paragraphs, and perform synthesis processing. The basic implementation scheme comprises the following modules:

1. the data document analyzing and storing module is used for storing the complete document, analyzing the document and obtaining MD5 codes, paragraph codes and paragraph contents based on Hash calculation;

(1) A MongoDB database is built in an application server and serves as a complete document storage service;

using an ElasticSearch as a search engine

(2) Based on a Java compiling program, starting multithreading to upload a Word file to a MongoDB database, carrying out Hash calculation on the file, analyzing the Word file to form an MD5 code of a unique file identifier, simultaneously carrying out paragraph decomposition on the file content, using the MD5 code and the paragraph code of the unique file identifier uploaded as keys, and storing the paragraph content as a value in an ElasticSearch service for being convenient for searching;

2. the relevant document acquisition module is used for acquiring relevant documents according to keywords input by a user;

distributing corresponding operation document rights for different users and submitting the resource document to be synthesized:

(1) The system for intervening the document authority distributes the operation of the user on the searched document and distributes the read-only and editable authority on the file downloaded and synthesized by the user;

(2) Obtaining keywords input by a user through an interface, searching and matching file names and file contents in an ElasticSearch service, and traversing a paragraph information list of a returned file after obtaining a matched file name MD5 coding set so as to facilitate the user to search;

3. the paragraph small file storage module is used for carrying out transmission verification, storage and management on the split file;

(1) The HTTP protocol is realized based on Netty, and files can be downloaded through the HTTP protocol

(2) The core functions of uploading and downloading files are improved, data sub-packaging and md5 file verification are adopted, and the file transmission accuracy is ensured

(3) The split file data format is stored and transformed into a protobuf serialized before-landing disk, so that the serialization speed can be increased, and the size of the occupied disk space can be optimized

(4) File delete + trash function. After the file is deleted, the file is moved to a garbage can, and the file can be recovered from the garbage can

4. The document batch disassembling and merging module is used for realizing batch disassembling and merging of Word files based on the paragraph list tree structure title;

(1) The montodb document storage service is installed.

(2) And clicking decoupling by a user on the screening file to acquire whether decoupling data exists in the small file system, and searching the MongoDB service through the MD5 code of the file to acquire the decoupled paragraph list tree structure title of the file and returning the decoupled paragraph list tree structure title to the user side for selection if the decoupling data exists in the small file system.

(3) If no decoupling data exists, decoupling operation is carried out. Obtaining a Word file, obtaining all title lists of the document, and circularly traversing all levels of titles to obtain paragraph contents.

(4) Converting the paragraph contents into a file stream form and storing the file stream form in a self-research small file system, setting parameters (a source file MD5, a superior paragraph ID, a self paragraph ID, a paragraph name and a layer level of the paragraph) corresponding to each paragraph, and finally constructing a document content resource pool in a tree structure storage mode.

(5) And adding corresponding parameters (a source file MD5, a paragraph name, a sub-paragraph and a level of the paragraph), and returning a tree structure to the front page for the user to select the paragraph.

5. A merged resource pool module for selecting paragraphs to add to the resource pool;

(1) And the user acquires the paragraph tree structure list, and selects corresponding paragraphs to be added to the resource pool.

(2) And acquiring the MD5 code and paragraph of the source file, and submitting the source file to be stored in a resource pool table.

6. The resource pool content synthesis and download module is used for synthesizing and downloading the resource pool content and generating a new Word file stream;

(1) The user can adjust the generated sequence of the contents of the resource pool, click and synthesize the contents and download the contents.

(2) Obtaining a source file MD5 code in the list and searching the document paragraph of the current resource pool to obtain the corresponding paragraph content in the original document Word, then combining the document streams to generate a new Word document stream, and then returning to the user side for downloading.

7. The detailed document disassembling and synthesizing process comprises the following specific steps:

(1) The user screens out the relevant documents according to the keywords, selects the documents to analyze, and can select one or more documents to analyze at the same time. The user firstly carries out Hash processing on the selected document, if the MD5 algorithm is used, if the calculated Hash value exists in the MongoDB, the Hash value is not repeatedly uploaded

(2) Constructing a document object model through an API (application programming interface) interface, reading an input stream of a file, extracting list tags of all paragraphs serving as list items by using the document object to form a set of paragraph nodes, circularly traversing each paragraph, judging that a current paragraph style is one of built-in title styles of the Word document through a read title level parameter set by a user, obtaining the level of the paragraph, if the read title level parameter is smaller than the user-defined level parameter, recursively querying sub-paragraphs, and so on to find the last paragraph.

After the paragraph title data of each hierarchy is obtained, a paragraph object is encapsulated, and the paragraph object comprises a paragraph ID, a document ID, a paragraph title name of the previous hierarchy and a subscript index (used for sequencing paragraphs of a composite file) of the hierarchy. The document ID and the paragraph ID are combined to form a key value as the paragraph contents.

Reading a document input stream, taking a paragraph title as an input parameter, acquiring DOM or XmlDocument of all content nodes through an API, traversing all nodes to extract the content of the paragraph, firstly verifying the node validity, namely whether the node exists in the document, and if the node exists in the document, executing the operation of saving the node to an elastic search and a library system: and using a key formed by combining the document ID and the paragraph ID as a main key, storing the content into an ElasticSearch as a value form, and constructing an output stream to be saved as a small file.

(3) Segmenting the keywords input by a user, matching the keywords with the values stored in the ES to obtain the matched object main keys, obtaining the IDs of the documents by the regular main keys, reading the IDs to a page to display the tree structure

(4) And the user can select other documents to continue adding paragraphs after clicking and adding the titles into the resource pool. Entering a resource pool, enabling a user to sort all paragraphs, clicking the synthesized file, taking keys formed by document IDs (identity) of all list items and paragraph IDs (identity) as parameters, removing a small file system for matching and searching, constructing input streams of all synthesized resources after searching, traversing and writing in a new file, and returning an output stream to a browser for downloading. And finishing the file synthesis process.

The above is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above-mentioned embodiments, and all technical solutions belonging to the idea of the present invention belong to the protection scope of the present invention. It should be noted that modifications and embellishments within the scope of the invention may be made by those skilled in the art without departing from the principle of the invention.

Claims

1. A document decoupling and composition system based on paragraph doclet storage, comprising:

2. The document decoupling and synthesis system based on paragraph doclet storage according to claim 1, characterized in that in the data document parsing and storage module, complete document storage is realized by building a MongoDB database in an application server; the method comprises the steps of using an ElasticSearch as a search engine, starting multithreading to upload a Word file to a MongoDB database based on a Java compiling program, carrying out Hash calculation on the file, analyzing the Word file to form an MD5 code of a unique file identifier, carrying out paragraph decomposition on the content of the file, using the MD5 code of the unique file identifier uploaded by the file and the paragraph code as keys, and storing the paragraph content as a value in the ElasticSearch service for convenient searching.

3. The document decoupling and composition system based on paragraph small file storage according to claim 1, wherein the relevance document obtaining module assigns corresponding rights to operate the document to different users and submits the resource document to be composed, specifically:

4. The document decoupling and composition system based on paragraph small file storage according to claim 1, wherein the paragraph small file storage module implements HTTP protocol based on Netty, and downloads files through HTTP protocol; checking by adopting a data sub-package + md5 file; and storing and transforming the split file data format into a ground disk after protobuf serialization.

5. The document decoupling and synthesis system based on paragraph small file storage as claimed in claim 4, wherein the paragraph small file storage module is provided with a file deletion + garbage can functional unit, and is configured to move to a garbage can after the file deletion and restore the file from the garbage can.

6. The document decoupling and composition system based on paragraph doclet storage according to claim 1, wherein the document batch disassembling and merging module is installed with a MongoDB document storage service unit; when decoupling the screened file, firstly judging whether decoupling data exists in the paragraph small file storage module, if decoupling data exists, searching a paragraph list tree structure title obtained by MongoDB service through document MD5 coding and returning the paragraph list tree structure title to a user side for selection; if no decoupling data exists, a decoupling operation is performed.

7. The document decoupling and composition system based on paragraph small file storage according to claim 6, wherein the decoupling operation process comprises:

obtaining a Word file, obtaining all title lists of a document, and circularly traversing all levels of titles to obtain paragraph contents;

converting the paragraph contents into a file stream form and storing the file stream form in a paragraph small file storage module, setting parameters corresponding to each paragraph, and finally constructing a document content resource pool in a tree structure storage mode;

8. The document decoupling and composition system based on paragraph doclet storage according to claim 1, wherein in the merge resource pool module, the MD5 encoding and paragraphs of the source file are selected and obtained based on the paragraph tree structure list and stored in the resource pool.

9. The document decoupling and synthesizing system based on paragraph small file storage according to claim 1, wherein the resource pool content synthesizing and downloading module searches the current resource pool file paragraph based on the source file MD5 code, obtains the corresponding paragraph content in the original file Word, then performs a combining operation on the file streams to generate a new Word file stream, and then returns to the user side for downloading.