CN106021390A - File management method and device - Google Patents
File management method and device Download PDFInfo
- Publication number
- CN106021390A CN106021390A CN201610312975.7A CN201610312975A CN106021390A CN 106021390 A CN106021390 A CN 106021390A CN 201610312975 A CN201610312975 A CN 201610312975A CN 106021390 A CN106021390 A CN 106021390A
- Authority
- CN
- China
- Prior art keywords
- file
- module
- text
- stream
- content
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/25—Integrating or interfacing systems involving database management systems
- G06F16/258—Data format conversion from or to a database
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/16—File or folder operations, e.g. details of user interfaces specifically adapted to file systems
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Human Computer Interaction (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a file management method and device. The method and the device are used for carrying out rapid optimization on query and search of a file, and eliminating the trouble of establishing a keyword by a user himself/herself. The file management method and device provided by the invention comprises the steps of obtaining a to-be-processed file from a temporary folder of a server, and obtaining a file format corresponding to the to-be-processed file; selecting a corresponding text reader according to the obtained file format; reading the to-be-processed file into file stream by employing the text reader, and extracting text contents from the file stream; and adding the text contents to a lucene index or a database filed, thereby enabling the user to search a lucene through the lucene index, or search a database search engine through the database field conveniently.
Description
Technical field
The present invention relates to art file management technology field, be specifically related to management method and the device of a kind of file.
Background technology
Along with informationalized development, many companies or government department, through office for a long time, tire out throughout the year
Amassed substantial amounts of file, in order to avoid to optical character recognition (Optical Character Recognition,
OCR) dependence of hardware, alleviates the loaded down with trivial details of file content inquiry, needs in the file consolidation stage just to literary composition
Part content recognition.
Piling up to solve heap file, file is deposited disorderly and unsystematic, and the most month after month file is significantly increased and makes
The file management become is chaotic, allows users to fast searching and goes out the file oneself wanted, need a set of row it
Effective file management system, in view of this, Bian You manufacturer proposes the key set up voluntarily according to user
The technology that file is classified by word, in order to provide user to carry out classifying or examining according to key word
The associated documents of the relevant property of rigging, but this method needs user to judge voluntarily due to its key word
And set up, not only can cause the puzzlement that user judges, and the process setting up key word is the most extremely complex.
In sum, it is known that since prior art is medium-term and long-term, there is document classification and inconvenient the asking of retrieval always
Topic, it is therefore necessary to propose the technological means improved, solve this problem.
Summary of the invention
It is an object of the invention to provide management method and the device of a kind of file, for the inquiry to file
Retrieval quickly optimizes, and removes user from and sets up the trouble of key word voluntarily.
In order to achieve the above object, the present invention uses such following technical scheme:
On the one hand, the present invention provides the management method of a kind of file, including:
From the temporary folder of server, obtain pending file, and obtain described pending file
Corresponding file format;
Corresponding text reader is selected according to the described file format got;
Use described text reader that described pending file is read into file stream, and from described file
Stream extracts content of text;
Described content of text is added in lucene index or in Database field, in order to Yong Hutong
Cross described lucene indexed search lucene search engine, or by described Database field searching database
Search engine.
On the other hand, the present invention provides the managing device of a kind of file, including:
Form acquisition module, for obtaining pending file from the temporary folder of server, and obtains
Take the file format that described pending file is corresponding;
Reader selects module, for selecting corresponding Reading text according to the described file format got
Device;
Content extraction module, for using described text reader to read written by described pending file
Part stream, and extract content of text from described file stream;
Content preserves module, for being added to by described content of text in lucene index or data base's word
Duan Zhong, in order to user passes through described lucene indexed search lucene search engine, or by described number
According to storehouse field search database search engine.
After using technique scheme, the technical scheme that the present invention provides will have the following advantages:
The embodiment of the present invention deposits file in the temporary folder of server, when temporary folder is gone back
When there is pending file, obtain the file format that pending file is corresponding, in the embodiment of the present invention
Corresponding text reader can be selected for different file formats, use text reader will wait to locate
The file of reason reads into file stream, and extracts content of text from file stream, the therefore embodiment of the present invention
In select concrete text reader by text formatting is judged, from file stream, automatically extract out literary composition
This content, these content of text add in lucene index or in Database field, user passes through
Lucene indexed search lucene search engine, or by Database field searching database search engine.
Therefore, in the embodiment of the present invention, server can not be directly placed into for untreated file, but by clothes
The temporary folder of business device carries out transfer, by text reader automatic extracted file content, removes use from
Person sets up the trouble of key word voluntarily.Therefore the query and search of file quickly can be optimized.?
Hold conversion and the extraction of file content data, possess the most perfect, the text function of superior performance.
Accompanying drawing explanation
Fig. 1 provides the process blocks schematic diagram of the management method of a kind of file for the embodiment of the present invention;
The workflow schematic diagram of the management method of the file that Fig. 2 provides for the embodiment of the present invention;
The composition structural representation of the managing device of a kind of file that Fig. 3-a provides for the embodiment of the present invention;
The composition structural representation of a kind of form acquisition module that Fig. 3-b provides for the embodiment of the present invention.
Detailed description of the invention
Embodiments provide management method and the device of a kind of file, for the inquiry of file is examined
Suo Jinhang quickly optimizes, and removes user from and sets up the trouble of key word voluntarily.
For making the goal of the invention of the present invention, feature, the advantage can be the most obvious and understandable, below will
In conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete
Ground describes, it is clear that the embodiments described below are only a part of embodiment of the present invention, and not all
Embodiment.Based on the embodiment in the present invention, the every other enforcement that those skilled in the art is obtained
Example, broadly falls into the scope of protection of the invention.
Term in description and claims of this specification and above-mentioned accompanying drawing " includes " and " having "
And their any deformation, it is intended that cover non-exclusive comprising, in order to comprise a series of unit
Process, method, system, product or equipment are not necessarily limited to those unit, but can include the most clearly
That list or for intrinsic other unit of these processes, method, product or equipment.
It is described in detail individually below.The embodiment of the present invention can quick-searching file content, with realize
Sharing and retrieval of file data, reduces the workload of search file, improves work efficiency.Present invention literary composition
One embodiment of the management method of part, can be applicable to, in the management automatically to file, refer to Fig. 1 institute
Show that the management method of the file that the present invention provides may include steps of:
101, from the temporary folder of server, obtain pending file, and obtain pending file
Corresponding file format.
In embodiments of the present invention, server is used for preserving file, and file is protected in the file of server
Depositing, file has unique store path, and file can be first stored in the temporary file of server by user
In folder, wait that the method provided according to the embodiment of the present invention carries out subsequent treatment, the literary composition that user gets
Part can be the file extracted from data base, it is also possible to is the file that is manually entered of user, also may be used
To be the file extracted from file system, it is also possible to be the file grabbed from network, specifically real
Existing mode does not limits.Such as user can be by the file uploaded or the file uploaded
Put in the temporary folder of server, the temporary folder of server saves all of pending
File, the embodiment of the present invention according to preset intermittent scanning temporary folder, thus can obtain automatically
Get pending file.
In some embodiments of the invention, step 101 obtains the file that pending file is corresponding
Form, may include steps of:
A1, resolve the attribute information of pending file;
A2, determine that the file format that pending file is corresponding is in following form according to attribute information
Kind: xls, xlsx, doc, docx, pdf, txt, ppt, pptx.
Wherein, after temporary folder gets pending file, various ways can be used to obtain
Take by the attribute information of file is resolved in the file format that file is corresponding, such as step A1, from
And determine file format by the attribute of file, in actual applications, the file of different-format has not
Same attribute.It can in addition contain carry out Study document form from the title of file, such as in the title of file
In can get file format by suffix name.Concrete, file format is in following form
Kind: xls, xlsx, doc, docx, pdf, txt, ppt, pptx, it should be noted that above-mentioned form is only
It is the form form that may be present of active file, but does not limits, pending in the embodiment of the present invention
File format corresponding to file can also be except xls, xlsx, doc, docx, pdf, txt, ppt, pptx
Other form in addition, such as visio form etc., specifically can determine with connected applications scene.
102, corresponding text reader is selected according to the file format got.
In embodiments of the present invention, for the pending file preserved in temporary folder, getting
After file format, concrete text reader can be selected according to file format, if from temporary folder
Get multiple pending file, then can select respectively accordingly for all of pending file
Text reader.Wherein, text reader can be special Reading text instrument, it is also possible to be to set
Putting the Reading text module in the managing device of the file of embodiment of the present invention offer, text reader can
To select according to concrete file format, document reader can when reading the file of different file format
There to be concrete implementation mode, followed by illustrated in greater detail.
In some embodiments of the invention, step 102 selects corresponding according to the file format got
Text reader, may include steps of:
B1, when the file format that pending file is corresponding is xls form, select HSSFWorkbook
Module is as text reader;
B2, when the file format that pending file is corresponding is xlsx form, select XSSFWorkbook
Module is as text reader;
B3, when the file format that pending file is corresponding is doc form, select HWPFDocument
Module is as text reader;
B4, when the file format that pending file is corresponding is docx form, select
XWPFDocument module is as text reader;
B5, when the file format that pending file is corresponding is pdf form, select PDFParser module
As text reader;
B6, when the file format that pending file is corresponding is txt form, select FileReader module
As text reader;
B7, when the file format that pending file is corresponding is ppt form, select HSLFSlideShow
Module is as text reader;
B8, when the file format that pending file is corresponding is pptx form, select XSLFSlideShow
Module is as text reader.
Wherein, in steps and one of them or several step are permissible to the institute in step B8 for step B1
Select flexibly to perform which or several step according to concrete implementation scene, implement and do not limit.
Additionally for different file formats, also may be used when selecting concrete operating system and application program solves
There to be different implementing, the illustration of the most attainable several ways, but the present invention is real
Execute example can be not limited thereto.Embodiment of the present invention Chinese version reader can according to file format not
With selecting flexibly, such as HSSFWorkbook module, XSSFWorkbook module,
HWPFDocument module, XWPFDocument module, PDFParser module, FileReader mould
Block, HSLFSlideShow module, XSLFSlideShow module.Wherein, it is normal for file format
Different-format, above-mentioned module simply by POI, pdfbox, java IO realize flexibly concrete
Example, it is also possible to combine and need the file format resolved to select other kinds of text resolver.
103, use text reader that pending file reads into file stream, and extract from file stream
Go out content of text.
In embodiments of the present invention, select, by abovementioned steps 102, the Reading text that file format is corresponding
After device, use text reader pending file can be read as file stream (English name:
File Stream), then by the extraction to file stream, content of text can be extracted from file stream.
File content is the text data in file stream, and content of text carries the content letter in pending file
Breath, by extracting file content, it is possible to achieve carry out pending file automatically from file stream
Content describes, and therefore text content can be as the keyword of pending file, the embodiment of the present invention
Chinese version content is automatically generated, it is not necessary to user inputs.
In some embodiments of the invention, step 103 uses text reader to be read by pending file
Take into file stream, and extract content of text from file stream, including:
C1, when the file format that pending file is corresponding is xls form, use HSSFWorkbook
Module is created that xls file stream from xls file, and uses HSSFExcelExtractor from xls file stream
Module extracts xls content of text;
C2, when the file format that pending file is corresponding is xlsx form, use XSSFWorkbook
Module is created that xlsx file stream from xlsx file, and uses from xlsx file stream
XSSFExcelExtractor module extracts xlsx content of text;
C3, when the file format that pending file is corresponding is doc form, use HWPFDocument
Module is created that doc file stream from doc file, and uses from doc file stream
HWPFWordExtractor module extracts doc content of text;
C4, when the file format that pending file is corresponding is docx form, use
XWPFDocument module is created that docx file stream from docx file, and from docx file stream
XWPFWordExtractor module is used to extract docx content of text;
C5, when the file format that pending file is corresponding is pdf form, use PDFParser module
From pdf file, it is created that pdf file stream, and from pdf file stream, uses PDFTextStripper module
Extract pdf content of text;
C6, when the file format that pending file is corresponding is txt form, use FileWriter module
From txt file, it is created that txt file stream, and uses FileWriter module to extract from txt file stream
Txt content of text;
C7, when the file format that pending file is corresponding is ppt form, use HSLFSlideShow
Module is created that ppt file stream from ppt file, and uses HSLF from ppt file stream
PowerPointExtractor module extracts ppt content of text;
C8, when the file format that pending file is corresponding is pptx form, use XSLFSlideShow
Module is created that pptx file stream from pptx file, and uses from pptx file stream
XSLFPowerPointExtractor module extracts pptx content of text.
Wherein, in the case of aforementioned execution step B1 to step B8, step C1 can be performed to step
Institute in rapid C8 is in steps and one of them or several step can be come according to concrete implementation scene
Select flexibly to perform which or several step, implement and do not limit.Additionally for different files
Form, can also have different implementing when selecting concrete operating system and application program solves,
It is the illustration of attainable several ways herein, but the embodiment of the present invention can be not limited thereto.
For ease of being better understood from and implement the such scheme of the embodiment of the present invention, citing below accordingly should
It is specifically described by scene.Refer to shown in Fig. 2, the file that Fig. 2 provides for the embodiment of the present invention
The workflow schematic diagram of management method, the most respectively to text under the scene of different file formats
The extraction of content is illustrated, the most respectively Excel file, Word file, Pdf file, PPT
As a example by file, the content of text of Txt file extract.
Above-mentioned steps C1 sum realize realize under scene under scene and in above-mentioned steps C2, first will
Pending file reads into file stream, such as can be with in System.IO NameSpace
The content of StreamWriter and StreamReader class reading and writing of files.Utilize POI's
HSSFWorkbook method creates xls file object, then by the ExcelExtractor method of POI to xls
Carry out parsing and draw content of text;The XSSFWorkbook method utilizing POI creates xlsx file object,
Xlsx is resolved by the XSSFExcelExtractor method of recycling POI.Wherein, Apache POI
Being open source code function storehouse, POI provides API to read Microsoft Office form archives to java applet
With the function write.Such as, HSSF provides the function of read-write Microsoft Excel form archives.XSSF
The function of read-write Microsoft Excel OOXML form archives is provided.HWPF provides read-write Microsoft
The function of Word format archives.HSLF provides the function of read-write Microsoft PowerPoint form archives.
HDGF provides the function of read-write Microsoft Visio form archives.
It is illustrated below, is first illustrated with the parsing of xls, can use
This class of org.apache.poi.hssf.extractor.ExcelExtractor processes excel 2003 (.xls).(1)
Same doc, is created by ExtractorFactory, (2) newly-built ExcelExtractor object.Can also adopt
By following manner: 1) same to DOC, create object by receiving POIFSFileSystem, 2) pass through
Receiving HSSFWorkbook and create object, wherein, HSSFWorkbook object can be by receiving
POIFSFileSystem or InputStream object creates.Sample code can be such that
InputStream inp=new FileInputStream (this.filePath);
HSSFWorkbook wb=new HSSFWorkbook (new POIFSFileSystem (inp));
Extractor=new ExcelExtractor (wb);
//filter formulas from the results
this.extractor.setFormulasNotResults(true);
//filter Sheet name from the results
this.extractor.setIncludeSheetNames(false);
Content=this.extractor.getText ();
It follows that be illustrated the analyzing step of file stream with the parsing of xlsx.Can use
Org.apache.poi.POITextExtractor and org.apache.poi.POIXMLTextExtractor, directly
Newly-built XSSFExcelExtractor object, or create by receiving XSSFWorkbook object.
XSSFWorkbook can receive inputStream object
Above-mentioned steps C3 realize under scene and above-mentioned steps C4 realize under scene, utilize POI
WordExtractor method doc file is resolved;Utilize the XWPFDocument side of POI
Method creates document object, and docx is resolved by recycling XWPFWordExtractor method.Wherein,
WordExtractor is the analytic method that Apache POI develops for word2003, XWPFWord
Extractor is that the parsing that Apache POI develops for word2007 increases method newly.Concrete, doc's
Analyzing step is as follows, uses this class of org.apache.poi.hwpf.extractor.WordExtractor to process
Word 2003 document (.doc).Such as can be to use the following two kinds mode:
(1) ExtractorFactory.createExtractor (InputStream) is used to create extracting object,
Return is common interface object, therefore unsteady state operation, InputStream fis=new FileInputStream
(filePath);WorderExtractor extractor=(WordExtractor) ExtractorFactory.create
Extractor(fis);(2) use WordExtractor to create extracting object, or receive InputStream
Create object, or reception POIFSFileSystem creates object, or POIFSFileSystem
Also receive InputStream and receive object.Object is created such as with reception HWPFDocument,
HWPFDocument creates object by receiving InputStream or POIFSFileSystem.
It follows that be illustrated the analyzing step of file stream with the parsing of docx.Can use as follows
org.apache.poi.POITextExtractor、org.apache.poi.POIXMLTextExtractor、
Org.apache.poi.xwpf.extractor.XWPFWordExtractor class can process word2007 (.docx).
Illustrating, the parent of (1) this class object can use ExtratorFactory to generate, but needs reception
Parameter is a virtual object OPCPackage, and this object is bad to be determined.The most directly use following (2)
In method create object.(2) newly-built XWPFWordExtractor object, by receiving XWPF
Document object creates, it is also possible to can receive inputStream pair by XWPFDocument
As.
Realize under scene in above-mentioned steps C5, utilize the PDFParser method of pdfbox to create pdf literary composition
Shelves object, the PDFTextStripper method of recycling pdfbox extracts text message.Wherein, PDF solves
Analysis can be realized by the code in pdfbox, no longer illustrated in greater detail.
Realize under scene in above-mentioned steps C6, utilize the FileReader method of java.io to txt document
Object carries out text message extraction.Wherein, txt is resolved to be realized by the code in java.io, no longer
Illustrated in greater detail.
Above-mentioned steps C7 realize under scene and above-mentioned steps C8 realize under scene, utilize POI
PowerPointExtractor ppt file is carried out text resolution;Utilize POI's
XSLFPowerPointExtractor carries out text resolution to pptx file.Wherein, PPT analyzing step is such as
Under, use this class of following org.apache.poi.hslf.extractor.PowerPointExtractor to create
PowerPoint2003.Such as, (1) same to DOC, create object with ExtractorFactory, (2) are straight
Connect use PowerPointExtractor create object, by receive HSLFSlideShow object,
POIFSFileSystem object or filename can create.
It follows that be illustrated the analyzing step of file stream with the parsing of pptx.Can use as follows
This class of org.apache.poi.POITextExtractor and org.apache.poi.POIXMLTextExtractor
PowerPoint2007 (.pptx) can be processed, directly create XSLFPowerPointExtractor object, example
As, (1) receives XMLSlideShow object, creates XMLSlideShow by receiving InputStream
Object.(2) receive XSLFSlideShow object, can create by receiving file path
XSLFSlideShow object.
104, content of text is added in lucene index or in Database field, in order to Yong Hutong
Cross lucene indexed search lucene search engine, or by Database field searching database search engine.
In embodiments of the present invention, get, by step 103, the file content that pending file is corresponding
Afterwards, content of text can be added in lucene index or in Database field, user passes through
Lucene indexed search lucene search engine, or by Database field searching database search engine.
The embodiment of the present invention utilizes body of an instrument analysis framework to go to generate in lucene retrieval file or Database field
Hold, to realize sharing and retrieval of file data, reduce the workload of search file, improve work efficiency.
In embodiments of the present invention, above-mentioned document retrieval mode is divided into two types: lucene search and data base to search
Rope.Lucene search mainly utilizes its index engine to create content indexing.Wherein, lucene search carries
The services package of confession contains two parts: one enters one goes out." entering " to refer to add, the content of text that will generate is made
Adding index for source string or it deleted from index, " going out " refers to read, and i.e. carries to user
For full-text search service, allow user can orient source string by key word.Database search is then
Content of text is stored in Database field, the SQL (Structured of recycling data base
Query Language, SQL) inquiry data, show file alternately.
Being illustrated by the application scenarios of aforementioned reality, the embodiment of the present invention utilizes poi, pdfbox, java
The query and search of file is quickly optimized by IO, database retrieval technology and lucene.Support main
The conversion of stream file data content and extraction, utilize the multithreading of java, asynchronous execution, thread pool to literary composition
After part is circulated reading, then batch extraction content.Possesses the most perfect, the files in batch of superior performance
Extract text function.In the embodiment of the present invention, it is possible to use file reads instrument and read out by file content
Coming, the index that body of an instrument content is stored in search engine neutralizes in Database field.So that reach can
Fast search file, the purpose of retrieval file content.
In some embodiments of the invention, step 104 content of text is added to lucene index in or
In person's Database field, specifically may include steps of:
D1, file path and file attribute information, the content of text of pending file are added simultaneously to
In lucene index or in Database field.
In actual applications, in lucene indexes or when Database field adds file content, also
The file path of pending file and file attribute information can be also added to lucene index in or
In Database field, so that can retrieve in lucene search engine and database search engine
The more contents relevant with pending file so that the recall precision of user is higher.Wherein, file road
Footpath refers to pending file store path in the server, and file attribute information also refers to treat
The attribute of the file processed, such as file format, file size, the part of file owning user, file
The establishment time etc..
By the previous embodiment illustration to the present invention, in the embodiment of the present invention, server faces
Time file in deposited file, when temporary folder there is also pending file, obtain and wait to locate
The file format corresponding to file of reason, can select phase for different file formats in the embodiment of the present invention
The text reader answered, uses text reader that pending file is read into file stream, and from literary composition
Part stream extracts content of text, therefore by the judgement of text formatting is selected tool in the embodiment of the present invention
The text reader of body, automatically extracts out content of text from file stream, and these content of text add to
During lucene indexes or in Database field, user passes through lucene indexed search lucene search engine,
Or by Database field searching database search engine.Therefore in the embodiment of the present invention, for untreated
File can not be directly placed into server, but carry out transfer by the temporary folder of server, pass through
Text reader automatic extracted file content, removes user from and sets up the trouble of key word voluntarily.Therefore may be used
Quickly optimize with the query and search to file.Support conversion and the extraction of file content data, tool
The most perfect standby, the text function of superior performance.
Previous embodiment describes the management method of the file that the embodiment of the present invention provides, and next introduces this
The managing device of a kind of file that inventive embodiments provides, refers to as it is shown on figure 3, the embodiment of the present invention
The managing device 300 of the file provided, may include that
Form acquisition module 301, for obtaining pending file from the temporary folder of server, and
Obtain the file format that described pending file is corresponding;
Reader selects module 302, for selecting corresponding text to read according to the described file format got
Take device;
Content extraction module 303, for using described text reader to be read into by described pending file
File stream, and extract content of text from described file stream;
Content preserves module 304, for being added to by described content of text in lucene index or data base
In field, in order to user passes through described lucene indexed search lucene search engine, or by described
Database field searching database search engine.
In some embodiments of the invention, refer to as shown in Fig. 3-b, described form acquisition module 301,
Including:
Parsing module 3011, for resolving the attribute information of described pending file;
According to described attribute information, format determination module 3012, for determining that described pending file is corresponding
File format be the one in following form: xls, xlsx, doc, docx, pdf, txt, ppt, pptx.
In some embodiments of the invention, described reader selects module 302, treats described in working as
When file format corresponding to file processed is xls form, select HSSFWorkbook module as text
Reader;When the file format that described pending file is corresponding is xlsx form, select
XSSFWorkbook module is as text reader;The file format corresponding when described pending file is
During doc form, select HWPFDocument module as text reader;When described pending literary composition
When the file format that part is corresponding is docx form, select XWPFDocument module as text reader;
When the file format that described pending file is corresponding is pdf form, select PDFParser module conduct
Text reader;When the file format that described pending file is corresponding is txt form, select
FileReader module is as text reader;When the file format that described pending file is corresponding is ppt
During form, select HSLFSlideShow module as text reader;When described pending file pair
When the file format answered is pptx form, select XSLFSlideShow module as text reader.
In some embodiments of the invention, further, described content extraction module 303, specifically for
When the file format that described pending file is corresponding is xls form, use HSSFWorkbook module
From xls file, it is created that xls file stream, and from described xls file stream, uses HSSFExcelExtractor
Module extracts xls content of text;When the file format that described pending file is corresponding is xlsx form,
XSSFWorkbook module is used to be created that xlsx file stream from xlsx file, and from described xlsx literary composition
Part stream use XSSFExcelExtractor module extract xlsx content of text;When described pending
When the file format that file is corresponding is doc form, use HWPFDocument module from doc file
It is created that doc file stream, and uses HWPFWordExtractor module to take out from described doc file stream
Take out doc content of text;When the file format that described pending file is corresponding is docx form, make
From docx file, docx file stream it is created that by XWPFDocument module, and from described docx literary composition
Part stream use XWPFWordExtractor module extract docx content of text;When described pending
When the file format that file is corresponding is pdf form, PDFParser module is used to be created that from pdf file
Pdf file stream, and use PDFTextStripper module to extract pdf text from described pdf file stream
Content;When the file format that described pending file is corresponding is txt form, use FileWriter mould
Block is created that txt file stream from txt file, and uses FileWriter module from described txt file stream
Extract txt content of text;When the file format that described pending file is corresponding is ppt form, make
From ppt file, ppt file stream it is created that by HSLFSlideShow module, and from described ppt file stream
Middle use HSLF PowerPointExtractor module extracts ppt content of text;When described pending
When the file format that file is corresponding is pptx form, use XSLFSlideShow module from pptx file
It is created that pptx file stream, and from described pptx file stream, uses XSLFPowerPointExtractor
Module extracts pptx content of text.
In some embodiments of the invention, described content preserves module 304, for by described pending
The file path of file and file attribute information, described content of text be added simultaneously to lucene index in or
In person's Database field.
By the previous embodiment illustration to the present invention, in the embodiment of the present invention, server faces
Time file in deposited file, when temporary folder there is also pending file, obtain and wait to locate
The file format corresponding to file of reason, can select phase for different file formats in the embodiment of the present invention
The text reader answered, uses text reader that pending file is read into file stream, and from literary composition
Part stream extracts content of text, therefore by the judgement of text formatting is selected tool in the embodiment of the present invention
The text reader of body, automatically extracts out content of text from file stream, and these content of text add to
During lucene indexes or in Database field, user passes through lucene indexed search lucene search engine,
Or by Database field searching database search engine.Therefore in the embodiment of the present invention, for untreated
File can not be directly placed into server, but carry out transfer by the temporary folder of server, pass through
Text reader automatic extracted file content, removes user from and sets up the trouble of key word voluntarily.Therefore may be used
Quickly optimize with the query and search to file.Support conversion and the extraction of file content data, tool
The most perfect standby, the text function of superior performance.
Additionally it should be noted that, device embodiment described above is only schematically, wherein said
The unit illustrated as separating component can be or may not be physically separate, shows as unit
The parts shown can be or may not be physical location, i.e. may be located at a place, or also may be used
To be distributed on multiple NE.Some or all of mould therein can be selected according to the actual needs
Block realizes the purpose of the present embodiment scheme.It addition, in the device embodiment accompanying drawing of present invention offer, mould
Annexation between block represents have communication connection between them, specifically can be implemented as one or more
Communication bus or holding wire.Those of ordinary skill in the art are not in the case of paying creative work, i.e.
It is appreciated that and implements.
Through the above description of the embodiments, those skilled in the art is it can be understood that arrive this
Invention can add the mode of required common hardware by software and realize, naturally it is also possible to pass through specialized hardware
Realize including special IC, dedicated cpu, private memory, special components and parts etc..General feelings
Under condition, all functions completed by computer program can realize with corresponding hardware easily, and
And, the particular hardware structure being used for realizing same function can also be diversified, such as analog circuit,
Digital circuit or special circuit etc..But, the most more in the case of software program realize be more
Good embodiment.Based on such understanding, technical scheme is the most in other words to existing skill
The part that art contributes can embody with the form of software product, and this computer software product stores
In the storage medium that can read, as the floppy disk of computer, USB flash disk, portable hard drive, read only memory (ROM,
Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic
Dish or CD etc., including some instructions with so that computer equipment (can be personal computer,
Server, or the network equipment etc.) perform the method described in each embodiment of the present invention.
In sum, above example only in order to technical scheme to be described, is not intended to limit;
Although being described in detail the present invention with reference to above-described embodiment, those of ordinary skill in the art should
Work as understanding: the technical scheme described in the various embodiments described above still can be modified by it, or to it
Middle part technical characteristic carries out equivalent;And these amendments or replacement, do not make appropriate technical solution
Essence depart from various embodiments of the present invention technical scheme spirit and scope.
Claims (10)
1. the management method of a file, it is characterised in that including:
From the temporary folder of server, obtain pending file, and obtain described pending file
Corresponding file format;
Corresponding text reader is selected according to the described file format got;
Use described text reader that described pending file is read into file stream, and from described file
Stream extracts content of text;
Described content of text is added in lucene index or in Database field, in order to Yong Hutong
Cross described lucene indexed search lucene search engine, or by described Database field searching database
Search engine.
The management method of a kind of file the most according to claim 1, it is characterised in that described acquisition
The file format that described pending file is corresponding, including:
Resolve the attribute information of described pending file;
Determine that the file format that described pending file is corresponding is in following form according to described attribute information
One: xls, xlsx, doc, docx, pdf, txt, ppt, pptx.
The management method of a kind of file the most according to claim 2, it is characterised in that described basis
The described file format got selects corresponding text reader, including:
When the file format that described pending file is corresponding is xls form, select HSSFWorkbook
Module is as text reader;
When the file format that described pending file is corresponding is xlsx form, select XSSFWorkbook
Module is as text reader;
When the file format that described pending file is corresponding is doc form, select HWPFDocument
Module is as text reader;
When the file format that described pending file is corresponding is docx form, select
XWPFDocument module is as text reader;
When the file format that described pending file is corresponding is pdf form, select PDFParser module
As text reader;
When the file format that described pending file is corresponding is txt form, select FileReader module
As text reader;
When the file format that described pending file is corresponding is ppt form, select HSLFSlideShow
Module is as text reader;
When the file format that described pending file is corresponding is pptx form, select
XSLFSlideShow module is as text reader.
The management method of a kind of file the most according to claim 3, it is characterised in that described use
Described pending file is read into file stream by described text reader, and extracts from described file stream
Go out content of text, including:
When the file format that described pending file is corresponding is xls form, use HSSFWorkbook
Module is created that xls file stream from xls file, and uses from described xls file stream
HSSFExcelExtractor module extracts xls content of text;
When the file format that described pending file is corresponding is xlsx form, use XSSFWorkbook
Module is created that xlsx file stream from xlsx file, and uses from described xlsx file stream
XSSFExcelExtractor module extracts xlsx content of text;
When the file format that described pending file is corresponding is doc form, use HWPFDocument
Module is created that doc file stream from doc file, and uses from described doc file stream
HWPFWordExtractor module extracts doc content of text;
When the file format that described pending file is corresponding is docx form, use
XWPFDocument module is created that docx file stream from docx file, and from described docx file
Stream use XWPFWordExtractor module extract docx content of text;
When the file format that described pending file is corresponding is pdf form, use PDFParser module
From pdf file, it is created that pdf file stream, and from described pdf file stream, uses PDFTextStripper
Module extracts pdf content of text;
When the file format that described pending file is corresponding is txt form, use FileWriter module
From txt file, it is created that txt file stream, and uses FileWriter module to take out from described txt file stream
Take out txt content of text;
When the file format that described pending file is corresponding is ppt form, use HSLFSlideShow
Module is created that ppt file stream from ppt file, and uses HSLF from described ppt file stream
PowerPointExtractor module extracts ppt content of text;
When the file format that described pending file is corresponding is pptx form, use
XSLFSlideShow module is created that pptx file stream from pptx file, and from described pptx file stream
Middle use XSLFPowerPointExtractor module extracts pptx content of text.
The management method of a kind of file the most according to claim 1, it is characterised in that described by institute
State content of text to add in lucene index or in Database field, including:
File path and file attribute information, the described content of text of described pending file are added simultaneously
It is added in lucene index or in Database field.
6. the managing device of a file, it is characterised in that including:
Form acquisition module, for obtaining pending file from the temporary folder of server, and obtains
Take the file format that described pending file is corresponding;
Reader selects module, for selecting corresponding Reading text according to the described file format got
Device;
Content extraction module, for using described text reader to read written by described pending file
Part stream, and extract content of text from described file stream;
Content preserves module, for being added to by described content of text in lucene index or data base's word
Duan Zhong, in order to user passes through described lucene indexed search lucene search engine, or by described number
According to storehouse field search database search engine.
The managing device of a kind of file the most according to claim 6, it is characterised in that described form
Acquisition module, including:
Parsing module, for resolving the attribute information of described pending file;
Format determination module, for determining, according to described attribute information, the literary composition that described pending file is corresponding
Part form is the one in following form: xls, xlsx, doc, docx, pdf, txt, ppt, pptx.
The managing device of a kind of file the most according to claim 7, it is characterised in that described reading
Device selects module, specifically for when file format corresponding to described pending file is xls form, selects
Select HSSFWorkbook module as text reader;When the tray that described pending file is corresponding
When formula is xlsx form, select XSSFWorkbook module as text reader;When described pending
File format corresponding to file when being doc form, select HWPFDocument module to read as text
Take device;When the file format that described pending file is corresponding is docx form, select
XWPFDocument module is as text reader;When the file format that described pending file is corresponding
During for pdf form, select PDFParser module as text reader;When described pending file pair
When the file format answered is txt form, select FileReader module as text reader;Treat when described
When file format corresponding to file processed is ppt form, select HSLFSlideShow module as literary composition
This reader;When the file format that described pending file is corresponding is pptx form, select
XSLFSlideShow module is as text reader.
The managing device of a kind of file the most according to claim 8, it is characterised in that described content
Abstraction module, specifically for when file format corresponding to described pending file is xls form, uses
HSSFWorkbook module is created that xls file stream from xls file, and makes from described xls file stream
Xls content of text is extracted by HSSFExcelExtractor module;When described pending file is corresponding
When file format is xlsx form, XSSFWorkbook module is used to be created that xlsx from xlsx file
File stream, and use XSSFExcelExtractor module to extract xlsx literary composition from described xlsx file stream
This content;When the file format that described pending file is corresponding is doc form, use
HWPFDocument module is created that doc file stream from doc file, and from described doc file stream
Middle use HWPFWordExtractor module extracts doc content of text;When described pending file
When corresponding file format is docx form, XWPFDocument module is used to create from docx file
Build out docx file stream, and use XWPFWordExtractor module to take out from described docx file stream
Take out docx content of text;When the file format that described pending file is corresponding is pdf form, make
From pdf file, it is created that pdf file stream by PDFParser module, and makes from described pdf file stream
Pdf content of text is extracted by PDFTextStripper module;When the literary composition that described pending file is corresponding
When part form is txt form, FileWriter module is used to be created that txt file stream from txt file, and
FileWriter module is used to extract txt content of text from described txt file stream;When described pending
When the file format that file is corresponding is ppt form, HSLFSlideShow module is used to create from ppt file
Build out ppt file stream, and from described ppt file stream, use HSLF PowerPointExtractor module
Extract ppt content of text;When the file format that described pending file is corresponding is pptx form,
XSLFSlideShow module is used to be created that pptx file stream from pptx file, and from described pptx
File stream use XSLFPowerPointExtractor module extract pptx content of text.
The managing device of a kind of file the most according to claim 6, it is characterised in that in described
Hold and preserve module, for by the file path of described pending file and file attribute information, described literary composition
This content is added simultaneously in lucene index or in Database field.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610312975.7A CN106021390A (en) | 2016-05-12 | 2016-05-12 | File management method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610312975.7A CN106021390A (en) | 2016-05-12 | 2016-05-12 | File management method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106021390A true CN106021390A (en) | 2016-10-12 |
Family
ID=57100178
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610312975.7A Pending CN106021390A (en) | 2016-05-12 | 2016-05-12 | File management method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106021390A (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107291949A (en) * | 2017-07-17 | 2017-10-24 | 小草数语(北京)科技有限公司 | Information search method and device |
CN108197117A (en) * | 2018-01-31 | 2018-06-22 | 厦门大学 | A kind of Chinese text keyword extracting method based on document subject matter structure with semanteme |
CN111143849A (en) * | 2019-12-31 | 2020-05-12 | 奇安信科技集团股份有限公司 | File type identification method and device applied to electronic equipment and electronic equipment |
CN111881332A (en) * | 2020-06-17 | 2020-11-03 | 武汉光庭信息技术股份有限公司 | Automatic driving simulation data management server and method |
CN111915424A (en) * | 2020-07-30 | 2020-11-10 | 平安证券股份有限公司 | Information storage method and related product |
CN113268283A (en) * | 2021-05-28 | 2021-08-17 | 深圳市蓬莱产业科技有限公司 | Batch processing method based on file materials |
CN111915424B (en) * | 2020-07-30 | 2024-06-28 | 平安证券股份有限公司 | Information storage method and related product |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102819592A (en) * | 2012-08-08 | 2012-12-12 | 河海大学 | Lucene-based desktop searching system and method |
CN104899337A (en) * | 2015-07-01 | 2015-09-09 | 中国农业银行股份有限公司 | File index building method and system |
CN105045852A (en) * | 2015-07-06 | 2015-11-11 | 华东师范大学 | Full-text search engine system for teaching resources |
CN105574164A (en) * | 2015-12-16 | 2016-05-11 | 北京华傲达数据技术有限公司 | Excel document data analysis method and device |
-
2016
- 2016-05-12 CN CN201610312975.7A patent/CN106021390A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102819592A (en) * | 2012-08-08 | 2012-12-12 | 河海大学 | Lucene-based desktop searching system and method |
CN104899337A (en) * | 2015-07-01 | 2015-09-09 | 中国农业银行股份有限公司 | File index building method and system |
CN105045852A (en) * | 2015-07-06 | 2015-11-11 | 华东师范大学 | Full-text search engine system for teaching resources |
CN105574164A (en) * | 2015-12-16 | 2016-05-11 | 北京华傲达数据技术有限公司 | Excel document data analysis method and device |
Non-Patent Citations (1)
Title |
---|
HARRYHUANG1990: "使用Apache POI抽取OFFICE文本(DOC,DOCX,XLS,XLSX,PPT,PPTX)—Desktop Search开发笔记【经验积累】", 《HTTP://BLOG.CSDN.NET/HARRYHUANG1990/ARTICLE/DETAILS/11888561》 * |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107291949A (en) * | 2017-07-17 | 2017-10-24 | 小草数语(北京)科技有限公司 | Information search method and device |
CN107291949B (en) * | 2017-07-17 | 2020-11-13 | 绿湾网络科技有限公司 | Information searching method and device |
CN108197117A (en) * | 2018-01-31 | 2018-06-22 | 厦门大学 | A kind of Chinese text keyword extracting method based on document subject matter structure with semanteme |
CN108197117B (en) * | 2018-01-31 | 2020-05-26 | 厦门大学 | Chinese text keyword extraction method based on document theme structure and semantics |
CN111143849A (en) * | 2019-12-31 | 2020-05-12 | 奇安信科技集团股份有限公司 | File type identification method and device applied to electronic equipment and electronic equipment |
CN111881332A (en) * | 2020-06-17 | 2020-11-03 | 武汉光庭信息技术股份有限公司 | Automatic driving simulation data management server and method |
CN111915424A (en) * | 2020-07-30 | 2020-11-10 | 平安证券股份有限公司 | Information storage method and related product |
CN111915424B (en) * | 2020-07-30 | 2024-06-28 | 平安证券股份有限公司 | Information storage method and related product |
CN113268283A (en) * | 2021-05-28 | 2021-08-17 | 深圳市蓬莱产业科技有限公司 | Batch processing method based on file materials |
CN113268283B (en) * | 2021-05-28 | 2022-03-22 | 深圳市蓬莱产业科技有限公司 | Batch processing method based on file materials |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106021390A (en) | File management method and device | |
US8799291B2 (en) | Forensic index method and apparatus by distributed processing | |
US8949241B2 (en) | Systems and methods for interactive disambiguation of data | |
US20150088854A1 (en) | Securing application information in system-wide search engines | |
JP5550669B2 (en) | SEARCH DEVICE, SEARCH METHOD, AND PROGRAM | |
US20160188723A1 (en) | Cloud website recommendation method and system based on terminal access statistics, and related device | |
EP2506208A1 (en) | Forensic system and forensic method, and forensic program | |
Elliott | Survey of author name disambiguation: 2004 to 2010 | |
CN106055546A (en) | Optical disk library full-text retrieval system based on Lucene | |
CN115145871A (en) | File query method and device and electronic equipment | |
JP5699743B2 (en) | SEARCH METHOD, SEARCH DEVICE, AND COMPUTER PROGRAM | |
CN110489032B (en) | Dictionary query method for electronic book and electronic equipment | |
CN114297143A (en) | File searching method, file displaying device and mobile terminal | |
CN111045994B (en) | File classification retrieval method and system based on KV database | |
KR20090097971A (en) | Method and system for searching patent | |
JP7293780B2 (en) | Information processing device, document management system and program | |
CN110008407B (en) | Information retrieval method and device | |
CN115794745A (en) | File searching method, system, device and storage medium | |
CN112597106A (en) | Document page skipping method and system | |
Nordling | South African law may impede human health research | |
JP7081155B2 (en) | Selection program, selection method, and selection device | |
JP5746912B2 (en) | Method, system and computer readable recording medium for refining a web document using text pattern extraction | |
Liu et al. | An improved full-text retrieval for elementary education resource database system | |
Ali et al. | Analysis of windows OS’s fragmented file carving techniques: A systematic literature review | |
US20190056913A1 (en) | Information density of documents |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20161012 |
|
RJ01 | Rejection of invention patent application after publication |