CN106021390A - File management method and device - Google Patents

File management method and device Download PDF

Info

Publication number
CN106021390A
CN106021390A CN201610312975.7A CN201610312975A CN106021390A CN 106021390 A CN106021390 A CN 106021390A CN 201610312975 A CN201610312975 A CN 201610312975A CN 106021390 A CN106021390 A CN 106021390A
Authority
CN
China
Prior art keywords
file
module
text
stream
content
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610312975.7A
Other languages
Chinese (zh)
Inventor
王俊鹏
王永军
王国顺
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fujian Linewell Software Co Ltd
Original Assignee
Fujian Linewell Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fujian Linewell Software Co Ltd filed Critical Fujian Linewell Software Co Ltd
Priority to CN201610312975.7A priority Critical patent/CN106021390A/en
Publication of CN106021390A publication Critical patent/CN106021390A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/258Data format conversion from or to a database
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/16File or folder operations, e.g. details of user interfaces specifically adapted to file systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a file management method and device. The method and the device are used for carrying out rapid optimization on query and search of a file, and eliminating the trouble of establishing a keyword by a user himself/herself. The file management method and device provided by the invention comprises the steps of obtaining a to-be-processed file from a temporary folder of a server, and obtaining a file format corresponding to the to-be-processed file; selecting a corresponding text reader according to the obtained file format; reading the to-be-processed file into file stream by employing the text reader, and extracting text contents from the file stream; and adding the text contents to a lucene index or a database filed, thereby enabling the user to search a lucene through the lucene index, or search a database search engine through the database field conveniently.

Description

The management method of a kind of file and device
Technical field
The present invention relates to art file management technology field, be specifically related to management method and the device of a kind of file.
Background technology
Along with informationalized development, many companies or government department, through office for a long time, tire out throughout the year Amassed substantial amounts of file, in order to avoid to optical character recognition (Optical Character Recognition, OCR) dependence of hardware, alleviates the loaded down with trivial details of file content inquiry, needs in the file consolidation stage just to literary composition Part content recognition.
Piling up to solve heap file, file is deposited disorderly and unsystematic, and the most month after month file is significantly increased and makes The file management become is chaotic, allows users to fast searching and goes out the file oneself wanted, need a set of row it Effective file management system, in view of this, Bian You manufacturer proposes the key set up voluntarily according to user The technology that file is classified by word, in order to provide user to carry out classifying or examining according to key word The associated documents of the relevant property of rigging, but this method needs user to judge voluntarily due to its key word And set up, not only can cause the puzzlement that user judges, and the process setting up key word is the most extremely complex.
In sum, it is known that since prior art is medium-term and long-term, there is document classification and inconvenient the asking of retrieval always Topic, it is therefore necessary to propose the technological means improved, solve this problem.
Summary of the invention
It is an object of the invention to provide management method and the device of a kind of file, for the inquiry to file Retrieval quickly optimizes, and removes user from and sets up the trouble of key word voluntarily.
In order to achieve the above object, the present invention uses such following technical scheme:
On the one hand, the present invention provides the management method of a kind of file, including:
From the temporary folder of server, obtain pending file, and obtain described pending file Corresponding file format;
Corresponding text reader is selected according to the described file format got;
Use described text reader that described pending file is read into file stream, and from described file Stream extracts content of text;
Described content of text is added in lucene index or in Database field, in order to Yong Hutong Cross described lucene indexed search lucene search engine, or by described Database field searching database Search engine.
On the other hand, the present invention provides the managing device of a kind of file, including:
Form acquisition module, for obtaining pending file from the temporary folder of server, and obtains Take the file format that described pending file is corresponding;
Reader selects module, for selecting corresponding Reading text according to the described file format got Device;
Content extraction module, for using described text reader to read written by described pending file Part stream, and extract content of text from described file stream;
Content preserves module, for being added to by described content of text in lucene index or data base's word Duan Zhong, in order to user passes through described lucene indexed search lucene search engine, or by described number According to storehouse field search database search engine.
After using technique scheme, the technical scheme that the present invention provides will have the following advantages:
The embodiment of the present invention deposits file in the temporary folder of server, when temporary folder is gone back When there is pending file, obtain the file format that pending file is corresponding, in the embodiment of the present invention Corresponding text reader can be selected for different file formats, use text reader will wait to locate The file of reason reads into file stream, and extracts content of text from file stream, the therefore embodiment of the present invention In select concrete text reader by text formatting is judged, from file stream, automatically extract out literary composition This content, these content of text add in lucene index or in Database field, user passes through Lucene indexed search lucene search engine, or by Database field searching database search engine. Therefore, in the embodiment of the present invention, server can not be directly placed into for untreated file, but by clothes The temporary folder of business device carries out transfer, by text reader automatic extracted file content, removes use from Person sets up the trouble of key word voluntarily.Therefore the query and search of file quickly can be optimized.? Hold conversion and the extraction of file content data, possess the most perfect, the text function of superior performance.
Accompanying drawing explanation
Fig. 1 provides the process blocks schematic diagram of the management method of a kind of file for the embodiment of the present invention;
The workflow schematic diagram of the management method of the file that Fig. 2 provides for the embodiment of the present invention;
The composition structural representation of the managing device of a kind of file that Fig. 3-a provides for the embodiment of the present invention;
The composition structural representation of a kind of form acquisition module that Fig. 3-b provides for the embodiment of the present invention.
Detailed description of the invention
Embodiments provide management method and the device of a kind of file, for the inquiry of file is examined Suo Jinhang quickly optimizes, and removes user from and sets up the trouble of key word voluntarily.
For making the goal of the invention of the present invention, feature, the advantage can be the most obvious and understandable, below will In conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete Ground describes, it is clear that the embodiments described below are only a part of embodiment of the present invention, and not all Embodiment.Based on the embodiment in the present invention, the every other enforcement that those skilled in the art is obtained Example, broadly falls into the scope of protection of the invention.
Term in description and claims of this specification and above-mentioned accompanying drawing " includes " and " having " And their any deformation, it is intended that cover non-exclusive comprising, in order to comprise a series of unit Process, method, system, product or equipment are not necessarily limited to those unit, but can include the most clearly That list or for intrinsic other unit of these processes, method, product or equipment.
It is described in detail individually below.The embodiment of the present invention can quick-searching file content, with realize Sharing and retrieval of file data, reduces the workload of search file, improves work efficiency.Present invention literary composition One embodiment of the management method of part, can be applicable to, in the management automatically to file, refer to Fig. 1 institute Show that the management method of the file that the present invention provides may include steps of:
101, from the temporary folder of server, obtain pending file, and obtain pending file Corresponding file format.
In embodiments of the present invention, server is used for preserving file, and file is protected in the file of server Depositing, file has unique store path, and file can be first stored in the temporary file of server by user In folder, wait that the method provided according to the embodiment of the present invention carries out subsequent treatment, the literary composition that user gets Part can be the file extracted from data base, it is also possible to is the file that is manually entered of user, also may be used To be the file extracted from file system, it is also possible to be the file grabbed from network, specifically real Existing mode does not limits.Such as user can be by the file uploaded or the file uploaded Put in the temporary folder of server, the temporary folder of server saves all of pending File, the embodiment of the present invention according to preset intermittent scanning temporary folder, thus can obtain automatically Get pending file.
In some embodiments of the invention, step 101 obtains the file that pending file is corresponding Form, may include steps of:
A1, resolve the attribute information of pending file;
A2, determine that the file format that pending file is corresponding is in following form according to attribute information Kind: xls, xlsx, doc, docx, pdf, txt, ppt, pptx.
Wherein, after temporary folder gets pending file, various ways can be used to obtain Take by the attribute information of file is resolved in the file format that file is corresponding, such as step A1, from And determine file format by the attribute of file, in actual applications, the file of different-format has not Same attribute.It can in addition contain carry out Study document form from the title of file, such as in the title of file In can get file format by suffix name.Concrete, file format is in following form Kind: xls, xlsx, doc, docx, pdf, txt, ppt, pptx, it should be noted that above-mentioned form is only It is the form form that may be present of active file, but does not limits, pending in the embodiment of the present invention File format corresponding to file can also be except xls, xlsx, doc, docx, pdf, txt, ppt, pptx Other form in addition, such as visio form etc., specifically can determine with connected applications scene.
102, corresponding text reader is selected according to the file format got.
In embodiments of the present invention, for the pending file preserved in temporary folder, getting After file format, concrete text reader can be selected according to file format, if from temporary folder Get multiple pending file, then can select respectively accordingly for all of pending file Text reader.Wherein, text reader can be special Reading text instrument, it is also possible to be to set Putting the Reading text module in the managing device of the file of embodiment of the present invention offer, text reader can To select according to concrete file format, document reader can when reading the file of different file format There to be concrete implementation mode, followed by illustrated in greater detail.
In some embodiments of the invention, step 102 selects corresponding according to the file format got Text reader, may include steps of:
B1, when the file format that pending file is corresponding is xls form, select HSSFWorkbook Module is as text reader;
B2, when the file format that pending file is corresponding is xlsx form, select XSSFWorkbook Module is as text reader;
B3, when the file format that pending file is corresponding is doc form, select HWPFDocument Module is as text reader;
B4, when the file format that pending file is corresponding is docx form, select XWPFDocument module is as text reader;
B5, when the file format that pending file is corresponding is pdf form, select PDFParser module As text reader;
B6, when the file format that pending file is corresponding is txt form, select FileReader module As text reader;
B7, when the file format that pending file is corresponding is ppt form, select HSLFSlideShow Module is as text reader;
B8, when the file format that pending file is corresponding is pptx form, select XSLFSlideShow Module is as text reader.
Wherein, in steps and one of them or several step are permissible to the institute in step B8 for step B1 Select flexibly to perform which or several step according to concrete implementation scene, implement and do not limit. Additionally for different file formats, also may be used when selecting concrete operating system and application program solves There to be different implementing, the illustration of the most attainable several ways, but the present invention is real Execute example can be not limited thereto.Embodiment of the present invention Chinese version reader can according to file format not With selecting flexibly, such as HSSFWorkbook module, XSSFWorkbook module, HWPFDocument module, XWPFDocument module, PDFParser module, FileReader mould Block, HSLFSlideShow module, XSLFSlideShow module.Wherein, it is normal for file format Different-format, above-mentioned module simply by POI, pdfbox, java IO realize flexibly concrete Example, it is also possible to combine and need the file format resolved to select other kinds of text resolver.
103, use text reader that pending file reads into file stream, and extract from file stream Go out content of text.
In embodiments of the present invention, select, by abovementioned steps 102, the Reading text that file format is corresponding After device, use text reader pending file can be read as file stream (English name: File Stream), then by the extraction to file stream, content of text can be extracted from file stream. File content is the text data in file stream, and content of text carries the content letter in pending file Breath, by extracting file content, it is possible to achieve carry out pending file automatically from file stream Content describes, and therefore text content can be as the keyword of pending file, the embodiment of the present invention Chinese version content is automatically generated, it is not necessary to user inputs.
In some embodiments of the invention, step 103 uses text reader to be read by pending file Take into file stream, and extract content of text from file stream, including:
C1, when the file format that pending file is corresponding is xls form, use HSSFWorkbook Module is created that xls file stream from xls file, and uses HSSFExcelExtractor from xls file stream Module extracts xls content of text;
C2, when the file format that pending file is corresponding is xlsx form, use XSSFWorkbook Module is created that xlsx file stream from xlsx file, and uses from xlsx file stream XSSFExcelExtractor module extracts xlsx content of text;
C3, when the file format that pending file is corresponding is doc form, use HWPFDocument Module is created that doc file stream from doc file, and uses from doc file stream HWPFWordExtractor module extracts doc content of text;
C4, when the file format that pending file is corresponding is docx form, use XWPFDocument module is created that docx file stream from docx file, and from docx file stream XWPFWordExtractor module is used to extract docx content of text;
C5, when the file format that pending file is corresponding is pdf form, use PDFParser module From pdf file, it is created that pdf file stream, and from pdf file stream, uses PDFTextStripper module Extract pdf content of text;
C6, when the file format that pending file is corresponding is txt form, use FileWriter module From txt file, it is created that txt file stream, and uses FileWriter module to extract from txt file stream Txt content of text;
C7, when the file format that pending file is corresponding is ppt form, use HSLFSlideShow Module is created that ppt file stream from ppt file, and uses HSLF from ppt file stream PowerPointExtractor module extracts ppt content of text;
C8, when the file format that pending file is corresponding is pptx form, use XSLFSlideShow Module is created that pptx file stream from pptx file, and uses from pptx file stream XSLFPowerPointExtractor module extracts pptx content of text.
Wherein, in the case of aforementioned execution step B1 to step B8, step C1 can be performed to step Institute in rapid C8 is in steps and one of them or several step can be come according to concrete implementation scene Select flexibly to perform which or several step, implement and do not limit.Additionally for different files Form, can also have different implementing when selecting concrete operating system and application program solves, It is the illustration of attainable several ways herein, but the embodiment of the present invention can be not limited thereto.
For ease of being better understood from and implement the such scheme of the embodiment of the present invention, citing below accordingly should It is specifically described by scene.Refer to shown in Fig. 2, the file that Fig. 2 provides for the embodiment of the present invention The workflow schematic diagram of management method, the most respectively to text under the scene of different file formats The extraction of content is illustrated, the most respectively Excel file, Word file, Pdf file, PPT As a example by file, the content of text of Txt file extract.
Above-mentioned steps C1 sum realize realize under scene under scene and in above-mentioned steps C2, first will Pending file reads into file stream, such as can be with in System.IO NameSpace The content of StreamWriter and StreamReader class reading and writing of files.Utilize POI's HSSFWorkbook method creates xls file object, then by the ExcelExtractor method of POI to xls Carry out parsing and draw content of text;The XSSFWorkbook method utilizing POI creates xlsx file object, Xlsx is resolved by the XSSFExcelExtractor method of recycling POI.Wherein, Apache POI Being open source code function storehouse, POI provides API to read Microsoft Office form archives to java applet With the function write.Such as, HSSF provides the function of read-write Microsoft Excel form archives.XSSF The function of read-write Microsoft Excel OOXML form archives is provided.HWPF provides read-write Microsoft The function of Word format archives.HSLF provides the function of read-write Microsoft PowerPoint form archives. HDGF provides the function of read-write Microsoft Visio form archives.
It is illustrated below, is first illustrated with the parsing of xls, can use This class of org.apache.poi.hssf.extractor.ExcelExtractor processes excel 2003 (.xls).(1) Same doc, is created by ExtractorFactory, (2) newly-built ExcelExtractor object.Can also adopt By following manner: 1) same to DOC, create object by receiving POIFSFileSystem, 2) pass through Receiving HSSFWorkbook and create object, wherein, HSSFWorkbook object can be by receiving POIFSFileSystem or InputStream object creates.Sample code can be such that
InputStream inp=new FileInputStream (this.filePath);
HSSFWorkbook wb=new HSSFWorkbook (new POIFSFileSystem (inp));
Extractor=new ExcelExtractor (wb);
//filter formulas from the results
this.extractor.setFormulasNotResults(true);
//filter Sheet name from the results
this.extractor.setIncludeSheetNames(false);
Content=this.extractor.getText ();
It follows that be illustrated the analyzing step of file stream with the parsing of xlsx.Can use Org.apache.poi.POITextExtractor and org.apache.poi.POIXMLTextExtractor, directly Newly-built XSSFExcelExtractor object, or create by receiving XSSFWorkbook object. XSSFWorkbook can receive inputStream object
Above-mentioned steps C3 realize under scene and above-mentioned steps C4 realize under scene, utilize POI WordExtractor method doc file is resolved;Utilize the XWPFDocument side of POI Method creates document object, and docx is resolved by recycling XWPFWordExtractor method.Wherein, WordExtractor is the analytic method that Apache POI develops for word2003, XWPFWord Extractor is that the parsing that Apache POI develops for word2007 increases method newly.Concrete, doc's Analyzing step is as follows, uses this class of org.apache.poi.hwpf.extractor.WordExtractor to process Word 2003 document (.doc).Such as can be to use the following two kinds mode:
(1) ExtractorFactory.createExtractor (InputStream) is used to create extracting object, Return is common interface object, therefore unsteady state operation, InputStream fis=new FileInputStream (filePath);WorderExtractor extractor=(WordExtractor) ExtractorFactory.create Extractor(fis);(2) use WordExtractor to create extracting object, or receive InputStream Create object, or reception POIFSFileSystem creates object, or POIFSFileSystem Also receive InputStream and receive object.Object is created such as with reception HWPFDocument, HWPFDocument creates object by receiving InputStream or POIFSFileSystem.
It follows that be illustrated the analyzing step of file stream with the parsing of docx.Can use as follows org.apache.poi.POITextExtractor、org.apache.poi.POIXMLTextExtractor、 Org.apache.poi.xwpf.extractor.XWPFWordExtractor class can process word2007 (.docx). Illustrating, the parent of (1) this class object can use ExtratorFactory to generate, but needs reception Parameter is a virtual object OPCPackage, and this object is bad to be determined.The most directly use following (2) In method create object.(2) newly-built XWPFWordExtractor object, by receiving XWPF Document object creates, it is also possible to can receive inputStream pair by XWPFDocument As.
Realize under scene in above-mentioned steps C5, utilize the PDFParser method of pdfbox to create pdf literary composition Shelves object, the PDFTextStripper method of recycling pdfbox extracts text message.Wherein, PDF solves Analysis can be realized by the code in pdfbox, no longer illustrated in greater detail.
Realize under scene in above-mentioned steps C6, utilize the FileReader method of java.io to txt document Object carries out text message extraction.Wherein, txt is resolved to be realized by the code in java.io, no longer Illustrated in greater detail.
Above-mentioned steps C7 realize under scene and above-mentioned steps C8 realize under scene, utilize POI PowerPointExtractor ppt file is carried out text resolution;Utilize POI's XSLFPowerPointExtractor carries out text resolution to pptx file.Wherein, PPT analyzing step is such as Under, use this class of following org.apache.poi.hslf.extractor.PowerPointExtractor to create PowerPoint2003.Such as, (1) same to DOC, create object with ExtractorFactory, (2) are straight Connect use PowerPointExtractor create object, by receive HSLFSlideShow object, POIFSFileSystem object or filename can create.
It follows that be illustrated the analyzing step of file stream with the parsing of pptx.Can use as follows This class of org.apache.poi.POITextExtractor and org.apache.poi.POIXMLTextExtractor PowerPoint2007 (.pptx) can be processed, directly create XSLFPowerPointExtractor object, example As, (1) receives XMLSlideShow object, creates XMLSlideShow by receiving InputStream Object.(2) receive XSLFSlideShow object, can create by receiving file path XSLFSlideShow object.
104, content of text is added in lucene index or in Database field, in order to Yong Hutong Cross lucene indexed search lucene search engine, or by Database field searching database search engine.
In embodiments of the present invention, get, by step 103, the file content that pending file is corresponding Afterwards, content of text can be added in lucene index or in Database field, user passes through Lucene indexed search lucene search engine, or by Database field searching database search engine. The embodiment of the present invention utilizes body of an instrument analysis framework to go to generate in lucene retrieval file or Database field Hold, to realize sharing and retrieval of file data, reduce the workload of search file, improve work efficiency. In embodiments of the present invention, above-mentioned document retrieval mode is divided into two types: lucene search and data base to search Rope.Lucene search mainly utilizes its index engine to create content indexing.Wherein, lucene search carries The services package of confession contains two parts: one enters one goes out." entering " to refer to add, the content of text that will generate is made Adding index for source string or it deleted from index, " going out " refers to read, and i.e. carries to user For full-text search service, allow user can orient source string by key word.Database search is then Content of text is stored in Database field, the SQL (Structured of recycling data base Query Language, SQL) inquiry data, show file alternately.
Being illustrated by the application scenarios of aforementioned reality, the embodiment of the present invention utilizes poi, pdfbox, java The query and search of file is quickly optimized by IO, database retrieval technology and lucene.Support main The conversion of stream file data content and extraction, utilize the multithreading of java, asynchronous execution, thread pool to literary composition After part is circulated reading, then batch extraction content.Possesses the most perfect, the files in batch of superior performance Extract text function.In the embodiment of the present invention, it is possible to use file reads instrument and read out by file content Coming, the index that body of an instrument content is stored in search engine neutralizes in Database field.So that reach can Fast search file, the purpose of retrieval file content.
In some embodiments of the invention, step 104 content of text is added to lucene index in or In person's Database field, specifically may include steps of:
D1, file path and file attribute information, the content of text of pending file are added simultaneously to In lucene index or in Database field.
In actual applications, in lucene indexes or when Database field adds file content, also The file path of pending file and file attribute information can be also added to lucene index in or In Database field, so that can retrieve in lucene search engine and database search engine The more contents relevant with pending file so that the recall precision of user is higher.Wherein, file road Footpath refers to pending file store path in the server, and file attribute information also refers to treat The attribute of the file processed, such as file format, file size, the part of file owning user, file The establishment time etc..
By the previous embodiment illustration to the present invention, in the embodiment of the present invention, server faces Time file in deposited file, when temporary folder there is also pending file, obtain and wait to locate The file format corresponding to file of reason, can select phase for different file formats in the embodiment of the present invention The text reader answered, uses text reader that pending file is read into file stream, and from literary composition Part stream extracts content of text, therefore by the judgement of text formatting is selected tool in the embodiment of the present invention The text reader of body, automatically extracts out content of text from file stream, and these content of text add to During lucene indexes or in Database field, user passes through lucene indexed search lucene search engine, Or by Database field searching database search engine.Therefore in the embodiment of the present invention, for untreated File can not be directly placed into server, but carry out transfer by the temporary folder of server, pass through Text reader automatic extracted file content, removes user from and sets up the trouble of key word voluntarily.Therefore may be used Quickly optimize with the query and search to file.Support conversion and the extraction of file content data, tool The most perfect standby, the text function of superior performance.
Previous embodiment describes the management method of the file that the embodiment of the present invention provides, and next introduces this The managing device of a kind of file that inventive embodiments provides, refers to as it is shown on figure 3, the embodiment of the present invention The managing device 300 of the file provided, may include that
Form acquisition module 301, for obtaining pending file from the temporary folder of server, and Obtain the file format that described pending file is corresponding;
Reader selects module 302, for selecting corresponding text to read according to the described file format got Take device;
Content extraction module 303, for using described text reader to be read into by described pending file File stream, and extract content of text from described file stream;
Content preserves module 304, for being added to by described content of text in lucene index or data base In field, in order to user passes through described lucene indexed search lucene search engine, or by described Database field searching database search engine.
In some embodiments of the invention, refer to as shown in Fig. 3-b, described form acquisition module 301, Including:
Parsing module 3011, for resolving the attribute information of described pending file;
According to described attribute information, format determination module 3012, for determining that described pending file is corresponding File format be the one in following form: xls, xlsx, doc, docx, pdf, txt, ppt, pptx.
In some embodiments of the invention, described reader selects module 302, treats described in working as When file format corresponding to file processed is xls form, select HSSFWorkbook module as text Reader;When the file format that described pending file is corresponding is xlsx form, select XSSFWorkbook module is as text reader;The file format corresponding when described pending file is During doc form, select HWPFDocument module as text reader;When described pending literary composition When the file format that part is corresponding is docx form, select XWPFDocument module as text reader; When the file format that described pending file is corresponding is pdf form, select PDFParser module conduct Text reader;When the file format that described pending file is corresponding is txt form, select FileReader module is as text reader;When the file format that described pending file is corresponding is ppt During form, select HSLFSlideShow module as text reader;When described pending file pair When the file format answered is pptx form, select XSLFSlideShow module as text reader.
In some embodiments of the invention, further, described content extraction module 303, specifically for When the file format that described pending file is corresponding is xls form, use HSSFWorkbook module From xls file, it is created that xls file stream, and from described xls file stream, uses HSSFExcelExtractor Module extracts xls content of text;When the file format that described pending file is corresponding is xlsx form, XSSFWorkbook module is used to be created that xlsx file stream from xlsx file, and from described xlsx literary composition Part stream use XSSFExcelExtractor module extract xlsx content of text;When described pending When the file format that file is corresponding is doc form, use HWPFDocument module from doc file It is created that doc file stream, and uses HWPFWordExtractor module to take out from described doc file stream Take out doc content of text;When the file format that described pending file is corresponding is docx form, make From docx file, docx file stream it is created that by XWPFDocument module, and from described docx literary composition Part stream use XWPFWordExtractor module extract docx content of text;When described pending When the file format that file is corresponding is pdf form, PDFParser module is used to be created that from pdf file Pdf file stream, and use PDFTextStripper module to extract pdf text from described pdf file stream Content;When the file format that described pending file is corresponding is txt form, use FileWriter mould Block is created that txt file stream from txt file, and uses FileWriter module from described txt file stream Extract txt content of text;When the file format that described pending file is corresponding is ppt form, make From ppt file, ppt file stream it is created that by HSLFSlideShow module, and from described ppt file stream Middle use HSLF PowerPointExtractor module extracts ppt content of text;When described pending When the file format that file is corresponding is pptx form, use XSLFSlideShow module from pptx file It is created that pptx file stream, and from described pptx file stream, uses XSLFPowerPointExtractor Module extracts pptx content of text.
In some embodiments of the invention, described content preserves module 304, for by described pending The file path of file and file attribute information, described content of text be added simultaneously to lucene index in or In person's Database field.
By the previous embodiment illustration to the present invention, in the embodiment of the present invention, server faces Time file in deposited file, when temporary folder there is also pending file, obtain and wait to locate The file format corresponding to file of reason, can select phase for different file formats in the embodiment of the present invention The text reader answered, uses text reader that pending file is read into file stream, and from literary composition Part stream extracts content of text, therefore by the judgement of text formatting is selected tool in the embodiment of the present invention The text reader of body, automatically extracts out content of text from file stream, and these content of text add to During lucene indexes or in Database field, user passes through lucene indexed search lucene search engine, Or by Database field searching database search engine.Therefore in the embodiment of the present invention, for untreated File can not be directly placed into server, but carry out transfer by the temporary folder of server, pass through Text reader automatic extracted file content, removes user from and sets up the trouble of key word voluntarily.Therefore may be used Quickly optimize with the query and search to file.Support conversion and the extraction of file content data, tool The most perfect standby, the text function of superior performance.
Additionally it should be noted that, device embodiment described above is only schematically, wherein said The unit illustrated as separating component can be or may not be physically separate, shows as unit The parts shown can be or may not be physical location, i.e. may be located at a place, or also may be used To be distributed on multiple NE.Some or all of mould therein can be selected according to the actual needs Block realizes the purpose of the present embodiment scheme.It addition, in the device embodiment accompanying drawing of present invention offer, mould Annexation between block represents have communication connection between them, specifically can be implemented as one or more Communication bus or holding wire.Those of ordinary skill in the art are not in the case of paying creative work, i.e. It is appreciated that and implements.
Through the above description of the embodiments, those skilled in the art is it can be understood that arrive this Invention can add the mode of required common hardware by software and realize, naturally it is also possible to pass through specialized hardware Realize including special IC, dedicated cpu, private memory, special components and parts etc..General feelings Under condition, all functions completed by computer program can realize with corresponding hardware easily, and And, the particular hardware structure being used for realizing same function can also be diversified, such as analog circuit, Digital circuit or special circuit etc..But, the most more in the case of software program realize be more Good embodiment.Based on such understanding, technical scheme is the most in other words to existing skill The part that art contributes can embody with the form of software product, and this computer software product stores In the storage medium that can read, as the floppy disk of computer, USB flash disk, portable hard drive, read only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic Dish or CD etc., including some instructions with so that computer equipment (can be personal computer, Server, or the network equipment etc.) perform the method described in each embodiment of the present invention.
In sum, above example only in order to technical scheme to be described, is not intended to limit; Although being described in detail the present invention with reference to above-described embodiment, those of ordinary skill in the art should Work as understanding: the technical scheme described in the various embodiments described above still can be modified by it, or to it Middle part technical characteristic carries out equivalent;And these amendments or replacement, do not make appropriate technical solution Essence depart from various embodiments of the present invention technical scheme spirit and scope.

Claims (10)

1. the management method of a file, it is characterised in that including:
From the temporary folder of server, obtain pending file, and obtain described pending file Corresponding file format;
Corresponding text reader is selected according to the described file format got;
Use described text reader that described pending file is read into file stream, and from described file Stream extracts content of text;
Described content of text is added in lucene index or in Database field, in order to Yong Hutong Cross described lucene indexed search lucene search engine, or by described Database field searching database Search engine.
The management method of a kind of file the most according to claim 1, it is characterised in that described acquisition The file format that described pending file is corresponding, including:
Resolve the attribute information of described pending file;
Determine that the file format that described pending file is corresponding is in following form according to described attribute information One: xls, xlsx, doc, docx, pdf, txt, ppt, pptx.
The management method of a kind of file the most according to claim 2, it is characterised in that described basis The described file format got selects corresponding text reader, including:
When the file format that described pending file is corresponding is xls form, select HSSFWorkbook Module is as text reader;
When the file format that described pending file is corresponding is xlsx form, select XSSFWorkbook Module is as text reader;
When the file format that described pending file is corresponding is doc form, select HWPFDocument Module is as text reader;
When the file format that described pending file is corresponding is docx form, select XWPFDocument module is as text reader;
When the file format that described pending file is corresponding is pdf form, select PDFParser module As text reader;
When the file format that described pending file is corresponding is txt form, select FileReader module As text reader;
When the file format that described pending file is corresponding is ppt form, select HSLFSlideShow Module is as text reader;
When the file format that described pending file is corresponding is pptx form, select XSLFSlideShow module is as text reader.
The management method of a kind of file the most according to claim 3, it is characterised in that described use Described pending file is read into file stream by described text reader, and extracts from described file stream Go out content of text, including:
When the file format that described pending file is corresponding is xls form, use HSSFWorkbook Module is created that xls file stream from xls file, and uses from described xls file stream HSSFExcelExtractor module extracts xls content of text;
When the file format that described pending file is corresponding is xlsx form, use XSSFWorkbook Module is created that xlsx file stream from xlsx file, and uses from described xlsx file stream XSSFExcelExtractor module extracts xlsx content of text;
When the file format that described pending file is corresponding is doc form, use HWPFDocument Module is created that doc file stream from doc file, and uses from described doc file stream HWPFWordExtractor module extracts doc content of text;
When the file format that described pending file is corresponding is docx form, use XWPFDocument module is created that docx file stream from docx file, and from described docx file Stream use XWPFWordExtractor module extract docx content of text;
When the file format that described pending file is corresponding is pdf form, use PDFParser module From pdf file, it is created that pdf file stream, and from described pdf file stream, uses PDFTextStripper Module extracts pdf content of text;
When the file format that described pending file is corresponding is txt form, use FileWriter module From txt file, it is created that txt file stream, and uses FileWriter module to take out from described txt file stream Take out txt content of text;
When the file format that described pending file is corresponding is ppt form, use HSLFSlideShow Module is created that ppt file stream from ppt file, and uses HSLF from described ppt file stream PowerPointExtractor module extracts ppt content of text;
When the file format that described pending file is corresponding is pptx form, use XSLFSlideShow module is created that pptx file stream from pptx file, and from described pptx file stream Middle use XSLFPowerPointExtractor module extracts pptx content of text.
The management method of a kind of file the most according to claim 1, it is characterised in that described by institute State content of text to add in lucene index or in Database field, including:
File path and file attribute information, the described content of text of described pending file are added simultaneously It is added in lucene index or in Database field.
6. the managing device of a file, it is characterised in that including:
Form acquisition module, for obtaining pending file from the temporary folder of server, and obtains Take the file format that described pending file is corresponding;
Reader selects module, for selecting corresponding Reading text according to the described file format got Device;
Content extraction module, for using described text reader to read written by described pending file Part stream, and extract content of text from described file stream;
Content preserves module, for being added to by described content of text in lucene index or data base's word Duan Zhong, in order to user passes through described lucene indexed search lucene search engine, or by described number According to storehouse field search database search engine.
The managing device of a kind of file the most according to claim 6, it is characterised in that described form Acquisition module, including:
Parsing module, for resolving the attribute information of described pending file;
Format determination module, for determining, according to described attribute information, the literary composition that described pending file is corresponding Part form is the one in following form: xls, xlsx, doc, docx, pdf, txt, ppt, pptx.
The managing device of a kind of file the most according to claim 7, it is characterised in that described reading Device selects module, specifically for when file format corresponding to described pending file is xls form, selects Select HSSFWorkbook module as text reader;When the tray that described pending file is corresponding When formula is xlsx form, select XSSFWorkbook module as text reader;When described pending File format corresponding to file when being doc form, select HWPFDocument module to read as text Take device;When the file format that described pending file is corresponding is docx form, select XWPFDocument module is as text reader;When the file format that described pending file is corresponding During for pdf form, select PDFParser module as text reader;When described pending file pair When the file format answered is txt form, select FileReader module as text reader;Treat when described When file format corresponding to file processed is ppt form, select HSLFSlideShow module as literary composition This reader;When the file format that described pending file is corresponding is pptx form, select XSLFSlideShow module is as text reader.
The managing device of a kind of file the most according to claim 8, it is characterised in that described content Abstraction module, specifically for when file format corresponding to described pending file is xls form, uses HSSFWorkbook module is created that xls file stream from xls file, and makes from described xls file stream Xls content of text is extracted by HSSFExcelExtractor module;When described pending file is corresponding When file format is xlsx form, XSSFWorkbook module is used to be created that xlsx from xlsx file File stream, and use XSSFExcelExtractor module to extract xlsx literary composition from described xlsx file stream This content;When the file format that described pending file is corresponding is doc form, use HWPFDocument module is created that doc file stream from doc file, and from described doc file stream Middle use HWPFWordExtractor module extracts doc content of text;When described pending file When corresponding file format is docx form, XWPFDocument module is used to create from docx file Build out docx file stream, and use XWPFWordExtractor module to take out from described docx file stream Take out docx content of text;When the file format that described pending file is corresponding is pdf form, make From pdf file, it is created that pdf file stream by PDFParser module, and makes from described pdf file stream Pdf content of text is extracted by PDFTextStripper module;When the literary composition that described pending file is corresponding When part form is txt form, FileWriter module is used to be created that txt file stream from txt file, and FileWriter module is used to extract txt content of text from described txt file stream;When described pending When the file format that file is corresponding is ppt form, HSLFSlideShow module is used to create from ppt file Build out ppt file stream, and from described ppt file stream, use HSLF PowerPointExtractor module Extract ppt content of text;When the file format that described pending file is corresponding is pptx form, XSLFSlideShow module is used to be created that pptx file stream from pptx file, and from described pptx File stream use XSLFPowerPointExtractor module extract pptx content of text.
The managing device of a kind of file the most according to claim 6, it is characterised in that in described Hold and preserve module, for by the file path of described pending file and file attribute information, described literary composition This content is added simultaneously in lucene index or in Database field.
CN201610312975.7A 2016-05-12 2016-05-12 File management method and device Pending CN106021390A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610312975.7A CN106021390A (en) 2016-05-12 2016-05-12 File management method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610312975.7A CN106021390A (en) 2016-05-12 2016-05-12 File management method and device

Publications (1)

Publication Number Publication Date
CN106021390A true CN106021390A (en) 2016-10-12

Family

ID=57100178

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610312975.7A Pending CN106021390A (en) 2016-05-12 2016-05-12 File management method and device

Country Status (1)

Country Link
CN (1) CN106021390A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107291949A (en) * 2017-07-17 2017-10-24 小草数语(北京)科技有限公司 Information search method and device
CN108197117A (en) * 2018-01-31 2018-06-22 厦门大学 A kind of Chinese text keyword extracting method based on document subject matter structure with semanteme
CN111143849A (en) * 2019-12-31 2020-05-12 奇安信科技集团股份有限公司 File type identification method and device applied to electronic equipment and electronic equipment
CN111881332A (en) * 2020-06-17 2020-11-03 武汉光庭信息技术股份有限公司 Automatic driving simulation data management server and method
CN111915424A (en) * 2020-07-30 2020-11-10 平安证券股份有限公司 Information storage method and related product
CN113268283A (en) * 2021-05-28 2021-08-17 深圳市蓬莱产业科技有限公司 Batch processing method based on file materials
CN111915424B (en) * 2020-07-30 2024-06-28 平安证券股份有限公司 Information storage method and related product

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102819592A (en) * 2012-08-08 2012-12-12 河海大学 Lucene-based desktop searching system and method
CN104899337A (en) * 2015-07-01 2015-09-09 中国农业银行股份有限公司 File index building method and system
CN105045852A (en) * 2015-07-06 2015-11-11 华东师范大学 Full-text search engine system for teaching resources
CN105574164A (en) * 2015-12-16 2016-05-11 北京华傲达数据技术有限公司 Excel document data analysis method and device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102819592A (en) * 2012-08-08 2012-12-12 河海大学 Lucene-based desktop searching system and method
CN104899337A (en) * 2015-07-01 2015-09-09 中国农业银行股份有限公司 File index building method and system
CN105045852A (en) * 2015-07-06 2015-11-11 华东师范大学 Full-text search engine system for teaching resources
CN105574164A (en) * 2015-12-16 2016-05-11 北京华傲达数据技术有限公司 Excel document data analysis method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
HARRYHUANG1990: "使用Apache POI抽取OFFICE文本(DOC,DOCX,XLS,XLSX,PPT,PPTX)—Desktop Search开发笔记【经验积累】", 《HTTP://BLOG.CSDN.NET/HARRYHUANG1990/ARTICLE/DETAILS/11888561》 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107291949A (en) * 2017-07-17 2017-10-24 小草数语(北京)科技有限公司 Information search method and device
CN107291949B (en) * 2017-07-17 2020-11-13 绿湾网络科技有限公司 Information searching method and device
CN108197117A (en) * 2018-01-31 2018-06-22 厦门大学 A kind of Chinese text keyword extracting method based on document subject matter structure with semanteme
CN108197117B (en) * 2018-01-31 2020-05-26 厦门大学 Chinese text keyword extraction method based on document theme structure and semantics
CN111143849A (en) * 2019-12-31 2020-05-12 奇安信科技集团股份有限公司 File type identification method and device applied to electronic equipment and electronic equipment
CN111881332A (en) * 2020-06-17 2020-11-03 武汉光庭信息技术股份有限公司 Automatic driving simulation data management server and method
CN111915424A (en) * 2020-07-30 2020-11-10 平安证券股份有限公司 Information storage method and related product
CN111915424B (en) * 2020-07-30 2024-06-28 平安证券股份有限公司 Information storage method and related product
CN113268283A (en) * 2021-05-28 2021-08-17 深圳市蓬莱产业科技有限公司 Batch processing method based on file materials
CN113268283B (en) * 2021-05-28 2022-03-22 深圳市蓬莱产业科技有限公司 Batch processing method based on file materials

Similar Documents

Publication Publication Date Title
CN106021390A (en) File management method and device
US8799291B2 (en) Forensic index method and apparatus by distributed processing
US8949241B2 (en) Systems and methods for interactive disambiguation of data
US20150088854A1 (en) Securing application information in system-wide search engines
JP5550669B2 (en) SEARCH DEVICE, SEARCH METHOD, AND PROGRAM
US20160188723A1 (en) Cloud website recommendation method and system based on terminal access statistics, and related device
EP2506208A1 (en) Forensic system and forensic method, and forensic program
Elliott Survey of author name disambiguation: 2004 to 2010
CN106055546A (en) Optical disk library full-text retrieval system based on Lucene
CN115145871A (en) File query method and device and electronic equipment
JP5699743B2 (en) SEARCH METHOD, SEARCH DEVICE, AND COMPUTER PROGRAM
CN110489032B (en) Dictionary query method for electronic book and electronic equipment
CN114297143A (en) File searching method, file displaying device and mobile terminal
CN111045994B (en) File classification retrieval method and system based on KV database
KR20090097971A (en) Method and system for searching patent
JP7293780B2 (en) Information processing device, document management system and program
CN110008407B (en) Information retrieval method and device
CN115794745A (en) File searching method, system, device and storage medium
CN112597106A (en) Document page skipping method and system
Nordling South African law may impede human health research
JP7081155B2 (en) Selection program, selection method, and selection device
JP5746912B2 (en) Method, system and computer readable recording medium for refining a web document using text pattern extraction
Liu et al. An improved full-text retrieval for elementary education resource database system
Ali et al. Analysis of windows OS’s fragmented file carving techniques: A systematic literature review
US20190056913A1 (en) Information density of documents

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20161012

RJ01 Rejection of invention patent application after publication