CN110096478A - Document index generation method and equipment - Google Patents
Document index generation method and equipment Download PDFInfo
- Publication number
- CN110096478A CN110096478A CN201910383600.3A CN201910383600A CN110096478A CN 110096478 A CN110096478 A CN 110096478A CN 201910383600 A CN201910383600 A CN 201910383600A CN 110096478 A CN110096478 A CN 110096478A
- Authority
- CN
- China
- Prior art keywords
- file
- information
- document
- file destination
- destination
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/11—File system administration, e.g. details of archiving or snapshots
- G06F16/116—Details of conversion of file system types or formats
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/13—File access structures, e.g. distributed indices
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/16—File or folder operations, e.g. details of user interfaces specifically adapted to file systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Human Computer Interaction (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The embodiment of the present invention provides a kind of document index generation method and equipment, this method comprises: receiving file and the corresponding metadata information of the file that all types of clients are sent, wherein in the database, the file is stored in local disk for the metadata information storage;The database root is inquired according to the metadata information, the file destination for waiting and automatically processing is obtained from the local disk;The file destination is parsed to obtain document information;The document information is subjected to word segmentation processing, obtains the term of the document information, and generates the index information of the document information according to default retrieval frame, it can be realized and document index is automatically generated, efficiency is higher, can handle large volume document information, can save human cost.
Description
Technical field
The present embodiments relate to field of computer technology more particularly to a kind of document index generation methods and equipment.
Background technique
With the rapid development of internet, the document information on internet is sharply increased.User is in order in massive information
Institute's document information in need is found, usually by finding relevant document information according to index information.
Currently, usual way is to carry out manual sorting to the document of separate sources in order to establish document index, obtain not
The index information of same document, is then uploaded in database server, for user search.
However, it is found by the inventors that the existing process for establishing index information to the document manual sorting of separate sources, works as text
It is cumbersome when gear number amount is more, need to expend a large amount of manpower, higher cost.
Summary of the invention
The embodiment of the present invention provides a kind of document index generation method and equipment, to overcome separate sources in the prior art
The process of index information is established in document manual sorting, cumbersome when number of documents is more, needs to expend a large amount of manpower,
The problem of higher cost.
In a first aspect, the embodiment of the present invention provides a kind of document index generation method, comprising:
File and the corresponding metadata information of the file that all types of clients are sent are received, wherein the metadata is believed
In the database, the file is stored in local disk for breath storage;
The database root is inquired according to the metadata information, the mesh for waiting and automatically processing is obtained from the local disk
Mark file;
The file destination is parsed to obtain document information;
The document information is subjected to word segmentation processing, obtains the term of the document information, and according to default frame retrieval
Frame generates the index information of the document information.
In a kind of possible design, the file and the corresponding member of the file that all types of clients are sent are received described
Before data information, further includes:
File Extracting Information is sent to all types of clients, wherein includes in file Extracting Information and all types of clients
Hold the corresponding file format and scan path information for allowing to extract;
Wherein, the file format extracted and scan path information is allowed to be used to indicate all types of clients according to correspondence
The file format for allowing to extract and scan path information extraction file and the corresponding metadata information of the file.
It is described that the file destination is parsed to obtain document information in a kind of possible design, comprising:
The file format of the file destination is obtained, and judges that the file destination whether there is;
If the file destination exists and the file format of the file destination is compressed file format, to the target
File decompressed after file, and the text that parsing generates the file destination is carried out to the file after the decompression
Shelves information;
If the file destination exists and the file format of the file destination is uncompressed file format, to the mesh
Mark file carries out the document information that parsing generates the file destination.
In a kind of possible design, judge that the file destination whether there is, comprising:
File polling is carried out by the file path in metadata information, the file destination is determined if inquiring file
In the presence of determining that the file destination is not present if it cannot inquire file.
It is described that the document letter that parsing generates the file destination is carried out to the file destination in a kind of possible design
Breath, comprising:
File destination is parsed by Open-Source Tools apache tika, obtains the document information of the file destination.
In a kind of possible design, the default retrieval frame is apache lucence frame.
Second aspect, the embodiment of the present invention provide a kind of document index generating device, including memory, processor and deposit
The computer program that can be run in the memory and on the processor is stored up, the processor executes the computer journey
Following steps are realized when sequence:
File and the corresponding metadata information of the file that all types of clients are sent are received, wherein the metadata is believed
In the database, the file is stored in local disk for breath storage;
The database root is inquired according to the metadata information, the mesh for waiting and automatically processing is obtained from the local disk
Mark file;
The file destination is parsed to obtain document information;
The document information is subjected to word segmentation processing, obtains the term of the document information, and according to default frame retrieval
Frame generates the index information of the document information.
The processor also realizes following steps when executing the computer program:
Before the file and the corresponding metadata information of the file for receiving all types of clients transmissions, text is sent
Part Extracting Information includes corresponding with all types of clients allowing to take out in file Extracting Information to all types of clients, wherein
The file format and scan path information taken;
Wherein, the file format extracted and scan path information is allowed to be used to indicate all types of clients according to correspondence
The file format for allowing to extract and scan path information extraction file and the corresponding metadata information of the file.
It is described that the file destination is parsed to obtain document information in a kind of possible design, comprising:
The file format of the file destination is obtained, and judges that the file destination whether there is;
If the file destination exists and the file format of the file destination is compressed file format, to the target
File decompressed after file, and the text that parsing generates the file destination is carried out to the file after the decompression
Shelves information;
If the file destination exists and the file format of the file destination is uncompressed file format, to the mesh
Mark file carries out the document information that parsing generates the file destination.
The third aspect, the embodiment of the present invention provide a kind of computer readable storage medium, the computer-readable storage medium
It is stored with computer executed instructions in matter, when processor executes the computer executed instructions, realizes such as first aspect and the
On the one hand described in any item document index generation methods.
Document index generation method and equipment provided in an embodiment of the present invention, this method is by receiving all types of client hairs
The file and the corresponding metadata information of the file sent, obtains file destination according to metadata information, carries out to file destination
Parsing obtains document information;The document information is subjected to word segmentation processing, obtains the term of the document information, and according to pre-
If retrieval frame generates the index information of the document information, it can be realized and document index automatically generated, efficiency is higher,
It can handle large volume document information, human cost can be saved.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below
There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is this hair
Bright some embodiments for those of ordinary skill in the art without any creative labor, can be with
It obtains other drawings based on these drawings.
Fig. 1 is the system architecture schematic diagram that document index provided in an embodiment of the present invention generates;
Fig. 2 is the flow diagram one of document index generation method provided in an embodiment of the present invention;
Fig. 3 is the flow diagram two of document index generation method provided in an embodiment of the present invention;
Fig. 4 is the structural schematic diagram one of document index generating device provided in an embodiment of the present invention;
Fig. 5 is the structural schematic diagram two of document index generating device provided in an embodiment of the present invention;
Fig. 6 is the hardware structural diagram of document index generating device provided in an embodiment of the present invention.
Specific embodiment
In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention
In attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is
A part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art
Every other embodiment obtained without creative efforts, shall fall within the protection scope of the present invention.
With reference to Fig. 1, Fig. 1 is the system architecture schematic diagram that document index provided in an embodiment of the present invention generates.Such as Fig. 1 institute
Show, system provided in this embodiment includes all types of clients 101 and server 102.Wherein, client 101 can for mobile phone,
Plate, PC etc..The present embodiment is not particularly limited the implementation of client 101 and inquiry user terminal 102, only
Want client 101 that can interact with server 102.Server 102 is used for management document retrieval service, and server 102 can
To be a server, it is also possible to the cluster of multiple server compositions.
With reference to Fig. 2, Fig. 2 is the flow diagram one of document index generation method provided in an embodiment of the present invention, this implementation
The executing subject of example can be the server of embodiment illustrated in fig. 1, and the present embodiment is not particularly limited herein.As shown in Fig. 2, should
Method includes:
S201: file and the corresponding metadata information of the file that all types of clients are sent are received, wherein the member
In the database, the file is stored in local disk to data information memory.
In the present embodiment, all types of clients can be Mail Clients, text service frame (Text
Services Framework, TSF) client and chat application client etc..
Specifically, file Extracting Information can be sent to all types of clients, wherein include and institute in file Extracting Information
State the corresponding file format and scan path information for allowing to extract of all types of clients;
Wherein, the file format extracted and scan path information is allowed to be used to indicate all types of clients according to correspondence
The file format for allowing to extract and scan path information extraction file and the corresponding metadata information of the file.
For example, Mail Clients obtains the file of mail format, TSF client according to corresponding scan path information scanning
The file of TSF format is obtained according to corresponding scan path information scanning.
Wherein, the corresponding metadata information of file may include details as shown in Table 1.
The details of the corresponding metadata information of 1. file of table
Field | Field is explained |
File_id | From increasing field, no concrete meaning |
File_tile | The title of specified file |
File_path | Storage position after file extraction |
Client_type | The concrete type of customized client |
In_time | The extraction time of file |
author | The founder of extracted file |
Status | The processing status of extracted file |
Parent_md5 | There are the MD5 of incidence relation father file codings |
S202: the database root is inquired according to the metadata information, is obtained from waiting automatically from the local disk
The file destination of reason.
It in the present embodiment, include the processing status (reference table 1) of extracted file, the place of extracted file in metadata information
Format when reason state can be extracted according to file is marked, such as the files-designated that can not be parsed is labeled as " the artificial place of waiting
Reason ", the file mark that can parse are " waiting artificial treatment ", and having completed the file mark manually parsed is " artificial treatment
Complete ", having completed the file mark automatically parsed is " automatically processing completion ", and the label that file is not present after scanning is text
There is no mistakes for part ", it is the existing mistake of file that the label of mistake, which occurs, in file in scanning process ", the corresponding father of scanning file
The label that grade associated with is not present is that there is no mistakes for parent associated with ", it cannot be write after being parsed after scanning to file
The label entered is that the document information after document analysis is not present ", the label that file content is not present after scanning is file content
There is no ".The state encoding of the corresponding extracted file of various types file and the corresponding relationship of processing status can be with reference tables 2.
The state encoding of the corresponding extracted file of 2. various types file of table and the corresponding relationship of processing status
State encoding | State is explained |
MANUAL_ANALYSE | Wait artificial treatment |
AUTO_ANALYSE | Waiting automatically processes |
MANUAL_ANALYSED | Artificial treatment is completed |
AUTO_ANALYSED | Automatically process completion |
ERROR_NO_PATH | Mistake is not present in file |
ERROR_MD5_EXIST | The existing mistake of file |
ERROR_PARENT_NOT_EXIST | Mistake is not present in parent associated with |
ERROR_DB_INSERT_ERROR | Information write-in failure after document analysis |
WARN_NO_CONTENT | Mistake is not present in file content |
S203: the file destination is parsed to obtain document information.
In the present embodiment, if file destination is compressed file, compressed file is decompressed first, after being decompressed
File is parsed to obtain document information to the file after decompression;If file destination is not compressed file, directly to target text
Part carries out the resolution file information that parsing generates the file destination.
Wherein, compressed file can be rar compressed format files, be also possible to zip compressed format files.
S204: the document information is subjected to word segmentation processing, obtains the term of the document information, and according to default inspection
Rope frame generates the index information of the document information.
In the present embodiment, document information can be divided by Chinese word segmentation machine Smart Chinese Analyzer
Word processing, obtains the term of document information.
Wherein, preset frame can be apache lucence frame.
Specifically, the term of document information and apache lucence frame can be generated into the rope of document information
Fuse breath.The range of search of apache lucence frame may include the title of document, the path of document, document content and
The metadata information etc. of document.
In the present embodiment, it can be retrieved, including systematic searching, be highlighted, file details by user interface
It checks, previewing file, file association information etc..
As can be seen from the above description, by receiving the file and the corresponding metadata letter of the file that all types of clients are sent
Breath, obtains file destination according to metadata information, is parsed to obtain document information to file destination;By the document information into
Row word segmentation processing obtains the term of the document information, and the rope of the document information is generated according to default retrieval frame
Fuse breath, can be realized and automatically generate to document index, efficiency is higher, can handle large volume document information, can save manpower
Cost.
With reference to Fig. 3, Fig. 3 is the flow diagram two of document index generation method provided in an embodiment of the present invention, at Fig. 2 pairs
On the basis of answering embodiment, the present embodiment detailed description step S203 is parsed to obtain document information to the file destination
Detailed process, details are as follows:
S301: obtaining the file format of the file destination, and judges that the file destination whether there is.
In the present embodiment, file format is divided into compressed file format and uncompressed file format.
Specifically, judge the file destination whether there is to include: to carry out text by the file path in metadata information
Part inquiry determines that the file destination exists, if cannot inquire file determines the file destination if inquiring file
It is not present.
S302: if the file destination exists and the file format of the file destination is compressed file format, to institute
The file after file destination is decompressed is stated, and parsing is carried out to the file after the decompression and generates the target text
The document information of part.
S303: right if the file destination exists and the file format of the file destination is uncompressed file format
The file destination carries out the document information that parsing generates the file destination.
In the present embodiment, file destination can be parsed by Open-Source Tools apache tika, obtains the mesh
Mark the document information of file.Likewise it is possible to carry out parsing generation to the file after decompression by Open-Source Tools apache tika
The document information of the file destination.
Wherein, file destination is parsed by Open-Source Tools apache tika, the document information after parsing can wrap
Include content shown in table 3.
Document information corresponding content after the parsing of table 3.
As can be seen from the above description, carrying out decompression processing by the file destination to compressed format, it not decompress manually, mention
The analyzing efficiency of high document.
Fig. 4 is the structural schematic diagram one of document index generating device provided in an embodiment of the present invention.As shown in figure 4, this article
Shelves index generating device 40 includes: receiving module 401, obtains module 402, parsing module 403 and index generation module 404.
Wherein, receiving module 401, for receiving the file and the corresponding metadata of the file that all types of clients are sent
Information, wherein metadata information storage is in the database, the file is stored in local disk;
Module 402 is obtained to obtain from the local disk for inquiring the database root according to the metadata information
Wait the file destination automatically processed;
Parsing module 403, for being parsed to obtain document information to the file destination;
Generation module 404 is indexed, for the document information to be carried out word segmentation processing, obtains the retrieval of the document information
Word, and according to the default index information retrieved frame and generate the document information.
Equipment provided in this embodiment can be used for executing the technical solution of above method embodiment, realization principle and skill
Art effect is similar, and details are not described herein again for the present embodiment.
With reference to Fig. 5, Fig. 5 is the structural schematic diagram two of document index generating device provided in an embodiment of the present invention.Such as Fig. 5 institute
Show, the present embodiment is on the basis of Fig. 4 embodiment, further includes: sending module 405.
Wherein, sending module 405, for corresponding in the file for receiving all types of clients transmissions and the file
Before metadata information, send file Extracting Information to all types of clients, wherein include in file Extracting Information with it is described respectively
The corresponding file format and scan path information for allowing to extract of type clients;Wherein, allow the file format extracted and sweep
It retouches routing information and is used to indicate all types of clients according to the corresponding file format and scan path information for allowing to extract
Extracted file and the corresponding metadata information of the file.
In one embodiment of the invention, the parsing module 403, specifically for obtaining the file of the file destination
Format, and judge that the file destination whether there is;
If the file destination exists and the file format of the file destination is compressed file format, to the target
File decompressed after file, and the text that parsing generates the file destination is carried out to the file after the decompression
Shelves information;
If the file destination exists and the file format of the file destination is uncompressed file format, to the mesh
Mark file carries out the document information that parsing generates the file destination.
In one embodiment of the invention, the parsing module 403, is also used to judge whether the file destination is deposited
, comprising:
File polling is carried out by the file path in metadata information, the file destination is determined if inquiring file
In the presence of determining that the file destination is not present if it cannot inquire file.
In one embodiment of the invention, generation module 404 is indexed, is specifically used for passing through Open-Source Tools apache
Tika parses file destination, obtains the document information of the file destination.
In one embodiment of the invention, the default retrieval frame is apache lucence frame.
Equipment provided in this embodiment can be used for executing the technical solution of above method embodiment, realization principle and skill
Art effect is similar, and details are not described herein again for the present embodiment.
Fig. 6 is the hardware structural diagram of document index generating device provided in an embodiment of the present invention.As shown in fig. 6, this
The document index generating device 60 of embodiment includes: processor 601 and memory 602;Wherein
Memory 602, for storing computer executed instructions;
Processor 601, for executing the computer executed instructions of memory storage, to realize server in above-described embodiment
Performed each step.It specifically may refer to the associated description in preceding method embodiment.
Optionally, memory 602 can also be integrated with processor 601 either independent.
When memory 602 is independently arranged, it further includes bus 603 that the document, which indexes generating device, for connecting described deposit
Reservoir 602 and processor 601.
The embodiment of the present invention also provides a kind of computer readable storage medium, stores in the computer readable storage medium
There are computer executed instructions, when processor executes the computer executed instructions, realizes that document index as described above generates
Method.
In several embodiments provided by the present invention, it should be understood that disclosed device and method can pass through it
Its mode is realized.For example, apparatus embodiments described above are merely indicative, for example, the division of the module, only
Only a kind of logical function partition, there may be another division manner in actual implementation, for example, multiple modules can combine or
It is desirably integrated into another system, or some features can be ignored or not executed.Another point, it is shown or discussed it is mutual it
Between coupling, direct-coupling or communication connection can be through some interfaces, the INDIRECT COUPLING or communication link of device or module
It connects, can be electrical property, mechanical or other forms.
The module as illustrated by the separation member may or may not be physically separated, aobvious as module
The component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multiple
In network unit.Some or all of the modules therein can be selected to realize the mesh of this embodiment scheme according to the actual needs
's.
It, can also be in addition, each functional module in each embodiment of the present invention can integrate in one processing unit
It is that modules physically exist alone, can also be integrated in one unit with two or more modules.Above-mentioned module at
Unit both can take the form of hardware realization, can also realize in the form of hardware adds SFU software functional unit.
The above-mentioned integrated module realized in the form of software function module, can store and computer-readable deposit at one
In storage media.Above-mentioned software function module is stored in a storage medium, including some instructions are used so that a computer
Equipment (can be personal computer, server or the network equipment etc.) or processor (English: processor) execute this Shen
Please each embodiment the method part steps.
It should be understood that above-mentioned processor can be central processing unit (Central Processing Unit, abbreviation CPU),
It can also be other general processors, digital signal processor (Digital Signal Processor, abbreviation DSP), dedicated
Integrated circuit (Application Specific Integrated Circuit, abbreviation ASIC) etc..General processor can be
Microprocessor or the processor are also possible to any conventional processor etc..It can be in conjunction with the step of invention disclosed method
Be embodied directly in hardware processor and execute completion, or in processor hardware and software module combination execute completion.
Memory may include high speed RAM memory, it is also possible to and it further include non-volatile memories NVM, for example, at least one
Magnetic disk storage can also be USB flash disk, mobile hard disk, read-only memory, disk or CD etc..
It is total that bus can be industry standard architecture (Industry Standard Architecture, abbreviation ISA)
Line, external equipment interconnection (Peripheral Component, abbreviation PCI) bus or extended industry-standard architecture
(Extended Industry Standard Architecture, abbreviation EISA) bus etc..It is total that bus can be divided into address
Line, data/address bus, control bus etc..For convenient for indicating, the bus in illustrations does not limit an only bus or one
The bus of seed type.
Above-mentioned storage medium can be by any kind of volatibility or non-volatile memory device or their combination
It realizes, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable
Read-only memory (EPROM), programmable read only memory (PROM), read-only memory (ROM), magnetic memory, flash memory,
Disk or CD.Storage medium can be any usable medium that general or specialized computer can access.
A kind of illustrative storage medium is coupled to processor, believes to enable a processor to read from the storage medium
Breath, and information can be written to the storage medium.Certainly, storage medium is also possible to the component part of processor.It processor and deposits
Storage media can be located at specific integrated circuit (Application Specific Integrated Circuits, abbreviation ASIC)
In.Certainly, pocessor and storage media can also be used as discrete assembly and be present in electronic equipment or main control device.
Those of ordinary skill in the art will appreciate that: realize that all or part of the steps of above-mentioned each method embodiment can lead to
The relevant hardware of program instruction is crossed to complete.Program above-mentioned can be stored in a computer readable storage medium.The journey
When being executed, execution includes the steps that above-mentioned each method embodiment to sequence;And storage medium above-mentioned include: ROM, RAM, magnetic disk or
The various media that can store program code such as person's CD.
Finally, it should be noted that the above embodiments are only used to illustrate the technical solution of the present invention., rather than its limitations;To the greatest extent
Pipe present invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that: its according to
So be possible to modify the technical solutions described in the foregoing embodiments, or to some or all of the technical features into
Row equivalent replacement;And these are modified or replaceed, various embodiments of the present invention technology that it does not separate the essence of the corresponding technical solution
The range of scheme.
Claims (10)
1. a kind of document index generation method characterized by comprising
File and the corresponding metadata information of the file that all types of clients are sent are received, wherein the metadata information is deposited
In the database, the file is stored in local disk for storage;
The database root is inquired according to the metadata information, the target text for waiting and automatically processing is obtained from the local disk
Part;
The file destination is parsed to obtain document information;
The document information is subjected to word segmentation processing, obtains the term of the document information, and raw according to default retrieval frame
At the index information of the document information.
2. the method according to claim 1, wherein receiving the file and institute that all types of clients are sent described
Before stating the corresponding metadata information of file, further includes:
File Extracting Information is sent to all types of clients, wherein includes in file Extracting Information and all types of clients pair
The file format and scan path information for allowing to extract answered;
Wherein, allow the file format extracted and scan path information to be used to indicate all types of clients to be permitted according to corresponding
Perhaps the file format and scan path information extraction file and the corresponding metadata information of the file extracted.
3. the method according to claim 1, wherein described parse the file destination to obtain document letter
Breath, comprising:
The file format of the file destination is obtained, and judges that the file destination whether there is;
If the file destination exists and the file format of the file destination is compressed file format, to the file destination
File after being decompressed, and the document letter that parsing generates the file destination is carried out to the file after the decompression
Breath;
If the file destination exists and the file format of the file destination is uncompressed file format, to the target text
Part carries out the document information that parsing generates the file destination.
4. according to the method described in claim 3, it is characterized in that, judging that the file destination whether there is, comprising:
File polling is carried out by the file path in metadata information, determines that the file destination is deposited if inquiring file
Determining that the file destination is not present if it cannot inquire file.
5. according to the method described in claim 3, it is characterized in that, described carry out the parsing generation mesh to the file destination
Mark the document information of file, comprising:
File destination is parsed by Open-Source Tools apache tika, obtains the document information of the file destination.
6. method according to any one of claims 1 to 5, which is characterized in that the default retrieval frame is apache
Lucence frame.
7. a kind of document index generating device, which is characterized in that in the memory including memory, processor and storage
And the computer program that can be run on the processor, the processor realize following step when executing the computer program
It is rapid:
File and the corresponding metadata information of the file that all types of clients are sent are received, wherein the metadata information is deposited
In the database, the file is stored in local disk for storage;
The database root is inquired according to the metadata information, the target text for waiting and automatically processing is obtained from the local disk
Part;
The file destination is parsed to obtain document information;
The document information is subjected to word segmentation processing, obtains the term of the document information, and raw according to default retrieval frame
At the index information of the document information.
8. equipment according to claim 7, which is characterized in that the processor is also realized when executing the computer program
Following steps:
Before the file and the corresponding metadata information of the file for receiving all types of clients transmissions, sends file and take out
Breath is won the confidence to all types of clients, wherein include in file Extracting Information it is corresponding with all types of clients allow extract
File format and scan path information;
Wherein, allow the file format extracted and scan path information to be used to indicate all types of clients to be permitted according to corresponding
Perhaps the file format and scan path information extraction file and the corresponding metadata information of the file extracted.
9. equipment according to claim 7, which is characterized in that described to be parsed to obtain document letter to the file destination
Breath, comprising:
The file format of the file destination is obtained, and judges that the file destination whether there is;
If the file destination exists and the file format of the file destination is compressed file format, to the file destination
File after being decompressed, and the document letter that parsing generates the file destination is carried out to the file after the decompression
Breath;
If the file destination exists and the file format of the file destination is uncompressed file format, to the target text
Part carries out the document information that parsing generates the file destination.
10. a kind of computer readable storage medium, which is characterized in that be stored with computer in the computer readable storage medium
It executes instruction, when processor executes the computer executed instructions, realizes such as document as claimed in any one of claims 1 to 6
Index generation method.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910383600.3A CN110096478B (en) | 2019-05-09 | 2019-05-09 | Document index generation method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910383600.3A CN110096478B (en) | 2019-05-09 | 2019-05-09 | Document index generation method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110096478A true CN110096478A (en) | 2019-08-06 |
CN110096478B CN110096478B (en) | 2021-06-29 |
Family
ID=67447334
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910383600.3A Active CN110096478B (en) | 2019-05-09 | 2019-05-09 | Document index generation method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110096478B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112035409A (en) * | 2020-11-03 | 2020-12-04 | 杭州蚁首网络科技有限公司 | Entity file management method, system and computer storage medium |
CN113312441A (en) * | 2021-06-10 | 2021-08-27 | 中寰卫星导航通信有限公司 | Map operation method and device |
Citations (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103853832A (en) * | 2014-03-11 | 2014-06-11 | 上海爱数软件有限公司 | Customizable data capturing method in full-text retrieval system |
US20140181056A1 (en) * | 2011-08-30 | 2014-06-26 | Patrick Thomas Sidney Pidduck | System and method of quality assessment of a search index |
CN104376067A (en) * | 2014-11-13 | 2015-02-25 | 北京海泰方圆科技有限公司 | Index file inputting method and retrieval method based on index file |
US20150074080A1 (en) * | 2011-08-30 | 2015-03-12 | Open Text SA | System and method of managing capacity of search index partitions |
CN104715068A (en) * | 2015-03-31 | 2015-06-17 | 北京奇虎科技有限公司 | Method and device for generating document indexes and searching method and device |
CN105205104A (en) * | 2015-08-26 | 2015-12-30 | 成都布林特信息技术有限公司 | Cloud platform data acquisition method |
CN105808615A (en) * | 2014-12-31 | 2016-07-27 | 北京奇虎科技有限公司 | Document index generation method and device based on word segment weights |
CN106599041A (en) * | 2016-11-07 | 2017-04-26 | 中国电子科技集团公司第三十二研究所 | Text processing and retrieval system based on big data platform |
CN106649426A (en) * | 2016-08-05 | 2017-05-10 | 浪潮软件股份有限公司 | Data analysis method, data analysis platform and server |
CN106776746A (en) * | 2016-11-14 | 2017-05-31 | 天津南大通用数据技术股份有限公司 | A kind of creation method and device of full-text index data |
CN107016047A (en) * | 2017-02-20 | 2017-08-04 | 阿里巴巴集团控股有限公司 | Document query, document storing method and device |
CN107038225A (en) * | 2017-03-31 | 2017-08-11 | 江苏飞搏软件股份有限公司 | The search method of information intelligent retrieval system |
CN108228743A (en) * | 2017-12-18 | 2018-06-29 | 深圳供电局有限公司 | A kind of real-time big data search engine system |
CN108241713A (en) * | 2016-12-27 | 2018-07-03 | 南京烽火软件科技有限公司 | A kind of inverted index search method based on polynary cutting |
CN108874956A (en) * | 2018-06-05 | 2018-11-23 | 中国平安人寿保险股份有限公司 | Mass file search method, device, computer equipment and storage medium |
CN109254967A (en) * | 2018-08-29 | 2019-01-22 | 河南智慧云大数据有限公司 | A kind of depth analysis method and device based on multi-source heterogeneous mass data |
-
2019
- 2019-05-09 CN CN201910383600.3A patent/CN110096478B/en active Active
Patent Citations (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140181056A1 (en) * | 2011-08-30 | 2014-06-26 | Patrick Thomas Sidney Pidduck | System and method of quality assessment of a search index |
US20150074080A1 (en) * | 2011-08-30 | 2015-03-12 | Open Text SA | System and method of managing capacity of search index partitions |
CN103853832A (en) * | 2014-03-11 | 2014-06-11 | 上海爱数软件有限公司 | Customizable data capturing method in full-text retrieval system |
CN104376067A (en) * | 2014-11-13 | 2015-02-25 | 北京海泰方圆科技有限公司 | Index file inputting method and retrieval method based on index file |
CN105808615A (en) * | 2014-12-31 | 2016-07-27 | 北京奇虎科技有限公司 | Document index generation method and device based on word segment weights |
CN104715068A (en) * | 2015-03-31 | 2015-06-17 | 北京奇虎科技有限公司 | Method and device for generating document indexes and searching method and device |
CN105205104A (en) * | 2015-08-26 | 2015-12-30 | 成都布林特信息技术有限公司 | Cloud platform data acquisition method |
CN106649426A (en) * | 2016-08-05 | 2017-05-10 | 浪潮软件股份有限公司 | Data analysis method, data analysis platform and server |
CN106599041A (en) * | 2016-11-07 | 2017-04-26 | 中国电子科技集团公司第三十二研究所 | Text processing and retrieval system based on big data platform |
CN106776746A (en) * | 2016-11-14 | 2017-05-31 | 天津南大通用数据技术股份有限公司 | A kind of creation method and device of full-text index data |
CN108241713A (en) * | 2016-12-27 | 2018-07-03 | 南京烽火软件科技有限公司 | A kind of inverted index search method based on polynary cutting |
CN107016047A (en) * | 2017-02-20 | 2017-08-04 | 阿里巴巴集团控股有限公司 | Document query, document storing method and device |
CN107038225A (en) * | 2017-03-31 | 2017-08-11 | 江苏飞搏软件股份有限公司 | The search method of information intelligent retrieval system |
CN108228743A (en) * | 2017-12-18 | 2018-06-29 | 深圳供电局有限公司 | A kind of real-time big data search engine system |
CN108874956A (en) * | 2018-06-05 | 2018-11-23 | 中国平安人寿保险股份有限公司 | Mass file search method, device, computer equipment and storage medium |
CN109254967A (en) * | 2018-08-29 | 2019-01-22 | 河南智慧云大数据有限公司 | A kind of depth analysis method and device based on multi-source heterogeneous mass data |
Non-Patent Citations (2)
Title |
---|
N. RAGAVAN: ""Efficient key hash indexing scheme with page rank for category based search engine big data,"", 《2017 IEEE INTERNATIONAL CONFERENCE ON INTELLIGENT TECHNIQUES IN CONTROL, OPTIMIZATION AND SIGNAL PROCESSING (INCOS)》 * |
徐旭平 等: ""基于MongoDB的元数据管理研究"", 《信息技术》 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112035409A (en) * | 2020-11-03 | 2020-12-04 | 杭州蚁首网络科技有限公司 | Entity file management method, system and computer storage medium |
CN113312441A (en) * | 2021-06-10 | 2021-08-27 | 中寰卫星导航通信有限公司 | Map operation method and device |
Also Published As
Publication number | Publication date |
---|---|
CN110096478B (en) | 2021-06-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN102622592A (en) | Name card recognition method based on cloud technology | |
CN107784205B (en) | User product auditing method, device, server and storage medium | |
CN110096478A (en) | Document index generation method and equipment | |
US8577826B2 (en) | Automated document separation | |
CN110532449B (en) | Method, device, equipment and storage medium for processing service document | |
CN111881094A (en) | Method, device, terminal and storage medium for extracting key information in log | |
CN105550179A (en) | Webpage collection method and browser plug-in | |
CN110888791A (en) | Log processing method, device, equipment and storage medium | |
CN112417195A (en) | Trademark inquiry system and method based on mobile terminal and storage medium | |
CN110972086A (en) | Short message processing method and device, electronic equipment and computer readable storage medium | |
CN116089732B (en) | User preference identification method and system based on advertisement click data | |
CN110472121B (en) | Business card information searching method and device, electronic equipment and computer readable storage medium | |
CN111047657A (en) | Picture compression method, device, medium and electronic equipment | |
CN105681523A (en) | Method and apparatus for sending birthday blessing short message automatically | |
WO2021129849A1 (en) | Log processing method, apparatus and device, and storage medium | |
CN115658127A (en) | Data processing method and device, electronic equipment and storage medium | |
CN105677827B (en) | A kind of acquisition methods and device of list | |
CN114281761A (en) | Data file loading method and device, computer equipment and storage medium | |
CN114339689A (en) | Internet of things machine card binding pool control method and device and related medium | |
CN113468037A (en) | Data quality evaluation method, device, medium and electronic equipment | |
CN113343116A (en) | Intelligent chat recommendation method, system, equipment and storage medium based on enterprise warehouse | |
CN107180054B (en) | Data processing method and device | |
CN114661772B (en) | Data processing method and related device | |
CN110674395B (en) | Information pushing method, device and equipment | |
CN112714033B (en) | Method and device for determining characteristic information of video set |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |