CN110096478A - Document index generation method and equipment - Google Patents

Document index generation method and equipment Download PDF

Info

Publication number
CN110096478A
CN110096478A CN201910383600.3A CN201910383600A CN110096478A CN 110096478 A CN110096478 A CN 110096478A CN 201910383600 A CN201910383600 A CN 201910383600A CN 110096478 A CN110096478 A CN 110096478A
Authority
CN
China
Prior art keywords
file
information
document
file destination
destination
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910383600.3A
Other languages
Chinese (zh)
Other versions
CN110096478B (en
Inventor
徐凯
丛新法
侯青军
杨通军
杨哲
高翔
张健钊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China United Network Communications Group Co Ltd
Original Assignee
China United Network Communications Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China United Network Communications Group Co Ltd filed Critical China United Network Communications Group Co Ltd
Priority to CN201910383600.3A priority Critical patent/CN110096478B/en
Publication of CN110096478A publication Critical patent/CN110096478A/en
Application granted granted Critical
Publication of CN110096478B publication Critical patent/CN110096478B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/11File system administration, e.g. details of archiving or snapshots
    • G06F16/116Details of conversion of file system types or formats
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/13File access structures, e.g. distributed indices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/16File or folder operations, e.g. details of user interfaces specifically adapted to file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Human Computer Interaction (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the present invention provides a kind of document index generation method and equipment, this method comprises: receiving file and the corresponding metadata information of the file that all types of clients are sent, wherein in the database, the file is stored in local disk for the metadata information storage;The database root is inquired according to the metadata information, the file destination for waiting and automatically processing is obtained from the local disk;The file destination is parsed to obtain document information;The document information is subjected to word segmentation processing, obtains the term of the document information, and generates the index information of the document information according to default retrieval frame, it can be realized and document index is automatically generated, efficiency is higher, can handle large volume document information, can save human cost.

Description

Document index generation method and equipment
Technical field
The present embodiments relate to field of computer technology more particularly to a kind of document index generation methods and equipment.
Background technique
With the rapid development of internet, the document information on internet is sharply increased.User is in order in massive information Institute's document information in need is found, usually by finding relevant document information according to index information.
Currently, usual way is to carry out manual sorting to the document of separate sources in order to establish document index, obtain not The index information of same document, is then uploaded in database server, for user search.
However, it is found by the inventors that the existing process for establishing index information to the document manual sorting of separate sources, works as text It is cumbersome when gear number amount is more, need to expend a large amount of manpower, higher cost.
Summary of the invention
The embodiment of the present invention provides a kind of document index generation method and equipment, to overcome separate sources in the prior art The process of index information is established in document manual sorting, cumbersome when number of documents is more, needs to expend a large amount of manpower, The problem of higher cost.
In a first aspect, the embodiment of the present invention provides a kind of document index generation method, comprising:
File and the corresponding metadata information of the file that all types of clients are sent are received, wherein the metadata is believed In the database, the file is stored in local disk for breath storage;
The database root is inquired according to the metadata information, the mesh for waiting and automatically processing is obtained from the local disk Mark file;
The file destination is parsed to obtain document information;
The document information is subjected to word segmentation processing, obtains the term of the document information, and according to default frame retrieval Frame generates the index information of the document information.
In a kind of possible design, the file and the corresponding member of the file that all types of clients are sent are received described Before data information, further includes:
File Extracting Information is sent to all types of clients, wherein includes in file Extracting Information and all types of clients Hold the corresponding file format and scan path information for allowing to extract;
Wherein, the file format extracted and scan path information is allowed to be used to indicate all types of clients according to correspondence The file format for allowing to extract and scan path information extraction file and the corresponding metadata information of the file.
It is described that the file destination is parsed to obtain document information in a kind of possible design, comprising:
The file format of the file destination is obtained, and judges that the file destination whether there is;
If the file destination exists and the file format of the file destination is compressed file format, to the target File decompressed after file, and the text that parsing generates the file destination is carried out to the file after the decompression Shelves information;
If the file destination exists and the file format of the file destination is uncompressed file format, to the mesh Mark file carries out the document information that parsing generates the file destination.
In a kind of possible design, judge that the file destination whether there is, comprising:
File polling is carried out by the file path in metadata information, the file destination is determined if inquiring file In the presence of determining that the file destination is not present if it cannot inquire file.
It is described that the document letter that parsing generates the file destination is carried out to the file destination in a kind of possible design Breath, comprising:
File destination is parsed by Open-Source Tools apache tika, obtains the document information of the file destination.
In a kind of possible design, the default retrieval frame is apache lucence frame.
Second aspect, the embodiment of the present invention provide a kind of document index generating device, including memory, processor and deposit The computer program that can be run in the memory and on the processor is stored up, the processor executes the computer journey Following steps are realized when sequence:
File and the corresponding metadata information of the file that all types of clients are sent are received, wherein the metadata is believed In the database, the file is stored in local disk for breath storage;
The database root is inquired according to the metadata information, the mesh for waiting and automatically processing is obtained from the local disk Mark file;
The file destination is parsed to obtain document information;
The document information is subjected to word segmentation processing, obtains the term of the document information, and according to default frame retrieval Frame generates the index information of the document information.
The processor also realizes following steps when executing the computer program:
Before the file and the corresponding metadata information of the file for receiving all types of clients transmissions, text is sent Part Extracting Information includes corresponding with all types of clients allowing to take out in file Extracting Information to all types of clients, wherein The file format and scan path information taken;
Wherein, the file format extracted and scan path information is allowed to be used to indicate all types of clients according to correspondence The file format for allowing to extract and scan path information extraction file and the corresponding metadata information of the file.
It is described that the file destination is parsed to obtain document information in a kind of possible design, comprising:
The file format of the file destination is obtained, and judges that the file destination whether there is;
If the file destination exists and the file format of the file destination is compressed file format, to the target File decompressed after file, and the text that parsing generates the file destination is carried out to the file after the decompression Shelves information;
If the file destination exists and the file format of the file destination is uncompressed file format, to the mesh Mark file carries out the document information that parsing generates the file destination.
The third aspect, the embodiment of the present invention provide a kind of computer readable storage medium, the computer-readable storage medium It is stored with computer executed instructions in matter, when processor executes the computer executed instructions, realizes such as first aspect and the On the one hand described in any item document index generation methods.
Document index generation method and equipment provided in an embodiment of the present invention, this method is by receiving all types of client hairs The file and the corresponding metadata information of the file sent, obtains file destination according to metadata information, carries out to file destination Parsing obtains document information;The document information is subjected to word segmentation processing, obtains the term of the document information, and according to pre- If retrieval frame generates the index information of the document information, it can be realized and document index automatically generated, efficiency is higher, It can handle large volume document information, human cost can be saved.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is this hair Bright some embodiments for those of ordinary skill in the art without any creative labor, can be with It obtains other drawings based on these drawings.
Fig. 1 is the system architecture schematic diagram that document index provided in an embodiment of the present invention generates;
Fig. 2 is the flow diagram one of document index generation method provided in an embodiment of the present invention;
Fig. 3 is the flow diagram two of document index generation method provided in an embodiment of the present invention;
Fig. 4 is the structural schematic diagram one of document index generating device provided in an embodiment of the present invention;
Fig. 5 is the structural schematic diagram two of document index generating device provided in an embodiment of the present invention;
Fig. 6 is the hardware structural diagram of document index generating device provided in an embodiment of the present invention.
Specific embodiment
In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention In attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is A part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art Every other embodiment obtained without creative efforts, shall fall within the protection scope of the present invention.
With reference to Fig. 1, Fig. 1 is the system architecture schematic diagram that document index provided in an embodiment of the present invention generates.Such as Fig. 1 institute Show, system provided in this embodiment includes all types of clients 101 and server 102.Wherein, client 101 can for mobile phone, Plate, PC etc..The present embodiment is not particularly limited the implementation of client 101 and inquiry user terminal 102, only Want client 101 that can interact with server 102.Server 102 is used for management document retrieval service, and server 102 can To be a server, it is also possible to the cluster of multiple server compositions.
With reference to Fig. 2, Fig. 2 is the flow diagram one of document index generation method provided in an embodiment of the present invention, this implementation The executing subject of example can be the server of embodiment illustrated in fig. 1, and the present embodiment is not particularly limited herein.As shown in Fig. 2, should Method includes:
S201: file and the corresponding metadata information of the file that all types of clients are sent are received, wherein the member In the database, the file is stored in local disk to data information memory.
In the present embodiment, all types of clients can be Mail Clients, text service frame (Text Services Framework, TSF) client and chat application client etc..
Specifically, file Extracting Information can be sent to all types of clients, wherein include and institute in file Extracting Information State the corresponding file format and scan path information for allowing to extract of all types of clients;
Wherein, the file format extracted and scan path information is allowed to be used to indicate all types of clients according to correspondence The file format for allowing to extract and scan path information extraction file and the corresponding metadata information of the file.
For example, Mail Clients obtains the file of mail format, TSF client according to corresponding scan path information scanning The file of TSF format is obtained according to corresponding scan path information scanning.
Wherein, the corresponding metadata information of file may include details as shown in Table 1.
The details of the corresponding metadata information of 1. file of table
Field Field is explained
File_id From increasing field, no concrete meaning
File_tile The title of specified file
File_path Storage position after file extraction
Client_type The concrete type of customized client
In_time The extraction time of file
author The founder of extracted file
Status The processing status of extracted file
Parent_md5 There are the MD5 of incidence relation father file codings
S202: the database root is inquired according to the metadata information, is obtained from waiting automatically from the local disk The file destination of reason.
It in the present embodiment, include the processing status (reference table 1) of extracted file, the place of extracted file in metadata information Format when reason state can be extracted according to file is marked, such as the files-designated that can not be parsed is labeled as " the artificial place of waiting Reason ", the file mark that can parse are " waiting artificial treatment ", and having completed the file mark manually parsed is " artificial treatment Complete ", having completed the file mark automatically parsed is " automatically processing completion ", and the label that file is not present after scanning is text There is no mistakes for part ", it is the existing mistake of file that the label of mistake, which occurs, in file in scanning process ", the corresponding father of scanning file The label that grade associated with is not present is that there is no mistakes for parent associated with ", it cannot be write after being parsed after scanning to file The label entered is that the document information after document analysis is not present ", the label that file content is not present after scanning is file content There is no ".The state encoding of the corresponding extracted file of various types file and the corresponding relationship of processing status can be with reference tables 2.
The state encoding of the corresponding extracted file of 2. various types file of table and the corresponding relationship of processing status
State encoding State is explained
MANUAL_ANALYSE Wait artificial treatment
AUTO_ANALYSE Waiting automatically processes
MANUAL_ANALYSED Artificial treatment is completed
AUTO_ANALYSED Automatically process completion
ERROR_NO_PATH Mistake is not present in file
ERROR_MD5_EXIST The existing mistake of file
ERROR_PARENT_NOT_EXIST Mistake is not present in parent associated with
ERROR_DB_INSERT_ERROR Information write-in failure after document analysis
WARN_NO_CONTENT Mistake is not present in file content
S203: the file destination is parsed to obtain document information.
In the present embodiment, if file destination is compressed file, compressed file is decompressed first, after being decompressed File is parsed to obtain document information to the file after decompression;If file destination is not compressed file, directly to target text Part carries out the resolution file information that parsing generates the file destination.
Wherein, compressed file can be rar compressed format files, be also possible to zip compressed format files.
S204: the document information is subjected to word segmentation processing, obtains the term of the document information, and according to default inspection Rope frame generates the index information of the document information.
In the present embodiment, document information can be divided by Chinese word segmentation machine Smart Chinese Analyzer Word processing, obtains the term of document information.
Wherein, preset frame can be apache lucence frame.
Specifically, the term of document information and apache lucence frame can be generated into the rope of document information Fuse breath.The range of search of apache lucence frame may include the title of document, the path of document, document content and The metadata information etc. of document.
In the present embodiment, it can be retrieved, including systematic searching, be highlighted, file details by user interface It checks, previewing file, file association information etc..
As can be seen from the above description, by receiving the file and the corresponding metadata letter of the file that all types of clients are sent Breath, obtains file destination according to metadata information, is parsed to obtain document information to file destination;By the document information into Row word segmentation processing obtains the term of the document information, and the rope of the document information is generated according to default retrieval frame Fuse breath, can be realized and automatically generate to document index, efficiency is higher, can handle large volume document information, can save manpower Cost.
With reference to Fig. 3, Fig. 3 is the flow diagram two of document index generation method provided in an embodiment of the present invention, at Fig. 2 pairs On the basis of answering embodiment, the present embodiment detailed description step S203 is parsed to obtain document information to the file destination Detailed process, details are as follows:
S301: obtaining the file format of the file destination, and judges that the file destination whether there is.
In the present embodiment, file format is divided into compressed file format and uncompressed file format.
Specifically, judge the file destination whether there is to include: to carry out text by the file path in metadata information Part inquiry determines that the file destination exists, if cannot inquire file determines the file destination if inquiring file It is not present.
S302: if the file destination exists and the file format of the file destination is compressed file format, to institute The file after file destination is decompressed is stated, and parsing is carried out to the file after the decompression and generates the target text The document information of part.
S303: right if the file destination exists and the file format of the file destination is uncompressed file format The file destination carries out the document information that parsing generates the file destination.
In the present embodiment, file destination can be parsed by Open-Source Tools apache tika, obtains the mesh Mark the document information of file.Likewise it is possible to carry out parsing generation to the file after decompression by Open-Source Tools apache tika The document information of the file destination.
Wherein, file destination is parsed by Open-Source Tools apache tika, the document information after parsing can wrap Include content shown in table 3.
Document information corresponding content after the parsing of table 3.
As can be seen from the above description, carrying out decompression processing by the file destination to compressed format, it not decompress manually, mention The analyzing efficiency of high document.
Fig. 4 is the structural schematic diagram one of document index generating device provided in an embodiment of the present invention.As shown in figure 4, this article Shelves index generating device 40 includes: receiving module 401, obtains module 402, parsing module 403 and index generation module 404.
Wherein, receiving module 401, for receiving the file and the corresponding metadata of the file that all types of clients are sent Information, wherein metadata information storage is in the database, the file is stored in local disk;
Module 402 is obtained to obtain from the local disk for inquiring the database root according to the metadata information Wait the file destination automatically processed;
Parsing module 403, for being parsed to obtain document information to the file destination;
Generation module 404 is indexed, for the document information to be carried out word segmentation processing, obtains the retrieval of the document information Word, and according to the default index information retrieved frame and generate the document information.
Equipment provided in this embodiment can be used for executing the technical solution of above method embodiment, realization principle and skill Art effect is similar, and details are not described herein again for the present embodiment.
With reference to Fig. 5, Fig. 5 is the structural schematic diagram two of document index generating device provided in an embodiment of the present invention.Such as Fig. 5 institute Show, the present embodiment is on the basis of Fig. 4 embodiment, further includes: sending module 405.
Wherein, sending module 405, for corresponding in the file for receiving all types of clients transmissions and the file Before metadata information, send file Extracting Information to all types of clients, wherein include in file Extracting Information with it is described respectively The corresponding file format and scan path information for allowing to extract of type clients;Wherein, allow the file format extracted and sweep It retouches routing information and is used to indicate all types of clients according to the corresponding file format and scan path information for allowing to extract Extracted file and the corresponding metadata information of the file.
In one embodiment of the invention, the parsing module 403, specifically for obtaining the file of the file destination Format, and judge that the file destination whether there is;
If the file destination exists and the file format of the file destination is compressed file format, to the target File decompressed after file, and the text that parsing generates the file destination is carried out to the file after the decompression Shelves information;
If the file destination exists and the file format of the file destination is uncompressed file format, to the mesh Mark file carries out the document information that parsing generates the file destination.
In one embodiment of the invention, the parsing module 403, is also used to judge whether the file destination is deposited , comprising:
File polling is carried out by the file path in metadata information, the file destination is determined if inquiring file In the presence of determining that the file destination is not present if it cannot inquire file.
In one embodiment of the invention, generation module 404 is indexed, is specifically used for passing through Open-Source Tools apache Tika parses file destination, obtains the document information of the file destination.
In one embodiment of the invention, the default retrieval frame is apache lucence frame.
Equipment provided in this embodiment can be used for executing the technical solution of above method embodiment, realization principle and skill Art effect is similar, and details are not described herein again for the present embodiment.
Fig. 6 is the hardware structural diagram of document index generating device provided in an embodiment of the present invention.As shown in fig. 6, this The document index generating device 60 of embodiment includes: processor 601 and memory 602;Wherein
Memory 602, for storing computer executed instructions;
Processor 601, for executing the computer executed instructions of memory storage, to realize server in above-described embodiment Performed each step.It specifically may refer to the associated description in preceding method embodiment.
Optionally, memory 602 can also be integrated with processor 601 either independent.
When memory 602 is independently arranged, it further includes bus 603 that the document, which indexes generating device, for connecting described deposit Reservoir 602 and processor 601.
The embodiment of the present invention also provides a kind of computer readable storage medium, stores in the computer readable storage medium There are computer executed instructions, when processor executes the computer executed instructions, realizes that document index as described above generates Method.
In several embodiments provided by the present invention, it should be understood that disclosed device and method can pass through it Its mode is realized.For example, apparatus embodiments described above are merely indicative, for example, the division of the module, only Only a kind of logical function partition, there may be another division manner in actual implementation, for example, multiple modules can combine or It is desirably integrated into another system, or some features can be ignored or not executed.Another point, it is shown or discussed it is mutual it Between coupling, direct-coupling or communication connection can be through some interfaces, the INDIRECT COUPLING or communication link of device or module It connects, can be electrical property, mechanical or other forms.
The module as illustrated by the separation member may or may not be physically separated, aobvious as module The component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multiple In network unit.Some or all of the modules therein can be selected to realize the mesh of this embodiment scheme according to the actual needs 's.
It, can also be in addition, each functional module in each embodiment of the present invention can integrate in one processing unit It is that modules physically exist alone, can also be integrated in one unit with two or more modules.Above-mentioned module at Unit both can take the form of hardware realization, can also realize in the form of hardware adds SFU software functional unit.
The above-mentioned integrated module realized in the form of software function module, can store and computer-readable deposit at one In storage media.Above-mentioned software function module is stored in a storage medium, including some instructions are used so that a computer Equipment (can be personal computer, server or the network equipment etc.) or processor (English: processor) execute this Shen Please each embodiment the method part steps.
It should be understood that above-mentioned processor can be central processing unit (Central Processing Unit, abbreviation CPU), It can also be other general processors, digital signal processor (Digital Signal Processor, abbreviation DSP), dedicated Integrated circuit (Application Specific Integrated Circuit, abbreviation ASIC) etc..General processor can be Microprocessor or the processor are also possible to any conventional processor etc..It can be in conjunction with the step of invention disclosed method Be embodied directly in hardware processor and execute completion, or in processor hardware and software module combination execute completion.
Memory may include high speed RAM memory, it is also possible to and it further include non-volatile memories NVM, for example, at least one Magnetic disk storage can also be USB flash disk, mobile hard disk, read-only memory, disk or CD etc..
It is total that bus can be industry standard architecture (Industry Standard Architecture, abbreviation ISA) Line, external equipment interconnection (Peripheral Component, abbreviation PCI) bus or extended industry-standard architecture (Extended Industry Standard Architecture, abbreviation EISA) bus etc..It is total that bus can be divided into address Line, data/address bus, control bus etc..For convenient for indicating, the bus in illustrations does not limit an only bus or one The bus of seed type.
Above-mentioned storage medium can be by any kind of volatibility or non-volatile memory device or their combination It realizes, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable Read-only memory (EPROM), programmable read only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, Disk or CD.Storage medium can be any usable medium that general or specialized computer can access.
A kind of illustrative storage medium is coupled to processor, believes to enable a processor to read from the storage medium Breath, and information can be written to the storage medium.Certainly, storage medium is also possible to the component part of processor.It processor and deposits Storage media can be located at specific integrated circuit (Application Specific Integrated Circuits, abbreviation ASIC) In.Certainly, pocessor and storage media can also be used as discrete assembly and be present in electronic equipment or main control device.
Those of ordinary skill in the art will appreciate that: realize that all or part of the steps of above-mentioned each method embodiment can lead to The relevant hardware of program instruction is crossed to complete.Program above-mentioned can be stored in a computer readable storage medium.The journey When being executed, execution includes the steps that above-mentioned each method embodiment to sequence;And storage medium above-mentioned include: ROM, RAM, magnetic disk or The various media that can store program code such as person's CD.
Finally, it should be noted that the above embodiments are only used to illustrate the technical solution of the present invention., rather than its limitations;To the greatest extent Pipe present invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that: its according to So be possible to modify the technical solutions described in the foregoing embodiments, or to some or all of the technical features into Row equivalent replacement;And these are modified or replaceed, various embodiments of the present invention technology that it does not separate the essence of the corresponding technical solution The range of scheme.

Claims (10)

1. a kind of document index generation method characterized by comprising
File and the corresponding metadata information of the file that all types of clients are sent are received, wherein the metadata information is deposited In the database, the file is stored in local disk for storage;
The database root is inquired according to the metadata information, the target text for waiting and automatically processing is obtained from the local disk Part;
The file destination is parsed to obtain document information;
The document information is subjected to word segmentation processing, obtains the term of the document information, and raw according to default retrieval frame At the index information of the document information.
2. the method according to claim 1, wherein receiving the file and institute that all types of clients are sent described Before stating the corresponding metadata information of file, further includes:
File Extracting Information is sent to all types of clients, wherein includes in file Extracting Information and all types of clients pair The file format and scan path information for allowing to extract answered;
Wherein, allow the file format extracted and scan path information to be used to indicate all types of clients to be permitted according to corresponding Perhaps the file format and scan path information extraction file and the corresponding metadata information of the file extracted.
3. the method according to claim 1, wherein described parse the file destination to obtain document letter Breath, comprising:
The file format of the file destination is obtained, and judges that the file destination whether there is;
If the file destination exists and the file format of the file destination is compressed file format, to the file destination File after being decompressed, and the document letter that parsing generates the file destination is carried out to the file after the decompression Breath;
If the file destination exists and the file format of the file destination is uncompressed file format, to the target text Part carries out the document information that parsing generates the file destination.
4. according to the method described in claim 3, it is characterized in that, judging that the file destination whether there is, comprising:
File polling is carried out by the file path in metadata information, determines that the file destination is deposited if inquiring file Determining that the file destination is not present if it cannot inquire file.
5. according to the method described in claim 3, it is characterized in that, described carry out the parsing generation mesh to the file destination Mark the document information of file, comprising:
File destination is parsed by Open-Source Tools apache tika, obtains the document information of the file destination.
6. method according to any one of claims 1 to 5, which is characterized in that the default retrieval frame is apache Lucence frame.
7. a kind of document index generating device, which is characterized in that in the memory including memory, processor and storage And the computer program that can be run on the processor, the processor realize following step when executing the computer program It is rapid:
File and the corresponding metadata information of the file that all types of clients are sent are received, wherein the metadata information is deposited In the database, the file is stored in local disk for storage;
The database root is inquired according to the metadata information, the target text for waiting and automatically processing is obtained from the local disk Part;
The file destination is parsed to obtain document information;
The document information is subjected to word segmentation processing, obtains the term of the document information, and raw according to default retrieval frame At the index information of the document information.
8. equipment according to claim 7, which is characterized in that the processor is also realized when executing the computer program Following steps:
Before the file and the corresponding metadata information of the file for receiving all types of clients transmissions, sends file and take out Breath is won the confidence to all types of clients, wherein include in file Extracting Information it is corresponding with all types of clients allow extract File format and scan path information;
Wherein, allow the file format extracted and scan path information to be used to indicate all types of clients to be permitted according to corresponding Perhaps the file format and scan path information extraction file and the corresponding metadata information of the file extracted.
9. equipment according to claim 7, which is characterized in that described to be parsed to obtain document letter to the file destination Breath, comprising:
The file format of the file destination is obtained, and judges that the file destination whether there is;
If the file destination exists and the file format of the file destination is compressed file format, to the file destination File after being decompressed, and the document letter that parsing generates the file destination is carried out to the file after the decompression Breath;
If the file destination exists and the file format of the file destination is uncompressed file format, to the target text Part carries out the document information that parsing generates the file destination.
10. a kind of computer readable storage medium, which is characterized in that be stored with computer in the computer readable storage medium It executes instruction, when processor executes the computer executed instructions, realizes such as document as claimed in any one of claims 1 to 6 Index generation method.
CN201910383600.3A 2019-05-09 2019-05-09 Document index generation method and device Active CN110096478B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910383600.3A CN110096478B (en) 2019-05-09 2019-05-09 Document index generation method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910383600.3A CN110096478B (en) 2019-05-09 2019-05-09 Document index generation method and device

Publications (2)

Publication Number Publication Date
CN110096478A true CN110096478A (en) 2019-08-06
CN110096478B CN110096478B (en) 2021-06-29

Family

ID=67447334

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910383600.3A Active CN110096478B (en) 2019-05-09 2019-05-09 Document index generation method and device

Country Status (1)

Country Link
CN (1) CN110096478B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112035409A (en) * 2020-11-03 2020-12-04 杭州蚁首网络科技有限公司 Entity file management method, system and computer storage medium
CN113312441A (en) * 2021-06-10 2021-08-27 中寰卫星导航通信有限公司 Map operation method and device

Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103853832A (en) * 2014-03-11 2014-06-11 上海爱数软件有限公司 Customizable data capturing method in full-text retrieval system
US20140181056A1 (en) * 2011-08-30 2014-06-26 Patrick Thomas Sidney Pidduck System and method of quality assessment of a search index
CN104376067A (en) * 2014-11-13 2015-02-25 北京海泰方圆科技有限公司 Index file inputting method and retrieval method based on index file
US20150074080A1 (en) * 2011-08-30 2015-03-12 Open Text SA System and method of managing capacity of search index partitions
CN104715068A (en) * 2015-03-31 2015-06-17 北京奇虎科技有限公司 Method and device for generating document indexes and searching method and device
CN105205104A (en) * 2015-08-26 2015-12-30 成都布林特信息技术有限公司 Cloud platform data acquisition method
CN105808615A (en) * 2014-12-31 2016-07-27 北京奇虎科技有限公司 Document index generation method and device based on word segment weights
CN106599041A (en) * 2016-11-07 2017-04-26 中国电子科技集团公司第三十二研究所 Text processing and retrieval system based on big data platform
CN106649426A (en) * 2016-08-05 2017-05-10 浪潮软件股份有限公司 Data analysis method, data analysis platform and server
CN106776746A (en) * 2016-11-14 2017-05-31 天津南大通用数据技术股份有限公司 A kind of creation method and device of full-text index data
CN107016047A (en) * 2017-02-20 2017-08-04 阿里巴巴集团控股有限公司 Document query, document storing method and device
CN107038225A (en) * 2017-03-31 2017-08-11 江苏飞搏软件股份有限公司 The search method of information intelligent retrieval system
CN108228743A (en) * 2017-12-18 2018-06-29 深圳供电局有限公司 A kind of real-time big data search engine system
CN108241713A (en) * 2016-12-27 2018-07-03 南京烽火软件科技有限公司 A kind of inverted index search method based on polynary cutting
CN108874956A (en) * 2018-06-05 2018-11-23 中国平安人寿保险股份有限公司 Mass file search method, device, computer equipment and storage medium
CN109254967A (en) * 2018-08-29 2019-01-22 河南智慧云大数据有限公司 A kind of depth analysis method and device based on multi-source heterogeneous mass data

Patent Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140181056A1 (en) * 2011-08-30 2014-06-26 Patrick Thomas Sidney Pidduck System and method of quality assessment of a search index
US20150074080A1 (en) * 2011-08-30 2015-03-12 Open Text SA System and method of managing capacity of search index partitions
CN103853832A (en) * 2014-03-11 2014-06-11 上海爱数软件有限公司 Customizable data capturing method in full-text retrieval system
CN104376067A (en) * 2014-11-13 2015-02-25 北京海泰方圆科技有限公司 Index file inputting method and retrieval method based on index file
CN105808615A (en) * 2014-12-31 2016-07-27 北京奇虎科技有限公司 Document index generation method and device based on word segment weights
CN104715068A (en) * 2015-03-31 2015-06-17 北京奇虎科技有限公司 Method and device for generating document indexes and searching method and device
CN105205104A (en) * 2015-08-26 2015-12-30 成都布林特信息技术有限公司 Cloud platform data acquisition method
CN106649426A (en) * 2016-08-05 2017-05-10 浪潮软件股份有限公司 Data analysis method, data analysis platform and server
CN106599041A (en) * 2016-11-07 2017-04-26 中国电子科技集团公司第三十二研究所 Text processing and retrieval system based on big data platform
CN106776746A (en) * 2016-11-14 2017-05-31 天津南大通用数据技术股份有限公司 A kind of creation method and device of full-text index data
CN108241713A (en) * 2016-12-27 2018-07-03 南京烽火软件科技有限公司 A kind of inverted index search method based on polynary cutting
CN107016047A (en) * 2017-02-20 2017-08-04 阿里巴巴集团控股有限公司 Document query, document storing method and device
CN107038225A (en) * 2017-03-31 2017-08-11 江苏飞搏软件股份有限公司 The search method of information intelligent retrieval system
CN108228743A (en) * 2017-12-18 2018-06-29 深圳供电局有限公司 A kind of real-time big data search engine system
CN108874956A (en) * 2018-06-05 2018-11-23 中国平安人寿保险股份有限公司 Mass file search method, device, computer equipment and storage medium
CN109254967A (en) * 2018-08-29 2019-01-22 河南智慧云大数据有限公司 A kind of depth analysis method and device based on multi-source heterogeneous mass data

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
N. RAGAVAN: ""Efficient key hash indexing scheme with page rank for category based search engine big data,"", 《2017 IEEE INTERNATIONAL CONFERENCE ON INTELLIGENT TECHNIQUES IN CONTROL, OPTIMIZATION AND SIGNAL PROCESSING (INCOS)》 *
徐旭平 等: ""基于MongoDB的元数据管理研究"", 《信息技术》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112035409A (en) * 2020-11-03 2020-12-04 杭州蚁首网络科技有限公司 Entity file management method, system and computer storage medium
CN113312441A (en) * 2021-06-10 2021-08-27 中寰卫星导航通信有限公司 Map operation method and device

Also Published As

Publication number Publication date
CN110096478B (en) 2021-06-29

Similar Documents

Publication Publication Date Title
CN102622592A (en) Name card recognition method based on cloud technology
CN107784205B (en) User product auditing method, device, server and storage medium
CN110096478A (en) Document index generation method and equipment
US8577826B2 (en) Automated document separation
CN110532449B (en) Method, device, equipment and storage medium for processing service document
CN111881094A (en) Method, device, terminal and storage medium for extracting key information in log
CN105550179A (en) Webpage collection method and browser plug-in
CN110888791A (en) Log processing method, device, equipment and storage medium
CN112417195A (en) Trademark inquiry system and method based on mobile terminal and storage medium
CN110972086A (en) Short message processing method and device, electronic equipment and computer readable storage medium
CN116089732B (en) User preference identification method and system based on advertisement click data
CN110472121B (en) Business card information searching method and device, electronic equipment and computer readable storage medium
CN111047657A (en) Picture compression method, device, medium and electronic equipment
CN105681523A (en) Method and apparatus for sending birthday blessing short message automatically
WO2021129849A1 (en) Log processing method, apparatus and device, and storage medium
CN115658127A (en) Data processing method and device, electronic equipment and storage medium
CN105677827B (en) A kind of acquisition methods and device of list
CN114281761A (en) Data file loading method and device, computer equipment and storage medium
CN114339689A (en) Internet of things machine card binding pool control method and device and related medium
CN113468037A (en) Data quality evaluation method, device, medium and electronic equipment
CN113343116A (en) Intelligent chat recommendation method, system, equipment and storage medium based on enterprise warehouse
CN107180054B (en) Data processing method and device
CN114661772B (en) Data processing method and related device
CN110674395B (en) Information pushing method, device and equipment
CN112714033B (en) Method and device for determining characteristic information of video set

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant