CN110096478B - Document index generation method and device - Google Patents
Document index generation method and device Download PDFInfo
- Publication number
- CN110096478B CN110096478B CN201910383600.3A CN201910383600A CN110096478B CN 110096478 B CN110096478 B CN 110096478B CN 201910383600 A CN201910383600 A CN 201910383600A CN 110096478 B CN110096478 B CN 110096478B
- Authority
- CN
- China
- Prior art keywords
- file
- information
- target file
- document
- target
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/11—File system administration, e.g. details of archiving or snapshots
- G06F16/116—Details of conversion of file system types or formats
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/13—File access structures, e.g. distributed indices
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/16—File or folder operations, e.g. details of user interfaces specifically adapted to file systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Human Computer Interaction (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The embodiment of the invention provides a document index generation method and a device, wherein the method comprises the following steps: receiving files sent by various types of clients and metadata information corresponding to the files, wherein the metadata information is stored in a database, and the files are stored in a local disk; inquiring the database to acquire a target file waiting for automatic processing from the local disk according to the metadata information; analyzing the target file to obtain document information; the document information is subjected to word segmentation processing to obtain the search words of the document information, and the index information of the document information is generated according to a preset search frame, so that automatic generation of document indexes can be realized, the efficiency is high, a large amount of document information can be processed, and the labor cost can be saved.
Description
Technical Field
The embodiment of the invention relates to the technical field of computers, in particular to a document index generation method and device.
Background
With the rapid development of the internet, the document information on the internet is rapidly increasing. In order to find all required document information in the massive information, a user generally finds relevant document information according to the index information.
At present, in order to establish a document index, a common method is to manually sort documents from different sources to obtain index information of different documents, and then upload the index information to a database server for retrieval by a user.
However, the inventor finds that in the existing process of manually sorting and establishing index information for documents from different sources, when the number of documents is large, the operation is complicated, a large amount of manpower is required, and the cost is high.
Disclosure of Invention
The embodiment of the invention provides a document index generation method and device, and aims to solve the problems that in the prior art, documents from different sources are manually sorted and index information is established, when the number of the documents is large, the operation is complex, a large amount of manpower is consumed, and the cost is high.
In a first aspect, an embodiment of the present invention provides a method for generating a document index, including:
receiving files sent by various types of clients and metadata information corresponding to the files, wherein the metadata information is stored in a database, and the files are stored in a local disk;
inquiring the database to acquire a target file waiting for automatic processing from the local disk according to the metadata information;
analyzing the target file to obtain document information;
and performing word segmentation processing on the document information to obtain a search word of the document information, and generating index information of the document information according to a preset search frame.
In a possible design, before the receiving the file sent by each type of client and the metadata information corresponding to the file, the method further includes:
sending file extraction information to various types of clients, wherein the file extraction information comprises file formats and scanning path information which are allowed to be extracted and correspond to the various types of clients;
the file format and the scanning path information which are allowed to be extracted are used for indicating the clients of each type to extract the file and the metadata information corresponding to the file according to the corresponding file format and the scanning path information which are allowed to be extracted.
In one possible design, the parsing the target file to obtain document information includes:
acquiring the file format of the target file, and judging whether the target file exists or not;
if the target file exists and the file format of the target file is a compressed file format, decompressing the target file to obtain a decompressed file, and analyzing the decompressed file to generate document information of the target file;
and if the target file exists and the file format of the target file is a non-compressed file format, analyzing the target file to generate the document information of the target file.
In one possible design, determining whether the target file exists includes:
and inquiring the file through the file path in the metadata information, if the file is inquired, determining that the target file exists, and if the file cannot be inquired, determining that the target file does not exist.
In one possible design, the parsing the target file to generate the document information of the target file includes:
and analyzing the target file through an open source tool apache tika to acquire the document information of the target file.
In one possible design, the preset retrieval framework is the apache lucence framework.
In a second aspect, an embodiment of the present invention provides a document index generating device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the following steps when executing the computer program:
receiving files sent by various types of clients and metadata information corresponding to the files, wherein the metadata information is stored in a database, and the files are stored in a local disk;
inquiring the database to acquire a target file waiting for automatic processing from the local disk according to the metadata information;
analyzing the target file to obtain document information;
and performing word segmentation processing on the document information to obtain a search word of the document information, and generating index information of the document information according to a preset search frame.
The processor, when executing the computer program, further implements the steps of:
before receiving the files sent by the various types of clients and the metadata information corresponding to the files, sending file extraction information to the various types of clients, wherein the file extraction information comprises file formats and scanning path information which are allowed to be extracted and correspond to the various types of clients;
the file format and the scanning path information which are allowed to be extracted are used for indicating the clients of each type to extract the file and the metadata information corresponding to the file according to the corresponding file format and the scanning path information which are allowed to be extracted.
In one possible design, the parsing the target file to obtain document information includes:
acquiring the file format of the target file, and judging whether the target file exists or not;
if the target file exists and the file format of the target file is a compressed file format, decompressing the target file to obtain a decompressed file, and analyzing the decompressed file to generate document information of the target file;
and if the target file exists and the file format of the target file is a non-compressed file format, analyzing the target file to generate the document information of the target file.
In a third aspect, an embodiment of the present invention provides a computer-readable storage medium, where computer-executable instructions are stored in the computer-readable storage medium, and when a processor executes the computer-executable instructions, the method for generating a document index according to any one of the first aspect and the first aspect is implemented.
According to the document index generation method and the document index generation equipment provided by the embodiment of the invention, files sent by various types of clients and metadata information corresponding to the files are received, a target file is obtained according to the metadata information, and the target file is analyzed to obtain document information; the document information is subjected to word segmentation processing to obtain the search words of the document information, and the index information of the document information is generated according to a preset search frame, so that automatic generation of document indexes can be realized, the efficiency is high, a large amount of document information can be processed, and the labor cost can be saved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
FIG. 1 is a schematic diagram of a system architecture for generating a document index according to an embodiment of the present invention;
FIG. 2 is a first flowchart illustrating a document index generating method according to an embodiment of the present invention;
FIG. 3 is a flowchart illustrating a document index generating method according to an embodiment of the present invention;
FIG. 4 is a first schematic structural diagram of a document index generating device according to an embodiment of the present invention;
FIG. 5 is a schematic structural diagram of a document index generating device according to an embodiment of the present invention;
fig. 6 is a schematic diagram of a hardware structure of a document index generating device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1, fig. 1 is a schematic diagram of a system architecture for generating a document index according to an embodiment of the present invention. As shown in fig. 1, the system provided by the present embodiment includes various types of clients 101 and servers 102. The client 101 may be a mobile phone, a tablet, a personal computer, or the like. The present embodiment does not particularly limit the implementation manner of the client 101 and the inquiry user terminal 102 as long as the client 101 can interact with the server 102. The server 102 is used for managing document retrieval services, and the server 102 may be a single server or a cluster of multiple servers.
Referring to fig. 2, fig. 2 is a flowchart illustrating a document index generating method according to an embodiment of the present invention, where an executing subject of the embodiment may be a server according to the embodiment shown in fig. 1, and the embodiment is not limited herein. As shown in fig. 2, the method includes:
s201: receiving files sent by various types of clients and metadata information corresponding to the files, wherein the metadata information is stored in a database, and the files are stored in a local disk.
In this embodiment, the clients of various types may be a mail client, a Text Services Framework (TSF) client, a chat application client, and the like.
Specifically, file extraction information may be sent to each type of client, where the file extraction information includes file formats and scanning path information that are allowed to be extracted and correspond to the each type of client;
the file format and the scanning path information which are allowed to be extracted are used for indicating the clients of each type to extract the file and the metadata information corresponding to the file according to the corresponding file format and the scanning path information which are allowed to be extracted.
For example, the mail client scans the file in the mail format according to the corresponding scanning path information, and the TSF client scans the file in the TSF format according to the corresponding scanning path information.
The metadata information corresponding to the file may include detailed information as shown in table 1.
TABLE 1 details of metadata information corresponding to a file
Field(s) | Field interpretation |
File_id | Self-increment field, no specific meaning |
File_tile | Specifying the name of a file |
File_path | Storage position after file extraction |
Client_type | Specific types of custom clients |
In_time | Extraction time of file |
author | Creator of extracted document |
Status | Extracting processing states of files |
Parent_md5 | MD5 encoding of parent file with association relation |
S202: and inquiring the database to acquire a target file waiting for automatic processing from the local disk according to the metadata information.
In this embodiment, the metadata information includes a processing state of the extracted file (refer to table 1), the processing state of the extracted file may be marked according to a format during file extraction, for example, a file that cannot be parsed is marked as "waiting for manual processing", a file that can be parsed is marked as "waiting for manual processing", a file that has completed manual parsing is marked as "manual processing completed", a file that has completed automatic parsing is marked as "automatic processing completed", a file that has no scanned file is marked as "file absent error", a file that has an error during scanning is marked as "file present error", a file that has a parent-level associated file corresponding to the scanned file does not have an error ", a file that cannot be written after the file is parsed after scanning is marked as" file information after file parsing does not exist ", the flag that the file content does not exist after scanning is "file content does not exist". The correspondence between the state code of the extracted file and the processing state corresponding to each type of file can be referred to table 2.
Table 2 correspondence between status codes of extracted files corresponding to various types of files and processing statuses
State coding | State interpretation |
MANUAL_ANALYSE | Wait for manual processing |
AUTO_ANALYSE | Wait for automatic processing |
MANUAL_ANALYSED | Completion of manual processing |
AUTO_ANALYSED | Automatic processing is completed |
ERROR_NO_PATH | File absence errors |
ERROR_MD5_EXIST | The file has an error |
ERROR_PARENT_NOT_EXIST | No error in parent level associated file |
ERROR_DB_INSERT_ERROR | Information write failure after file parsing |
WARN_NO_CONTENT | File content free of errors |
S203: and analyzing the target file to obtain document information.
In this embodiment, if the target file is a compressed file, the compressed file is first decompressed to obtain a decompressed file, and the decompressed file is parsed to obtain document information; and if the target file is not the compressed file, directly analyzing the target file to generate the analyzed file information of the target file.
The compressed file may be a rar compressed format file or a zip compressed format file.
S204: and performing word segmentation processing on the document information to obtain a search word of the document information, and generating index information of the document information according to a preset search frame.
In this embodiment, the document information may be subjected to word segmentation processing by a Chinese word segmenter Smart Chinese Analyzer to obtain a search term of the document information.
Wherein the preset frame may be an apache lucence frame.
Specifically, the index information of the document information may be generated by using the search term of the document information and the apache lucence framework. The retrieval range of the apache lucence framework may include a title of a document, a path of the document, contents of the document, metadata information of the document, and the like.
In this embodiment, the search may be performed through a user interface, including category search, highlighting, document detail viewing, document preview, document association information, and so on.
According to the description, the files sent by the clients of various types and the metadata information corresponding to the files are received, the target files are obtained according to the metadata information, and the target files are analyzed to obtain the document information; the document information is subjected to word segmentation processing to obtain the search words of the document information, and the index information of the document information is generated according to a preset search frame, so that automatic generation of document indexes can be realized, the efficiency is high, a large amount of document information can be processed, and the labor cost can be saved.
Referring to fig. 3, fig. 3 is a second schematic flow chart of a document index generating method according to an embodiment of the present invention, and on the basis of the embodiment corresponding to fig. 2, this embodiment describes in detail a specific process of analyzing the target file to obtain document information in step S203, which is detailed as follows:
s301: and acquiring the file format of the target file, and judging whether the target file exists.
In the present embodiment, the file format is classified into a compressed file format and an uncompressed file format.
Specifically, the determining whether the target file exists includes: and inquiring the file through the file path in the metadata information, if the file is inquired, determining that the target file exists, and if the file cannot be inquired, determining that the target file does not exist.
S302: and if the target file exists and the file format of the target file is a compressed file format, decompressing the target file to obtain a decompressed file, and analyzing the decompressed file to generate the document information of the target file.
S303: and if the target file exists and the file format of the target file is a non-compressed file format, analyzing the target file to generate the document information of the target file.
In this embodiment, the target file may be parsed by an open source tool apache tika, so as to obtain the document information of the target file. Similarly, the document information of the target file can be generated by parsing the decompressed file through the open source tool apache tika.
The target file is parsed by the open source tool apache tika, and the parsed document information may include the contents shown in table 3.
TABLE 3 content corresponding to parsed document information
From the above description, it can be known that the target file in the compressed format is decompressed without manual decompression, thereby improving the parsing efficiency of the document.
Fig. 4 is a first schematic structural diagram of a document index generating device according to an embodiment of the present invention. As shown in fig. 4, the document index generating apparatus 40 includes: a receiving module 401, an obtaining module 402, a parsing module 403 and an index generating module 404.
The receiving module 401 is configured to receive files sent by various types of clients and metadata information corresponding to the files, where the metadata information is stored in a database, and the files are stored in a local disk;
an obtaining module 402, configured to query the database, according to the metadata information, to obtain a target file to be automatically processed from the local disk;
an analysis module 403, configured to analyze the target file to obtain document information;
an index generating module 404, configured to perform word segmentation processing on the document information to obtain a search word of the document information, and generate index information of the document information according to a preset search frame.
The device provided in this embodiment may be used to implement the technical solution of the above method embodiment, and the implementation principle and technical effect are similar, which are not described herein again.
Referring to fig. 5, fig. 5 is a schematic structural diagram of a document index generating device according to an embodiment of the present invention. As shown in fig. 5, this embodiment further includes, on the basis of the embodiment in fig. 4: a sending module 405.
The sending module 405 is configured to send file extraction information to each type of client before receiving a file sent by each type of client and metadata information corresponding to the file, where the file extraction information includes file formats and scanning path information that are allowed to be extracted and correspond to each type of client; the file format and the scanning path information which are allowed to be extracted are used for indicating the clients of each type to extract the file and the metadata information corresponding to the file according to the corresponding file format and the scanning path information which are allowed to be extracted.
In an embodiment of the present invention, the parsing module 403 is specifically configured to obtain a file format of the target file, and determine whether the target file exists;
if the target file exists and the file format of the target file is a compressed file format, decompressing the target file to obtain a decompressed file, and analyzing the decompressed file to generate document information of the target file;
and if the target file exists and the file format of the target file is a non-compressed file format, analyzing the target file to generate the document information of the target file.
In an embodiment of the present invention, the parsing module 403 is further configured to determine whether the target file exists, including:
and inquiring the file through the file path in the metadata information, if the file is inquired, determining that the target file exists, and if the file cannot be inquired, determining that the target file does not exist.
In an embodiment of the present invention, the index generating module 404 is specifically configured to parse the target file through an open source tool apache tika, and obtain the document information of the target file.
In an embodiment of the present invention, the preset retrieval frame is an apache lucence frame.
The device provided in this embodiment may be used to implement the technical solution of the above method embodiment, and the implementation principle and technical effect are similar, which are not described herein again.
Fig. 6 is a schematic diagram of a hardware structure of a document index generating device according to an embodiment of the present invention. As shown in fig. 6, the document index generating device 60 of the present embodiment includes: a processor 601 and a memory 602; wherein
A memory 602 for storing computer-executable instructions;
the processor 601 is configured to execute the computer execution instructions stored in the memory to implement the steps performed by the server in the above embodiments. Reference may be made in particular to the description relating to the method embodiments described above.
Alternatively, the memory 602 may be separate or integrated with the processor 601.
When the memory 602 is provided separately, the document index generating apparatus further includes a bus 603 for connecting the memory 602 and the processor 601.
The embodiment of the present invention further provides a computer-readable storage medium, where a computer execution instruction is stored in the computer-readable storage medium, and when a processor executes the computer execution instruction, the document index generation method as described above is implemented.
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described device embodiments are merely illustrative, and for example, the division of the modules is only one logical division, and other divisions may be realized in practice, for example, a plurality of modules may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or modules, and may be in an electrical, mechanical or other form.
The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
In addition, functional modules in the embodiments of the present invention may be integrated into one processing unit, or each module may exist alone physically, or two or more modules are integrated into one unit. The unit formed by the modules can be realized in a hardware form, and can also be realized in a form of hardware and a software functional unit.
The integrated module implemented in the form of a software functional module may be stored in a computer-readable storage medium. The software functional module is stored in a storage medium and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to execute some steps of the methods according to the embodiments of the present application.
It should be understood that the Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the present invention may be embodied directly in a hardware processor, or in a combination of the hardware and software modules within the processor.
The memory may comprise a high-speed RAM memory, and may further comprise a non-volatile storage NVM, such as at least one disk memory, and may also be a usb disk, a removable hard disk, a read-only memory, a magnetic or optical disk, etc.
The bus may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended ISA (enhanced Industry Standard Architecture) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, the buses in the figures of the present application are not limited to only one bus or one type of bus.
The storage medium may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks. A storage media may be any available media that can be accessed by a general purpose or special purpose computer.
An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. Of course, the storage medium may also be integral to the processor. The processor and the storage medium may reside in an Application Specific Integrated Circuits (ASIC). Of course, the processor and the storage medium may reside as discrete components in an electronic device or host device.
Those of ordinary skill in the art will understand that: all or a portion of the steps of implementing the above-described method embodiments may be performed by hardware associated with program instructions. The program may be stored in a computer-readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.
Claims (8)
1. A document index generation method is characterized by comprising the following steps:
receiving files sent by various types of clients and metadata information corresponding to the files, wherein the metadata information is stored in a database, and the files are stored in a local disk;
inquiring the database to acquire a target file waiting for automatic processing from the local disk according to the metadata information;
analyzing the target file to obtain document information;
performing word segmentation processing on the document information to obtain a search word of the document information, and generating index information of the document information according to a preset search frame;
before the receiving the file sent by each type of client and the metadata information corresponding to the file, the method further comprises the following steps:
sending file extraction information to various types of clients, wherein the file extraction information comprises file formats and scanning path information which are allowed to be extracted and correspond to the various types of clients;
the file format and the scanning path information which are allowed to be extracted are used for indicating the clients of each type to extract the file and the metadata information corresponding to the file according to the corresponding file format and the scanning path information which are allowed to be extracted.
2. The method of claim 1, wherein parsing the target file to obtain document information comprises:
acquiring the file format of the target file, and judging whether the target file exists or not;
if the target file exists and the file format of the target file is a compressed file format, decompressing the target file to obtain a decompressed file, and analyzing the decompressed file to generate document information of the target file;
and if the target file exists and the file format of the target file is a non-compressed file format, analyzing the target file to generate the document information of the target file.
3. The method of claim 2, wherein determining whether the target file exists comprises:
and inquiring the file through the file path in the metadata information, if the file is inquired, determining that the target file exists, and if the file cannot be inquired, determining that the target file does not exist.
4. The method of claim 2, wherein parsing the target file to generate document information of the target file comprises:
and analyzing the target file through an open source tool apache tika to acquire the document information of the target file.
5. The method of any one of claims 1 to 4, wherein the preset retrieval framework is an apache lucence framework.
6. A document index generation device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, the processor implementing the following steps when executing the computer program:
receiving files sent by various types of clients and metadata information corresponding to the files, wherein the metadata information is stored in a database, and the files are stored in a local disk;
inquiring the database to acquire a target file waiting for automatic processing from the local disk according to the metadata information;
analyzing the target file to obtain document information;
performing word segmentation processing on the document information to obtain a search word of the document information, and generating index information of the document information according to a preset search frame;
the processor, when executing the computer program, further implements the steps of:
before receiving the files sent by the various types of clients and the metadata information corresponding to the files, sending file extraction information to the various types of clients, wherein the file extraction information comprises file formats and scanning path information which are allowed to be extracted and correspond to the various types of clients;
the file format and the scanning path information which are allowed to be extracted are used for indicating the clients of each type to extract the file and the metadata information corresponding to the file according to the corresponding file format and the scanning path information which are allowed to be extracted.
7. The apparatus of claim 6, wherein parsing the target file to obtain document information comprises:
acquiring the file format of the target file, and judging whether the target file exists or not;
if the target file exists and the file format of the target file is a compressed file format, decompressing the target file to obtain a decompressed file, and analyzing the decompressed file to generate document information of the target file;
and if the target file exists and the file format of the target file is a non-compressed file format, analyzing the target file to generate the document information of the target file.
8. A computer-readable storage medium having stored thereon computer-executable instructions which, when executed by a processor, implement the document index generation method of any one of claims 1 to 5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910383600.3A CN110096478B (en) | 2019-05-09 | 2019-05-09 | Document index generation method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910383600.3A CN110096478B (en) | 2019-05-09 | 2019-05-09 | Document index generation method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110096478A CN110096478A (en) | 2019-08-06 |
CN110096478B true CN110096478B (en) | 2021-06-29 |
Family
ID=67447334
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910383600.3A Active CN110096478B (en) | 2019-05-09 | 2019-05-09 | Document index generation method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110096478B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112035409B (en) * | 2020-11-03 | 2021-07-27 | 杭州蚁首网络科技有限公司 | Entity file management method, system and computer storage medium |
CN113312441A (en) * | 2021-06-10 | 2021-08-27 | 中寰卫星导航通信有限公司 | Map operation method and device |
Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104376067A (en) * | 2014-11-13 | 2015-02-25 | 北京海泰方圆科技有限公司 | Index file inputting method and retrieval method based on index file |
CN104715068A (en) * | 2015-03-31 | 2015-06-17 | 北京奇虎科技有限公司 | Method and device for generating document indexes and searching method and device |
CN105205104A (en) * | 2015-08-26 | 2015-12-30 | 成都布林特信息技术有限公司 | Cloud platform data acquisition method |
CN106599041A (en) * | 2016-11-07 | 2017-04-26 | 中国电子科技集团公司第三十二研究所 | Text processing and retrieval system based on big data platform |
CN106649426A (en) * | 2016-08-05 | 2017-05-10 | 浪潮软件股份有限公司 | Data analysis method, data analysis platform and server |
CN106776746A (en) * | 2016-11-14 | 2017-05-31 | 天津南大通用数据技术股份有限公司 | A kind of creation method and device of full-text index data |
CN107016047A (en) * | 2017-02-20 | 2017-08-04 | 阿里巴巴集团控股有限公司 | Document query, document storing method and device |
CN107038225A (en) * | 2017-03-31 | 2017-08-11 | 江苏飞搏软件股份有限公司 | The search method of information intelligent retrieval system |
CN108228743A (en) * | 2017-12-18 | 2018-06-29 | 深圳供电局有限公司 | A kind of real-time big data search engine system |
CN108241713A (en) * | 2016-12-27 | 2018-07-03 | 南京烽火软件科技有限公司 | A kind of inverted index search method based on polynary cutting |
CN109254967A (en) * | 2018-08-29 | 2019-01-22 | 河南智慧云大数据有限公司 | A kind of depth analysis method and device based on multi-source heterogeneous mass data |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8983920B2 (en) * | 2011-08-30 | 2015-03-17 | Open Text S.A. | System and method of quality assessment of a search index |
US8909615B2 (en) * | 2011-08-30 | 2014-12-09 | Open Text S.A. | System and method of managing capacity of search index partitions |
CN103853832B (en) * | 2014-03-11 | 2017-07-28 | 上海爱数信息技术股份有限公司 | Customizable data grasping means in a kind of text retrieval system |
CN105808615A (en) * | 2014-12-31 | 2016-07-27 | 北京奇虎科技有限公司 | Document index generation method and device based on word segment weights |
CN108874956A (en) * | 2018-06-05 | 2018-11-23 | 中国平安人寿保险股份有限公司 | Mass file search method, device, computer equipment and storage medium |
-
2019
- 2019-05-09 CN CN201910383600.3A patent/CN110096478B/en active Active
Patent Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104376067A (en) * | 2014-11-13 | 2015-02-25 | 北京海泰方圆科技有限公司 | Index file inputting method and retrieval method based on index file |
CN104715068A (en) * | 2015-03-31 | 2015-06-17 | 北京奇虎科技有限公司 | Method and device for generating document indexes and searching method and device |
CN105205104A (en) * | 2015-08-26 | 2015-12-30 | 成都布林特信息技术有限公司 | Cloud platform data acquisition method |
CN106649426A (en) * | 2016-08-05 | 2017-05-10 | 浪潮软件股份有限公司 | Data analysis method, data analysis platform and server |
CN106599041A (en) * | 2016-11-07 | 2017-04-26 | 中国电子科技集团公司第三十二研究所 | Text processing and retrieval system based on big data platform |
CN106776746A (en) * | 2016-11-14 | 2017-05-31 | 天津南大通用数据技术股份有限公司 | A kind of creation method and device of full-text index data |
CN108241713A (en) * | 2016-12-27 | 2018-07-03 | 南京烽火软件科技有限公司 | A kind of inverted index search method based on polynary cutting |
CN107016047A (en) * | 2017-02-20 | 2017-08-04 | 阿里巴巴集团控股有限公司 | Document query, document storing method and device |
CN107038225A (en) * | 2017-03-31 | 2017-08-11 | 江苏飞搏软件股份有限公司 | The search method of information intelligent retrieval system |
CN108228743A (en) * | 2017-12-18 | 2018-06-29 | 深圳供电局有限公司 | A kind of real-time big data search engine system |
CN109254967A (en) * | 2018-08-29 | 2019-01-22 | 河南智慧云大数据有限公司 | A kind of depth analysis method and device based on multi-source heterogeneous mass data |
Also Published As
Publication number | Publication date |
---|---|
CN110096478A (en) | 2019-08-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110096478B (en) | Document index generation method and device | |
CN112199344B (en) | Log classification method and device | |
CN110532449B (en) | Method, device, equipment and storage medium for processing service document | |
CN108038441B (en) | System and method based on image recognition | |
CN110888791A (en) | Log processing method, device, equipment and storage medium | |
CN115098440A (en) | Electronic archive query method, device, storage medium and equipment | |
CN110874526B (en) | File similarity detection method and device, electronic equipment and storage medium | |
CN112418813A (en) | AEO qualification intelligent rating management system and method based on intelligent analysis and identification and storage medium | |
CN110472121B (en) | Business card information searching method and device, electronic equipment and computer readable storage medium | |
CN109815243B (en) | Structured storage method and device during document interface modification | |
CN111124470A (en) | Automatic optimization method and device for program package based on cloud platform | |
CN113138974B (en) | Method and device for detecting database compliance | |
CN111047657A (en) | Picture compression method, device, medium and electronic equipment | |
CN116204428A (en) | Test case generation method and device | |
CN114281761A (en) | Data file loading method and device, computer equipment and storage medium | |
CN115658127A (en) | Data processing method and device, electronic equipment and storage medium | |
CN114090673A (en) | Data processing method, equipment and storage medium for multiple data sources | |
CN109491699B (en) | Resource checking method, device, equipment and storage medium of application program | |
CN113111200A (en) | Method and device for auditing picture file, electronic equipment and storage medium | |
US20200186675A1 (en) | System and method for determining compression rates for images comprising text | |
CN110674395B (en) | Information pushing method, device and equipment | |
CN113033832B (en) | Method and device for inputting automobile repair data, terminal equipment and readable storage medium | |
CN112597109B (en) | Data storage method, device, electronic equipment and storage medium | |
CN117389769B (en) | Browser-end rich text copying method and system based on cloud service and cloud platform | |
CN117112846B (en) | Multi-information source license information management method, system and medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |