CN110096478B - Document index generation method and device - Google Patents

Document index generation method and device Download PDF

Info

Publication number
CN110096478B
CN110096478B CN201910383600.3A CN201910383600A CN110096478B CN 110096478 B CN110096478 B CN 110096478B CN 201910383600 A CN201910383600 A CN 201910383600A CN 110096478 B CN110096478 B CN 110096478B
Authority
CN
China
Prior art keywords
file
information
target file
document
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910383600.3A
Other languages
Chinese (zh)
Other versions
CN110096478A (en
Inventor
徐凯
丛新法
侯青军
杨通军
杨哲
高翔
张健钊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China United Network Communications Group Co Ltd
Original Assignee
China United Network Communications Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China United Network Communications Group Co Ltd filed Critical China United Network Communications Group Co Ltd
Priority to CN201910383600.3A priority Critical patent/CN110096478B/en
Publication of CN110096478A publication Critical patent/CN110096478A/en
Application granted granted Critical
Publication of CN110096478B publication Critical patent/CN110096478B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/11File system administration, e.g. details of archiving or snapshots
    • G06F16/116Details of conversion of file system types or formats
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/13File access structures, e.g. distributed indices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/16File or folder operations, e.g. details of user interfaces specifically adapted to file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Human Computer Interaction (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention provides a document index generation method and a device, wherein the method comprises the following steps: receiving files sent by various types of clients and metadata information corresponding to the files, wherein the metadata information is stored in a database, and the files are stored in a local disk; inquiring the database to acquire a target file waiting for automatic processing from the local disk according to the metadata information; analyzing the target file to obtain document information; the document information is subjected to word segmentation processing to obtain the search words of the document information, and the index information of the document information is generated according to a preset search frame, so that automatic generation of document indexes can be realized, the efficiency is high, a large amount of document information can be processed, and the labor cost can be saved.

Description

Document index generation method and device
Technical Field
The embodiment of the invention relates to the technical field of computers, in particular to a document index generation method and device.
Background
With the rapid development of the internet, the document information on the internet is rapidly increasing. In order to find all required document information in the massive information, a user generally finds relevant document information according to the index information.
At present, in order to establish a document index, a common method is to manually sort documents from different sources to obtain index information of different documents, and then upload the index information to a database server for retrieval by a user.
However, the inventor finds that in the existing process of manually sorting and establishing index information for documents from different sources, when the number of documents is large, the operation is complicated, a large amount of manpower is required, and the cost is high.
Disclosure of Invention
The embodiment of the invention provides a document index generation method and device, and aims to solve the problems that in the prior art, documents from different sources are manually sorted and index information is established, when the number of the documents is large, the operation is complex, a large amount of manpower is consumed, and the cost is high.
In a first aspect, an embodiment of the present invention provides a method for generating a document index, including:
receiving files sent by various types of clients and metadata information corresponding to the files, wherein the metadata information is stored in a database, and the files are stored in a local disk;
inquiring the database to acquire a target file waiting for automatic processing from the local disk according to the metadata information;
analyzing the target file to obtain document information;
and performing word segmentation processing on the document information to obtain a search word of the document information, and generating index information of the document information according to a preset search frame.
In a possible design, before the receiving the file sent by each type of client and the metadata information corresponding to the file, the method further includes:
sending file extraction information to various types of clients, wherein the file extraction information comprises file formats and scanning path information which are allowed to be extracted and correspond to the various types of clients;
the file format and the scanning path information which are allowed to be extracted are used for indicating the clients of each type to extract the file and the metadata information corresponding to the file according to the corresponding file format and the scanning path information which are allowed to be extracted.
In one possible design, the parsing the target file to obtain document information includes:
acquiring the file format of the target file, and judging whether the target file exists or not;
if the target file exists and the file format of the target file is a compressed file format, decompressing the target file to obtain a decompressed file, and analyzing the decompressed file to generate document information of the target file;
and if the target file exists and the file format of the target file is a non-compressed file format, analyzing the target file to generate the document information of the target file.
In one possible design, determining whether the target file exists includes:
and inquiring the file through the file path in the metadata information, if the file is inquired, determining that the target file exists, and if the file cannot be inquired, determining that the target file does not exist.
In one possible design, the parsing the target file to generate the document information of the target file includes:
and analyzing the target file through an open source tool apache tika to acquire the document information of the target file.
In one possible design, the preset retrieval framework is the apache lucence framework.
In a second aspect, an embodiment of the present invention provides a document index generating device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the following steps when executing the computer program:
receiving files sent by various types of clients and metadata information corresponding to the files, wherein the metadata information is stored in a database, and the files are stored in a local disk;
inquiring the database to acquire a target file waiting for automatic processing from the local disk according to the metadata information;
analyzing the target file to obtain document information;
and performing word segmentation processing on the document information to obtain a search word of the document information, and generating index information of the document information according to a preset search frame.
The processor, when executing the computer program, further implements the steps of:
before receiving the files sent by the various types of clients and the metadata information corresponding to the files, sending file extraction information to the various types of clients, wherein the file extraction information comprises file formats and scanning path information which are allowed to be extracted and correspond to the various types of clients;
the file format and the scanning path information which are allowed to be extracted are used for indicating the clients of each type to extract the file and the metadata information corresponding to the file according to the corresponding file format and the scanning path information which are allowed to be extracted.
In one possible design, the parsing the target file to obtain document information includes:
acquiring the file format of the target file, and judging whether the target file exists or not;
if the target file exists and the file format of the target file is a compressed file format, decompressing the target file to obtain a decompressed file, and analyzing the decompressed file to generate document information of the target file;
and if the target file exists and the file format of the target file is a non-compressed file format, analyzing the target file to generate the document information of the target file.
In a third aspect, an embodiment of the present invention provides a computer-readable storage medium, where computer-executable instructions are stored in the computer-readable storage medium, and when a processor executes the computer-executable instructions, the method for generating a document index according to any one of the first aspect and the first aspect is implemented.
According to the document index generation method and the document index generation equipment provided by the embodiment of the invention, files sent by various types of clients and metadata information corresponding to the files are received, a target file is obtained according to the metadata information, and the target file is analyzed to obtain document information; the document information is subjected to word segmentation processing to obtain the search words of the document information, and the index information of the document information is generated according to a preset search frame, so that automatic generation of document indexes can be realized, the efficiency is high, a large amount of document information can be processed, and the labor cost can be saved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
FIG. 1 is a schematic diagram of a system architecture for generating a document index according to an embodiment of the present invention;
FIG. 2 is a first flowchart illustrating a document index generating method according to an embodiment of the present invention;
FIG. 3 is a flowchart illustrating a document index generating method according to an embodiment of the present invention;
FIG. 4 is a first schematic structural diagram of a document index generating device according to an embodiment of the present invention;
FIG. 5 is a schematic structural diagram of a document index generating device according to an embodiment of the present invention;
fig. 6 is a schematic diagram of a hardware structure of a document index generating device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1, fig. 1 is a schematic diagram of a system architecture for generating a document index according to an embodiment of the present invention. As shown in fig. 1, the system provided by the present embodiment includes various types of clients 101 and servers 102. The client 101 may be a mobile phone, a tablet, a personal computer, or the like. The present embodiment does not particularly limit the implementation manner of the client 101 and the inquiry user terminal 102 as long as the client 101 can interact with the server 102. The server 102 is used for managing document retrieval services, and the server 102 may be a single server or a cluster of multiple servers.
Referring to fig. 2, fig. 2 is a flowchart illustrating a document index generating method according to an embodiment of the present invention, where an executing subject of the embodiment may be a server according to the embodiment shown in fig. 1, and the embodiment is not limited herein. As shown in fig. 2, the method includes:
s201: receiving files sent by various types of clients and metadata information corresponding to the files, wherein the metadata information is stored in a database, and the files are stored in a local disk.
In this embodiment, the clients of various types may be a mail client, a Text Services Framework (TSF) client, a chat application client, and the like.
Specifically, file extraction information may be sent to each type of client, where the file extraction information includes file formats and scanning path information that are allowed to be extracted and correspond to the each type of client;
the file format and the scanning path information which are allowed to be extracted are used for indicating the clients of each type to extract the file and the metadata information corresponding to the file according to the corresponding file format and the scanning path information which are allowed to be extracted.
For example, the mail client scans the file in the mail format according to the corresponding scanning path information, and the TSF client scans the file in the TSF format according to the corresponding scanning path information.
The metadata information corresponding to the file may include detailed information as shown in table 1.
TABLE 1 details of metadata information corresponding to a file
Field(s) Field interpretation
File_id Self-increment field, no specific meaning
File_tile Specifying the name of a file
File_path Storage position after file extraction
Client_type Specific types of custom clients
In_time Extraction time of file
author Creator of extracted document
Status Extracting processing states of files
Parent_md5 MD5 encoding of parent file with association relation
S202: and inquiring the database to acquire a target file waiting for automatic processing from the local disk according to the metadata information.
In this embodiment, the metadata information includes a processing state of the extracted file (refer to table 1), the processing state of the extracted file may be marked according to a format during file extraction, for example, a file that cannot be parsed is marked as "waiting for manual processing", a file that can be parsed is marked as "waiting for manual processing", a file that has completed manual parsing is marked as "manual processing completed", a file that has completed automatic parsing is marked as "automatic processing completed", a file that has no scanned file is marked as "file absent error", a file that has an error during scanning is marked as "file present error", a file that has a parent-level associated file corresponding to the scanned file does not have an error ", a file that cannot be written after the file is parsed after scanning is marked as" file information after file parsing does not exist ", the flag that the file content does not exist after scanning is "file content does not exist". The correspondence between the state code of the extracted file and the processing state corresponding to each type of file can be referred to table 2.
Table 2 correspondence between status codes of extracted files corresponding to various types of files and processing statuses
State coding State interpretation
MANUAL_ANALYSE Wait for manual processing
AUTO_ANALYSE Wait for automatic processing
MANUAL_ANALYSED Completion of manual processing
AUTO_ANALYSED Automatic processing is completed
ERROR_NO_PATH File absence errors
ERROR_MD5_EXIST The file has an error
ERROR_PARENT_NOT_EXIST No error in parent level associated file
ERROR_DB_INSERT_ERROR Information write failure after file parsing
WARN_NO_CONTENT File content free of errors
S203: and analyzing the target file to obtain document information.
In this embodiment, if the target file is a compressed file, the compressed file is first decompressed to obtain a decompressed file, and the decompressed file is parsed to obtain document information; and if the target file is not the compressed file, directly analyzing the target file to generate the analyzed file information of the target file.
The compressed file may be a rar compressed format file or a zip compressed format file.
S204: and performing word segmentation processing on the document information to obtain a search word of the document information, and generating index information of the document information according to a preset search frame.
In this embodiment, the document information may be subjected to word segmentation processing by a Chinese word segmenter Smart Chinese Analyzer to obtain a search term of the document information.
Wherein the preset frame may be an apache lucence frame.
Specifically, the index information of the document information may be generated by using the search term of the document information and the apache lucence framework. The retrieval range of the apache lucence framework may include a title of a document, a path of the document, contents of the document, metadata information of the document, and the like.
In this embodiment, the search may be performed through a user interface, including category search, highlighting, document detail viewing, document preview, document association information, and so on.
According to the description, the files sent by the clients of various types and the metadata information corresponding to the files are received, the target files are obtained according to the metadata information, and the target files are analyzed to obtain the document information; the document information is subjected to word segmentation processing to obtain the search words of the document information, and the index information of the document information is generated according to a preset search frame, so that automatic generation of document indexes can be realized, the efficiency is high, a large amount of document information can be processed, and the labor cost can be saved.
Referring to fig. 3, fig. 3 is a second schematic flow chart of a document index generating method according to an embodiment of the present invention, and on the basis of the embodiment corresponding to fig. 2, this embodiment describes in detail a specific process of analyzing the target file to obtain document information in step S203, which is detailed as follows:
s301: and acquiring the file format of the target file, and judging whether the target file exists.
In the present embodiment, the file format is classified into a compressed file format and an uncompressed file format.
Specifically, the determining whether the target file exists includes: and inquiring the file through the file path in the metadata information, if the file is inquired, determining that the target file exists, and if the file cannot be inquired, determining that the target file does not exist.
S302: and if the target file exists and the file format of the target file is a compressed file format, decompressing the target file to obtain a decompressed file, and analyzing the decompressed file to generate the document information of the target file.
S303: and if the target file exists and the file format of the target file is a non-compressed file format, analyzing the target file to generate the document information of the target file.
In this embodiment, the target file may be parsed by an open source tool apache tika, so as to obtain the document information of the target file. Similarly, the document information of the target file can be generated by parsing the decompressed file through the open source tool apache tika.
The target file is parsed by the open source tool apache tika, and the parsed document information may include the contents shown in table 3.
TABLE 3 content corresponding to parsed document information
Figure BDA0002054118300000071
From the above description, it can be known that the target file in the compressed format is decompressed without manual decompression, thereby improving the parsing efficiency of the document.
Fig. 4 is a first schematic structural diagram of a document index generating device according to an embodiment of the present invention. As shown in fig. 4, the document index generating apparatus 40 includes: a receiving module 401, an obtaining module 402, a parsing module 403 and an index generating module 404.
The receiving module 401 is configured to receive files sent by various types of clients and metadata information corresponding to the files, where the metadata information is stored in a database, and the files are stored in a local disk;
an obtaining module 402, configured to query the database, according to the metadata information, to obtain a target file to be automatically processed from the local disk;
an analysis module 403, configured to analyze the target file to obtain document information;
an index generating module 404, configured to perform word segmentation processing on the document information to obtain a search word of the document information, and generate index information of the document information according to a preset search frame.
The device provided in this embodiment may be used to implement the technical solution of the above method embodiment, and the implementation principle and technical effect are similar, which are not described herein again.
Referring to fig. 5, fig. 5 is a schematic structural diagram of a document index generating device according to an embodiment of the present invention. As shown in fig. 5, this embodiment further includes, on the basis of the embodiment in fig. 4: a sending module 405.
The sending module 405 is configured to send file extraction information to each type of client before receiving a file sent by each type of client and metadata information corresponding to the file, where the file extraction information includes file formats and scanning path information that are allowed to be extracted and correspond to each type of client; the file format and the scanning path information which are allowed to be extracted are used for indicating the clients of each type to extract the file and the metadata information corresponding to the file according to the corresponding file format and the scanning path information which are allowed to be extracted.
In an embodiment of the present invention, the parsing module 403 is specifically configured to obtain a file format of the target file, and determine whether the target file exists;
if the target file exists and the file format of the target file is a compressed file format, decompressing the target file to obtain a decompressed file, and analyzing the decompressed file to generate document information of the target file;
and if the target file exists and the file format of the target file is a non-compressed file format, analyzing the target file to generate the document information of the target file.
In an embodiment of the present invention, the parsing module 403 is further configured to determine whether the target file exists, including:
and inquiring the file through the file path in the metadata information, if the file is inquired, determining that the target file exists, and if the file cannot be inquired, determining that the target file does not exist.
In an embodiment of the present invention, the index generating module 404 is specifically configured to parse the target file through an open source tool apache tika, and obtain the document information of the target file.
In an embodiment of the present invention, the preset retrieval frame is an apache lucence frame.
The device provided in this embodiment may be used to implement the technical solution of the above method embodiment, and the implementation principle and technical effect are similar, which are not described herein again.
Fig. 6 is a schematic diagram of a hardware structure of a document index generating device according to an embodiment of the present invention. As shown in fig. 6, the document index generating device 60 of the present embodiment includes: a processor 601 and a memory 602; wherein
A memory 602 for storing computer-executable instructions;
the processor 601 is configured to execute the computer execution instructions stored in the memory to implement the steps performed by the server in the above embodiments. Reference may be made in particular to the description relating to the method embodiments described above.
Alternatively, the memory 602 may be separate or integrated with the processor 601.
When the memory 602 is provided separately, the document index generating apparatus further includes a bus 603 for connecting the memory 602 and the processor 601.
The embodiment of the present invention further provides a computer-readable storage medium, where a computer execution instruction is stored in the computer-readable storage medium, and when a processor executes the computer execution instruction, the document index generation method as described above is implemented.
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described device embodiments are merely illustrative, and for example, the division of the modules is only one logical division, and other divisions may be realized in practice, for example, a plurality of modules may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or modules, and may be in an electrical, mechanical or other form.
The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
In addition, functional modules in the embodiments of the present invention may be integrated into one processing unit, or each module may exist alone physically, or two or more modules are integrated into one unit. The unit formed by the modules can be realized in a hardware form, and can also be realized in a form of hardware and a software functional unit.
The integrated module implemented in the form of a software functional module may be stored in a computer-readable storage medium. The software functional module is stored in a storage medium and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to execute some steps of the methods according to the embodiments of the present application.
It should be understood that the Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the present invention may be embodied directly in a hardware processor, or in a combination of the hardware and software modules within the processor.
The memory may comprise a high-speed RAM memory, and may further comprise a non-volatile storage NVM, such as at least one disk memory, and may also be a usb disk, a removable hard disk, a read-only memory, a magnetic or optical disk, etc.
The bus may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended ISA (enhanced Industry Standard Architecture) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, the buses in the figures of the present application are not limited to only one bus or one type of bus.
The storage medium may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks. A storage media may be any available media that can be accessed by a general purpose or special purpose computer.
An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. Of course, the storage medium may also be integral to the processor. The processor and the storage medium may reside in an Application Specific Integrated Circuits (ASIC). Of course, the processor and the storage medium may reside as discrete components in an electronic device or host device.
Those of ordinary skill in the art will understand that: all or a portion of the steps of implementing the above-described method embodiments may be performed by hardware associated with program instructions. The program may be stored in a computer-readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims (8)

1. A document index generation method is characterized by comprising the following steps:
receiving files sent by various types of clients and metadata information corresponding to the files, wherein the metadata information is stored in a database, and the files are stored in a local disk;
inquiring the database to acquire a target file waiting for automatic processing from the local disk according to the metadata information;
analyzing the target file to obtain document information;
performing word segmentation processing on the document information to obtain a search word of the document information, and generating index information of the document information according to a preset search frame;
before the receiving the file sent by each type of client and the metadata information corresponding to the file, the method further comprises the following steps:
sending file extraction information to various types of clients, wherein the file extraction information comprises file formats and scanning path information which are allowed to be extracted and correspond to the various types of clients;
the file format and the scanning path information which are allowed to be extracted are used for indicating the clients of each type to extract the file and the metadata information corresponding to the file according to the corresponding file format and the scanning path information which are allowed to be extracted.
2. The method of claim 1, wherein parsing the target file to obtain document information comprises:
acquiring the file format of the target file, and judging whether the target file exists or not;
if the target file exists and the file format of the target file is a compressed file format, decompressing the target file to obtain a decompressed file, and analyzing the decompressed file to generate document information of the target file;
and if the target file exists and the file format of the target file is a non-compressed file format, analyzing the target file to generate the document information of the target file.
3. The method of claim 2, wherein determining whether the target file exists comprises:
and inquiring the file through the file path in the metadata information, if the file is inquired, determining that the target file exists, and if the file cannot be inquired, determining that the target file does not exist.
4. The method of claim 2, wherein parsing the target file to generate document information of the target file comprises:
and analyzing the target file through an open source tool apache tika to acquire the document information of the target file.
5. The method of any one of claims 1 to 4, wherein the preset retrieval framework is an apache lucence framework.
6. A document index generation device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, the processor implementing the following steps when executing the computer program:
receiving files sent by various types of clients and metadata information corresponding to the files, wherein the metadata information is stored in a database, and the files are stored in a local disk;
inquiring the database to acquire a target file waiting for automatic processing from the local disk according to the metadata information;
analyzing the target file to obtain document information;
performing word segmentation processing on the document information to obtain a search word of the document information, and generating index information of the document information according to a preset search frame;
the processor, when executing the computer program, further implements the steps of:
before receiving the files sent by the various types of clients and the metadata information corresponding to the files, sending file extraction information to the various types of clients, wherein the file extraction information comprises file formats and scanning path information which are allowed to be extracted and correspond to the various types of clients;
the file format and the scanning path information which are allowed to be extracted are used for indicating the clients of each type to extract the file and the metadata information corresponding to the file according to the corresponding file format and the scanning path information which are allowed to be extracted.
7. The apparatus of claim 6, wherein parsing the target file to obtain document information comprises:
acquiring the file format of the target file, and judging whether the target file exists or not;
if the target file exists and the file format of the target file is a compressed file format, decompressing the target file to obtain a decompressed file, and analyzing the decompressed file to generate document information of the target file;
and if the target file exists and the file format of the target file is a non-compressed file format, analyzing the target file to generate the document information of the target file.
8. A computer-readable storage medium having stored thereon computer-executable instructions which, when executed by a processor, implement the document index generation method of any one of claims 1 to 5.
CN201910383600.3A 2019-05-09 2019-05-09 Document index generation method and device Active CN110096478B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910383600.3A CN110096478B (en) 2019-05-09 2019-05-09 Document index generation method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910383600.3A CN110096478B (en) 2019-05-09 2019-05-09 Document index generation method and device

Publications (2)

Publication Number Publication Date
CN110096478A CN110096478A (en) 2019-08-06
CN110096478B true CN110096478B (en) 2021-06-29

Family

ID=67447334

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910383600.3A Active CN110096478B (en) 2019-05-09 2019-05-09 Document index generation method and device

Country Status (1)

Country Link
CN (1) CN110096478B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112035409B (en) * 2020-11-03 2021-07-27 杭州蚁首网络科技有限公司 Entity file management method, system and computer storage medium
CN113312441A (en) * 2021-06-10 2021-08-27 中寰卫星导航通信有限公司 Map operation method and device

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104376067A (en) * 2014-11-13 2015-02-25 北京海泰方圆科技有限公司 Index file inputting method and retrieval method based on index file
CN104715068A (en) * 2015-03-31 2015-06-17 北京奇虎科技有限公司 Method and device for generating document indexes and searching method and device
CN105205104A (en) * 2015-08-26 2015-12-30 成都布林特信息技术有限公司 Cloud platform data acquisition method
CN106599041A (en) * 2016-11-07 2017-04-26 中国电子科技集团公司第三十二研究所 Text processing and retrieval system based on big data platform
CN106649426A (en) * 2016-08-05 2017-05-10 浪潮软件股份有限公司 Data analysis method, data analysis platform and server
CN106776746A (en) * 2016-11-14 2017-05-31 天津南大通用数据技术股份有限公司 A kind of creation method and device of full-text index data
CN107016047A (en) * 2017-02-20 2017-08-04 阿里巴巴集团控股有限公司 Document query, document storing method and device
CN107038225A (en) * 2017-03-31 2017-08-11 江苏飞搏软件股份有限公司 The search method of information intelligent retrieval system
CN108228743A (en) * 2017-12-18 2018-06-29 深圳供电局有限公司 A kind of real-time big data search engine system
CN108241713A (en) * 2016-12-27 2018-07-03 南京烽火软件科技有限公司 A kind of inverted index search method based on polynary cutting
CN109254967A (en) * 2018-08-29 2019-01-22 河南智慧云大数据有限公司 A kind of depth analysis method and device based on multi-source heterogeneous mass data

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8983920B2 (en) * 2011-08-30 2015-03-17 Open Text S.A. System and method of quality assessment of a search index
US8909615B2 (en) * 2011-08-30 2014-12-09 Open Text S.A. System and method of managing capacity of search index partitions
CN103853832B (en) * 2014-03-11 2017-07-28 上海爱数信息技术股份有限公司 Customizable data grasping means in a kind of text retrieval system
CN105808615A (en) * 2014-12-31 2016-07-27 北京奇虎科技有限公司 Document index generation method and device based on word segment weights
CN108874956A (en) * 2018-06-05 2018-11-23 中国平安人寿保险股份有限公司 Mass file search method, device, computer equipment and storage medium

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104376067A (en) * 2014-11-13 2015-02-25 北京海泰方圆科技有限公司 Index file inputting method and retrieval method based on index file
CN104715068A (en) * 2015-03-31 2015-06-17 北京奇虎科技有限公司 Method and device for generating document indexes and searching method and device
CN105205104A (en) * 2015-08-26 2015-12-30 成都布林特信息技术有限公司 Cloud platform data acquisition method
CN106649426A (en) * 2016-08-05 2017-05-10 浪潮软件股份有限公司 Data analysis method, data analysis platform and server
CN106599041A (en) * 2016-11-07 2017-04-26 中国电子科技集团公司第三十二研究所 Text processing and retrieval system based on big data platform
CN106776746A (en) * 2016-11-14 2017-05-31 天津南大通用数据技术股份有限公司 A kind of creation method and device of full-text index data
CN108241713A (en) * 2016-12-27 2018-07-03 南京烽火软件科技有限公司 A kind of inverted index search method based on polynary cutting
CN107016047A (en) * 2017-02-20 2017-08-04 阿里巴巴集团控股有限公司 Document query, document storing method and device
CN107038225A (en) * 2017-03-31 2017-08-11 江苏飞搏软件股份有限公司 The search method of information intelligent retrieval system
CN108228743A (en) * 2017-12-18 2018-06-29 深圳供电局有限公司 A kind of real-time big data search engine system
CN109254967A (en) * 2018-08-29 2019-01-22 河南智慧云大数据有限公司 A kind of depth analysis method and device based on multi-source heterogeneous mass data

Also Published As

Publication number Publication date
CN110096478A (en) 2019-08-06

Similar Documents

Publication Publication Date Title
CN110096478B (en) Document index generation method and device
CN112199344B (en) Log classification method and device
CN110532449B (en) Method, device, equipment and storage medium for processing service document
CN108038441B (en) System and method based on image recognition
CN110888791A (en) Log processing method, device, equipment and storage medium
CN115098440A (en) Electronic archive query method, device, storage medium and equipment
CN110874526B (en) File similarity detection method and device, electronic equipment and storage medium
CN112418813A (en) AEO qualification intelligent rating management system and method based on intelligent analysis and identification and storage medium
CN110472121B (en) Business card information searching method and device, electronic equipment and computer readable storage medium
CN109815243B (en) Structured storage method and device during document interface modification
CN111124470A (en) Automatic optimization method and device for program package based on cloud platform
CN113138974B (en) Method and device for detecting database compliance
CN111047657A (en) Picture compression method, device, medium and electronic equipment
CN116204428A (en) Test case generation method and device
CN114281761A (en) Data file loading method and device, computer equipment and storage medium
CN115658127A (en) Data processing method and device, electronic equipment and storage medium
CN114090673A (en) Data processing method, equipment and storage medium for multiple data sources
CN109491699B (en) Resource checking method, device, equipment and storage medium of application program
CN113111200A (en) Method and device for auditing picture file, electronic equipment and storage medium
US20200186675A1 (en) System and method for determining compression rates for images comprising text
CN110674395B (en) Information pushing method, device and equipment
CN113033832B (en) Method and device for inputting automobile repair data, terminal equipment and readable storage medium
CN112597109B (en) Data storage method, device, electronic equipment and storage medium
CN117389769B (en) Browser-end rich text copying method and system based on cloud service and cloud platform
CN117112846B (en) Multi-information source license information management method, system and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant