CN110096478B

CN110096478B - Document index generation method and device

Info

Publication number: CN110096478B
Application number: CN201910383600.3A
Authority: CN
Inventors: 徐凯; 丛新法; 侯青军; 杨通军; 杨哲; 高翔; 张健钊
Original assignee: China United Network Communications Group Co Ltd
Current assignee: China United Network Communications Group Co Ltd
Priority date: 2019-05-09
Filing date: 2019-05-09
Publication date: 2021-06-29
Anticipated expiration: 2039-05-09
Also published as: CN110096478A

Abstract

The embodiment of the invention provides a document index generation method and a device, wherein the method comprises the following steps: receiving files sent by various types of clients and metadata information corresponding to the files, wherein the metadata information is stored in a database, and the files are stored in a local disk; inquiring the database to acquire a target file waiting for automatic processing from the local disk according to the metadata information; analyzing the target file to obtain document information; the document information is subjected to word segmentation processing to obtain the search words of the document information, and the index information of the document information is generated according to a preset search frame, so that automatic generation of document indexes can be realized, the efficiency is high, a large amount of document information can be processed, and the labor cost can be saved.

Description

Document index generation method and device

Technical Field

The embodiment of the invention relates to the technical field of computers, in particular to a document index generation method and device.

Background

With the rapid development of the internet, the document information on the internet is rapidly increasing. In order to find all required document information in the massive information, a user generally finds relevant document information according to the index information.

At present, in order to establish a document index, a common method is to manually sort documents from different sources to obtain index information of different documents, and then upload the index information to a database server for retrieval by a user.

However, the inventor finds that in the existing process of manually sorting and establishing index information for documents from different sources, when the number of documents is large, the operation is complicated, a large amount of manpower is required, and the cost is high.

Disclosure of Invention

The embodiment of the invention provides a document index generation method and device, and aims to solve the problems that in the prior art, documents from different sources are manually sorted and index information is established, when the number of the documents is large, the operation is complex, a large amount of manpower is consumed, and the cost is high.

In a first aspect, an embodiment of the present invention provides a method for generating a document index, including:

receiving files sent by various types of clients and metadata information corresponding to the files, wherein the metadata information is stored in a database, and the files are stored in a local disk;

inquiring the database to acquire a target file waiting for automatic processing from the local disk according to the metadata information;

analyzing the target file to obtain document information;

and performing word segmentation processing on the document information to obtain a search word of the document information, and generating index information of the document information according to a preset search frame.

In a possible design, before the receiving the file sent by each type of client and the metadata information corresponding to the file, the method further includes:

sending file extraction information to various types of clients, wherein the file extraction information comprises file formats and scanning path information which are allowed to be extracted and correspond to the various types of clients;

the file format and the scanning path information which are allowed to be extracted are used for indicating the clients of each type to extract the file and the metadata information corresponding to the file according to the corresponding file format and the scanning path information which are allowed to be extracted.

In one possible design, the parsing the target file to obtain document information includes:

acquiring the file format of the target file, and judging whether the target file exists or not;

if the target file exists and the file format of the target file is a compressed file format, decompressing the target file to obtain a decompressed file, and analyzing the decompressed file to generate document information of the target file;

and if the target file exists and the file format of the target file is a non-compressed file format, analyzing the target file to generate the document information of the target file.

In one possible design, determining whether the target file exists includes:

and inquiring the file through the file path in the metadata information, if the file is inquired, determining that the target file exists, and if the file cannot be inquired, determining that the target file does not exist.

In one possible design, the parsing the target file to generate the document information of the target file includes:

and analyzing the target file through an open source tool apache tika to acquire the document information of the target file.

In one possible design, the preset retrieval framework is the apache lucence framework.

In a second aspect, an embodiment of the present invention provides a document index generating device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the following steps when executing the computer program:

analyzing the target file to obtain document information;

The processor, when executing the computer program, further implements the steps of:

before receiving the files sent by the various types of clients and the metadata information corresponding to the files, sending file extraction information to the various types of clients, wherein the file extraction information comprises file formats and scanning path information which are allowed to be extracted and correspond to the various types of clients;

In a third aspect, an embodiment of the present invention provides a computer-readable storage medium, where computer-executable instructions are stored in the computer-readable storage medium, and when a processor executes the computer-executable instructions, the method for generating a document index according to any one of the first aspect and the first aspect is implemented.

According to the document index generation method and the document index generation equipment provided by the embodiment of the invention, files sent by various types of clients and metadata information corresponding to the files are received, a target file is obtained according to the metadata information, and the target file is analyzed to obtain document information; the document information is subjected to word segmentation processing to obtain the search words of the document information, and the index information of the document information is generated according to a preset search frame, so that automatic generation of document indexes can be realized, the efficiency is high, a large amount of document information can be processed, and the labor cost can be saved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a schematic diagram of a system architecture for generating a document index according to an embodiment of the present invention;

FIG. 2 is a first flowchart illustrating a document index generating method according to an embodiment of the present invention;

FIG. 3 is a flowchart illustrating a document index generating method according to an embodiment of the present invention;

FIG. 4 is a first schematic structural diagram of a document index generating device according to an embodiment of the present invention;

FIG. 5 is a schematic structural diagram of a document index generating device according to an embodiment of the present invention;

fig. 6 is a schematic diagram of a hardware structure of a document index generating device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, fig. 1 is a schematic diagram of a system architecture for generating a document index according to an embodiment of the present invention. As shown in fig. 1, the system provided by the present embodiment includes various types of clients 101 and servers 102. The client 101 may be a mobile phone, a tablet, a personal computer, or the like. The present embodiment does not particularly limit the implementation manner of the client 101 and the inquiry user terminal 102 as long as the client 101 can interact with the server 102. The server 102 is used for managing document retrieval services, and the server 102 may be a single server or a cluster of multiple servers.

Referring to fig. 2, fig. 2 is a flowchart illustrating a document index generating method according to an embodiment of the present invention, where an executing subject of the embodiment may be a server according to the embodiment shown in fig. 1, and the embodiment is not limited herein. As shown in fig. 2, the method includes:

s201: receiving files sent by various types of clients and metadata information corresponding to the files, wherein the metadata information is stored in a database, and the files are stored in a local disk.

In this embodiment, the clients of various types may be a mail client, a Text Services Framework (TSF) client, a chat application client, and the like.

Specifically, file extraction information may be sent to each type of client, where the file extraction information includes file formats and scanning path information that are allowed to be extracted and correspond to the each type of client;

For example, the mail client scans the file in the mail format according to the corresponding scanning path information, and the TSF client scans the file in the TSF format according to the corresponding scanning path information.

The metadata information corresponding to the file may include detailed information as shown in table 1.

TABLE 1 details of metadata information corresponding to a file

Field(s)	Field interpretation
		File_id	Self-increment field, no specific meaning
File_tile	Specifying the name of a file
		File_path	Storage position after file extraction
Client_type	Specific types of custom clients
		In_time	Extraction time of file
author	Creator of extracted document
		Status	Extracting processing states of files
Parent_md5	MD5 encoding of parent file with association relation

S202: and inquiring the database to acquire a target file waiting for automatic processing from the local disk according to the metadata information.

In this embodiment, the metadata information includes a processing state of the extracted file (refer to table 1), the processing state of the extracted file may be marked according to a format during file extraction, for example, a file that cannot be parsed is marked as "waiting for manual processing", a file that can be parsed is marked as "waiting for manual processing", a file that has completed manual parsing is marked as "manual processing completed", a file that has completed automatic parsing is marked as "automatic processing completed", a file that has no scanned file is marked as "file absent error", a file that has an error during scanning is marked as "file present error", a file that has a parent-level associated file corresponding to the scanned file does not have an error ", a file that cannot be written after the file is parsed after scanning is marked as" file information after file parsing does not exist ", the flag that the file content does not exist after scanning is "file content does not exist". The correspondence between the state code of the extracted file and the processing state corresponding to each type of file can be referred to table 2.

Table 2 correspondence between status codes of extracted files corresponding to various types of files and processing statuses

State coding	State interpretation
		MANUAL_ANALYSE	Wait for manual processing
AUTO_ANALYSE	Wait for automatic processing
		MANUAL_ANALYSED	Completion of manual processing
AUTO_ANALYSED	Automatic processing is completed
		ERROR_NO_PATH	File absence errors
ERROR_MD5_EXIST	The file has an error
		ERROR_PARENT_NOT_EXIST	No error in parent level associated file
ERROR_DB_INSERT_ERROR	Information write failure after file parsing
		WARN_NO_CONTENT	File content free of errors

S203: and analyzing the target file to obtain document information.

In this embodiment, if the target file is a compressed file, the compressed file is first decompressed to obtain a decompressed file, and the decompressed file is parsed to obtain document information; and if the target file is not the compressed file, directly analyzing the target file to generate the analyzed file information of the target file.

The compressed file may be a rar compressed format file or a zip compressed format file.

S204: and performing word segmentation processing on the document information to obtain a search word of the document information, and generating index information of the document information according to a preset search frame.

In this embodiment, the document information may be subjected to word segmentation processing by a Chinese word segmenter Smart Chinese Analyzer to obtain a search term of the document information.

Wherein the preset frame may be an apache lucence frame.

Specifically, the index information of the document information may be generated by using the search term of the document information and the apache lucence framework. The retrieval range of the apache lucence framework may include a title of a document, a path of the document, contents of the document, metadata information of the document, and the like.

In this embodiment, the search may be performed through a user interface, including category search, highlighting, document detail viewing, document preview, document association information, and so on.

According to the description, the files sent by the clients of various types and the metadata information corresponding to the files are received, the target files are obtained according to the metadata information, and the target files are analyzed to obtain the document information; the document information is subjected to word segmentation processing to obtain the search words of the document information, and the index information of the document information is generated according to a preset search frame, so that automatic generation of document indexes can be realized, the efficiency is high, a large amount of document information can be processed, and the labor cost can be saved.

Referring to fig. 3, fig. 3 is a second schematic flow chart of a document index generating method according to an embodiment of the present invention, and on the basis of the embodiment corresponding to fig. 2, this embodiment describes in detail a specific process of analyzing the target file to obtain document information in step S203, which is detailed as follows:

s301: and acquiring the file format of the target file, and judging whether the target file exists.

In the present embodiment, the file format is classified into a compressed file format and an uncompressed file format.

Specifically, the determining whether the target file exists includes: and inquiring the file through the file path in the metadata information, if the file is inquired, determining that the target file exists, and if the file cannot be inquired, determining that the target file does not exist.

S302: and if the target file exists and the file format of the target file is a compressed file format, decompressing the target file to obtain a decompressed file, and analyzing the decompressed file to generate the document information of the target file.

S303: and if the target file exists and the file format of the target file is a non-compressed file format, analyzing the target file to generate the document information of the target file.

In this embodiment, the target file may be parsed by an open source tool apache tika, so as to obtain the document information of the target file. Similarly, the document information of the target file can be generated by parsing the decompressed file through the open source tool apache tika.

The target file is parsed by the open source tool apache tika, and the parsed document information may include the contents shown in table 3.

TABLE 3 content corresponding to parsed document information

From the above description, it can be known that the target file in the compressed format is decompressed without manual decompression, thereby improving the parsing efficiency of the document.

Fig. 4 is a first schematic structural diagram of a document index generating device according to an embodiment of the present invention. As shown in fig. 4, the document index generating apparatus 40 includes: a receiving module 401, an obtaining module 402, a parsing module 403 and an index generating module 404.

The receiving module 401 is configured to receive files sent by various types of clients and metadata information corresponding to the files, where the metadata information is stored in a database, and the files are stored in a local disk;

an obtaining module 402, configured to query the database, according to the metadata information, to obtain a target file to be automatically processed from the local disk;

an analysis module 403, configured to analyze the target file to obtain document information;

an index generating module 404, configured to perform word segmentation processing on the document information to obtain a search word of the document information, and generate index information of the document information according to a preset search frame.

The device provided in this embodiment may be used to implement the technical solution of the above method embodiment, and the implementation principle and technical effect are similar, which are not described herein again.

Referring to fig. 5, fig. 5 is a schematic structural diagram of a document index generating device according to an embodiment of the present invention. As shown in fig. 5, this embodiment further includes, on the basis of the embodiment in fig. 4: a sending module 405.

The sending module 405 is configured to send file extraction information to each type of client before receiving a file sent by each type of client and metadata information corresponding to the file, where the file extraction information includes file formats and scanning path information that are allowed to be extracted and correspond to each type of client; the file format and the scanning path information which are allowed to be extracted are used for indicating the clients of each type to extract the file and the metadata information corresponding to the file according to the corresponding file format and the scanning path information which are allowed to be extracted.

In an embodiment of the present invention, the parsing module 403 is specifically configured to obtain a file format of the target file, and determine whether the target file exists;

In an embodiment of the present invention, the parsing module 403 is further configured to determine whether the target file exists, including:

In an embodiment of the present invention, the index generating module 404 is specifically configured to parse the target file through an open source tool apache tika, and obtain the document information of the target file.

In an embodiment of the present invention, the preset retrieval frame is an apache lucence frame.

Fig. 6 is a schematic diagram of a hardware structure of a document index generating device according to an embodiment of the present invention. As shown in fig. 6, the document index generating device 60 of the present embodiment includes: a processor 601 and a memory 602; wherein

A memory 602 for storing computer-executable instructions;

the processor 601 is configured to execute the computer execution instructions stored in the memory to implement the steps performed by the server in the above embodiments. Reference may be made in particular to the description relating to the method embodiments described above.

Alternatively, the memory 602 may be separate or integrated with the processor 601.

When the memory 602 is provided separately, the document index generating apparatus further includes a bus 603 for connecting the memory 602 and the processor 601.

The embodiment of the present invention further provides a computer-readable storage medium, where a computer execution instruction is stored in the computer-readable storage medium, and when a processor executes the computer execution instruction, the document index generation method as described above is implemented.

In the embodiments provided in the present invention, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described device embodiments are merely illustrative, and for example, the division of the modules is only one logical division, and other divisions may be realized in practice, for example, a plurality of modules may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or modules, and may be in an electrical, mechanical or other form.

The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

In addition, functional modules in the embodiments of the present invention may be integrated into one processing unit, or each module may exist alone physically, or two or more modules are integrated into one unit. The unit formed by the modules can be realized in a hardware form, and can also be realized in a form of hardware and a software functional unit.

The integrated module implemented in the form of a software functional module may be stored in a computer-readable storage medium. The software functional module is stored in a storage medium and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to execute some steps of the methods according to the embodiments of the present application.

It should be understood that the Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the present invention may be embodied directly in a hardware processor, or in a combination of the hardware and software modules within the processor.

The memory may comprise a high-speed RAM memory, and may further comprise a non-volatile storage NVM, such as at least one disk memory, and may also be a usb disk, a removable hard disk, a read-only memory, a magnetic or optical disk, etc.

The bus may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended ISA (enhanced Industry Standard Architecture) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, the buses in the figures of the present application are not limited to only one bus or one type of bus.

The storage medium may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks. A storage media may be any available media that can be accessed by a general purpose or special purpose computer.

An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. Of course, the storage medium may also be integral to the processor. The processor and the storage medium may reside in an Application Specific Integrated Circuits (ASIC). Of course, the processor and the storage medium may reside as discrete components in an electronic device or host device.

Those of ordinary skill in the art will understand that: all or a portion of the steps of implementing the above-described method embodiments may be performed by hardware associated with program instructions. The program may be stored in a computer-readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A document index generation method is characterized by comprising the following steps:

analyzing the target file to obtain document information;

performing word segmentation processing on the document information to obtain a search word of the document information, and generating index information of the document information according to a preset search frame;

before the receiving the file sent by each type of client and the metadata information corresponding to the file, the method further comprises the following steps:

2. The method of claim 1, wherein parsing the target file to obtain document information comprises:

3. The method of claim 2, wherein determining whether the target file exists comprises:

4. The method of claim 2, wherein parsing the target file to generate document information of the target file comprises:

5. The method of any one of claims 1 to 4, wherein the preset retrieval framework is an apache lucence framework.

6. A document index generation device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, the processor implementing the following steps when executing the computer program:

analyzing the target file to obtain document information;

7. The apparatus of claim 6, wherein parsing the target file to obtain document information comprises:

8. A computer-readable storage medium having stored thereon computer-executable instructions which, when executed by a processor, implement the document index generation method of any one of claims 1 to 5.