CN116701813A

CN116701813A - Data retrieval method, system, terminal and storage medium

Info

Publication number: CN116701813A
Application number: CN202310974098.XA
Authority: CN
Inventors: 龙泽灵; 李五妍; 姜忠群; 林勇; 安莹玉
Original assignee: Beijing Enterprises Water China Investment Co Ltd
Current assignee: Beijing Enterprises Water China Investment Co Ltd
Priority date: 2023-08-04
Filing date: 2023-08-04
Publication date: 2023-09-05

Abstract

The application relates to a data retrieval method, a system, a terminal and a storage medium, which belong to the field of data retrieval; the data retrieval method comprises the steps of obtaining a website list; traversing the website list to obtain target file information; establishing a preset keyword list; and retrieving target keyword information from the target file information according to a preset keyword list, and storing the target keyword information. The application improves the retrieval efficiency of the keywords in the file.

Description

Data retrieval method, system, terminal and storage medium

Technical Field

The present application relates to the field of data retrieval, and in particular, to a data retrieval method, system, terminal, and storage medium.

Background

At present, if people want to search whether certain keywords are involved in a certain type of file, the related database type website is usually utilized to search the file, the file is downloaded, and the downloaded file is subjected to manual keyword searching; however, this method has a certain disadvantage that, firstly, when searching files, because there are more files in the database, searching is time-consuming, and when searching files, searching for keywords from many searched files is a time-consuming and labor-consuming act.

Disclosure of Invention

The application provides a data retrieval method, a system, a terminal and a storage medium, which have the characteristic of improving the retrieval efficiency of keywords in a file.

The application aims at providing a data retrieval method.

The first object of the present application is achieved by the following technical solutions:

a data retrieval method comprising: acquiring a website list; traversing the website list to obtain target file information; establishing a preset keyword list;

and retrieving target keyword information from the target file information according to a preset keyword list, and storing the target keyword information.

The present application may be further configured in a preferred example, wherein the acquiring the website list includes:

determining a file name keyword list according to preset file name information;

traversing the file name keyword list and a search engine to obtain website link information;

converting the website link information to obtain original website link information;

and obtaining a website list according to the original website link information.

The present application may be further configured in a preferred embodiment to, before obtaining the website list according to the original website link information, further include deduplicating and filtering the original website link information.

In a preferred embodiment, the method may further be configured, where traversing the website list to obtain the target file information includes:

identifying the website list to obtain webpage content type information;

and obtaining target file information according to preset webpage processing rules and the webpage content type information.

The present application may be further configured in a preferred example, wherein the preset web page processing rule includes:

if the web page format is an HTML format and the web page has doc/docx/pdf file downloading links, downloading the doc/docx/pdf file to a first preset path;

if the web page format is an HTML format and the web page does not have doc/docx/pdf file downloading links, acquiring web page text information, and storing the web page text information to a second preset path;

if the web page format is PDF/WORD format, downloading the web page file to a third preset path;

if the web page format is other than HTML, PDF, WORD, the web address is exported, and the web page content of the web address is stored in a fourth preset path.

The present application may be further configured in a preferred example to: the preset keyword list comprises a primary keyword list and a secondary keyword list.

In a preferred embodiment, the method may further include retrieving target keyword information from the target file information according to a preset keyword list, and storing the target keyword information includes:

extracting text information and punctuation information from the target file information;

traversing the text information according to the main keyword list to obtain the position information of the main keywords;

determining sentences of the main keywords according to the text information, the position information and the punctuation information;

judging whether the sentences containing the main keywords contain the secondary keywords or not according to the secondary keyword list;

if yes, storing the sentences of the main keywords into a table file;

otherwise, the sentence of the main keyword is not stored.

The application also provides a data retrieval system.

The second object of the present application is achieved by the following technical solutions:

a data retrieval system, comprising:

the acquisition module is used for acquiring the website list;

the traversing module is used for traversing the website list to obtain target file information;

the establishing module is used for establishing a preset keyword list;

and the retrieval module is used for retrieving the target keyword information from the target file information according to a preset keyword list and storing the target keyword information.

The application aims at providing a terminal.

The third object of the present application is achieved by the following technical solutions:

a terminal comprising a memory and a processor, the memory having stored thereon computer program instructions of the above data retrieval method capable of being loaded and executed by the processor.

A fourth object of the present application is to provide a computer medium capable of storing a corresponding program.

The fourth object of the present application is achieved by the following technical solutions:

a computer readable storage medium storing a computer program capable of being loaded by a processor and executing any one of the data retrieval methods described above.

In summary, the present application includes at least one of the following beneficial technical effects:

traversing the website list to obtain target file information, and then retrieving target keywords from the target file information according to a preset keyword list; the traversing operation of the website list ensures that enough target files are acquired, and the condition that the target files are missed does not occur; then, the target file is screened for the first time according to the main keywords, and the target file is screened for the second time according to the secondary keywords, so that the extracted keywords are ensured to meet the requirements, and the possibility of keyword retrieval errors is reduced; by the method, quick, convenient and efficient retrieval of the data is realized, the possibility of missed detection and false detection is reduced, and the retrieval efficiency of keywords in the file is improved.

Drawings

Fig. 1 is a flow chart of a data retrieval method in an embodiment of the application.

Fig. 2 is a schematic diagram of a data retrieval system according to an embodiment of the present application.

Reference numerals illustrate: 1. an acquisition module; 2. traversing the module; 3. establishing a module; 4. and a retrieval module.

Detailed Description

The present embodiment is only for explanation of the present application and is not to be construed as limiting the present application, and modifications to the present embodiment, which may not creatively contribute to the present application as required, are within the scope of the claims of the present application as far as they are protected by patent law.

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

Embodiments of the application are described in further detail below with reference to the drawings.

The application provides a data retrieval method, and the main flow of the method is described as follows.

As shown in fig. 1:

step S101: and acquiring a website list.

Specifically, a file name keyword list is determined according to preset file name information; traversing the file name keyword list to obtain file name keyword information; acquiring website link information according to the file name keyword information and a search engine; converting the website link information to obtain original website link information; and obtaining a website list according to the original website link information.

In the embodiment of the application, a file name keyword list is determined according to preset file name information, namely, keywords are selected according to preset file names, and the keywords are arranged into the keyword list; finding out related website links through a search engine by file name keyword information; the original website link information is obtained through the conversion processing of the website link; and finally, the original website link information is arranged to obtain a website list.

It should be noted that in the above process, after the original website link information is obtained, the duplicate removal and screening operations are required to be performed on the original website link information; for example, when processing a website link, defining the file source as a website, only the gov website link should be reserved for ensuring the validity of the information source; when screening the website links, judging whether the link source range is limited, if so, screening the website according to the keyword information of the website with the given source range, and if not, carrying out screening operation.

In the embodiment of the application, the specific process of finding the associated website links by using the search engine and the file name keyword list is as follows; determining file name keyword information according to the file name keyword list; then splicing the file name keywords with the URL of the search engine, and obtaining search page contents by using a Request library; extracting all website links in the page content; judging whether the link is jumped or not, if yes, continuing to acquire the jumped link; by the method, the comprehensiveness of the acquired website links can be ensured, and the possibility of omission of the website links is reduced.

Step S102: traversing the website list to obtain target file information.

Specifically, identifying the website list to obtain webpage content type information; and obtaining target file information according to preset webpage processing rules and the webpage content type information.

In the embodiment of the application, the webpage processing rule comprises that if the webpage format is an HTML format and a doc/docx/pdf type file downloading link exists in the webpage, the doc/docx/pdf type file is downloaded to a first preset path; if the web page format is an HTML format and the web page does not have doc/docx/pdf file downloading links, acquiring web page text information, and storing the web page text information to a second preset path; if the web page format is PDF/WORD format, downloading the web page file to a third preset path; if the web page format is other than HTML, PDF, WORD, the web address is exported, and the web page content of the web address is stored in a fourth preset path.

The downloading of the files and the texts in the web pages is completed through the analysis of the web page formats and the web page contents, and the files and the texts corresponding to the web pages of different types are stored in different paths, so that the access and the management are convenient, and the subsequent operation and treatment are also convenient; it can be appreciated that in the above process, if the web page format is other than HTML, PDF, WORD, after the web address is exported, the web page content of the web address may be stored to the fourth preset path by means of manual processing.

Step S103: and establishing a preset keyword list.

In the embodiment of the application, the preset keyword list comprises a main keyword list and a secondary keyword list; then establishing the preset keyword list means respectively establishing a primary keyword list and a secondary keyword list; the primary keyword list is used for primarily screening sentences where keywords are located, and the secondary keyword list is used for secondarily screening sentences where the primary keywords are located; for example, taking the example of whether the file contains a reclaimed water related policy, the main keyword in the target file should be "reclaimed water"; however, in the process of retrieving the main keyword "reclaimed water", it is necessary to avoid an interference sentence containing "reclaimed water" but lacking effective information; the common regenerated water policy related sentences comprise fields such as 'industry', 'greening', 'rate', 'percent', and the like, so that the fields are put into a secondary keyword list, secondary screening is performed through the secondary keywords, and the keyword screening efficiency is improved.

Step S104: and retrieving target keyword information from the target file information according to a preset keyword list, and storing the target keyword information.

Specifically, text information and punctuation information are extracted from the target file information; traversing the text information according to the main keyword list to obtain the position information of the main keywords; determining sentences of the main keywords according to the text information, the position information and the punctuation information; judging whether the sentences containing the main keywords contain the secondary keywords or not according to the secondary keyword list; if yes, storing the sentences of the main keywords into a table file; otherwise, the sentence of the main keyword is not stored.

It should be noted that, when searching the primary keywords in the sentences, it can be considered that all sentences containing the primary keywords and secondary keywords in a single target file are stored separately by using a storage space in a preset format, where the storage space can be regarded as a list; and combining all the storage spaces into a table file, wherein the table file is the table file, and the table file contains the retrieval results of the main keywords in all the files.

In the embodiment of the application, after the preset keyword list and the target file information are obtained, traversing searching is carried out on each main keyword in the preset keyword list in the corresponding target file; then extracting sentences in which the main keywords are located, and storing the sentences in a new file; specifically, firstly converting a format of a target file into a docx format, then reading the file, extracting text information and punctuation information in the file, and splicing the text information and the punctuation information to form a character string, wherein the text information refers to non-space text; and then searching the text information, detecting the position information of the main keywords in the text information, and storing the position information of the main keywords in a keyword position list.

After the position of the main keyword is determined, the positions of periods on the left side and the right side of the position of the main keyword can be determined according to the extracted text information and punctuation information, and then sentences in which the main keyword is positioned are extracted according to the positions of the periods; it should be noted that, here, the positions of the periods on the left and right sides of the position where the main keyword is located need to be screened, but not the positions of punctuation marks, so after the punctuation marks are detected, whether the punctuation marks are periods needs to be judged, if yes, the punctuation marks are marked, otherwise, the next punctuation marks are searched.

Judging whether the sentences in which the main keywords are located contain the secondary keywords according to the secondary keyword list; if so, storing the sentence of the main keyword in a table file in a preset format; in the table file, the header of each column is the retrieved file name, and each row of each column is the sentence in which the main keyword is located.

It can be understood that the text information and punctuation information of the file are collected and sorted, so that sentences in which the main keywords are located in the file are extracted, then secondary screening is carried out on the sentences in which the main keywords are located by utilizing the secondary keyword list, filtering of sentences lacking effective information is achieved, accuracy and precision of keyword searching are improved, and efficiency of keyword searching is improved.

It will be appreciated that in retrieving keywords from documents, some documents are not presented in the form of keywords, but are represented in documents in the same meaning as the keywords intended to be retrieved, and thus it is necessary to retrieve similar keywords in this form.

Specifically, determining similar words according to the keywords; traversing the file to obtain the quantity information and the position information of the similar words; acquiring total word number information of a file; acquiring word number information of similar words; determining total word number information of the similar words appearing in the file according to the number information and the word number information of the similar words; determining the word number ratio of the similar words according to the total word number information of the similar words and the total word number information of the file; determining the occurrence frequency of similar words according to the word number duty ratio and a preset proportion range; determining the frequency level of the similar words according to the occurrence frequency of the similar words and the preset frequency level; the frequency levels include low, medium, and high; determining the importance level of the similar words according to the frequency level of the similar words and the preset similar level; the preset similarity level comprises low, medium and high; importance levels include low, medium, and high.

If the importance level of the similar words is high, the similar words are core words of the file; and determining the positions of periods on two sides of the similar words according to the position information of the similar words, and extracting sentences of the similar words between the two periods.

If the importance level of the similar words is medium, carrying out similarity analysis on the similar words and the keywords; judging whether the similar words and the keywords have the same words, if so, respectively analyzing word senses of the similar words and the keywords; judging whether the word senses are the same or not, if so, randomly selecting sentences containing similar words, and determining the sentence sense of each sentence; replacing similar words by using the keywords, and determining the meaning of the replaced sentences; if the sentence meaning of the front and rear sentences is the same, the sentences are marked as key sentences, and the key sentences are extracted and stored.

If the importance level of the similar word is low, the similar word is abandoned.

In the embodiment of the application, the similarity level exists between the similarity words and the keywords, and the similarity level comprises low, medium and high; if the frequency level and the similarity level of the similar words are both high, the importance level of the similar words is high; if the frequency level and the similarity level of the similar words are low, the importance level of the similar words is low; otherwise, the importance level of the similar word is medium.

For similar words with high importance level, after extracting the sentences in which the similar words are located, the sentences on two sides of the sentences need to be further analyzed.

Specifically, judging whether the beginning or the end of the sentence where the similar word is located has a conjunctive word, if so, extracting sentences on two sides of the sentence where the similar word is located and marking the sentences as special sentences; binding and storing the special sentences and sentences in which the corresponding similar words are located; carrying out semantic analysis on the special sentences to determine whether the semantics of the special sentences are explanation of sentences in which similar words are located; extracting special keywords in the special sentences, and binding and storing the special keywords and similar words; by the method, when the keywords are searched, related information bound with the keywords can be synchronously checked, and convenience of information query is improved.

The application also provides a data retrieval system, as shown in fig. 2, which comprises an acquisition module 1 for acquiring a website list; the traversing module 2 is used for traversing the website list to obtain target file information; the establishing module 3 is used for establishing a preset keyword list; and the retrieval module 4 is used for retrieving the target keyword information from the target file information according to a preset keyword list and storing the target keyword information.

In order to better execute the program of the method, the application also provides a terminal, which comprises a memory and a processor.

Wherein the memory may be used to store instructions, programs, code, sets of codes, or sets of instructions. The memory may include a storage program area and a storage data area, wherein the storage program area may store instructions for implementing an operating system, instructions for at least one function, instructions for implementing the above-described data retrieval method, and the like; the storage data area may store data and the like involved in the above-described data retrieval method.

The processor may include one or more processing cores. The processor performs the various functions of the application and processes the data by executing or executing instructions, programs, code sets, or instruction sets stored in memory, calling data stored in memory. The processor may be at least one of an application specific integrated circuit, a digital signal processor, a digital signal processing device, a programmable logic device, a field programmable gate array, a central processing unit, a controller, a microcontroller, and a microprocessor. It will be appreciated that the electronics for implementing the above-described processor functions may be other for different devices, and embodiments of the present application are not particularly limited.

The present application also provides a computer-readable storage medium, for example, comprising: a U-disk, a removable hard disk, a Read Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes. The computer readable storage medium stores a computer program that can be loaded by a processor and that performs the data retrieval method described above.

The above description is only illustrative of the preferred embodiments of the present application and the principles of the technology being employed. It will be appreciated by persons skilled in the art that the scope of the disclosure referred to in the present application is not limited to the specific combinations of technical features described above, but also covers other technical features which may be formed by any combination of the technical features described above or their equivalents without departing from the spirit of the disclosure. Such as the above-mentioned features and the technical features disclosed in the present application (but not limited to) having similar functions are replaced with each other.

Claims

1. A data retrieval method, comprising:

acquiring a website list;

traversing the website list to obtain target file information;

establishing a preset keyword list;

2. The data retrieval method of claim 1, wherein the obtaining a list of web sites comprises:

determining a file name keyword list according to preset file name information;

3. The data retrieval method according to claim 2, further comprising, before obtaining a web site list from the original web site link information, de-duplicating and filtering the original web site link information.

4. The method of claim 1, wherein traversing the website list to obtain the target file information comprises:

identifying the website list to obtain webpage content type information;

5. The data retrieval method according to claim 4, wherein the preset web page processing rule includes:

6. The data retrieval method according to claim 1, wherein the preset keyword list includes a primary keyword list and a secondary keyword list.

7. The data retrieval method according to claim 6, wherein retrieving target keyword information from the target file information according to a preset keyword list, and storing the target keyword information comprises:

if yes, storing the sentences of the main keywords into a table file;

otherwise, the sentence of the main keyword is not stored.

8. A data retrieval system, comprising:

the acquisition module (1) is used for acquiring a website list;

the traversing module (2) is used for traversing the website list to obtain target file information;

the establishing module (3) is used for establishing a preset keyword list;

and the retrieval module (4) is used for retrieving target keyword information from the target file information according to a preset keyword list and storing the target keyword information.

9. A terminal comprising a memory and a processor, the memory having stored thereon computer program instructions capable of being loaded by the processor and performing the method according to any of claims 1-7.

10. A computer readable storage medium, characterized in that a computer program is stored which can be loaded by a processor and which performs the method according to any of claims 1-7.