CN107861943B

CN107861943B - Method for quickly extracting useful data from document set

Info

Publication number: CN107861943B
Application number: CN201710985840.1A
Authority: CN
Inventors: 刘军旗; 苏爱军; 唐辉明; 吴冲龙; 姚梦辉; 滕伟福; 王亮清; 封瑞雪; 赵剑雄; 陈根深; 邹宗兴; 王菁莪; 曾雯; 张抒
Original assignee: China University of Geosciences
Current assignee: China University of Geosciences
Priority date: 2017-10-20
Filing date: 2017-10-20
Publication date: 2020-03-24
Anticipated expiration: 2037-10-20
Also published as: CN107861943A

Abstract

The invention provides a method for quickly extracting useful data from a document set, which comprises the following steps: 1: performing word segmentation processing to obtain potential search words in each document and potential search words in each paragraph in the document; 2: performing word frequency statistics to obtain a word frequency statistical result of each potential search word in each paragraph and a word frequency statistical result of potential search words of the whole document; 3: adopting an unstructured database technology to store, so that all the documents in the document set are converted into an ordered set in an unstructured database; 4: inputting a search word, and implementing search in an unstructured database with an ordered set; 5: and outputting a retrieval result. Has the advantages that: the retrieval is simple and the use is convenient.

Description

Method for quickly extracting useful data from document set

Technical Field

The invention relates to the technical field of information retrieval, in particular to a method for quickly extracting useful data from a document set.

Background

Unstructured database: generally speaking, unstructured data is data which has an irregular or incomplete data structure, has no predefined data model, and is inconvenient to express in a two-dimensional table by adopting a similar relational database. Document data such as Word, PDF, etc., picture data, image, audio, video data, etc. Unstructured data has a large weight among all data. The management of unstructured data by using traditional structured databases such as relational databases is difficult to conveniently mine valuable information contained in unstructured data.

Chinese word segmentation technology: the Chinese word segmentation refers to a process of segmenting a continuous word sequence in a text into individual words according to a certain specification and recombining the word sequence.

The word frequency statistical technique comprises the following steps: the number of times a word appears in a document is referred to as the word frequency of the word in the document. At present, TF-IDF (term frequency-inverse document frequency) method is generally adopted for word frequency statistics. This is a commonly used weighting technique for intelligence retrieval and text mining to assess how important a word is to a set of domain documents in a document or corpus. The importance of a word increases in proportion to the number of times it appears in a document, but at the same time decreases in inverse proportion to the frequency with which it appears in the corpus.

Document retrieval refers to a process of finding an optimal document having a search term in a document database in the case where the search term is input. With the increasing pace of social life and work and the increasing number of documents and words, only documents are searched in mass data, and even if relevant documents are searched, a large amount of time is needed to manually search relevant data in the relevant documents, so that the efficiency is extremely low and is quite difficult. For example: the geological disaster work accumulates a large amount of document data, the document data is generally stored integrally by taking the whole document as a unit, a specific data or information is extracted from one or a plurality of documents, or a specific section of the specific data or information in one or a plurality of documents is determined, and the information is extracted quickly, which is difficult up to now.

Disclosure of Invention

In view of this, the embodiment of the present invention provides a method for quickly extracting useful data from a document set, which is simple to retrieve and convenient to use.

The embodiment of the invention provides a method for quickly extracting useful data from a document set, which comprises the following steps:

step 1: using a Chinese word segmentation tool to carry out preprocessing including word segmentation, part-of-speech tagging and word segmentation screening on each document in a document set so as to obtain a potential search word in each document and a potential search word in each paragraph in the document;

step 2: performing word frequency statistics on the potential search words in each paragraph in each document in the document set to obtain a word frequency statistical result of each potential search word in each paragraph, and obtaining a word frequency statistical result of the potential search words of the corresponding document whole based on the word frequency statistical result of the paragraphs;

and step 3: adopting an unstructured database technology to store the document sets processed in the steps 1 and 2, and establishing a storage set for each document in the document sets, wherein the storage content of each storage set comprises: { the name of a document, the content of the document, the potential search word of each paragraph in the document, the word frequency statistical result of each potential search word, the potential search word of the document, the word frequency statistical result of each potential search word, and the storage time }, so that all documents in the document set are converted into an ordered set in an unstructured database;

and 4, step 4: inputting a search word, and implementing search in an unstructured database with an ordered set;

and 5: and outputting a retrieval result according to the matching of the retrieval word and the potential retrieval word and the word frequency statistical result of the potential retrieval word.

Further, the potential search terms include nouns, verbs and quantifiers.

Further, the word segmentation in step 1 is to remove non-potential search words from the words labeled by the segmentation and part of speech, where the non-potential search words include conjunctions, adverbs, and word atmosphere words.

Further, in step 5, the output content of the search result includes at least one result set, and the content of each result set includes: { name of document, storage time, contents of each paragraph in the document with a search term }.

And further, sorting the result set in a descending order according to the word frequency statistical result of the potential search words of the document.

Further, in each of the result sets, the paragraphs with the search terms are arranged according to the paragraph order of each paragraph in the document.

Further, the content of the result set further includes: { storage location, number of terms per paragraph with term }.

Further, the document set is a large document set of geological disasters.

Further, the Chinese Word segmentation tool is a Word segmentation dictionary, and the Word segmentation algorithm adopted in the Word segmentation in the step 1 is a Chinese Word segmentation algorithm, a Word segmentation algorithm or a Pangu segmentation algorithm.

Further, the method for performing word frequency statistics in step 2 is a TF-IDF method, and the unstructured database is a MongoDB, HBase or Redis database.

The technical scheme provided by the embodiment of the invention has the following beneficial effects: the method for rapidly extracting the useful data from the document set overcomes the difficulty that the useful data or information contained in a large number of documents is difficult to determine whether the useful data or information exists, where the useful data or information is located and how the useful data or information is rapidly extracted, enables a user to rapidly extract the required useful data or information from a large number of geological disaster documents, and provides powerful support and service for geological disaster data management, data analysis, data mining, data fusion, big data processing and the like.

Drawings

FIG. 1 is a diagram of one step of the method of the present invention for rapid extraction of useful data from a collection of documents;

FIG. 2 is an exemplary graph of a result set.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be further described with reference to the accompanying drawings.

Referring to fig. 1, an embodiment of the present invention provides a method for rapidly extracting useful data from a document set, including the following steps:

step 1: and (3) preprocessing each document in the document set by using a Chinese word segmentation tool, wherein the preprocessing comprises word segmentation, part of speech tagging and word segmentation screening to obtain a potential search word in each document and a potential search word in each paragraph in the document.

The Chinese word segmentation tool is a word segmentation dictionary, such as a Chinese dictionary with part-of-speech labels and the like counted in daily reports of people. The Word segmentation algorithm adopted in the Word segmentation in the step 1 is a Chinese Word segmentation algorithm, a Word segmentation algorithm or a ancient Word segmentation algorithm. And selecting a proper word segmentation algorithm according to the requirement.

The potential search words comprise nouns, verbs and quantifiers, and even comprise adjectives and other words which can be used for searching by the user. The word segmentation in the step 1 is to remove non-potential search words in the words after word segmentation and part of speech tagging, wherein the non-potential search words include words which cannot be used for searching by users or words with low search probability when the users use the words, such as conjunctions, adverbs, and word-atmosphere words. In this embodiment, the document set is a large document set of a geological disaster, but not limited thereto.

Step 2: and performing word frequency statistics on the potential search words in each paragraph in each document in the document set to obtain a word frequency statistical result of each potential search word in each paragraph, and obtaining a word frequency statistical result of the potential search words of the corresponding document whole based on the word frequency statistical result of the paragraphs.

Preferably, the method for performing word frequency statistics in step 2 is a TF-IDF method, and the unstructured database is a MongoDB, HBase or Redis database.

For example, the potential terms for document 1 are: computer, data and keyboard, the potential search term of the nth segment in document 1 is: computers and data. After the word frequency statistics in step 2, the number of occurrences of the potential search word computers and data in the nth segment in the document 1 is 5 times and 9 times, and the number of occurrences of the potential search word computers, data and keyboard in the document 1 is 15 times, 31 times and 92 times, respectively. Therefore, the word frequency statistical result of the nth segment in the document 1 is: [ document 1, paragraph N, computer, 5 ]; [ document 1, paragraph N, data, 9 ]. The word frequency statistical result of the document 1 is as follows: [ document 1, computer, 15 ]; [ document 1, data, 31 ]; [ document 1, keyboard, 92 ].

And step 3: adopting an unstructured database technology to store the document sets processed in the steps 1 and 2, and establishing a storage set for each document in the document sets, wherein the storage content of each storage set comprises: { the name of a document, the content of the document, the potential search terms of each paragraph in the document and the word frequency statistical result of each potential search term, the potential search terms of the document and the word frequency statistical result of each potential search term, and the storage time }, so that all the documents in the document set are converted into an ordered set in an unstructured database.

And 4, step 4: and inputting a search term, and implementing search in the unstructured database with the ordered set.

Referring to fig. 2, in step 5, the output content of the search result includes at least one result set, and the content of each result set includes: { name of document, storage time, contents of each paragraph in the document with a search term }. And sorting the result set in a descending order according to the word frequency statistical results of the potential search words of the documents. In each result set, paragraphs with search terms are arranged according to the paragraph order of each paragraph in the document. The contents of the result set may further include: { storage location, number of terms per paragraph with term }.

In this document, the terms front, back, upper and lower are used to define the components in the drawings and the positions of the components relative to each other, and are used for clarity and convenience of the technical solution. It is to be understood that the use of the directional terms should not be taken to limit the scope of the claims.

The features of the embodiments and embodiments described herein above may be combined with each other without conflict.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A method for rapidly extracting useful data from a document set is characterized in that: the method comprises the following steps:

step 1: using a Chinese word segmentation tool to perform preprocessing including word segmentation, part-of-speech tagging and word segmentation screening on each document in a document set, and eliminating non-potential search words in words after the word segmentation and the part-of-speech tagging, wherein the non-potential search words comprise conjunctions, adverbs and words of tone, so as to obtain potential search words in each document and potential search words in each paragraph in the document, and the potential search words comprise nouns, verbs and quantifiers;

and 5: outputting a retrieval result according to the matching of the retrieval word and the potential retrieval word and the word frequency statistical result of the potential retrieval word, wherein the output content of the retrieval result comprises at least one result set, and the content of each result set comprises: { name of document, storage time, contents of each paragraph in the document with a search term }, the contents of the result set further include: { storage location, number of terms of each paragraph having a term }, and sorting the result sets in descending order according to the word frequency statistical result of the potential terms of the document, and sorting the paragraphs having a term in each of the result sets according to the paragraph order of each paragraph in the document.

2. The method for rapidly extracting useful data from a document set according to claim 1, characterized in that: the document set is a large document set of geological disasters.

3. The method for rapidly extracting useful data from a document set according to claim 1, characterized in that: the Chinese Word segmentation tool is a Word segmentation dictionary, and the Word segmentation algorithm adopted in the Word segmentation in the step 1 is a Chinese Word segmentation, Word segmentation or ancient Word segmentation algorithm.

4. The method for rapidly extracting useful data from a document set according to claim 1, characterized in that: the method for carrying out word frequency statistics in the step 2 is a TF-IDF method, and the unstructured database is a MongoDB, HBase or Redis database.