CN107861943A

CN107861943A - A kind of method of the rapid extraction useful data from document sets

Info

Publication number: CN107861943A
Application number: CN201710985840.1A
Authority: CN
Inventors: 刘军旗; 苏爱军; 唐辉明; 吴冲龙; 姚梦辉; 滕伟福; 王亮清; 封瑞雪; 赵剑雄; 陈根深; 邹宗兴; 王菁莪; 曾雯; 张抒
Original assignee: China University of Geosciences
Current assignee: China University of Geosciences
Priority date: 2017-10-20
Filing date: 2017-10-20
Publication date: 2018-03-30
Anticipated expiration: 2037-10-20
Also published as: CN107861943B

Abstract

The present invention provides a kind of method of the rapid extraction useful data from document sets, comprises the following steps：1：Word segmentation processing is carried out, obtains the potential term in each paragraph in the potential term and the document in each document；2：Word frequency statisticses are carried out, obtain the word frequency statisticses result of each potential term in each paragraph, and the word frequency statisticses result of the potential term of document entirety；3：Stored using unstructured database technology, all documents in the document sets is converted into an ordered set in unstructured database；4：Term is inputted, implements retrieval in the unstructured database with ordered set；5：Export retrieval result.Beneficial effect：Retrieve simple, easy to use.

Description

A kind of method of the rapid extraction useful data from document sets

Technical field

The present invention relates to technical field of information retrieval, more particularly to a kind of side of the rapid extraction useful data from document sets Method.

Background technology

Unstructured database：In general, unstructured data is that data structure is irregular or imperfect, is not made a reservation for The data model of justice, it has not been convenient to use similarity relation type database with bivariate table come the data expressed.Such as Word, PDF document Class data, picture category data, image, audio, video class data etc..Unstructured data occupies very big ratio in all data Weight.Non-structural data are managed using the traditional structure such as relevant database database, contained it is difficult to easily excavate Valuable information in unstructured data.

Chinese words segmentation：Chinese word segmentation refers to continuous word sequence in text being cut into one according to certain specification Individual single word, and it is reassembled into the process of word sequence.

Word frequency statisticses technology：The number that some word occurs in some file is referred to as the word frequency of the word in this document.Mesh Preceding word frequency statisticses typically use TF-IDF (term frequency-inverse document frequency) method.This is A kind of conventional weighting technique for information retrieval and text mining, to assess a word for a file or a language Expect the significance level of a field file set in storehouse.The directly proportional increasing of number that the importance of words occurs hereof with it Add, but the frequency that can occur simultaneously with it in corpus is inversely proportional decline.

File retrieval refers to that in the case where inputting term being searched in document database has the term most The process of excellent document.As social life, the continuous quickening of the rhythm of work and number of documents, word quantity are continuously increased, File search is only carried out in mass data, even if having found relevant documentation, it is also necessary to take a significant amount of time related at these Related data is manually searched in document, it is extremely inefficient and extremely difficult.Such as：Geological disaster work have accumulated substantial amounts of document money Material, these document informations are typically all that global storage is carried out in units of entire chapter document, to be extracted from one or more documents Some specific data or information, or which the specific paragraph of some specific data or information in some or certain several documents determined In, and these information rapid extractions are come out, up to the present all it is highly difficult.

The content of the invention

In view of this, the embodiment provides a kind of retrieval simple, easy to use quickly to carry from document sets The method for taking useful data.

Embodiments of the invention provide a kind of method of the rapid extraction useful data from document sets, comprise the following steps：

Step 1：Using Chinese word segmentation instrument, each document in document sets is carried out including participle, part-of-speech tagging and divided Pretreatment including word screening, obtains the potential retrieval in each paragraph in the potential term and the document in each document Word；

Step 2：Word frequency statisticses are carried out to the potential term in each paragraph in each document in the document sets, The word frequency statisticses result of each potential term in each paragraph is obtained, the word frequency statisticses result based on paragraph obtains corresponding text The word frequency statisticses result of the overall potential term of shelves；

Step 3：Using document sets of the unstructured database technology storage after step 1 and step 2 processing, to the text Each document that shelves are concentrated establishes a storage collection, and the storage content of each storage collection includes：The title of document, document it is interior Hold, the potential term of each paragraph and each potential inspection of the word frequency statisticses result, document of the potential term in document Word frequency statisticses result, the storage time of rope word and each potential term }, it is converted into all documents in the document sets An ordered set in unstructured database；

Step 4：Term is inputted, implements retrieval in the unstructured database with ordered set；

Step 5：According to the matching of term and potential term, and the word frequency statisticses result of potential term, output Retrieval result.

Further, the potential term includes noun, verb and numeral-classifier compound.

Further, the participle screening described in step 1 is non-potential in the word after participle and part-of-speech tagging to reject Term, the non-potential term include conjunction, adverbial word and modal particle.

Further, in step 5, the output content of retrieval result includes at least one result set, each result set Content include：{ content in the title of document, storage time, document with each paragraph of term }.

Further, the result set is arranged according to the word frequency statisticses result descending of the potential term of document.

Further, in each result set, the paragraph order arrangement of each paragraph in document has retrieval The paragraph of word.

Further, the content of the result set also includes：Storage location, have term each paragraph retrieval The quantity of word }.

Further, the document sets are the big document sets of geological disaster.

Further, the Chinese word segmentation instrument is dictionary for word segmentation, and the segmentation methods used when being segmented in step 1 is stammerers Participle, Word participles or Pan Gu's segmentation methods.

Further, the method for word frequency statisticses being carried out in step 2 is TF-IDF methods, and the unstructured database is MongoDB, HBase or Redis database.

The beneficial effect brought of technical scheme that embodiments of the invention provide is：The present invention's quickly carries from document sets The method for taking useful data, overcome useful data or information included in large volume document it difficult to determine whether in the presence of, where How rapid extraction come out difficulty, user is quickly extracted the useful number of needs from a large amount of geological disaster documents According to or information, be that the offers such as geological disaster data management, data analysis, data mining, data fusion, big data processing are strong Support and service.

Brief description of the drawings

Fig. 1 is a block diagram of present invention method of rapid extraction useful data from document sets；

Fig. 2 is the exemplary plot of result set.

Embodiment

To make the object, technical solutions and advantages of the present invention clearer, below in conjunction with accompanying drawing to embodiment party of the present invention Formula is further described.

Fig. 1 is refer to, the embodiment provides a kind of method of the rapid extraction useful data from document sets, bag Include following steps：

Step 1：Using Chinese word segmentation instrument, each document in document sets is carried out including participle, part-of-speech tagging and divided Pretreatment including word screening, obtains the potential retrieval in each paragraph in the potential term and the document in each document Word.

The Chinese word segmentation instrument is dictionary for word segmentation, such as the Chinese dictionary with part-of-speech tagging of People's Daily's statistics.Step The segmentation methods used when being segmented in rapid 1 is stammerer participle, Word participles or Pan Gu's segmentation methods.As needed, selection is suitable Segmentation methods.

The potential term includes noun, verb and numeral-classifier compound, or even user is possibly used for retrieving including adjective etc. Word.Participle screening described in step 1 is described non-to reject the non-potential term in the word after participle and part-of-speech tagging Potential term can not possibly be used to retrieve including users such as conjunction, adverbial word and modal particles or user is relatively low for retrieving probability Word.In the present embodiment, the document sets are the big document sets of geological disaster, but are not limited.

Step 2：Word frequency statisticses are carried out to the potential term in each paragraph in each document in the document sets, The word frequency statisticses result of each potential term in each paragraph is obtained, the word frequency statisticses result based on paragraph obtains corresponding text The word frequency statisticses result of the overall potential term of shelves.

The method that word frequency statisticses are carried out in preferred steps 2 is TF-IDF methods, and the unstructured database is MongoDB, HBase or Redis database.

Such as the potential term of document 1 is：Computer, data and keyboard, the potential retrieval of N sections in document 1 Word is：Computer and data.After the word frequency statisticses of step 2, the potential retrieval word computer sum in document 1 in N sections Number according to appearance is respectively 5 times and 9 times, the number that potential retrieval word computer, data and the keyboard in document 1 occur Respectively 15 times, 31 times and 92 times.Therefore the word frequency statisticses result of N sections is in document 1：[the N sections of document 1, computer, 5]； [the N sections of document 1, data, 9].The word frequency statisticses result of document 1 is：[document 1, computer, 15]；[document 1, data, 31]； [document 1, keyboard, 92].

Step 3：Using document sets of the unstructured database technology storage after step 1 and step 2 processing, to the text Each document that shelves are concentrated establishes a storage collection, and the storage content of each storage collection includes：The title of document, document it is interior Hold, the potential term of each paragraph and each potential inspection of the word frequency statisticses result, document of the potential term in document Word frequency statisticses result, the storage time of rope word and each potential term }, it is converted into all documents in the document sets An ordered set in unstructured database.

Step 4：Term is inputted, implements retrieval in the unstructured database with ordered set.

It refer to Fig. 2, in step 5, the output content of retrieval result includes at least one result set, each result set Content include：{ content in the title of document, storage time, document with each paragraph of term }.According to document The word frequency statisticses result descending of potential term arranges the result set.It is each in document in each result set Paragraph of the paragraph order arrangement with term of paragraph.The content of the result set can also include：{ storage location, have The quantity of the term of each paragraph of term }.

Herein, the involved noun of locality such as forward and backward, upper and lower is to be located at parts in accompanying drawing in figure and zero The mutual position of part is intended merely to the clear of expression technology scheme and conveniently come what is defined.It should be appreciated that the noun of locality Use should not limit the claimed scope of the application.

In the case where not conflicting, the feature in embodiment and embodiment herein-above set forth can be combined with each other.

The foregoing is only presently preferred embodiments of the present invention, be not intended to limit the invention, it is all the present invention spirit and Within principle, any modification, equivalent substitution and improvements made etc., it should be included in the scope of the protection.

Claims

A kind of 1. method of the rapid extraction useful data from document sets, it is characterised in that：Comprise the following steps：

Step 1：Using Chinese word segmentation instrument, each document in document sets is carried out to include participle, part-of-speech tagging and participle sieve Interior pretreatment is selected in, obtains the potential term in each paragraph in the potential term and the document in each document；

Step 2：Word frequency statisticses are carried out to the potential term in each paragraph in each document in the document sets, obtained The word frequency statisticses result of each potential term in each paragraph, it is whole that the word frequency statisticses result based on paragraph obtains corresponding document The word frequency statisticses result of the potential term of body；

Step 3：Using document sets of the unstructured database technology storage after step 1 and step 2 processing, to the document sets In each document establish a storage collection, the storage content of each storage collection includes：{ title of document, the content of document, text The potential term of each paragraph and each word frequency statisticses result of the potential term in shelves, the potential term of document and Word frequency statisticses result, the storage time of each potential term }, it is converted into the document sets all documents non-structural Change an ordered set in database；

Step 4：Term is inputted, implements retrieval in the unstructured database with ordered set；

Step 5：According to the matching of term and potential term, and the word frequency statisticses result of potential term, output retrieval As a result.
2. as claimed in claim 1 from document sets rapid extraction useful data method, it is characterised in that：The potential inspection Rope word includes noun, verb and numeral-classifier compound.
3. as claimed in claim 1 from document sets rapid extraction useful data method, it is characterised in that：Institute in step 1 To reject the non-potential term in the word after participle and part-of-speech tagging, the non-potential term includes for the participle screening stated Conjunction, adverbial word and modal particle.
4. as claimed in claim 1 from document sets rapid extraction useful data method, it is characterised in that：In step 5, inspection The output content of hitch fruit includes at least one result set, and the content of each result set includes：{ title of document, storage The content of each paragraph with term in time, document }.
5. as claimed in claim 4 from document sets rapid extraction useful data method, it is characterised in that：According to document The word frequency statisticses result descending of potential term arranges the result set.
6. as claimed in claim 4 from document sets rapid extraction useful data method, it is characterised in that：Each knot Fruit is concentrated, paragraph of the paragraph order arrangement with term of each paragraph in document.
7. as claimed in claim 4 from document sets rapid extraction useful data method, it is characterised in that：The result set Content also include：Storage location, have term each paragraph term quantity.
8. as claimed in claim 1 from document sets rapid extraction useful data method, it is characterised in that：The document sets For the big document sets of geological disaster.
9. as claimed in claim 1 from document sets rapid extraction useful data method, it is characterised in that：The Chinese point Word instrument is dictionary for word segmentation, and the segmentation methods used when being segmented in step 1 are segmented and calculated for stammerer participle, Word participles or Pan Gu Method.
10. as claimed in claim 1 from document sets rapid extraction useful data method, it is characterised in that：Enter in step 2 The method of row word frequency statisticses is TF-IDF methods, and the unstructured database is MongoDB, HBase or Redis database.