CN107861943A - A kind of method of the rapid extraction useful data from document sets - Google Patents

A kind of method of the rapid extraction useful data from document sets Download PDF

Info

Publication number
CN107861943A
CN107861943A CN201710985840.1A CN201710985840A CN107861943A CN 107861943 A CN107861943 A CN 107861943A CN 201710985840 A CN201710985840 A CN 201710985840A CN 107861943 A CN107861943 A CN 107861943A
Authority
CN
China
Prior art keywords
document
term
word
paragraph
document sets
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710985840.1A
Other languages
Chinese (zh)
Other versions
CN107861943B (en
Inventor
刘军旗
苏爱军
唐辉明
吴冲龙
姚梦辉
滕伟福
王亮清
封瑞雪
赵剑雄
陈根深
邹宗兴
王菁莪
曾雯
张抒
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China University of Geosciences
Original Assignee
China University of Geosciences
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China University of Geosciences filed Critical China University of Geosciences
Priority to CN201710985840.1A priority Critical patent/CN107861943B/en
Publication of CN107861943A publication Critical patent/CN107861943A/en
Application granted granted Critical
Publication of CN107861943B publication Critical patent/CN107861943B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing

Abstract

The present invention provides a kind of method of the rapid extraction useful data from document sets, comprises the following steps:1:Word segmentation processing is carried out, obtains the potential term in each paragraph in the potential term and the document in each document;2:Word frequency statisticses are carried out, obtain the word frequency statisticses result of each potential term in each paragraph, and the word frequency statisticses result of the potential term of document entirety;3:Stored using unstructured database technology, all documents in the document sets is converted into an ordered set in unstructured database;4:Term is inputted, implements retrieval in the unstructured database with ordered set;5:Export retrieval result.Beneficial effect:Retrieve simple, easy to use.

Description

A kind of method of the rapid extraction useful data from document sets
Technical field
The present invention relates to technical field of information retrieval, more particularly to a kind of side of the rapid extraction useful data from document sets Method.
Background technology
Unstructured database:In general, unstructured data is that data structure is irregular or imperfect, is not made a reservation for The data model of justice, it has not been convenient to use similarity relation type database with bivariate table come the data expressed.Such as Word, PDF document Class data, picture category data, image, audio, video class data etc..Unstructured data occupies very big ratio in all data Weight.Non-structural data are managed using the traditional structure such as relevant database database, contained it is difficult to easily excavate Valuable information in unstructured data.
Chinese words segmentation:Chinese word segmentation refers to continuous word sequence in text being cut into one according to certain specification Individual single word, and it is reassembled into the process of word sequence.
Word frequency statisticses technology:The number that some word occurs in some file is referred to as the word frequency of the word in this document.Mesh Preceding word frequency statisticses typically use TF-IDF (term frequency-inverse document frequency) method.This is A kind of conventional weighting technique for information retrieval and text mining, to assess a word for a file or a language Expect the significance level of a field file set in storehouse.The directly proportional increasing of number that the importance of words occurs hereof with it Add, but the frequency that can occur simultaneously with it in corpus is inversely proportional decline.
File retrieval refers to that in the case where inputting term being searched in document database has the term most The process of excellent document.As social life, the continuous quickening of the rhythm of work and number of documents, word quantity are continuously increased, File search is only carried out in mass data, even if having found relevant documentation, it is also necessary to take a significant amount of time related at these Related data is manually searched in document, it is extremely inefficient and extremely difficult.Such as:Geological disaster work have accumulated substantial amounts of document money Material, these document informations are typically all that global storage is carried out in units of entire chapter document, to be extracted from one or more documents Some specific data or information, or which the specific paragraph of some specific data or information in some or certain several documents determined In, and these information rapid extractions are come out, up to the present all it is highly difficult.
The content of the invention
In view of this, the embodiment provides a kind of retrieval simple, easy to use quickly to carry from document sets The method for taking useful data.
Embodiments of the invention provide a kind of method of the rapid extraction useful data from document sets, comprise the following steps:
Step 1:Using Chinese word segmentation instrument, each document in document sets is carried out including participle, part-of-speech tagging and divided Pretreatment including word screening, obtains the potential retrieval in each paragraph in the potential term and the document in each document Word;
Step 2:Word frequency statisticses are carried out to the potential term in each paragraph in each document in the document sets, The word frequency statisticses result of each potential term in each paragraph is obtained, the word frequency statisticses result based on paragraph obtains corresponding text The word frequency statisticses result of the overall potential term of shelves;
Step 3:Using document sets of the unstructured database technology storage after step 1 and step 2 processing, to the text Each document that shelves are concentrated establishes a storage collection, and the storage content of each storage collection includes:The title of document, document it is interior Hold, the potential term of each paragraph and each potential inspection of the word frequency statisticses result, document of the potential term in document Word frequency statisticses result, the storage time of rope word and each potential term }, it is converted into all documents in the document sets An ordered set in unstructured database;
Step 4:Term is inputted, implements retrieval in the unstructured database with ordered set;
Step 5:According to the matching of term and potential term, and the word frequency statisticses result of potential term, output Retrieval result.
Further, the potential term includes noun, verb and numeral-classifier compound.
Further, the participle screening described in step 1 is non-potential in the word after participle and part-of-speech tagging to reject Term, the non-potential term include conjunction, adverbial word and modal particle.
Further, in step 5, the output content of retrieval result includes at least one result set, each result set Content include:{ content in the title of document, storage time, document with each paragraph of term }.
Further, the result set is arranged according to the word frequency statisticses result descending of the potential term of document.
Further, in each result set, the paragraph order arrangement of each paragraph in document has retrieval The paragraph of word.
Further, the content of the result set also includes:Storage location, have term each paragraph retrieval The quantity of word }.
Further, the document sets are the big document sets of geological disaster.
Further, the Chinese word segmentation instrument is dictionary for word segmentation, and the segmentation methods used when being segmented in step 1 is stammerers Participle, Word participles or Pan Gu's segmentation methods.
Further, the method for word frequency statisticses being carried out in step 2 is TF-IDF methods, and the unstructured database is MongoDB, HBase or Redis database.
The beneficial effect brought of technical scheme that embodiments of the invention provide is:The present invention's quickly carries from document sets The method for taking useful data, overcome useful data or information included in large volume document it difficult to determine whether in the presence of, where How rapid extraction come out difficulty, user is quickly extracted the useful number of needs from a large amount of geological disaster documents According to or information, be that the offers such as geological disaster data management, data analysis, data mining, data fusion, big data processing are strong Support and service.
Brief description of the drawings
Fig. 1 is a block diagram of present invention method of rapid extraction useful data from document sets;
Fig. 2 is the exemplary plot of result set.
Embodiment
To make the object, technical solutions and advantages of the present invention clearer, below in conjunction with accompanying drawing to embodiment party of the present invention Formula is further described.
Fig. 1 is refer to, the embodiment provides a kind of method of the rapid extraction useful data from document sets, bag Include following steps:
Step 1:Using Chinese word segmentation instrument, each document in document sets is carried out including participle, part-of-speech tagging and divided Pretreatment including word screening, obtains the potential retrieval in each paragraph in the potential term and the document in each document Word.
The Chinese word segmentation instrument is dictionary for word segmentation, such as the Chinese dictionary with part-of-speech tagging of People's Daily's statistics.Step The segmentation methods used when being segmented in rapid 1 is stammerer participle, Word participles or Pan Gu's segmentation methods.As needed, selection is suitable Segmentation methods.
The potential term includes noun, verb and numeral-classifier compound, or even user is possibly used for retrieving including adjective etc. Word.Participle screening described in step 1 is described non-to reject the non-potential term in the word after participle and part-of-speech tagging Potential term can not possibly be used to retrieve including users such as conjunction, adverbial word and modal particles or user is relatively low for retrieving probability Word.In the present embodiment, the document sets are the big document sets of geological disaster, but are not limited.
Step 2:Word frequency statisticses are carried out to the potential term in each paragraph in each document in the document sets, The word frequency statisticses result of each potential term in each paragraph is obtained, the word frequency statisticses result based on paragraph obtains corresponding text The word frequency statisticses result of the overall potential term of shelves.
The method that word frequency statisticses are carried out in preferred steps 2 is TF-IDF methods, and the unstructured database is MongoDB, HBase or Redis database.
Such as the potential term of document 1 is:Computer, data and keyboard, the potential retrieval of N sections in document 1 Word is:Computer and data.After the word frequency statisticses of step 2, the potential retrieval word computer sum in document 1 in N sections Number according to appearance is respectively 5 times and 9 times, the number that potential retrieval word computer, data and the keyboard in document 1 occur Respectively 15 times, 31 times and 92 times.Therefore the word frequency statisticses result of N sections is in document 1:[the N sections of document 1, computer, 5]; [the N sections of document 1, data, 9].The word frequency statisticses result of document 1 is:[document 1, computer, 15];[document 1, data, 31]; [document 1, keyboard, 92].
Step 3:Using document sets of the unstructured database technology storage after step 1 and step 2 processing, to the text Each document that shelves are concentrated establishes a storage collection, and the storage content of each storage collection includes:The title of document, document it is interior Hold, the potential term of each paragraph and each potential inspection of the word frequency statisticses result, document of the potential term in document Word frequency statisticses result, the storage time of rope word and each potential term }, it is converted into all documents in the document sets An ordered set in unstructured database.
Step 4:Term is inputted, implements retrieval in the unstructured database with ordered set.
Step 5:According to the matching of term and potential term, and the word frequency statisticses result of potential term, output Retrieval result.
It refer to Fig. 2, in step 5, the output content of retrieval result includes at least one result set, each result set Content include:{ content in the title of document, storage time, document with each paragraph of term }.According to document The word frequency statisticses result descending of potential term arranges the result set.It is each in document in each result set Paragraph of the paragraph order arrangement with term of paragraph.The content of the result set can also include:{ storage location, have The quantity of the term of each paragraph of term }.
The beneficial effect brought of technical scheme that embodiments of the invention provide is:The present invention's quickly carries from document sets The method for taking useful data, overcome useful data or information included in large volume document it difficult to determine whether in the presence of, where How rapid extraction come out difficulty, user is quickly extracted the useful number of needs from a large amount of geological disaster documents According to or information, be that the offers such as geological disaster data management, data analysis, data mining, data fusion, big data processing are strong Support and service.
Herein, the involved noun of locality such as forward and backward, upper and lower is to be located at parts in accompanying drawing in figure and zero The mutual position of part is intended merely to the clear of expression technology scheme and conveniently come what is defined.It should be appreciated that the noun of locality Use should not limit the claimed scope of the application.
In the case where not conflicting, the feature in embodiment and embodiment herein-above set forth can be combined with each other.
The foregoing is only presently preferred embodiments of the present invention, be not intended to limit the invention, it is all the present invention spirit and Within principle, any modification, equivalent substitution and improvements made etc., it should be included in the scope of the protection.

Claims (10)

  1. A kind of 1. method of the rapid extraction useful data from document sets, it is characterised in that:Comprise the following steps:
    Step 1:Using Chinese word segmentation instrument, each document in document sets is carried out to include participle, part-of-speech tagging and participle sieve Interior pretreatment is selected in, obtains the potential term in each paragraph in the potential term and the document in each document;
    Step 2:Word frequency statisticses are carried out to the potential term in each paragraph in each document in the document sets, obtained The word frequency statisticses result of each potential term in each paragraph, it is whole that the word frequency statisticses result based on paragraph obtains corresponding document The word frequency statisticses result of the potential term of body;
    Step 3:Using document sets of the unstructured database technology storage after step 1 and step 2 processing, to the document sets In each document establish a storage collection, the storage content of each storage collection includes:{ title of document, the content of document, text The potential term of each paragraph and each word frequency statisticses result of the potential term in shelves, the potential term of document and Word frequency statisticses result, the storage time of each potential term }, it is converted into the document sets all documents non-structural Change an ordered set in database;
    Step 4:Term is inputted, implements retrieval in the unstructured database with ordered set;
    Step 5:According to the matching of term and potential term, and the word frequency statisticses result of potential term, output retrieval As a result.
  2. 2. as claimed in claim 1 from document sets rapid extraction useful data method, it is characterised in that:The potential inspection Rope word includes noun, verb and numeral-classifier compound.
  3. 3. as claimed in claim 1 from document sets rapid extraction useful data method, it is characterised in that:Institute in step 1 To reject the non-potential term in the word after participle and part-of-speech tagging, the non-potential term includes for the participle screening stated Conjunction, adverbial word and modal particle.
  4. 4. as claimed in claim 1 from document sets rapid extraction useful data method, it is characterised in that:In step 5, inspection The output content of hitch fruit includes at least one result set, and the content of each result set includes:{ title of document, storage The content of each paragraph with term in time, document }.
  5. 5. as claimed in claim 4 from document sets rapid extraction useful data method, it is characterised in that:According to document The word frequency statisticses result descending of potential term arranges the result set.
  6. 6. as claimed in claim 4 from document sets rapid extraction useful data method, it is characterised in that:Each knot Fruit is concentrated, paragraph of the paragraph order arrangement with term of each paragraph in document.
  7. 7. as claimed in claim 4 from document sets rapid extraction useful data method, it is characterised in that:The result set Content also include:Storage location, have term each paragraph term quantity.
  8. 8. as claimed in claim 1 from document sets rapid extraction useful data method, it is characterised in that:The document sets For the big document sets of geological disaster.
  9. 9. as claimed in claim 1 from document sets rapid extraction useful data method, it is characterised in that:The Chinese point Word instrument is dictionary for word segmentation, and the segmentation methods used when being segmented in step 1 are segmented and calculated for stammerer participle, Word participles or Pan Gu Method.
  10. 10. as claimed in claim 1 from document sets rapid extraction useful data method, it is characterised in that:Enter in step 2 The method of row word frequency statisticses is TF-IDF methods, and the unstructured database is MongoDB, HBase or Redis database.
CN201710985840.1A 2017-10-20 2017-10-20 Method for quickly extracting useful data from document set Active CN107861943B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710985840.1A CN107861943B (en) 2017-10-20 2017-10-20 Method for quickly extracting useful data from document set

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710985840.1A CN107861943B (en) 2017-10-20 2017-10-20 Method for quickly extracting useful data from document set

Publications (2)

Publication Number Publication Date
CN107861943A true CN107861943A (en) 2018-03-30
CN107861943B CN107861943B (en) 2020-03-24

Family

ID=61696544

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710985840.1A Active CN107861943B (en) 2017-10-20 2017-10-20 Method for quickly extracting useful data from document set

Country Status (1)

Country Link
CN (1) CN107861943B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109784689A (en) * 2018-12-28 2019-05-21 远光软件股份有限公司 A kind of power grid infrastructure project method for processing report data

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101377777A (en) * 2007-09-03 2009-03-04 北京百问百答网络技术有限公司 Automatic inquiring and answering method and system
CN102541901A (en) * 2010-12-26 2012-07-04 上海量明科技发展有限公司 Method and system for identifying and outputting information during document reading
CN103136352A (en) * 2013-02-27 2013-06-05 华中师范大学 Full-text retrieval system based on two-level semantic analysis
CN104679778A (en) * 2013-11-29 2015-06-03 腾讯科技(深圳)有限公司 Search result generating method and device
CN105005556A (en) * 2015-07-29 2015-10-28 成都理工大学 Index keyword extraction method and system based on big geological data
CN105760474A (en) * 2016-02-14 2016-07-13 Tcl集团股份有限公司 Document collection feature word extracting method and system based on position information
CN106649863A (en) * 2016-12-30 2017-05-10 天津市测绘院 Non-structured data management method and apparatus

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101377777A (en) * 2007-09-03 2009-03-04 北京百问百答网络技术有限公司 Automatic inquiring and answering method and system
CN102541901A (en) * 2010-12-26 2012-07-04 上海量明科技发展有限公司 Method and system for identifying and outputting information during document reading
CN103136352A (en) * 2013-02-27 2013-06-05 华中师范大学 Full-text retrieval system based on two-level semantic analysis
CN104679778A (en) * 2013-11-29 2015-06-03 腾讯科技(深圳)有限公司 Search result generating method and device
CN105005556A (en) * 2015-07-29 2015-10-28 成都理工大学 Index keyword extraction method and system based on big geological data
CN105760474A (en) * 2016-02-14 2016-07-13 Tcl集团股份有限公司 Document collection feature word extracting method and system based on position information
CN106649863A (en) * 2016-12-30 2017-05-10 天津市测绘院 Non-structured data management method and apparatus

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
常璐 等: "搜索引擎的几种常用排序算法", 《图书情报工作》 *
王存宇 等: "面向云存储的非结构化数据存储研究", 《计算机时代》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109784689A (en) * 2018-12-28 2019-05-21 远光软件股份有限公司 A kind of power grid infrastructure project method for processing report data
CN109784689B (en) * 2018-12-28 2022-03-15 远光软件股份有限公司 Power grid infrastructure project report data processing method

Also Published As

Publication number Publication date
CN107861943B (en) 2020-03-24

Similar Documents

Publication Publication Date Title
CN106294320B (en) A kind of terminology extraction method and system towards academic paper
CN103559233B (en) Network neologisms abstracting method and microblog emotional analysis method and system in microblogging
CN104281653B (en) A kind of opining mining method for millions scale microblogging text
CN110059311A (en) A kind of keyword extracting method and system towards judicial style data
CN109960756B (en) News event information induction method
CN106650943A (en) Auxiliary writing method and apparatus based on artificial intelligence
CN107967290A (en) A kind of knowledge mapping network establishing method and system, medium based on magnanimity scientific research data
CN108763348A (en) A kind of classification improved method of extension short text word feature vector
CN106294473B (en) Entity word mining method, information recommendation method and device
CN110750995A (en) File management method based on user-defined map
Hammarfelt Harvesting footnotes in a rural field: Citation patterns in Swedish literary studies
CN103034656B (en) Chapters and sections content layered approach and device, article content layered approach and device
CN107145476A (en) One kind is based on improvement TF IDF keyword extraction algorithms
Maciołek et al. Cluo: Web-scale text mining system for open source intelligence purposes
CN105574004B (en) A kind of removing duplicate webpages method and apparatus
CN107861943A (en) A kind of method of the rapid extraction useful data from document sets
CN103034657B (en) Documentation summary generates method and apparatus
Joseph et al. An approach to selecting keywords to track on twitter during a disaster.
CN110020034B (en) Information quotation analysis method and system
CN104572628B (en) A kind of science based on syntactic feature defines automatic extraction system and method
CN106934007B (en) Associated information pushing method and device
US9886488B2 (en) Conceptual document analysis and characterization
CN108595593A (en) Meeting research hotspot based on topic model and development trend information analysis method
Moradi et al. Clustering of deep contextualized representations for summarization of biomedical texts
Fuller et al. Structuring, recording, and analyzing historical networks in the china biographical database

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant