CN107861943A - A kind of method of the rapid extraction useful data from document sets - Google Patents
A kind of method of the rapid extraction useful data from document sets Download PDFInfo
- Publication number
- CN107861943A CN107861943A CN201710985840.1A CN201710985840A CN107861943A CN 107861943 A CN107861943 A CN 107861943A CN 201710985840 A CN201710985840 A CN 201710985840A CN 107861943 A CN107861943 A CN 107861943A
- Authority
- CN
- China
- Prior art keywords
- document
- term
- word
- paragraph
- document sets
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/903—Querying
- G06F16/90335—Query processing
Abstract
The present invention provides a kind of method of the rapid extraction useful data from document sets, comprises the following steps:1:Word segmentation processing is carried out, obtains the potential term in each paragraph in the potential term and the document in each document;2:Word frequency statisticses are carried out, obtain the word frequency statisticses result of each potential term in each paragraph, and the word frequency statisticses result of the potential term of document entirety;3:Stored using unstructured database technology, all documents in the document sets is converted into an ordered set in unstructured database;4:Term is inputted, implements retrieval in the unstructured database with ordered set;5:Export retrieval result.Beneficial effect:Retrieve simple, easy to use.
Description
Technical field
The present invention relates to technical field of information retrieval, more particularly to a kind of side of the rapid extraction useful data from document sets
Method.
Background technology
Unstructured database:In general, unstructured data is that data structure is irregular or imperfect, is not made a reservation for
The data model of justice, it has not been convenient to use similarity relation type database with bivariate table come the data expressed.Such as Word, PDF document
Class data, picture category data, image, audio, video class data etc..Unstructured data occupies very big ratio in all data
Weight.Non-structural data are managed using the traditional structure such as relevant database database, contained it is difficult to easily excavate
Valuable information in unstructured data.
Chinese words segmentation:Chinese word segmentation refers to continuous word sequence in text being cut into one according to certain specification
Individual single word, and it is reassembled into the process of word sequence.
Word frequency statisticses technology:The number that some word occurs in some file is referred to as the word frequency of the word in this document.Mesh
Preceding word frequency statisticses typically use TF-IDF (term frequency-inverse document frequency) method.This is
A kind of conventional weighting technique for information retrieval and text mining, to assess a word for a file or a language
Expect the significance level of a field file set in storehouse.The directly proportional increasing of number that the importance of words occurs hereof with it
Add, but the frequency that can occur simultaneously with it in corpus is inversely proportional decline.
File retrieval refers to that in the case where inputting term being searched in document database has the term most
The process of excellent document.As social life, the continuous quickening of the rhythm of work and number of documents, word quantity are continuously increased,
File search is only carried out in mass data, even if having found relevant documentation, it is also necessary to take a significant amount of time related at these
Related data is manually searched in document, it is extremely inefficient and extremely difficult.Such as:Geological disaster work have accumulated substantial amounts of document money
Material, these document informations are typically all that global storage is carried out in units of entire chapter document, to be extracted from one or more documents
Some specific data or information, or which the specific paragraph of some specific data or information in some or certain several documents determined
In, and these information rapid extractions are come out, up to the present all it is highly difficult.
The content of the invention
In view of this, the embodiment provides a kind of retrieval simple, easy to use quickly to carry from document sets
The method for taking useful data.
Embodiments of the invention provide a kind of method of the rapid extraction useful data from document sets, comprise the following steps:
Step 1:Using Chinese word segmentation instrument, each document in document sets is carried out including participle, part-of-speech tagging and divided
Pretreatment including word screening, obtains the potential retrieval in each paragraph in the potential term and the document in each document
Word;
Step 2:Word frequency statisticses are carried out to the potential term in each paragraph in each document in the document sets,
The word frequency statisticses result of each potential term in each paragraph is obtained, the word frequency statisticses result based on paragraph obtains corresponding text
The word frequency statisticses result of the overall potential term of shelves;
Step 3:Using document sets of the unstructured database technology storage after step 1 and step 2 processing, to the text
Each document that shelves are concentrated establishes a storage collection, and the storage content of each storage collection includes:The title of document, document it is interior
Hold, the potential term of each paragraph and each potential inspection of the word frequency statisticses result, document of the potential term in document
Word frequency statisticses result, the storage time of rope word and each potential term }, it is converted into all documents in the document sets
An ordered set in unstructured database;
Step 4:Term is inputted, implements retrieval in the unstructured database with ordered set;
Step 5:According to the matching of term and potential term, and the word frequency statisticses result of potential term, output
Retrieval result.
Further, the potential term includes noun, verb and numeral-classifier compound.
Further, the participle screening described in step 1 is non-potential in the word after participle and part-of-speech tagging to reject
Term, the non-potential term include conjunction, adverbial word and modal particle.
Further, in step 5, the output content of retrieval result includes at least one result set, each result set
Content include:{ content in the title of document, storage time, document with each paragraph of term }.
Further, the result set is arranged according to the word frequency statisticses result descending of the potential term of document.
Further, in each result set, the paragraph order arrangement of each paragraph in document has retrieval
The paragraph of word.
Further, the content of the result set also includes:Storage location, have term each paragraph retrieval
The quantity of word }.
Further, the document sets are the big document sets of geological disaster.
Further, the Chinese word segmentation instrument is dictionary for word segmentation, and the segmentation methods used when being segmented in step 1 is stammerers
Participle, Word participles or Pan Gu's segmentation methods.
Further, the method for word frequency statisticses being carried out in step 2 is TF-IDF methods, and the unstructured database is
MongoDB, HBase or Redis database.
The beneficial effect brought of technical scheme that embodiments of the invention provide is:The present invention's quickly carries from document sets
The method for taking useful data, overcome useful data or information included in large volume document it difficult to determine whether in the presence of, where
How rapid extraction come out difficulty, user is quickly extracted the useful number of needs from a large amount of geological disaster documents
According to or information, be that the offers such as geological disaster data management, data analysis, data mining, data fusion, big data processing are strong
Support and service.
Brief description of the drawings
Fig. 1 is a block diagram of present invention method of rapid extraction useful data from document sets;
Fig. 2 is the exemplary plot of result set.
Embodiment
To make the object, technical solutions and advantages of the present invention clearer, below in conjunction with accompanying drawing to embodiment party of the present invention
Formula is further described.
Fig. 1 is refer to, the embodiment provides a kind of method of the rapid extraction useful data from document sets, bag
Include following steps:
Step 1:Using Chinese word segmentation instrument, each document in document sets is carried out including participle, part-of-speech tagging and divided
Pretreatment including word screening, obtains the potential retrieval in each paragraph in the potential term and the document in each document
Word.
The Chinese word segmentation instrument is dictionary for word segmentation, such as the Chinese dictionary with part-of-speech tagging of People's Daily's statistics.Step
The segmentation methods used when being segmented in rapid 1 is stammerer participle, Word participles or Pan Gu's segmentation methods.As needed, selection is suitable
Segmentation methods.
The potential term includes noun, verb and numeral-classifier compound, or even user is possibly used for retrieving including adjective etc.
Word.Participle screening described in step 1 is described non-to reject the non-potential term in the word after participle and part-of-speech tagging
Potential term can not possibly be used to retrieve including users such as conjunction, adverbial word and modal particles or user is relatively low for retrieving probability
Word.In the present embodiment, the document sets are the big document sets of geological disaster, but are not limited.
Step 2:Word frequency statisticses are carried out to the potential term in each paragraph in each document in the document sets,
The word frequency statisticses result of each potential term in each paragraph is obtained, the word frequency statisticses result based on paragraph obtains corresponding text
The word frequency statisticses result of the overall potential term of shelves.
The method that word frequency statisticses are carried out in preferred steps 2 is TF-IDF methods, and the unstructured database is
MongoDB, HBase or Redis database.
Such as the potential term of document 1 is:Computer, data and keyboard, the potential retrieval of N sections in document 1
Word is:Computer and data.After the word frequency statisticses of step 2, the potential retrieval word computer sum in document 1 in N sections
Number according to appearance is respectively 5 times and 9 times, the number that potential retrieval word computer, data and the keyboard in document 1 occur
Respectively 15 times, 31 times and 92 times.Therefore the word frequency statisticses result of N sections is in document 1:[the N sections of document 1, computer, 5];
[the N sections of document 1, data, 9].The word frequency statisticses result of document 1 is:[document 1, computer, 15];[document 1, data, 31];
[document 1, keyboard, 92].
Step 3:Using document sets of the unstructured database technology storage after step 1 and step 2 processing, to the text
Each document that shelves are concentrated establishes a storage collection, and the storage content of each storage collection includes:The title of document, document it is interior
Hold, the potential term of each paragraph and each potential inspection of the word frequency statisticses result, document of the potential term in document
Word frequency statisticses result, the storage time of rope word and each potential term }, it is converted into all documents in the document sets
An ordered set in unstructured database.
Step 4:Term is inputted, implements retrieval in the unstructured database with ordered set.
Step 5:According to the matching of term and potential term, and the word frequency statisticses result of potential term, output
Retrieval result.
It refer to Fig. 2, in step 5, the output content of retrieval result includes at least one result set, each result set
Content include:{ content in the title of document, storage time, document with each paragraph of term }.According to document
The word frequency statisticses result descending of potential term arranges the result set.It is each in document in each result set
Paragraph of the paragraph order arrangement with term of paragraph.The content of the result set can also include:{ storage location, have
The quantity of the term of each paragraph of term }.
The beneficial effect brought of technical scheme that embodiments of the invention provide is:The present invention's quickly carries from document sets
The method for taking useful data, overcome useful data or information included in large volume document it difficult to determine whether in the presence of, where
How rapid extraction come out difficulty, user is quickly extracted the useful number of needs from a large amount of geological disaster documents
According to or information, be that the offers such as geological disaster data management, data analysis, data mining, data fusion, big data processing are strong
Support and service.
Herein, the involved noun of locality such as forward and backward, upper and lower is to be located at parts in accompanying drawing in figure and zero
The mutual position of part is intended merely to the clear of expression technology scheme and conveniently come what is defined.It should be appreciated that the noun of locality
Use should not limit the claimed scope of the application.
In the case where not conflicting, the feature in embodiment and embodiment herein-above set forth can be combined with each other.
The foregoing is only presently preferred embodiments of the present invention, be not intended to limit the invention, it is all the present invention spirit and
Within principle, any modification, equivalent substitution and improvements made etc., it should be included in the scope of the protection.
Claims (10)
- A kind of 1. method of the rapid extraction useful data from document sets, it is characterised in that:Comprise the following steps:Step 1:Using Chinese word segmentation instrument, each document in document sets is carried out to include participle, part-of-speech tagging and participle sieve Interior pretreatment is selected in, obtains the potential term in each paragraph in the potential term and the document in each document;Step 2:Word frequency statisticses are carried out to the potential term in each paragraph in each document in the document sets, obtained The word frequency statisticses result of each potential term in each paragraph, it is whole that the word frequency statisticses result based on paragraph obtains corresponding document The word frequency statisticses result of the potential term of body;Step 3:Using document sets of the unstructured database technology storage after step 1 and step 2 processing, to the document sets In each document establish a storage collection, the storage content of each storage collection includes:{ title of document, the content of document, text The potential term of each paragraph and each word frequency statisticses result of the potential term in shelves, the potential term of document and Word frequency statisticses result, the storage time of each potential term }, it is converted into the document sets all documents non-structural Change an ordered set in database;Step 4:Term is inputted, implements retrieval in the unstructured database with ordered set;Step 5:According to the matching of term and potential term, and the word frequency statisticses result of potential term, output retrieval As a result.
- 2. as claimed in claim 1 from document sets rapid extraction useful data method, it is characterised in that:The potential inspection Rope word includes noun, verb and numeral-classifier compound.
- 3. as claimed in claim 1 from document sets rapid extraction useful data method, it is characterised in that:Institute in step 1 To reject the non-potential term in the word after participle and part-of-speech tagging, the non-potential term includes for the participle screening stated Conjunction, adverbial word and modal particle.
- 4. as claimed in claim 1 from document sets rapid extraction useful data method, it is characterised in that:In step 5, inspection The output content of hitch fruit includes at least one result set, and the content of each result set includes:{ title of document, storage The content of each paragraph with term in time, document }.
- 5. as claimed in claim 4 from document sets rapid extraction useful data method, it is characterised in that:According to document The word frequency statisticses result descending of potential term arranges the result set.
- 6. as claimed in claim 4 from document sets rapid extraction useful data method, it is characterised in that:Each knot Fruit is concentrated, paragraph of the paragraph order arrangement with term of each paragraph in document.
- 7. as claimed in claim 4 from document sets rapid extraction useful data method, it is characterised in that:The result set Content also include:Storage location, have term each paragraph term quantity.
- 8. as claimed in claim 1 from document sets rapid extraction useful data method, it is characterised in that:The document sets For the big document sets of geological disaster.
- 9. as claimed in claim 1 from document sets rapid extraction useful data method, it is characterised in that:The Chinese point Word instrument is dictionary for word segmentation, and the segmentation methods used when being segmented in step 1 are segmented and calculated for stammerer participle, Word participles or Pan Gu Method.
- 10. as claimed in claim 1 from document sets rapid extraction useful data method, it is characterised in that:Enter in step 2 The method of row word frequency statisticses is TF-IDF methods, and the unstructured database is MongoDB, HBase or Redis database.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710985840.1A CN107861943B (en) | 2017-10-20 | 2017-10-20 | Method for quickly extracting useful data from document set |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710985840.1A CN107861943B (en) | 2017-10-20 | 2017-10-20 | Method for quickly extracting useful data from document set |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107861943A true CN107861943A (en) | 2018-03-30 |
CN107861943B CN107861943B (en) | 2020-03-24 |
Family
ID=61696544
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710985840.1A Active CN107861943B (en) | 2017-10-20 | 2017-10-20 | Method for quickly extracting useful data from document set |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107861943B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109784689A (en) * | 2018-12-28 | 2019-05-21 | 远光软件股份有限公司 | A kind of power grid infrastructure project method for processing report data |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101377777A (en) * | 2007-09-03 | 2009-03-04 | 北京百问百答网络技术有限公司 | Automatic inquiring and answering method and system |
CN102541901A (en) * | 2010-12-26 | 2012-07-04 | 上海量明科技发展有限公司 | Method and system for identifying and outputting information during document reading |
CN103136352A (en) * | 2013-02-27 | 2013-06-05 | 华中师范大学 | Full-text retrieval system based on two-level semantic analysis |
CN104679778A (en) * | 2013-11-29 | 2015-06-03 | 腾讯科技(深圳)有限公司 | Search result generating method and device |
CN105005556A (en) * | 2015-07-29 | 2015-10-28 | 成都理工大学 | Index keyword extraction method and system based on big geological data |
CN105760474A (en) * | 2016-02-14 | 2016-07-13 | Tcl集团股份有限公司 | Document collection feature word extracting method and system based on position information |
CN106649863A (en) * | 2016-12-30 | 2017-05-10 | 天津市测绘院 | Non-structured data management method and apparatus |
-
2017
- 2017-10-20 CN CN201710985840.1A patent/CN107861943B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101377777A (en) * | 2007-09-03 | 2009-03-04 | 北京百问百答网络技术有限公司 | Automatic inquiring and answering method and system |
CN102541901A (en) * | 2010-12-26 | 2012-07-04 | 上海量明科技发展有限公司 | Method and system for identifying and outputting information during document reading |
CN103136352A (en) * | 2013-02-27 | 2013-06-05 | 华中师范大学 | Full-text retrieval system based on two-level semantic analysis |
CN104679778A (en) * | 2013-11-29 | 2015-06-03 | 腾讯科技(深圳)有限公司 | Search result generating method and device |
CN105005556A (en) * | 2015-07-29 | 2015-10-28 | 成都理工大学 | Index keyword extraction method and system based on big geological data |
CN105760474A (en) * | 2016-02-14 | 2016-07-13 | Tcl集团股份有限公司 | Document collection feature word extracting method and system based on position information |
CN106649863A (en) * | 2016-12-30 | 2017-05-10 | 天津市测绘院 | Non-structured data management method and apparatus |
Non-Patent Citations (2)
Title |
---|
常璐 等: "搜索引擎的几种常用排序算法", 《图书情报工作》 * |
王存宇 等: "面向云存储的非结构化数据存储研究", 《计算机时代》 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109784689A (en) * | 2018-12-28 | 2019-05-21 | 远光软件股份有限公司 | A kind of power grid infrastructure project method for processing report data |
CN109784689B (en) * | 2018-12-28 | 2022-03-15 | 远光软件股份有限公司 | Power grid infrastructure project report data processing method |
Also Published As
Publication number | Publication date |
---|---|
CN107861943B (en) | 2020-03-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106294320B (en) | A kind of terminology extraction method and system towards academic paper | |
CN103559233B (en) | Network neologisms abstracting method and microblog emotional analysis method and system in microblogging | |
CN104281653B (en) | A kind of opining mining method for millions scale microblogging text | |
CN110059311A (en) | A kind of keyword extracting method and system towards judicial style data | |
CN109960756B (en) | News event information induction method | |
CN106650943A (en) | Auxiliary writing method and apparatus based on artificial intelligence | |
CN107967290A (en) | A kind of knowledge mapping network establishing method and system, medium based on magnanimity scientific research data | |
CN108763348A (en) | A kind of classification improved method of extension short text word feature vector | |
CN106294473B (en) | Entity word mining method, information recommendation method and device | |
CN110750995A (en) | File management method based on user-defined map | |
Hammarfelt | Harvesting footnotes in a rural field: Citation patterns in Swedish literary studies | |
CN103034656B (en) | Chapters and sections content layered approach and device, article content layered approach and device | |
CN107145476A (en) | One kind is based on improvement TF IDF keyword extraction algorithms | |
Maciołek et al. | Cluo: Web-scale text mining system for open source intelligence purposes | |
CN105574004B (en) | A kind of removing duplicate webpages method and apparatus | |
CN107861943A (en) | A kind of method of the rapid extraction useful data from document sets | |
CN103034657B (en) | Documentation summary generates method and apparatus | |
Joseph et al. | An approach to selecting keywords to track on twitter during a disaster. | |
CN110020034B (en) | Information quotation analysis method and system | |
CN104572628B (en) | A kind of science based on syntactic feature defines automatic extraction system and method | |
CN106934007B (en) | Associated information pushing method and device | |
US9886488B2 (en) | Conceptual document analysis and characterization | |
CN108595593A (en) | Meeting research hotspot based on topic model and development trend information analysis method | |
Moradi et al. | Clustering of deep contextualized representations for summarization of biomedical texts | |
Fuller et al. | Structuring, recording, and analyzing historical networks in the china biographical database |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |