CN107861943B - Method for quickly extracting useful data from document set - Google Patents

Method for quickly extracting useful data from document set Download PDF

Info

Publication number
CN107861943B
CN107861943B CN201710985840.1A CN201710985840A CN107861943B CN 107861943 B CN107861943 B CN 107861943B CN 201710985840 A CN201710985840 A CN 201710985840A CN 107861943 B CN107861943 B CN 107861943B
Authority
CN
China
Prior art keywords
document
word
potential
paragraph
result
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710985840.1A
Other languages
Chinese (zh)
Other versions
CN107861943A (en
Inventor
刘军旗
苏爱军
唐辉明
吴冲龙
姚梦辉
滕伟福
王亮清
封瑞雪
赵剑雄
陈根深
邹宗兴
王菁莪
曾雯
张抒
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China University of Geosciences
Original Assignee
China University of Geosciences
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China University of Geosciences filed Critical China University of Geosciences
Priority to CN201710985840.1A priority Critical patent/CN107861943B/en
Publication of CN107861943A publication Critical patent/CN107861943A/en
Application granted granted Critical
Publication of CN107861943B publication Critical patent/CN107861943B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a method for quickly extracting useful data from a document set, which comprises the following steps: 1: performing word segmentation processing to obtain potential search words in each document and potential search words in each paragraph in the document; 2: performing word frequency statistics to obtain a word frequency statistical result of each potential search word in each paragraph and a word frequency statistical result of potential search words of the whole document; 3: adopting an unstructured database technology to store, so that all the documents in the document set are converted into an ordered set in an unstructured database; 4: inputting a search word, and implementing search in an unstructured database with an ordered set; 5: and outputting a retrieval result. Has the advantages that: the retrieval is simple and the use is convenient.

Description

Method for quickly extracting useful data from document set
Technical Field
The invention relates to the technical field of information retrieval, in particular to a method for quickly extracting useful data from a document set.
Background
Unstructured database: generally speaking, unstructured data is data which has an irregular or incomplete data structure, has no predefined data model, and is inconvenient to express in a two-dimensional table by adopting a similar relational database. Document data such as Word, PDF, etc., picture data, image, audio, video data, etc. Unstructured data has a large weight among all data. The management of unstructured data by using traditional structured databases such as relational databases is difficult to conveniently mine valuable information contained in unstructured data.
Chinese word segmentation technology: the Chinese word segmentation refers to a process of segmenting a continuous word sequence in a text into individual words according to a certain specification and recombining the word sequence.
The word frequency statistical technique comprises the following steps: the number of times a word appears in a document is referred to as the word frequency of the word in the document. At present, TF-IDF (term frequency-inverse document frequency) method is generally adopted for word frequency statistics. This is a commonly used weighting technique for intelligence retrieval and text mining to assess how important a word is to a set of domain documents in a document or corpus. The importance of a word increases in proportion to the number of times it appears in a document, but at the same time decreases in inverse proportion to the frequency with which it appears in the corpus.
Document retrieval refers to a process of finding an optimal document having a search term in a document database in the case where the search term is input. With the increasing pace of social life and work and the increasing number of documents and words, only documents are searched in mass data, and even if relevant documents are searched, a large amount of time is needed to manually search relevant data in the relevant documents, so that the efficiency is extremely low and is quite difficult. For example: the geological disaster work accumulates a large amount of document data, the document data is generally stored integrally by taking the whole document as a unit, a specific data or information is extracted from one or a plurality of documents, or a specific section of the specific data or information in one or a plurality of documents is determined, and the information is extracted quickly, which is difficult up to now.
Disclosure of Invention
In view of this, the embodiment of the present invention provides a method for quickly extracting useful data from a document set, which is simple to retrieve and convenient to use.
The embodiment of the invention provides a method for quickly extracting useful data from a document set, which comprises the following steps:
step 1: using a Chinese word segmentation tool to carry out preprocessing including word segmentation, part-of-speech tagging and word segmentation screening on each document in a document set so as to obtain a potential search word in each document and a potential search word in each paragraph in the document;
step 2: performing word frequency statistics on the potential search words in each paragraph in each document in the document set to obtain a word frequency statistical result of each potential search word in each paragraph, and obtaining a word frequency statistical result of the potential search words of the corresponding document whole based on the word frequency statistical result of the paragraphs;
and step 3: adopting an unstructured database technology to store the document sets processed in the steps 1 and 2, and establishing a storage set for each document in the document sets, wherein the storage content of each storage set comprises: { the name of a document, the content of the document, the potential search word of each paragraph in the document, the word frequency statistical result of each potential search word, the potential search word of the document, the word frequency statistical result of each potential search word, and the storage time }, so that all documents in the document set are converted into an ordered set in an unstructured database;
and 4, step 4: inputting a search word, and implementing search in an unstructured database with an ordered set;
and 5: and outputting a retrieval result according to the matching of the retrieval word and the potential retrieval word and the word frequency statistical result of the potential retrieval word.
Further, the potential search terms include nouns, verbs and quantifiers.
Further, the word segmentation in step 1 is to remove non-potential search words from the words labeled by the segmentation and part of speech, where the non-potential search words include conjunctions, adverbs, and word atmosphere words.
Further, in step 5, the output content of the search result includes at least one result set, and the content of each result set includes: { name of document, storage time, contents of each paragraph in the document with a search term }.
And further, sorting the result set in a descending order according to the word frequency statistical result of the potential search words of the document.
Further, in each of the result sets, the paragraphs with the search terms are arranged according to the paragraph order of each paragraph in the document.
Further, the content of the result set further includes: { storage location, number of terms per paragraph with term }.
Further, the document set is a large document set of geological disasters.
Further, the Chinese Word segmentation tool is a Word segmentation dictionary, and the Word segmentation algorithm adopted in the Word segmentation in the step 1 is a Chinese Word segmentation algorithm, a Word segmentation algorithm or a Pangu segmentation algorithm.
Further, the method for performing word frequency statistics in step 2 is a TF-IDF method, and the unstructured database is a MongoDB, HBase or Redis database.
The technical scheme provided by the embodiment of the invention has the following beneficial effects: the method for rapidly extracting the useful data from the document set overcomes the difficulty that the useful data or information contained in a large number of documents is difficult to determine whether the useful data or information exists, where the useful data or information is located and how the useful data or information is rapidly extracted, enables a user to rapidly extract the required useful data or information from a large number of geological disaster documents, and provides powerful support and service for geological disaster data management, data analysis, data mining, data fusion, big data processing and the like.
Drawings
FIG. 1 is a diagram of one step of the method of the present invention for rapid extraction of useful data from a collection of documents;
FIG. 2 is an exemplary graph of a result set.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be further described with reference to the accompanying drawings.
Referring to fig. 1, an embodiment of the present invention provides a method for rapidly extracting useful data from a document set, including the following steps:
step 1: and (3) preprocessing each document in the document set by using a Chinese word segmentation tool, wherein the preprocessing comprises word segmentation, part of speech tagging and word segmentation screening to obtain a potential search word in each document and a potential search word in each paragraph in the document.
The Chinese word segmentation tool is a word segmentation dictionary, such as a Chinese dictionary with part-of-speech labels and the like counted in daily reports of people. The Word segmentation algorithm adopted in the Word segmentation in the step 1 is a Chinese Word segmentation algorithm, a Word segmentation algorithm or a ancient Word segmentation algorithm. And selecting a proper word segmentation algorithm according to the requirement.
The potential search words comprise nouns, verbs and quantifiers, and even comprise adjectives and other words which can be used for searching by the user. The word segmentation in the step 1 is to remove non-potential search words in the words after word segmentation and part of speech tagging, wherein the non-potential search words include words which cannot be used for searching by users or words with low search probability when the users use the words, such as conjunctions, adverbs, and word-atmosphere words. In this embodiment, the document set is a large document set of a geological disaster, but not limited thereto.
Step 2: and performing word frequency statistics on the potential search words in each paragraph in each document in the document set to obtain a word frequency statistical result of each potential search word in each paragraph, and obtaining a word frequency statistical result of the potential search words of the corresponding document whole based on the word frequency statistical result of the paragraphs.
Preferably, the method for performing word frequency statistics in step 2 is a TF-IDF method, and the unstructured database is a MongoDB, HBase or Redis database.
For example, the potential terms for document 1 are: computer, data and keyboard, the potential search term of the nth segment in document 1 is: computers and data. After the word frequency statistics in step 2, the number of occurrences of the potential search word computers and data in the nth segment in the document 1 is 5 times and 9 times, and the number of occurrences of the potential search word computers, data and keyboard in the document 1 is 15 times, 31 times and 92 times, respectively. Therefore, the word frequency statistical result of the nth segment in the document 1 is: [ document 1, paragraph N, computer, 5 ]; [ document 1, paragraph N, data, 9 ]. The word frequency statistical result of the document 1 is as follows: [ document 1, computer, 15 ]; [ document 1, data, 31 ]; [ document 1, keyboard, 92 ].
And step 3: adopting an unstructured database technology to store the document sets processed in the steps 1 and 2, and establishing a storage set for each document in the document sets, wherein the storage content of each storage set comprises: { the name of a document, the content of the document, the potential search terms of each paragraph in the document and the word frequency statistical result of each potential search term, the potential search terms of the document and the word frequency statistical result of each potential search term, and the storage time }, so that all the documents in the document set are converted into an ordered set in an unstructured database.
And 4, step 4: and inputting a search term, and implementing search in the unstructured database with the ordered set.
And 5: and outputting a retrieval result according to the matching of the retrieval word and the potential retrieval word and the word frequency statistical result of the potential retrieval word.
Referring to fig. 2, in step 5, the output content of the search result includes at least one result set, and the content of each result set includes: { name of document, storage time, contents of each paragraph in the document with a search term }. And sorting the result set in a descending order according to the word frequency statistical results of the potential search words of the documents. In each result set, paragraphs with search terms are arranged according to the paragraph order of each paragraph in the document. The contents of the result set may further include: { storage location, number of terms per paragraph with term }.
The technical scheme provided by the embodiment of the invention has the following beneficial effects: the method for rapidly extracting the useful data from the document set overcomes the difficulty that the useful data or information contained in a large number of documents is difficult to determine whether the useful data or information exists, where the useful data or information is located and how the useful data or information is rapidly extracted, enables a user to rapidly extract the required useful data or information from a large number of geological disaster documents, and provides powerful support and service for geological disaster data management, data analysis, data mining, data fusion, big data processing and the like.
In this document, the terms front, back, upper and lower are used to define the components in the drawings and the positions of the components relative to each other, and are used for clarity and convenience of the technical solution. It is to be understood that the use of the directional terms should not be taken to limit the scope of the claims.
The features of the embodiments and embodiments described herein above may be combined with each other without conflict.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims (4)

1. A method for rapidly extracting useful data from a document set is characterized in that: the method comprises the following steps:
step 1: using a Chinese word segmentation tool to perform preprocessing including word segmentation, part-of-speech tagging and word segmentation screening on each document in a document set, and eliminating non-potential search words in words after the word segmentation and the part-of-speech tagging, wherein the non-potential search words comprise conjunctions, adverbs and words of tone, so as to obtain potential search words in each document and potential search words in each paragraph in the document, and the potential search words comprise nouns, verbs and quantifiers;
step 2: performing word frequency statistics on the potential search words in each paragraph in each document in the document set to obtain a word frequency statistical result of each potential search word in each paragraph, and obtaining a word frequency statistical result of the potential search words of the corresponding document whole based on the word frequency statistical result of the paragraphs;
and step 3: adopting an unstructured database technology to store the document sets processed in the steps 1 and 2, and establishing a storage set for each document in the document sets, wherein the storage content of each storage set comprises: { the name of a document, the content of the document, the potential search word of each paragraph in the document, the word frequency statistical result of each potential search word, the potential search word of the document, the word frequency statistical result of each potential search word, and the storage time }, so that all documents in the document set are converted into an ordered set in an unstructured database;
and 4, step 4: inputting a search word, and implementing search in an unstructured database with an ordered set;
and 5: outputting a retrieval result according to the matching of the retrieval word and the potential retrieval word and the word frequency statistical result of the potential retrieval word, wherein the output content of the retrieval result comprises at least one result set, and the content of each result set comprises: { name of document, storage time, contents of each paragraph in the document with a search term }, the contents of the result set further include: { storage location, number of terms of each paragraph having a term }, and sorting the result sets in descending order according to the word frequency statistical result of the potential terms of the document, and sorting the paragraphs having a term in each of the result sets according to the paragraph order of each paragraph in the document.
2. The method for rapidly extracting useful data from a document set according to claim 1, characterized in that: the document set is a large document set of geological disasters.
3. The method for rapidly extracting useful data from a document set according to claim 1, characterized in that: the Chinese Word segmentation tool is a Word segmentation dictionary, and the Word segmentation algorithm adopted in the Word segmentation in the step 1 is a Chinese Word segmentation, Word segmentation or ancient Word segmentation algorithm.
4. The method for rapidly extracting useful data from a document set according to claim 1, characterized in that: the method for carrying out word frequency statistics in the step 2 is a TF-IDF method, and the unstructured database is a MongoDB, HBase or Redis database.
CN201710985840.1A 2017-10-20 2017-10-20 Method for quickly extracting useful data from document set Active CN107861943B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710985840.1A CN107861943B (en) 2017-10-20 2017-10-20 Method for quickly extracting useful data from document set

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710985840.1A CN107861943B (en) 2017-10-20 2017-10-20 Method for quickly extracting useful data from document set

Publications (2)

Publication Number Publication Date
CN107861943A CN107861943A (en) 2018-03-30
CN107861943B true CN107861943B (en) 2020-03-24

Family

ID=61696544

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710985840.1A Active CN107861943B (en) 2017-10-20 2017-10-20 Method for quickly extracting useful data from document set

Country Status (1)

Country Link
CN (1) CN107861943B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109784689B (en) * 2018-12-28 2022-03-15 远光软件股份有限公司 Power grid infrastructure project report data processing method

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101377777A (en) * 2007-09-03 2009-03-04 北京百问百答网络技术有限公司 Automatic inquiring and answering method and system
CN105005556A (en) * 2015-07-29 2015-10-28 成都理工大学 Index keyword extraction method and system based on big geological data

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102541901A (en) * 2010-12-26 2012-07-04 上海量明科技发展有限公司 Method and system for identifying and outputting information during document reading
CN103136352B (en) * 2013-02-27 2016-02-03 华中师范大学 Text retrieval system based on double-deck semantic analysis
CN104679778B (en) * 2013-11-29 2019-03-26 腾讯科技(深圳)有限公司 A kind of generation method and device of search result
CN105760474B (en) * 2016-02-14 2021-02-19 Tcl科技集团股份有限公司 Method and system for extracting feature words of document set based on position information
CN106649863A (en) * 2016-12-30 2017-05-10 天津市测绘院 Non-structured data management method and apparatus

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101377777A (en) * 2007-09-03 2009-03-04 北京百问百答网络技术有限公司 Automatic inquiring and answering method and system
CN105005556A (en) * 2015-07-29 2015-10-28 成都理工大学 Index keyword extraction method and system based on big geological data

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
面向云存储的非结构化数据存储研究;王存宇 等;《计算机时代》;20150531(第5期);第13-15页 *

Also Published As

Publication number Publication date
CN107861943A (en) 2018-03-30

Similar Documents

Publication Publication Date Title
CN109992645B (en) Data management system and method based on text data
CN103218444B (en) Based on semantic method of Tibetan language webpage text classification
CN106776574B (en) User comment text mining method and device
KR101681109B1 (en) An automatic method for classifying documents by using presentative words and similarity
CN106708940B (en) Method and device for processing pictures
CN108875065B (en) Indonesia news webpage recommendation method based on content
KR102373884B1 (en) Image data processing method for searching images by text
Imani et al. Focus location extraction from political news reports with bias correction
CN115270738A (en) Method and system for generating newspaper and computer storage medium
CN110866086A (en) Article matching system
CN109902152B (en) Method and apparatus for retrieving information
KR101753768B1 (en) A knowledge management system of searching documents on categories by using weights
Karumudi et al. Information retrieval and processing system for news articles in English
Fodil et al. Theme classification of Arabic text: A statistical approach
CN107861943B (en) Method for quickly extracting useful data from document set
Shah et al. An automatic text summarization on Naive Bayes classifier using latent semantic analysis
CN109325096B (en) Knowledge resource search system based on knowledge resource classification
Cvitaš Relation extraction from text documents
Sahani et al. Automatic text categorization of Marathi language documents
US9886488B2 (en) Conceptual document analysis and characterization
Zhu et al. Chinese texts classification system
CN112949287A (en) Hot word mining method, system, computer device and storage medium
Ashqar et al. A Comparative Assessment of Various Embeddings for Keyword Extraction
KR20200078170A (en) Apparatus for classifying products by hierarchical category and method thereof
CN117150046B (en) Automatic task decomposition method and system based on context semantics

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant