CN107861943B - Method for quickly extracting useful data from document set - Google Patents
Method for quickly extracting useful data from document set Download PDFInfo
- Publication number
- CN107861943B CN107861943B CN201710985840.1A CN201710985840A CN107861943B CN 107861943 B CN107861943 B CN 107861943B CN 201710985840 A CN201710985840 A CN 201710985840A CN 107861943 B CN107861943 B CN 107861943B
- Authority
- CN
- China
- Prior art keywords
- document
- word
- potential
- paragraph
- result
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/903—Querying
- G06F16/90335—Query processing
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Computational Linguistics (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a method for quickly extracting useful data from a document set, which comprises the following steps: 1: performing word segmentation processing to obtain potential search words in each document and potential search words in each paragraph in the document; 2: performing word frequency statistics to obtain a word frequency statistical result of each potential search word in each paragraph and a word frequency statistical result of potential search words of the whole document; 3: adopting an unstructured database technology to store, so that all the documents in the document set are converted into an ordered set in an unstructured database; 4: inputting a search word, and implementing search in an unstructured database with an ordered set; 5: and outputting a retrieval result. Has the advantages that: the retrieval is simple and the use is convenient.
Description
Technical Field
The invention relates to the technical field of information retrieval, in particular to a method for quickly extracting useful data from a document set.
Background
Unstructured database: generally speaking, unstructured data is data which has an irregular or incomplete data structure, has no predefined data model, and is inconvenient to express in a two-dimensional table by adopting a similar relational database. Document data such as Word, PDF, etc., picture data, image, audio, video data, etc. Unstructured data has a large weight among all data. The management of unstructured data by using traditional structured databases such as relational databases is difficult to conveniently mine valuable information contained in unstructured data.
Chinese word segmentation technology: the Chinese word segmentation refers to a process of segmenting a continuous word sequence in a text into individual words according to a certain specification and recombining the word sequence.
The word frequency statistical technique comprises the following steps: the number of times a word appears in a document is referred to as the word frequency of the word in the document. At present, TF-IDF (term frequency-inverse document frequency) method is generally adopted for word frequency statistics. This is a commonly used weighting technique for intelligence retrieval and text mining to assess how important a word is to a set of domain documents in a document or corpus. The importance of a word increases in proportion to the number of times it appears in a document, but at the same time decreases in inverse proportion to the frequency with which it appears in the corpus.
Document retrieval refers to a process of finding an optimal document having a search term in a document database in the case where the search term is input. With the increasing pace of social life and work and the increasing number of documents and words, only documents are searched in mass data, and even if relevant documents are searched, a large amount of time is needed to manually search relevant data in the relevant documents, so that the efficiency is extremely low and is quite difficult. For example: the geological disaster work accumulates a large amount of document data, the document data is generally stored integrally by taking the whole document as a unit, a specific data or information is extracted from one or a plurality of documents, or a specific section of the specific data or information in one or a plurality of documents is determined, and the information is extracted quickly, which is difficult up to now.
Disclosure of Invention
In view of this, the embodiment of the present invention provides a method for quickly extracting useful data from a document set, which is simple to retrieve and convenient to use.
The embodiment of the invention provides a method for quickly extracting useful data from a document set, which comprises the following steps:
step 1: using a Chinese word segmentation tool to carry out preprocessing including word segmentation, part-of-speech tagging and word segmentation screening on each document in a document set so as to obtain a potential search word in each document and a potential search word in each paragraph in the document;
step 2: performing word frequency statistics on the potential search words in each paragraph in each document in the document set to obtain a word frequency statistical result of each potential search word in each paragraph, and obtaining a word frequency statistical result of the potential search words of the corresponding document whole based on the word frequency statistical result of the paragraphs;
and step 3: adopting an unstructured database technology to store the document sets processed in the steps 1 and 2, and establishing a storage set for each document in the document sets, wherein the storage content of each storage set comprises: { the name of a document, the content of the document, the potential search word of each paragraph in the document, the word frequency statistical result of each potential search word, the potential search word of the document, the word frequency statistical result of each potential search word, and the storage time }, so that all documents in the document set are converted into an ordered set in an unstructured database;
and 4, step 4: inputting a search word, and implementing search in an unstructured database with an ordered set;
and 5: and outputting a retrieval result according to the matching of the retrieval word and the potential retrieval word and the word frequency statistical result of the potential retrieval word.
Further, the potential search terms include nouns, verbs and quantifiers.
Further, the word segmentation in step 1 is to remove non-potential search words from the words labeled by the segmentation and part of speech, where the non-potential search words include conjunctions, adverbs, and word atmosphere words.
Further, in step 5, the output content of the search result includes at least one result set, and the content of each result set includes: { name of document, storage time, contents of each paragraph in the document with a search term }.
And further, sorting the result set in a descending order according to the word frequency statistical result of the potential search words of the document.
Further, in each of the result sets, the paragraphs with the search terms are arranged according to the paragraph order of each paragraph in the document.
Further, the content of the result set further includes: { storage location, number of terms per paragraph with term }.
Further, the document set is a large document set of geological disasters.
Further, the Chinese Word segmentation tool is a Word segmentation dictionary, and the Word segmentation algorithm adopted in the Word segmentation in the step 1 is a Chinese Word segmentation algorithm, a Word segmentation algorithm or a Pangu segmentation algorithm.
Further, the method for performing word frequency statistics in step 2 is a TF-IDF method, and the unstructured database is a MongoDB, HBase or Redis database.
The technical scheme provided by the embodiment of the invention has the following beneficial effects: the method for rapidly extracting the useful data from the document set overcomes the difficulty that the useful data or information contained in a large number of documents is difficult to determine whether the useful data or information exists, where the useful data or information is located and how the useful data or information is rapidly extracted, enables a user to rapidly extract the required useful data or information from a large number of geological disaster documents, and provides powerful support and service for geological disaster data management, data analysis, data mining, data fusion, big data processing and the like.
Drawings
FIG. 1 is a diagram of one step of the method of the present invention for rapid extraction of useful data from a collection of documents;
FIG. 2 is an exemplary graph of a result set.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be further described with reference to the accompanying drawings.
Referring to fig. 1, an embodiment of the present invention provides a method for rapidly extracting useful data from a document set, including the following steps:
step 1: and (3) preprocessing each document in the document set by using a Chinese word segmentation tool, wherein the preprocessing comprises word segmentation, part of speech tagging and word segmentation screening to obtain a potential search word in each document and a potential search word in each paragraph in the document.
The Chinese word segmentation tool is a word segmentation dictionary, such as a Chinese dictionary with part-of-speech labels and the like counted in daily reports of people. The Word segmentation algorithm adopted in the Word segmentation in the step 1 is a Chinese Word segmentation algorithm, a Word segmentation algorithm or a ancient Word segmentation algorithm. And selecting a proper word segmentation algorithm according to the requirement.
The potential search words comprise nouns, verbs and quantifiers, and even comprise adjectives and other words which can be used for searching by the user. The word segmentation in the step 1 is to remove non-potential search words in the words after word segmentation and part of speech tagging, wherein the non-potential search words include words which cannot be used for searching by users or words with low search probability when the users use the words, such as conjunctions, adverbs, and word-atmosphere words. In this embodiment, the document set is a large document set of a geological disaster, but not limited thereto.
Step 2: and performing word frequency statistics on the potential search words in each paragraph in each document in the document set to obtain a word frequency statistical result of each potential search word in each paragraph, and obtaining a word frequency statistical result of the potential search words of the corresponding document whole based on the word frequency statistical result of the paragraphs.
Preferably, the method for performing word frequency statistics in step 2 is a TF-IDF method, and the unstructured database is a MongoDB, HBase or Redis database.
For example, the potential terms for document 1 are: computer, data and keyboard, the potential search term of the nth segment in document 1 is: computers and data. After the word frequency statistics in step 2, the number of occurrences of the potential search word computers and data in the nth segment in the document 1 is 5 times and 9 times, and the number of occurrences of the potential search word computers, data and keyboard in the document 1 is 15 times, 31 times and 92 times, respectively. Therefore, the word frequency statistical result of the nth segment in the document 1 is: [ document 1, paragraph N, computer, 5 ]; [ document 1, paragraph N, data, 9 ]. The word frequency statistical result of the document 1 is as follows: [ document 1, computer, 15 ]; [ document 1, data, 31 ]; [ document 1, keyboard, 92 ].
And step 3: adopting an unstructured database technology to store the document sets processed in the steps 1 and 2, and establishing a storage set for each document in the document sets, wherein the storage content of each storage set comprises: { the name of a document, the content of the document, the potential search terms of each paragraph in the document and the word frequency statistical result of each potential search term, the potential search terms of the document and the word frequency statistical result of each potential search term, and the storage time }, so that all the documents in the document set are converted into an ordered set in an unstructured database.
And 4, step 4: and inputting a search term, and implementing search in the unstructured database with the ordered set.
And 5: and outputting a retrieval result according to the matching of the retrieval word and the potential retrieval word and the word frequency statistical result of the potential retrieval word.
Referring to fig. 2, in step 5, the output content of the search result includes at least one result set, and the content of each result set includes: { name of document, storage time, contents of each paragraph in the document with a search term }. And sorting the result set in a descending order according to the word frequency statistical results of the potential search words of the documents. In each result set, paragraphs with search terms are arranged according to the paragraph order of each paragraph in the document. The contents of the result set may further include: { storage location, number of terms per paragraph with term }.
The technical scheme provided by the embodiment of the invention has the following beneficial effects: the method for rapidly extracting the useful data from the document set overcomes the difficulty that the useful data or information contained in a large number of documents is difficult to determine whether the useful data or information exists, where the useful data or information is located and how the useful data or information is rapidly extracted, enables a user to rapidly extract the required useful data or information from a large number of geological disaster documents, and provides powerful support and service for geological disaster data management, data analysis, data mining, data fusion, big data processing and the like.
In this document, the terms front, back, upper and lower are used to define the components in the drawings and the positions of the components relative to each other, and are used for clarity and convenience of the technical solution. It is to be understood that the use of the directional terms should not be taken to limit the scope of the claims.
The features of the embodiments and embodiments described herein above may be combined with each other without conflict.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.
Claims (4)
1. A method for rapidly extracting useful data from a document set is characterized in that: the method comprises the following steps:
step 1: using a Chinese word segmentation tool to perform preprocessing including word segmentation, part-of-speech tagging and word segmentation screening on each document in a document set, and eliminating non-potential search words in words after the word segmentation and the part-of-speech tagging, wherein the non-potential search words comprise conjunctions, adverbs and words of tone, so as to obtain potential search words in each document and potential search words in each paragraph in the document, and the potential search words comprise nouns, verbs and quantifiers;
step 2: performing word frequency statistics on the potential search words in each paragraph in each document in the document set to obtain a word frequency statistical result of each potential search word in each paragraph, and obtaining a word frequency statistical result of the potential search words of the corresponding document whole based on the word frequency statistical result of the paragraphs;
and step 3: adopting an unstructured database technology to store the document sets processed in the steps 1 and 2, and establishing a storage set for each document in the document sets, wherein the storage content of each storage set comprises: { the name of a document, the content of the document, the potential search word of each paragraph in the document, the word frequency statistical result of each potential search word, the potential search word of the document, the word frequency statistical result of each potential search word, and the storage time }, so that all documents in the document set are converted into an ordered set in an unstructured database;
and 4, step 4: inputting a search word, and implementing search in an unstructured database with an ordered set;
and 5: outputting a retrieval result according to the matching of the retrieval word and the potential retrieval word and the word frequency statistical result of the potential retrieval word, wherein the output content of the retrieval result comprises at least one result set, and the content of each result set comprises: { name of document, storage time, contents of each paragraph in the document with a search term }, the contents of the result set further include: { storage location, number of terms of each paragraph having a term }, and sorting the result sets in descending order according to the word frequency statistical result of the potential terms of the document, and sorting the paragraphs having a term in each of the result sets according to the paragraph order of each paragraph in the document.
2. The method for rapidly extracting useful data from a document set according to claim 1, characterized in that: the document set is a large document set of geological disasters.
3. The method for rapidly extracting useful data from a document set according to claim 1, characterized in that: the Chinese Word segmentation tool is a Word segmentation dictionary, and the Word segmentation algorithm adopted in the Word segmentation in the step 1 is a Chinese Word segmentation, Word segmentation or ancient Word segmentation algorithm.
4. The method for rapidly extracting useful data from a document set according to claim 1, characterized in that: the method for carrying out word frequency statistics in the step 2 is a TF-IDF method, and the unstructured database is a MongoDB, HBase or Redis database.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710985840.1A CN107861943B (en) | 2017-10-20 | 2017-10-20 | Method for quickly extracting useful data from document set |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710985840.1A CN107861943B (en) | 2017-10-20 | 2017-10-20 | Method for quickly extracting useful data from document set |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107861943A CN107861943A (en) | 2018-03-30 |
CN107861943B true CN107861943B (en) | 2020-03-24 |
Family
ID=61696544
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710985840.1A Active CN107861943B (en) | 2017-10-20 | 2017-10-20 | Method for quickly extracting useful data from document set |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107861943B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109784689B (en) * | 2018-12-28 | 2022-03-15 | 远光软件股份有限公司 | Power grid infrastructure project report data processing method |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101377777A (en) * | 2007-09-03 | 2009-03-04 | 北京百问百答网络技术有限公司 | Automatic inquiring and answering method and system |
CN105005556A (en) * | 2015-07-29 | 2015-10-28 | 成都理工大学 | Index keyword extraction method and system based on big geological data |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102541901A (en) * | 2010-12-26 | 2012-07-04 | 上海量明科技发展有限公司 | Method and system for identifying and outputting information during document reading |
CN103136352B (en) * | 2013-02-27 | 2016-02-03 | 华中师范大学 | Text retrieval system based on double-deck semantic analysis |
CN104679778B (en) * | 2013-11-29 | 2019-03-26 | 腾讯科技(深圳)有限公司 | A kind of generation method and device of search result |
CN105760474B (en) * | 2016-02-14 | 2021-02-19 | Tcl科技集团股份有限公司 | Method and system for extracting feature words of document set based on position information |
CN106649863A (en) * | 2016-12-30 | 2017-05-10 | 天津市测绘院 | Non-structured data management method and apparatus |
-
2017
- 2017-10-20 CN CN201710985840.1A patent/CN107861943B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101377777A (en) * | 2007-09-03 | 2009-03-04 | 北京百问百答网络技术有限公司 | Automatic inquiring and answering method and system |
CN105005556A (en) * | 2015-07-29 | 2015-10-28 | 成都理工大学 | Index keyword extraction method and system based on big geological data |
Non-Patent Citations (1)
Title |
---|
面向云存储的非结构化数据存储研究;王存宇 等;《计算机时代》;20150531(第5期);第13-15页 * |
Also Published As
Publication number | Publication date |
---|---|
CN107861943A (en) | 2018-03-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109992645B (en) | Data management system and method based on text data | |
CN103218444B (en) | Based on semantic method of Tibetan language webpage text classification | |
CN106776574B (en) | User comment text mining method and device | |
KR101681109B1 (en) | An automatic method for classifying documents by using presentative words and similarity | |
CN106708940B (en) | Method and device for processing pictures | |
CN108875065B (en) | Indonesia news webpage recommendation method based on content | |
KR102373884B1 (en) | Image data processing method for searching images by text | |
Imani et al. | Focus location extraction from political news reports with bias correction | |
CN115270738A (en) | Method and system for generating newspaper and computer storage medium | |
CN110866086A (en) | Article matching system | |
CN109902152B (en) | Method and apparatus for retrieving information | |
KR101753768B1 (en) | A knowledge management system of searching documents on categories by using weights | |
Karumudi et al. | Information retrieval and processing system for news articles in English | |
Fodil et al. | Theme classification of Arabic text: A statistical approach | |
CN107861943B (en) | Method for quickly extracting useful data from document set | |
Shah et al. | An automatic text summarization on Naive Bayes classifier using latent semantic analysis | |
CN109325096B (en) | Knowledge resource search system based on knowledge resource classification | |
Cvitaš | Relation extraction from text documents | |
Sahani et al. | Automatic text categorization of Marathi language documents | |
US9886488B2 (en) | Conceptual document analysis and characterization | |
Zhu et al. | Chinese texts classification system | |
CN112949287A (en) | Hot word mining method, system, computer device and storage medium | |
Ashqar et al. | A Comparative Assessment of Various Embeddings for Keyword Extraction | |
KR20200078170A (en) | Apparatus for classifying products by hierarchical category and method thereof | |
CN117150046B (en) | Automatic task decomposition method and system based on context semantics |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |