CN104376044A - Information retrieval optimization method based on information granularity - Google Patents
Information retrieval optimization method based on information granularity Download PDFInfo
- Publication number
- CN104376044A CN104376044A CN201410550066.8A CN201410550066A CN104376044A CN 104376044 A CN104376044 A CN 104376044A CN 201410550066 A CN201410550066 A CN 201410550066A CN 104376044 A CN104376044 A CN 104376044A
- Authority
- CN
- China
- Prior art keywords
- topic
- information
- pattern
- content
- event
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
- G06F16/3325—Reformulation based on results of preceding query
- G06F16/3326—Reformulation based on results of preceding query using relevance feedback from the user, e.g. relevance feedback on documents, documents sets, document terms or passages
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Theoretical Computer Science (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides an information retrieval optimization method based on information granularity, and relates to the technical field of information retrieval optimization. The method comprises the steps that whether the content of a document and a theme keyword expansion set are consistent or not is judged according to a how-net system; all sentences, a title and a subtitle in a training text set of the specific category are extracted, and a pattern example set of the specific category is generated; a map is subjected to breadth-first traversal to generate a pattern set; the pattern set is divided into a plurality of pattern sub sets corresponding to different event themes according to the excitation degrees of pattern elements in the pattern set to different event themes in a training set; the obtained characteristics which can be possibly superior to the characteristics determined by human experience on certain aspects are automatically extracted based on a machine pattern, in the process of retrieving massive text, the interference of unrelated content can be effectively removed through the initial separation of the content and the theme, and the retrieval speed is increased.
Description
Technical field:
The present invention relates to optimization technical field of information retrieval, be specifically related to a kind of information retrieval optimizing method based on Information Granularity.
Background technology:
Information retrieval (Information Retrieval) refers to that information is organized in a certain way, and finds out process and the technology of relevant information according to the needs of information user.The information retrieval of narrow sense is exactly the latter half of information retrieval process, from information aggregate, namely find out the process of required information, namely our Information searching (Information Search or Information Seek) of often saying.
Information retrieval have broad sense and narrow sense point.The information retrieval full name of broad sense is " informationm storage and retrieval ", refers to and information to be organized in a certain way and store, and finds out process for information about according to the needs of user.The information retrieval of narrow sense is the latter half of " informationm storage and retrieval ", is commonly referred to " information searching " or " information search ", refers to the process for information about found out from information aggregate required for user.The information retrieval of narrow sense comprises the implication of 3 aspects: understand the information requirement of user, the technology of information retrieval or method, meet the demand of information user.
From information retrieval principle, the storage of information is the basis realizing information retrieval.The information that will store here not only comprises original document data, also comprises picture, Audio and Video etc., first these raw informations will be carried out the conversion of computerese, and be stored in a database, otherwise cannot carry out machine recognition.After user is according to the request of intention input inquiry, searching system searches for information associated with the query in a database according to the inquiry request of user, calculated the similarity size of information by certain matching mechanisms, and by order from big to small, information conversion is exported.
" granularity " (granularity) refers to relative size or the degree of roughness of message unit.
Information retrieval originates from reference consultation and the abstracting and indexing service in library, from the second half in 19th century first development, to the forties in 20th century, index is library's independently instrument and user's service item with retrieving into.
Along with nineteen forty-six First robot calculator appearance in the world, computer technology progressively comes into information retrieval field, and combines closely with information retrieval theory; Off line batch information retrieval system, online real time intelligence searching system are succeeded in developing and commercialization in succession, the sixties in 20th century is to the eighties, under the promotion of the information processing technology, mechanics of communication, computing machine and database technology, information retrieval, at each field high speed developments such as education, military affairs and business, is widely used.Dialog international online information retrieval system is the representative of the information retrieval field in this period, is still one of foremost system in the world so far.
Subject retrieval is a shortcoming of information retrieval technique research field, but uses existing algorithm, and in big data quantity retrieving, result for retrieval is often not fully up to expectations, is that result for retrieval and users' expectation differ greatly on the one hand; Two is sharply increase with the refinement of Information Granularity retrieval time.
Summary of the invention:
The object of this invention is to provide a kind of information retrieval optimizing method based on Information Granularity, it is in the retrieving carrying out mass text, is just divided the interference effectively can eliminating irrelevant contents, accelerate seek rate by content topic volume.
In order to solve the problem existing for background technology, the present invention is by the following technical solutions: it utilizes content recognition and topic identification under thickness different grain size, carry out the characteristic calculated, devise a kind of new topic identification model, step is as follows: 1, topic keyword expansion, forms the topic identification tree of a N layer; 2, judge that whether the content of document is consistent with topic keyword expanded set according to system of Web of Knowledge; 3, judge that whether the event topic involved by document is consistent in step 2; 4, extract other training text of specified class and concentrate all sentences and text header, subtitle, generate the schema instance set of particular category; 5, a word of example or the mapping of phrase sequence and concept in " knowing net " system implementation pattern example collection is used; 6, breadth first traversal figure, generate pattern set; 7, according to the incentive degree of the schema elements in set of modes to different event theme in training set, set of patterns is divided into the mode subset of several corresponding different event themes.
Principle of work of the present invention: the quality of set of modes and content, the precision of topic identification algorithm communicates, use automatic Novel extracting technology, and the mode of its study realizes the automatic generation based on set of patterns, well avoid the mode expansion problem when relating to open text collection, the text of particular event theme correspond to the thinnest granular world, it is the refinement of event topic granular world, first carry out content topic identification, carry out event topic identification again, the knowledge and experience of traditional theme identification not only can be utilized to improve topic identification efficiency, and event topic differentiation scope can be limited, thus greatly improve the degree of accuracy of event topic.
The present invention has following beneficial effect: it may be better than the fixed feature of human experience in some aspect again based on the pattern Automatic Extraction gained feature of machine, in the retrieving carrying out mass text, just divided the interference effectively can eliminating irrelevant contents by content topic volume, accelerate seek rate.
Embodiment:
This embodiment is by the following technical solutions: it utilizes content recognition and topic identification under thickness different grain size, carry out the characteristic calculated, devise a kind of new topic identification model, step is as follows: 1, topic keyword expansion, forms the topic identification tree of a N layer; 2, judge that whether the content of document is consistent with topic keyword expanded set according to system of Web of Knowledge; 3, judge that whether the event topic involved by document is consistent in step 2; 4, extract other training text of specified class and concentrate all sentences and text header, subtitle, generate the schema instance set of particular category; 5, a word of example or the mapping of phrase sequence and concept in " knowing net " system implementation pattern example collection is used; 6, breadth first traversal figure, generate pattern set; 7, according to the incentive degree of the schema elements in set of modes to different event theme in training set, set of patterns is divided into the mode subset of several corresponding different event themes.
This embodiment principle of work: the quality of set of modes and content, the precision of topic identification algorithm communicates, use automatic Novel extracting technology, and the mode of its study realizes the automatic generation based on set of patterns, well avoid the mode expansion problem when relating to open text collection, the text of particular event theme correspond to the thinnest granular world, it is the refinement of event topic granular world, first carry out content topic identification, carry out event topic identification again, the knowledge and experience of traditional theme identification not only can be utilized to improve topic identification efficiency, and event topic differentiation scope can be limited, thus greatly improve the degree of accuracy of event topic.
This embodiment may be better than the fixed feature of human experience in some aspect again based on the pattern Automatic Extraction gained feature of machine, in the retrieving carrying out mass text, just divided the interference effectively can eliminating irrelevant contents by content topic volume, accelerate seek rate.
Claims (3)
1. the information retrieval optimizing method based on Information Granularity, it is characterized in that it utilizes content recognition and topic identification under thickness different grain size, carry out the characteristic calculated, devise a kind of new topic identification model, step is as follows: (1), topic keyword are expanded, and forms the topic identification tree of a N layer; (2), judge that whether the content of document is consistent with topic keyword expanded set according to system of Web of Knowledge; (3), in step (2), judge that whether the event topic involved by document is consistent; (4), extracting other training text of specified class concentrates all sentences and text header, subtitle, the schema instance set of generation particular category; (5) word of example or the mapping of phrase sequence and concept in " knowing net " system implementation pattern example collection, is used; (6), breadth first traversal figure, generate pattern set; (7), according to the incentive degree of the schema elements in set of modes to different event theme in training set, set of patterns is divided into the mode subset of several corresponding different event themes.
2. a kind of information retrieval optimizing method based on Information Granularity according to claim 1, it is characterized in that quality and the content of described set of modes, the precision of topic identification algorithm communicates, use automatic Novel extracting technology, and the mode of its study realizes the automatic generation based on set of patterns, well avoid the mode expansion problem when relating to open text collection, the text of particular event theme correspond to the thinnest granular world, it is the refinement of event topic granular world, first carry out content topic identification, carry out event topic identification again, the knowledge and experience of traditional theme identification not only can be utilized to improve topic identification efficiency, and event topic differentiation scope can be limited, thus greatly improve the degree of accuracy of event topic.
3. a kind of information retrieval optimizing method based on Information Granularity according to claim 1, it is characterized in that it may be better than the fixed feature of human experience in some aspect again based on the pattern Automatic Extraction gained feature of machine, in the retrieving carrying out mass text, just divided the interference effectively can eliminating irrelevant contents by content topic volume, accelerate seek rate.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410550066.8A CN104376044A (en) | 2014-10-16 | 2014-10-16 | Information retrieval optimization method based on information granularity |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410550066.8A CN104376044A (en) | 2014-10-16 | 2014-10-16 | Information retrieval optimization method based on information granularity |
Publications (1)
Publication Number | Publication Date |
---|---|
CN104376044A true CN104376044A (en) | 2015-02-25 |
Family
ID=52554951
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410550066.8A Pending CN104376044A (en) | 2014-10-16 | 2014-10-16 | Information retrieval optimization method based on information granularity |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104376044A (en) |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102207945A (en) * | 2010-05-11 | 2011-10-05 | 天津海量信息技术有限公司 | Knowledge network-based text indexing system and method |
-
2014
- 2014-10-16 CN CN201410550066.8A patent/CN104376044A/en active Pending
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102207945A (en) * | 2010-05-11 | 2011-10-05 | 天津海量信息技术有限公司 | Knowledge network-based text indexing system and method |
Non-Patent Citations (1)
Title |
---|
郭俊荣 等: "一种基于信息粒度的信息检索优化方法", 《计算机仿真》 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109101479B (en) | Clustering method and device for Chinese sentences | |
CN104765769B (en) | The short text query expansion and search method of a kind of word-based vector | |
CN110162591B (en) | Entity alignment method and system for digital education resources | |
US20040249808A1 (en) | Query expansion using query logs | |
CN109710792B (en) | Index-based rapid face retrieval system application | |
CN103577416A (en) | Query expansion method and system | |
CN106708929B (en) | Video program searching method and device | |
CN102693299A (en) | System and method for parallel video copy detection | |
GB2583679A (en) | Searching multilingual documents based on document structure extraction | |
CN111026710A (en) | Data set retrieval method and system | |
CN103778206A (en) | Method for providing network service resources | |
CN103761286B (en) | A kind of Service Source search method based on user interest | |
CN103927177A (en) | Characteristic-interface digraph establishment method based on LDA model and PageRank algorithm | |
CN105404677A (en) | Tree structure based retrieval method | |
CN104657376A (en) | Searching method and searching device for video programs based on program relationship | |
CN106570196B (en) | Video program searching method and device | |
CN103226601A (en) | Method and device for image search | |
CN112148938A (en) | Cross-domain heterogeneous data retrieval system and retrieval method | |
Nagavi et al. | Content based audio retrieval with MFCC feature extraction, clustering and sort-merge techniques | |
CN102508920B (en) | Information retrieval method based on Boosting sorting algorithm | |
CN105426490A (en) | Tree structure based indexing method | |
Swe | Intelligent information retrieval within digital library using domain ontology | |
KR101592670B1 (en) | Apparatus for searching data using index and method for using the apparatus | |
CN104376044A (en) | Information retrieval optimization method based on information granularity | |
CN108345605B (en) | Text search method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20150225 |