CN104376044A - Information retrieval optimization method based on information granularity - Google Patents

Information retrieval optimization method based on information granularity Download PDF

Info

Publication number
CN104376044A
CN104376044A CN201410550066.8A CN201410550066A CN104376044A CN 104376044 A CN104376044 A CN 104376044A CN 201410550066 A CN201410550066 A CN 201410550066A CN 104376044 A CN104376044 A CN 104376044A
Authority
CN
China
Prior art keywords
topic
information
pattern
content
event
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410550066.8A
Other languages
Chinese (zh)
Inventor
傅涛
傅德胜
经正俊
孙文静
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
JIANGSU BOZHI SOFTWARE TECHNOLOGY Co Ltd
Original Assignee
JIANGSU BOZHI SOFTWARE TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by JIANGSU BOZHI SOFTWARE TECHNOLOGY Co Ltd filed Critical JIANGSU BOZHI SOFTWARE TECHNOLOGY Co Ltd
Priority to CN201410550066.8A priority Critical patent/CN104376044A/en
Publication of CN104376044A publication Critical patent/CN104376044A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3325Reformulation based on results of preceding query
    • G06F16/3326Reformulation based on results of preceding query using relevance feedback from the user, e.g. relevance feedback on documents, documents sets, document terms or passages

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides an information retrieval optimization method based on information granularity, and relates to the technical field of information retrieval optimization. The method comprises the steps that whether the content of a document and a theme keyword expansion set are consistent or not is judged according to a how-net system; all sentences, a title and a subtitle in a training text set of the specific category are extracted, and a pattern example set of the specific category is generated; a map is subjected to breadth-first traversal to generate a pattern set; the pattern set is divided into a plurality of pattern sub sets corresponding to different event themes according to the excitation degrees of pattern elements in the pattern set to different event themes in a training set; the obtained characteristics which can be possibly superior to the characteristics determined by human experience on certain aspects are automatically extracted based on a machine pattern, in the process of retrieving massive text, the interference of unrelated content can be effectively removed through the initial separation of the content and the theme, and the retrieval speed is increased.

Description

A kind of information retrieval optimizing method based on Information Granularity
Technical field:
The present invention relates to optimization technical field of information retrieval, be specifically related to a kind of information retrieval optimizing method based on Information Granularity.
Background technology:
Information retrieval (Information Retrieval) refers to that information is organized in a certain way, and finds out process and the technology of relevant information according to the needs of information user.The information retrieval of narrow sense is exactly the latter half of information retrieval process, from information aggregate, namely find out the process of required information, namely our Information searching (Information Search or Information Seek) of often saying.
Information retrieval have broad sense and narrow sense point.The information retrieval full name of broad sense is " informationm storage and retrieval ", refers to and information to be organized in a certain way and store, and finds out process for information about according to the needs of user.The information retrieval of narrow sense is the latter half of " informationm storage and retrieval ", is commonly referred to " information searching " or " information search ", refers to the process for information about found out from information aggregate required for user.The information retrieval of narrow sense comprises the implication of 3 aspects: understand the information requirement of user, the technology of information retrieval or method, meet the demand of information user.
From information retrieval principle, the storage of information is the basis realizing information retrieval.The information that will store here not only comprises original document data, also comprises picture, Audio and Video etc., first these raw informations will be carried out the conversion of computerese, and be stored in a database, otherwise cannot carry out machine recognition.After user is according to the request of intention input inquiry, searching system searches for information associated with the query in a database according to the inquiry request of user, calculated the similarity size of information by certain matching mechanisms, and by order from big to small, information conversion is exported.
" granularity " (granularity) refers to relative size or the degree of roughness of message unit.
Information retrieval originates from reference consultation and the abstracting and indexing service in library, from the second half in 19th century first development, to the forties in 20th century, index is library's independently instrument and user's service item with retrieving into.
Along with nineteen forty-six First robot calculator appearance in the world, computer technology progressively comes into information retrieval field, and combines closely with information retrieval theory; Off line batch information retrieval system, online real time intelligence searching system are succeeded in developing and commercialization in succession, the sixties in 20th century is to the eighties, under the promotion of the information processing technology, mechanics of communication, computing machine and database technology, information retrieval, at each field high speed developments such as education, military affairs and business, is widely used.Dialog international online information retrieval system is the representative of the information retrieval field in this period, is still one of foremost system in the world so far.
Subject retrieval is a shortcoming of information retrieval technique research field, but uses existing algorithm, and in big data quantity retrieving, result for retrieval is often not fully up to expectations, is that result for retrieval and users' expectation differ greatly on the one hand; Two is sharply increase with the refinement of Information Granularity retrieval time.
Summary of the invention:
The object of this invention is to provide a kind of information retrieval optimizing method based on Information Granularity, it is in the retrieving carrying out mass text, is just divided the interference effectively can eliminating irrelevant contents, accelerate seek rate by content topic volume.
In order to solve the problem existing for background technology, the present invention is by the following technical solutions: it utilizes content recognition and topic identification under thickness different grain size, carry out the characteristic calculated, devise a kind of new topic identification model, step is as follows: 1, topic keyword expansion, forms the topic identification tree of a N layer; 2, judge that whether the content of document is consistent with topic keyword expanded set according to system of Web of Knowledge; 3, judge that whether the event topic involved by document is consistent in step 2; 4, extract other training text of specified class and concentrate all sentences and text header, subtitle, generate the schema instance set of particular category; 5, a word of example or the mapping of phrase sequence and concept in " knowing net " system implementation pattern example collection is used; 6, breadth first traversal figure, generate pattern set; 7, according to the incentive degree of the schema elements in set of modes to different event theme in training set, set of patterns is divided into the mode subset of several corresponding different event themes.
Principle of work of the present invention: the quality of set of modes and content, the precision of topic identification algorithm communicates, use automatic Novel extracting technology, and the mode of its study realizes the automatic generation based on set of patterns, well avoid the mode expansion problem when relating to open text collection, the text of particular event theme correspond to the thinnest granular world, it is the refinement of event topic granular world, first carry out content topic identification, carry out event topic identification again, the knowledge and experience of traditional theme identification not only can be utilized to improve topic identification efficiency, and event topic differentiation scope can be limited, thus greatly improve the degree of accuracy of event topic.
The present invention has following beneficial effect: it may be better than the fixed feature of human experience in some aspect again based on the pattern Automatic Extraction gained feature of machine, in the retrieving carrying out mass text, just divided the interference effectively can eliminating irrelevant contents by content topic volume, accelerate seek rate.
Embodiment:
This embodiment is by the following technical solutions: it utilizes content recognition and topic identification under thickness different grain size, carry out the characteristic calculated, devise a kind of new topic identification model, step is as follows: 1, topic keyword expansion, forms the topic identification tree of a N layer; 2, judge that whether the content of document is consistent with topic keyword expanded set according to system of Web of Knowledge; 3, judge that whether the event topic involved by document is consistent in step 2; 4, extract other training text of specified class and concentrate all sentences and text header, subtitle, generate the schema instance set of particular category; 5, a word of example or the mapping of phrase sequence and concept in " knowing net " system implementation pattern example collection is used; 6, breadth first traversal figure, generate pattern set; 7, according to the incentive degree of the schema elements in set of modes to different event theme in training set, set of patterns is divided into the mode subset of several corresponding different event themes.
This embodiment principle of work: the quality of set of modes and content, the precision of topic identification algorithm communicates, use automatic Novel extracting technology, and the mode of its study realizes the automatic generation based on set of patterns, well avoid the mode expansion problem when relating to open text collection, the text of particular event theme correspond to the thinnest granular world, it is the refinement of event topic granular world, first carry out content topic identification, carry out event topic identification again, the knowledge and experience of traditional theme identification not only can be utilized to improve topic identification efficiency, and event topic differentiation scope can be limited, thus greatly improve the degree of accuracy of event topic.
This embodiment may be better than the fixed feature of human experience in some aspect again based on the pattern Automatic Extraction gained feature of machine, in the retrieving carrying out mass text, just divided the interference effectively can eliminating irrelevant contents by content topic volume, accelerate seek rate.

Claims (3)

1. the information retrieval optimizing method based on Information Granularity, it is characterized in that it utilizes content recognition and topic identification under thickness different grain size, carry out the characteristic calculated, devise a kind of new topic identification model, step is as follows: (1), topic keyword are expanded, and forms the topic identification tree of a N layer; (2), judge that whether the content of document is consistent with topic keyword expanded set according to system of Web of Knowledge; (3), in step (2), judge that whether the event topic involved by document is consistent; (4), extracting other training text of specified class concentrates all sentences and text header, subtitle, the schema instance set of generation particular category; (5) word of example or the mapping of phrase sequence and concept in " knowing net " system implementation pattern example collection, is used; (6), breadth first traversal figure, generate pattern set; (7), according to the incentive degree of the schema elements in set of modes to different event theme in training set, set of patterns is divided into the mode subset of several corresponding different event themes.
2. a kind of information retrieval optimizing method based on Information Granularity according to claim 1, it is characterized in that quality and the content of described set of modes, the precision of topic identification algorithm communicates, use automatic Novel extracting technology, and the mode of its study realizes the automatic generation based on set of patterns, well avoid the mode expansion problem when relating to open text collection, the text of particular event theme correspond to the thinnest granular world, it is the refinement of event topic granular world, first carry out content topic identification, carry out event topic identification again, the knowledge and experience of traditional theme identification not only can be utilized to improve topic identification efficiency, and event topic differentiation scope can be limited, thus greatly improve the degree of accuracy of event topic.
3. a kind of information retrieval optimizing method based on Information Granularity according to claim 1, it is characterized in that it may be better than the fixed feature of human experience in some aspect again based on the pattern Automatic Extraction gained feature of machine, in the retrieving carrying out mass text, just divided the interference effectively can eliminating irrelevant contents by content topic volume, accelerate seek rate.
CN201410550066.8A 2014-10-16 2014-10-16 Information retrieval optimization method based on information granularity Pending CN104376044A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410550066.8A CN104376044A (en) 2014-10-16 2014-10-16 Information retrieval optimization method based on information granularity

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410550066.8A CN104376044A (en) 2014-10-16 2014-10-16 Information retrieval optimization method based on information granularity

Publications (1)

Publication Number Publication Date
CN104376044A true CN104376044A (en) 2015-02-25

Family

ID=52554951

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410550066.8A Pending CN104376044A (en) 2014-10-16 2014-10-16 Information retrieval optimization method based on information granularity

Country Status (1)

Country Link
CN (1) CN104376044A (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102207945A (en) * 2010-05-11 2011-10-05 天津海量信息技术有限公司 Knowledge network-based text indexing system and method

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102207945A (en) * 2010-05-11 2011-10-05 天津海量信息技术有限公司 Knowledge network-based text indexing system and method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
郭俊荣 等: "一种基于信息粒度的信息检索优化方法", 《计算机仿真》 *

Similar Documents

Publication Publication Date Title
CN109101479B (en) Clustering method and device for Chinese sentences
CN104765769B (en) The short text query expansion and search method of a kind of word-based vector
CN110162591B (en) Entity alignment method and system for digital education resources
US20040249808A1 (en) Query expansion using query logs
CN109710792B (en) Index-based rapid face retrieval system application
CN103577416A (en) Query expansion method and system
CN106708929B (en) Video program searching method and device
CN102693299A (en) System and method for parallel video copy detection
GB2583679A (en) Searching multilingual documents based on document structure extraction
CN111026710A (en) Data set retrieval method and system
CN103778206A (en) Method for providing network service resources
CN103761286B (en) A kind of Service Source search method based on user interest
CN103927177A (en) Characteristic-interface digraph establishment method based on LDA model and PageRank algorithm
CN105404677A (en) Tree structure based retrieval method
CN104657376A (en) Searching method and searching device for video programs based on program relationship
CN106570196B (en) Video program searching method and device
CN103226601A (en) Method and device for image search
CN112148938A (en) Cross-domain heterogeneous data retrieval system and retrieval method
Nagavi et al. Content based audio retrieval with MFCC feature extraction, clustering and sort-merge techniques
CN102508920B (en) Information retrieval method based on Boosting sorting algorithm
CN105426490A (en) Tree structure based indexing method
Swe Intelligent information retrieval within digital library using domain ontology
KR101592670B1 (en) Apparatus for searching data using index and method for using the apparatus
CN104376044A (en) Information retrieval optimization method based on information granularity
CN108345605B (en) Text search method and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20150225