CN104376044A

CN104376044A - Information retrieval optimization method based on information granularity

Info

Publication number: CN104376044A
Application number: CN201410550066.8A
Authority: CN
Inventors: 傅涛; 傅德胜; 经正俊; 孙文静
Original assignee: JIANGSU BOZHI SOFTWARE TECHNOLOGY Co Ltd
Current assignee: JIANGSU BOZHI SOFTWARE TECHNOLOGY Co Ltd
Priority date: 2014-10-16
Filing date: 2014-10-16
Publication date: 2015-02-25

Abstract

The invention provides an information retrieval optimization method based on information granularity, and relates to the technical field of information retrieval optimization. The method comprises the steps that whether the content of a document and a theme keyword expansion set are consistent or not is judged according to a how-net system; all sentences, a title and a subtitle in a training text set of the specific category are extracted, and a pattern example set of the specific category is generated; a map is subjected to breadth-first traversal to generate a pattern set; the pattern set is divided into a plurality of pattern sub sets corresponding to different event themes according to the excitation degrees of pattern elements in the pattern set to different event themes in a training set; the obtained characteristics which can be possibly superior to the characteristics determined by human experience on certain aspects are automatically extracted based on a machine pattern, in the process of retrieving massive text, the interference of unrelated content can be effectively removed through the initial separation of the content and the theme, and the retrieval speed is increased.

Description

A kind of information retrieval optimizing method based on Information Granularity

Technical field:

The present invention relates to optimization technical field of information retrieval, be specifically related to a kind of information retrieval optimizing method based on Information Granularity.

Background technology:

Information retrieval (Information Retrieval) refers to that information is organized in a certain way, and finds out process and the technology of relevant information according to the needs of information user.The information retrieval of narrow sense is exactly the latter half of information retrieval process, from information aggregate, namely find out the process of required information, namely our Information searching (Information Search or Information Seek) of often saying.

Information retrieval have broad sense and narrow sense point.The information retrieval full name of broad sense is " informationm storage and retrieval ", refers to and information to be organized in a certain way and store, and finds out process for information about according to the needs of user.The information retrieval of narrow sense is the latter half of " informationm storage and retrieval ", is commonly referred to " information searching " or " information search ", refers to the process for information about found out from information aggregate required for user.The information retrieval of narrow sense comprises the implication of 3 aspects: understand the information requirement of user, the technology of information retrieval or method, meet the demand of information user.

From information retrieval principle, the storage of information is the basis realizing information retrieval.The information that will store here not only comprises original document data, also comprises picture, Audio and Video etc., first these raw informations will be carried out the conversion of computerese, and be stored in a database, otherwise cannot carry out machine recognition.After user is according to the request of intention input inquiry, searching system searches for information associated with the query in a database according to the inquiry request of user, calculated the similarity size of information by certain matching mechanisms, and by order from big to small, information conversion is exported.

" granularity " (granularity) refers to relative size or the degree of roughness of message unit.

Information retrieval originates from reference consultation and the abstracting and indexing service in library, from the second half in 19th century first development, to the forties in 20th century, index is library's independently instrument and user's service item with retrieving into.

Along with nineteen forty-six First robot calculator appearance in the world, computer technology progressively comes into information retrieval field, and combines closely with information retrieval theory; Off line batch information retrieval system, online real time intelligence searching system are succeeded in developing and commercialization in succession, the sixties in 20th century is to the eighties, under the promotion of the information processing technology, mechanics of communication, computing machine and database technology, information retrieval, at each field high speed developments such as education, military affairs and business, is widely used.Dialog international online information retrieval system is the representative of the information retrieval field in this period, is still one of foremost system in the world so far.

Subject retrieval is a shortcoming of information retrieval technique research field, but uses existing algorithm, and in big data quantity retrieving, result for retrieval is often not fully up to expectations, is that result for retrieval and users' expectation differ greatly on the one hand; Two is sharply increase with the refinement of Information Granularity retrieval time.

Summary of the invention:

The object of this invention is to provide a kind of information retrieval optimizing method based on Information Granularity, it is in the retrieving carrying out mass text, is just divided the interference effectively can eliminating irrelevant contents, accelerate seek rate by content topic volume.

In order to solve the problem existing for background technology, the present invention is by the following technical solutions: it utilizes content recognition and topic identification under thickness different grain size, carry out the characteristic calculated, devise a kind of new topic identification model, step is as follows: 1, topic keyword expansion, forms the topic identification tree of a N layer; 2, judge that whether the content of document is consistent with topic keyword expanded set according to system of Web of Knowledge; 3, judge that whether the event topic involved by document is consistent in step 2; 4, extract other training text of specified class and concentrate all sentences and text header, subtitle, generate the schema instance set of particular category; 5, a word of example or the mapping of phrase sequence and concept in " knowing net " system implementation pattern example collection is used; 6, breadth first traversal figure, generate pattern set; 7, according to the incentive degree of the schema elements in set of modes to different event theme in training set, set of patterns is divided into the mode subset of several corresponding different event themes.

Principle of work of the present invention: the quality of set of modes and content, the precision of topic identification algorithm communicates, use automatic Novel extracting technology, and the mode of its study realizes the automatic generation based on set of patterns, well avoid the mode expansion problem when relating to open text collection, the text of particular event theme correspond to the thinnest granular world, it is the refinement of event topic granular world, first carry out content topic identification, carry out event topic identification again, the knowledge and experience of traditional theme identification not only can be utilized to improve topic identification efficiency, and event topic differentiation scope can be limited, thus greatly improve the degree of accuracy of event topic.

The present invention has following beneficial effect: it may be better than the fixed feature of human experience in some aspect again based on the pattern Automatic Extraction gained feature of machine, in the retrieving carrying out mass text, just divided the interference effectively can eliminating irrelevant contents by content topic volume, accelerate seek rate.

Embodiment:

This embodiment is by the following technical solutions: it utilizes content recognition and topic identification under thickness different grain size, carry out the characteristic calculated, devise a kind of new topic identification model, step is as follows: 1, topic keyword expansion, forms the topic identification tree of a N layer; 2, judge that whether the content of document is consistent with topic keyword expanded set according to system of Web of Knowledge; 3, judge that whether the event topic involved by document is consistent in step 2; 4, extract other training text of specified class and concentrate all sentences and text header, subtitle, generate the schema instance set of particular category; 5, a word of example or the mapping of phrase sequence and concept in " knowing net " system implementation pattern example collection is used; 6, breadth first traversal figure, generate pattern set; 7, according to the incentive degree of the schema elements in set of modes to different event theme in training set, set of patterns is divided into the mode subset of several corresponding different event themes.

This embodiment principle of work: the quality of set of modes and content, the precision of topic identification algorithm communicates, use automatic Novel extracting technology, and the mode of its study realizes the automatic generation based on set of patterns, well avoid the mode expansion problem when relating to open text collection, the text of particular event theme correspond to the thinnest granular world, it is the refinement of event topic granular world, first carry out content topic identification, carry out event topic identification again, the knowledge and experience of traditional theme identification not only can be utilized to improve topic identification efficiency, and event topic differentiation scope can be limited, thus greatly improve the degree of accuracy of event topic.

This embodiment may be better than the fixed feature of human experience in some aspect again based on the pattern Automatic Extraction gained feature of machine, in the retrieving carrying out mass text, just divided the interference effectively can eliminating irrelevant contents by content topic volume, accelerate seek rate.

Claims

1. the information retrieval optimizing method based on Information Granularity, it is characterized in that it utilizes content recognition and topic identification under thickness different grain size, carry out the characteristic calculated, devise a kind of new topic identification model, step is as follows: (1), topic keyword are expanded, and forms the topic identification tree of a N layer; (2), judge that whether the content of document is consistent with topic keyword expanded set according to system of Web of Knowledge; (3), in step (2), judge that whether the event topic involved by document is consistent; (4), extracting other training text of specified class concentrates all sentences and text header, subtitle, the schema instance set of generation particular category; (5) word of example or the mapping of phrase sequence and concept in " knowing net " system implementation pattern example collection, is used; (6), breadth first traversal figure, generate pattern set; (7), according to the incentive degree of the schema elements in set of modes to different event theme in training set, set of patterns is divided into the mode subset of several corresponding different event themes.

2. a kind of information retrieval optimizing method based on Information Granularity according to claim 1, it is characterized in that quality and the content of described set of modes, the precision of topic identification algorithm communicates, use automatic Novel extracting technology, and the mode of its study realizes the automatic generation based on set of patterns, well avoid the mode expansion problem when relating to open text collection, the text of particular event theme correspond to the thinnest granular world, it is the refinement of event topic granular world, first carry out content topic identification, carry out event topic identification again, the knowledge and experience of traditional theme identification not only can be utilized to improve topic identification efficiency, and event topic differentiation scope can be limited, thus greatly improve the degree of accuracy of event topic.

3. a kind of information retrieval optimizing method based on Information Granularity according to claim 1, it is characterized in that it may be better than the fixed feature of human experience in some aspect again based on the pattern Automatic Extraction gained feature of machine, in the retrieving carrying out mass text, just divided the interference effectively can eliminating irrelevant contents by content topic volume, accelerate seek rate.