CN101546331A

CN101546331A - System and method for acquiring characteristics favorable for retrieval and evaluating value of related things

Info

Publication number: CN101546331A
Application number: CN200910050761A
Authority: CN
Inventors: 刘健
Original assignee: Individual
Current assignee: Individual
Priority date: 2009-05-07
Filing date: 2009-05-07
Publication date: 2009-09-30

Abstract

The invention relates to a system and a method for acquiring characteristics favorable for text retrieval. By using the system and the method, a user can acquire the characteristics which are related to personal retrieval requirements and are favorable for retrieval, such as a key word, a sequence, a grammatical pattern, a semantic role and the like. By means of the characteristics, the user can construct more effective query so as to improve the efficiency of searching documents; on the other hand, the invention also relates to a system and a method for evaluating the value of things related with an input text, wherein the user can acquire the evaluation of various things related with the input text by submitting the input text; simultaneously. The invention also relates to a system and a method for evaluating career advantages according to a resume of a person at the same time, which provide career advantage grade of various skills and experiences related to the resume of the person by means of a job notice library and/or a resume library; and the invention also relates to a system and a method for document retrieval by using a text as a query input, which can quickly shrink a retrieval range and avoid losing a search result with potential value.

Description

Obtain the feature that helps retrieval, the system and method for estimating the value of correlate

Technical field

The present invention relates to field of information processing, be particularly related to text retrieval technical field and text-processing field, specifically be meant a kind of obtain System and method for, evaluation and the input text correlate of the feature that helps text retrieval for the System and method for of user's value, realize assessing the System and method for of its professional advantage and with the System and method for of text as the file retrieval of inquiry input according to personnel's resume.

Background technology

As one aspect of the present invention, how to promote the effect of retrieval, be the major issue of cybertimes.The basic goal of retrieval is to help the user from the magnanimity document document of the document nothing to do with of needs to be distinguished.

The way of existing searching system is, the user provides the inquiry (query) that is made of certain characteristics (being generally keyword), searching system is according to inquiry, each document and the matching degree of inquiry in the assessment document library, and export document or the document identification that matching degree reaches preset standard.

But, the user often and do not know that the retrieval of each feature renders a service, this has caused certain blindness.The user has used big measure feature but still can't effectively improve the inquiry effect in an inquiry.Its consequence, or the scale that can not shrink result for retrieval, or lose the result for retrieval that for user's Search Requirement, has potential importance.In addition, the complex query that is made of big measure feature causes the computing cost of searching system greatly to increase.

So, in the face of needing the magnanimity document of retrieval, how to help the user to find suitable feature, thereby help user's fast contraction range of search but do not cause losing of important result for retrieval, be a major issue that promotes retrieval effectiveness.

On the other hand, how utilizing the infotech means to estimate the importance of various things for the mankind, also is the major issue of cybertimes, is related to all many-sides such as ecommerce, Web Community.Existing technology be basically by phase-split network interbehavior (such as, query statement that click, the link between webpage is pointed to, is used to retrieve etc.) assess things that various network resources (link, searching key word etc.) characterized by degree of concern.But the quantity of information that these behaviors contain is limited after all, thereby can influence the accuracy of evaluation result.

Simultaneously, background technology document related to the present invention is as follows:

(1) relates to the patent documentation that descriptor extracts

● Chinese patent application CN200710177074, a kind of news keyword abstraction method based on word frequency and multi-component grammar;

● U.S. Patent application US2008/0195595, Keyword Extracting Device;

● U.S. Patent application US2008/0319746, KEYWORD OUTPUTTING APPARATUS ANDMETHOD;

● U.S. Patent application US2008/0033938, Keyword outputting apparatus, keyword outputtingmethod, and keyword outputting computer program product;

● U.S. Pat 6470307, Method and apparatus for automatically identifying keywordswithin a document.

(2) technology of evaluation retrieval character

● U.S. Patent application US2009/0049036, Systems and methods for keyword selection in aweb-based social network has wherein disclosed the scoring of how calculating keyword according to the distributional difference of keyword in two text collections;

● U.S. Patent application US2007/0288514, System and method for keyword extraction and US2009/0083262, SYSTEM FOR ENTITY SEARCH AND A METHOD FOR ENTITYSCORING IN A LINKED DOCUMENT DATABASE, the keyword that provides according to the user and entity type how have wherein been disclosed as the retrieval input, the document that searching contains keyword and belongs to the entity of this entity type is then according to the scoring of these each entities of document calculations;

● U.S. Patent application US2007/0061320, Multi-document keyphrase exctraction using partialmutual information, wherein disclosed extracting keywords from the collection of document subclass, according to the method for collection of document to keyword score;

● U.S. Pat 6502065, Teletext broadcast receiving apparatus using keyword extractionand weighting, wherein disclosed and sought in the collection of document common keyword, wherein related in the document of each vocabulary in the statistics collection of document word frequency between word frequency and document as the method for text snippet.

(3) similarity (, finding text similarly) according to a text

● U.S. Patent application US2007/0192310, INFORMATION PROCESSING APPARATUS ANDMETHOD, AND PROGRAM has wherein disclosed the keyword that contains jointly in utilization inquiry and the document to be retrieved, the method for the correlativity of assessment inquiry and document to be retrieved.

(4) expansion and the contraction of retrieval character set

● U.S. Pat 7191177, Keyword extracting device has wherein disclosed and how extracted candidate keywords from query text, filters by blacklist then and simplifies candidate keywords;

● United States Patent (USP) Shen US2008/0243820, Semantic analysis documents to rank term and US20080133509, Selecting Keywords Representative of a Document, wherein disclosed and from query text, extracted candidate keywords, utilize body that keyword is marked, realize the expansion of candidate keywords.

(5) relate to feature extraction and characteristic evaluating simultaneously

● Chinese patent application CN200580044686, full-text query and search system and using method thereof have wherein disclosed and have calculated the method for the matching degree between query text and the result for retrieval, but do not related to the separating capacity of estimating retrieval character;

● Chinese patent application CN200510117001, a kind of method that is used for magnanimity text fast similarity, how selecteed wherein disclosed a kind of document method of retrieval fast, related to and utilize key character to shrink range of search, be but do not disclose key character.

● U.S. Patent application US2007/0288433, DETERMINING RELEVANCY AND DESIRABILITYOF TERMS has wherein disclosed according to the distribution of keyword in other user inquirings that relate in the inquiry, the method that keyword is marked.

● United States Patent (USP) and patented claim US6064952, Information abstracting method, informationabstracting apparatus, and weighting method, US6240378, Weighting methodfor use ininformation extraction and abstracting, based on the frequency of occurrence of keywordsand similarity calculations, US2002/0072895, Weighting method for use in informationextraction and abstracting, based on thefrequency of occurrence of keywords andsimilarity calculations, wherein disclosed article has been divided into plurality of sections, every section extracting keywords, according to the appearance of keyword, calculate the method for keyword score at other sections.

● U.S. Pat 5297039, Text search system for locating on the basis of keyword matchingand keyword relationship matching has wherein related to and has calculated the scoring of term relevant in the document library with the correlativity of inquiry.

● U.S. Patent application US2008/0243811, SYSTEM AND METHOD FOR RANKEDKEYWORD SEARCH ON GRAPHS, when wherein having disclosed retrieval model and being digraph, a kind of to from the candidate search feature of retrieval text with mate from the candidate feature of document to be retrieved, thereby realize the method for file retrieval.

Summary of the invention

The objective of the invention is to have overcome above-mentioned shortcoming of the prior art, providing a kind of can help the user to search out retrieval character to the helpful meaning of Search Requirement of self, realize effectively inquiry, help the fast contraction range of search, avoid potential valuable result for retrieval to lose, simple and convenient, stable and reliable for performance, the scope of application is obtained the System and method for of the feature that helps text retrieval comparatively widely, estimate and the System and method for of input text correlate, realize assessing the System and method for of its professional advantage and with the System and method for of text as the file retrieval of inquiry input according to personnel's resume for user's value.

In order to realize above-mentioned purpose, of the present invention obtain System and method for, evaluation and the input text correlate of the feature that helps text retrieval for the System and method for of user's value, realize assessing the System and method for of its professional advantage and as follows as the System and method for of the file retrieval of inquiry input with text according to personnel's resume:

This obtains the system of the feature that helps text retrieval, and its principal feature is that described system comprises:

Input media is used to receive the input text that the user submits to;

The feature generating apparatus is used for generating at least one candidate feature according to described input text;

Scoring apparatus, at least one scoring that is used to calculate described candidate feature about retrieval effectiveness;

Generating apparatus is used for producing at least one feature as a result according to the candidate feature with described scoring as a result; With

Output unit is used for the form of expression that can be handled by the user or understand described feature as a result being exported to the user;

And the computation process of described scoring depends on the distribution character of described candidate feature in the reference collection of document at least in part.

This obtains the scoring that the output unit in the system of the feature that helps text retrieval is also exported each described feature as a result, and the scoring of described feature as a result is the scoring of described scoring apparatus for a candidate feature that is equal to mutually with described feature as a result in the candidate feature data.

This feature generating apparatus that obtains in the system of the feature that helps text retrieval is also operated at least one candidate's adjusting gear, is used on the basis of original described candidate feature deletion and/or increases at least one candidate feature.

This computation process of obtaining the scoring in the system of the feature that helps text retrieval also depends on the distribution character of described candidate feature in described input text, and described feature generating apparatus also generates the data of each described candidate feature about the distribution character of described input text.

This computation process of obtaining the scoring in the system of the feature that helps text retrieval also depend on described candidate feature at least one second with reference to the distribution character in the collection of document.

Should realize obtaining the method for the feature that helps document retrieval based on above-mentioned device, its principal feature is that described method may further comprise the steps:

(1) input step receives the input text that the user submits to;

(2) feature generates step, generates at least one candidate feature according to described input text;

(3) scoring step, at least one scoring about retrieval effectiveness of calculating described candidate feature;

(4) result generates step, produces at least one feature as a result according to the candidate feature with described scoring;

(5) the output step is exported to the user with the form of expression that can be handled by the user or understand with described feature as a result;

This realization is obtained in the output step in the method for the feature that helps document retrieval further comprising the steps of:

Export the scoring of each described feature as a result, the scoring of described feature as a result is the scoring of a candidate feature being equal to mutually with described feature as a result in the described candidate feature data.

The feature that this realization is obtained in the method for the feature that helps document retrieval generates in the step further comprising the steps of:

At least one candidate's set-up procedure is used for deleting and/or increasing at least one candidate feature on the basis of original described candidate feature.

The computation process that the scoring in the method for the feature that helps document retrieval is obtained in this realization also depends on the distribution character of described candidate feature in described input text, and described feature generates in the step and may further comprise the steps:

Generate the data of each described candidate feature about the distribution character of described input text.

The computation process that the scoring in the method for the feature that helps document retrieval is obtained in this realization also depend on described candidate feature at least one second with reference to the distribution character in the collection of document.

This estimates the system of the value of the things relevant with the input text of user's submission, and its principal feature is that described system comprises:

Input media receives the input text that the user submits to;

The keyword generating apparatus generates the candidate feature of at least one keyword form according to described input text;

Scoring apparatus, at least one scoring of calculating described candidate feature;

Generating apparatus produces at least one feature as a result according to the described candidate feature through scoring as a result; With

Output unit is exported to the user with the form of expression that can be handled by the user or understand with described feature as a result.

Keyword generating apparatus in the system of the value of the things that this evaluation is relevant with the input text that the user submits to is also operated at least one candidate's adjusting gear, in order to adjust described candidate feature, promptly from original candidate feature, delete some candidate feature and/or add some features as new candidate feature.

The scoring that output unit in the system of the value of the things that the input text that this evaluation and user submit to is relevant has also been exported each described feature as a result, the scoring of a described feature as a result are the scoring of a described candidate feature being equal to mutually with described feature as a result in the candidate feature data.

The computation process of the scoring in the system of the value of the things that the input text that this evaluation and user submit to is relevant also depend on described candidate feature at least one second with reference to the distribution character in the collection of document.

Further, the input text in the system of the value of the things that the input text that this evaluation and user submit to is relevant contains the description of first party demand, describedly contains description with the corresponding second party supply of first party demand with reference to collection of document; Perhaps described input text contains the description that first party is supplied with, and describedly contains the description of supplying with corresponding second party demand with first party with reference to collection of document; Perhaps described input text contains the description of first party supply or demand, and described containing with first party demand or supply with reference to collection of document belongs to the second party demand of same type or the description of supply.

Further, described second contains with the demand of first party or supply with reference to collection of document and to belong to the third-party demand of the same type or the description of supply.

Described input text can be added into second with reference to collection of document.

Be somebody's turn to do the method based on the value of the relevant things of the above-mentioned input text that system's realization is estimated and the user submits to, its principal feature is that described method may further comprise the steps:

(1) input step receives the input text that the user submits to;

(2) keyword generates step, generates the candidate feature of at least one keyword form according to described input text;

(3) scoring step, at least one scoring of calculating described candidate feature;

(4) result generates step, produces at least one feature as a result according to the described candidate feature through scoring;

(5) the output step is exported to the user with the form of expression that can be handled by the user or understand with described feature as a result.

Keyword in the method for the value of the things that this input text that realization is estimated and the user submits to is relevant generates in the step further comprising the steps of:

At least one candidate adjusts substep, some candidate feature of deletion from original candidate feature, and/or add some features as new candidate feature.

Further comprising the steps of in the output step in the method for the value of the things that this input text that realization is estimated and the user submits to is relevant:

Export the scoring of each described feature as a result, the scoring of described feature as a result is the scoring of a described candidate feature being equal to mutually with described feature as a result in the candidate feature data.

The computation process that this realize to estimate the scoring of the candidate feature in the method for value of the things relevant with the input text of user's submission, partly depend at least described candidate feature described with reference to the distribution character in the collection of document.

The computation process that the scoring in the method for value of the things relevant with the input text of user's submission is estimated in this realization also partly depends on the distribution character of described candidate feature in described input text at least, and/or partly depend at least described candidate feature at least one second with reference to the distribution character in the collection of document.

Should realize assessing according to personnel's resume the system of its professional advantage based on above-mentioned system, its principal feature is, described input text and can be one of following configuration with reference to collection of document:

Input text	With reference to collection of document
Input text	With reference to collection of document	The resume text	The job notice storehouse
The resume text	The resume storehouse	The resume text	The job notice storehouse

The scoring in the system of its professional advantage is assessed in this realization according to personnel's resume computation process also depend on described candidate feature at least one second with reference to the distribution character in the collection of document, described input text, be one of following configuration with reference to collection of document with reference to collection of document and second:

Input text	With reference to collection of document	Second with reference to collection of document
Input text	With reference to collection of document	Second with reference to collection of document	The resume text	The job notice storehouse	The resume storehouse
The resume text	The resume storehouse	The job notice storehouse	The resume text	The job notice storehouse	The resume storehouse

Should realize assessing according to personnel's resume the method for its professional advantage based on above-mentioned method, its principal feature is, described input text and be one of following configuration with reference to collection of document:

The scoring in the method for its professional advantage is assessed in this realization according to personnel's resume computation process also depend on described candidate feature at least one second with reference to the distribution character in the collection of document, described input text, be one of following configuration with reference to collection of document with reference to collection of document and second:

Should be with the system of text as the file retrieval of inquiry input, its principal feature is that described system comprises:

Input media receives the input text that the user submits to;

The above-mentioned system that obtains the feature that helps text retrieval obtains containing the output result of feature as a result according to described input text;

Indexing unit is imported searching system with described output result and is obtained result for retrieval;

The retrieval output unit is with described result for retrieval output.

This realizes that its principal feature is that described method may further comprise the steps with the method for text as the file retrieval of inquiry input:

(1) input step receives the input text that the user submits to;

(2) feature obtaining step, the method for utilizing above-mentioned realization to obtain the feature that helps document retrieval is obtained feature as a result;

(3) searching step depends on described feature as a result and produces result for retrieval;

(4) retrieval output step is with described result for retrieval output.

Adopted obtaining of this invention to help the System and method for of the feature of text retrieval, the user can search out the retrieval character to the helpful meaning of Search Requirement of self, make the user when facing the document of magnanimity, can use these effectively inquiries of retrieval characters structure, and then fast contraction range of search, but avoided losing of potential valuable result for retrieval simultaneously again, and simple and convenient, stable and reliable for performance, the scope of application is comparatively extensive.And this System and method for is combined with existing searching system, can construct more convenient easy-to-use searching system, the user only need import descriptive text, just can retrieve related data, has avoided selecting because of keyword the decline of the improper retrieval effectiveness that causes.

And having adopted the evaluation of this invention and input text correlate System and method for for user's value, the user can be by submitting a descriptive text to, and obtain the evaluation of various correlates for this user's value.This System and method for, simple and effective, visual and understandable, be applicable to multiple use, such as: employment, paper submission, internet dating etc.

Description of drawings

Fig. 1 forms synoptic diagram for the system function module that obtains the feature that helps text retrieval of the present invention.

Fig. 2 forms synoptic diagram for the system function module that obtains the feature that helps text retrieval of the candidate's of having adjusting gear of the present invention.

Fig. 3 relates to second and helps the system function module of the feature of text retrieval to form synoptic diagram with reference to obtaining of collection of document for of the present invention.

Fig. 4 is that evaluation of the present invention and input text correlate are formed synoptic diagram for the system function module of user's value.

Fig. 5 forms synoptic diagram for the system function module that realization of the present invention is assessed its professional advantage according to personnel's resume.

Fig. 6 is of the present invention with the system function module composition synoptic diagram of text as the file retrieval of inquiry input.

Embodiment

In order more to be expressly understood technology contents of the present invention, describe in detail especially exemplified by following examples.

At first introduce elementary tactics of the present invention:

When the user provides the input text of its Search Requirement of reflection, from input text, identify abundant candidate feature, then the retrieval of each candidate feature is renderd a service and is calculated scoring, at last according to the candidate feature through scoring produce at least one as a result feature export to the user.The calculating of its scoring depend at least in part described candidate feature the conduct reference with reference to the distribution character in the collection of document.Further, the calculating of described scoring also depend on described candidate feature in input text distribution character and (or) described candidate feature at least one second with reference to the distribution character in the collection of document.Like this, the present invention relates to a kind of system that helps the feature retrieved that obtains.

Simultaneously, the present invention can also utilize the retrieval of the feature of the keyword form that characterizes something or other to render a service the value of estimating this things.The present invention is based on such fact, a keyword has characterized a things, therefore about such keyword input text, with reference to collection of document, second with reference to the message reflection of the distribution character in the collection of document things that this keyword the characterized degree, the degree that is required that are much accounted of in reference to collection of document at input text, with reference to collection of document, second or the degree that is generally had; Therefore the scoring of this keyword that depends on described distribution character information and obtain has also embodied the value of the things that this keyword characterized for the author of input text.Like this, the present invention has just realized the System and method for of the value of the input text correlate that evaluation and user submit to.

In the present invention, the retrieval of a described candidate feature is renderd a service, be meant when one with reference to collection of document and (or) second in original inquiry, adds this candidate feature as the searching system in data searching source with reference to collection of document after, the corresponding and improvement degree of new result for retrieval on retrieval effectiveness of acquisition.

If one with reference to collection of document in, contain the many more of certain candidate feature with reference to document, then this candidate feature is renderd a service just weak more for the retrieval of reference collection of document.If certain searching system adds in original inquiry that more weak candidate feature is renderd a service in retrieval inquiry newly, then newly inquire about the result for retrieval that is returned, can not shrink the retrieval scale effectively.Otherwise, if one with reference to collection of document in, contain the few more of certain candidate feature with reference to document, then this candidate feature is renderd a service just strong more for the retrieval of reference collection of document.If certain searching system adds this candidate feature in original inquiry new inquiry, then newly inquire about the result for retrieval that is returned, its scale will significantly be dwindled.Certainly, if certain candidate feature does not appear at any with reference in the document with reference to collection of document, then the retrieval of this candidate feature in the reference collection of document renderd a service meaningless on mathematics, it is invalid that such candidate feature will be identified as scoring in the step in scoring, will be not can be as a feature as a result.Of the present inventionly be meant with reference to collection of document, be used as the collection of document of reference when calculating scoring with reference to collection of document and second.During enforcement, a collection of document can have multiple existence form, such as the record in the database, and the webpage on the website, the catalogue in the file system and affiliated file, the perhaps form of other collection of document.

The text filed size that position, covering appear in the distribution character of a described candidate feature in a collection of document, existence, occurrence number, each time that is meant candidate feature described in each document of described collection of document with (or) other information relevant with the distribution situation of described candidate feature.About the data of the distribution character of a candidate feature in each document of a collection of document, be called as the distribution character data of described candidate feature about described collection of document.When implementing, feature is about the distribution character data of a collection of document, can contain following content but is not limited to:

● contain the number of files of this feature in the collection of document;

● the total occurrence number of this feature in the collection of document;

● first the occur position of average of this feature in collection of document.For some type text (such as natural language text), the relative position value (promptly occurring the ratio of position with respect to text size first) that feature occurs first in the text is more little, represent that it more early is mentioned in the text, then it is considered to important for the text.Therefore,, find those documents that it occurs in collection of document, and the mean value of position appears in this feature first in those documents, can provide valuable information for the retrieval effectiveness of calculated characteristics for a feature;

● the text filed size that this feature covers in each document of collection of document altogether.Institute's characters matched string appears in a feature at every turn in document be not isometric, and a tangible example is linguistic unit (linguistic unit can be represented syntactic structure, semantic role etc.).According to the analyzing approximate texts method, a plurality of linguistic units can obtain a new linguistic unit by reduction.If the linguistic unit that participates in reduction is at a distance of far away more, then the overlay text of newspeak unit zone is just big more (specifically sees also Chinese patent literature " apparatus and method of analyzing approximate texts ", the patent No.: 200510023589.8).

With reference to the definition of aforesaid distribution character, the technician is not difficult to construct various concrete distribution character data, the concrete requirement of engineering when implementing to satisfy.

Similarly, the text filed size that position, covering appear in the distribution character of a described candidate feature in a text or document, existence, occurrence number, each time that is meant this candidate feature in described text or the document with (or) other information relevant with the distribution situation of described candidate feature; About the data of the distribution character of a candidate feature in a text or document, be called as the distribution character data of described candidate feature about described text or document.

In the present invention, described with reference to document and described input text, do not refer in particular to the natural language text of writing by human written language, it also can be the computer code text, perhaps marking language text is (such as html text, the XML text), the burst (such as voice signal) that perhaps has been digitized, the perhaps sequence of expressing with coding form (such as dna sequence dna).

Feature described in the present invention (candidate feature or feature) as a result, do not refer in particular to the pattern (such as the pattern of character string, keyword, character string, the pattern of keyword) of the certain text subsequence of coupling, can also refer to about certain feature match pattern (such as, the various labels (tag) in syntactic structure, semantic role, the marking language text or the layout character of piece (block)) and other feature that can be utilized during in retrieval by certain searching system with reference to collection of document.Feature is the match pattern about certain feature, and two kinds of situations are arranged: (1) certain feature of characteristic matching, such as the feature of a semantic role form, can mate the keyword feature that some has this kind semantic role; (2) combinations that the certain feature of characteristic matching constitutes such as the grammar property of an expression subject-predicate phrase, can be mated semantic role feature and the characteristics combination that the grammar property of representing verb constitutes by an objective entity of expression.

Keyword described in the present invention is meant the feature that characterizes things.Described things can be objective objects, action, the incident on the ordinary meaning, also can be other concept nature statement such as character, state, degree of things.

Any one device described in the present invention and any one second device, it can be physically different calculation elements, also can be the same calculation element of carrying out the different operating sequence, also can be the same calculation element of carrying out the same operation sequence with the different operating parameter.The necessary data that described operating parameter need obtain when being calculation element executable operations sequence.

User described in the present invention is meant the object of the system that utilization method operation involved in the present invention is involved in the present invention.The user can be nature person, organizational structure or aut.eq..Described the present invention is meant this instructions and all contents disclosed with the pairing claim of this instructions.Calculation element described in the present invention can be but is not limited to: the macromolecular structure of the calculation task carried out of computing machine, embedded device, circuit, integrated circuit (IC) chip, manual construction, quantum computer and other can be finished the artificiality of calculation task.

See also shown in Figure 1ly, involved in the present invention this kind obtains the system of the feature that helps text retrieval, including but not limited to:

Input media receives the input text that the user submits to;

The feature generating apparatus generates at least one candidate feature according to input text;

Scoring apparatus, at least one scoring about retrieval effectiveness of calculating described candidate feature;

Generating apparatus produces at least one feature as a result according to the candidate feature through scoring as a result;

Output unit is exported to the user with characteristic as a result.

Wherein, see also shown in Figure 2ly again, described feature generating apparatus can also be operated at least one candidate's adjusting gear, in order to adjust candidate feature, promptly from original candidate feature some candidate feature of deletion and (or) add some features as new candidate feature.

On the other hand, described output unit has also been exported each scoring of feature as a result, and the scoring of a described feature as a result is exactly the scoring of scoring apparatus for a candidate feature that is equal to described feature as a result in the candidate feature data.

Simultaneously, involved in the present invention this kind obtains the method for the feature that helps text retrieval, including but not limited to:

(1) input step receives the input text that the user submits to;

(2) feature generates step, generates at least one candidate feature according to input text;

(4) result generates step, produces at least one feature as a result according to the candidate feature through scoring;

(5) output step is exported to the user with the form of expression that can be handled by the user or understand with feature as a result; The computation process of the scoring of rendeing a service about retrieval of a described candidate feature, partly depend at least described candidate feature described with reference to the distribution character in the collection of document.

Wherein, in described feature generation step, can also comprise at least one candidate and adjust substep, from original candidate feature, delete some candidate feature and/or add some features as new candidate feature.

On the other hand, described output step has also been exported each scoring of feature as a result, and the scoring of a described feature as a result is exactly the scoring of scoring apparatus for a candidate feature that is equal to described feature as a result in the candidate feature data.

Simultaneously, see also shown in Figure 3ly again, the computation process of described scoring also partly depends on the distribution character of described candidate feature in input text at least; On the other hand, further, the computation process of described scoring also partly depend at least described candidate feature at least one second with reference to the distribution character in the collection of document.

See also shown in Figure 4ly again, the system of the value of the things that the input text that involved in the present invention this kind of evaluation and user submit to is relevant comprises:

Input media receives the input text that the user submits to;

The keyword generating apparatus generates the candidate feature of at least one keyword form according to input text;

Scoring apparatus, at least one scoring of calculated candidate feature;

Output unit is exported to the user with the form of expression that can be handled by the user or understand with feature as a result.

Wherein, described keyword generating apparatus can also be operated at least one candidate's adjusting gear, in order to adjust candidate feature, promptly from original candidate feature some candidate feature of deletion and (or) add some features as new candidate feature.

On the other hand, described output unit has also been exported each scoring of feature as a result, and the scoring of a described feature as a result is exactly the scoring of a candidate feature being equal to described feature as a result in the candidate feature data.

Simultaneously, the method for the value of the things that involved in the present invention this kind of evaluation is relevant with the input text that the user submits to, including but not limited to:

(1) input step receives the input text that the user submits to;

(2) keyword generates step, generates the candidate feature of at least one keyword form according to input text;

(5) output step is exported to the user with the form of expression that can be handled by the user or understand with feature as a result.

Wherein, generate at described keyword and can also comprise at least one candidate in the step and adjust substep, from original candidate feature some candidate feature of deletion and (or) add some features as new candidate feature.

On the other hand, described output step has also been exported each scoring of feature as a result, and the scoring of a described feature as a result is exactly the scoring of a candidate feature being equal to described feature as a result in the candidate feature data.

Simultaneously, the computation process of the scoring of a described candidate feature, partly depend at least described candidate feature described with reference to the distribution character in the collection of document.Further, the computation process of described scoring also partly depend at least described candidate feature in input text distribution character and (or) partly depend at least described candidate feature at least one second with reference to the distribution character in the collection of document.

Moreover, see also shown in Figure 5 again, according to aforesaid a kind of System and method for of estimating the value of the things relevant with the input text of user's submission, involved in the present invention this kind assessed the System and method for of its professional advantage according to someone resume, it is characterized in that: input text, with reference to collection of document, second takes one of following configuration with reference to collection of document:

Input text	With reference to collection of document	Second with reference to collection of document
Input text	With reference to collection of document	Second with reference to collection of document	The resume text	The job notice storehouse
The resume text	The job notice storehouse	The resume storehouse	The resume text	The job notice storehouse
The resume text	The job notice storehouse	The resume storehouse	The resume text	The resume storehouse
The resume text	The resume storehouse	The job notice storehouse	The resume text	The resume storehouse

See also shown in Figure 6ly again, involved in the present invention this kind be with the system of text as the file retrieval of inquiry input, including but not limited to:

Input media receives the input text that the user submits to;

The above-mentioned system that obtains the feature that helps text retrieval ", obtain containing the output result of feature as a result according to input text;

Indexing unit will be exported the result and import searching system acquisition result for retrieval;

The retrieval output unit is exported result for retrieval.

Describe each step in the method related in the technique scheme of the present invention below in detail:

1, input step

In the method that the present invention relates to, comprise an input step that utilizes input media to obtain the input text of user's submission.Input media can have multiple implementation, can be but be not limited to: the interface of example, in hardware (as network interface, USB interface, RS232 interface, chip pin), the interface of form of software (as the storage medium access interface in human-computer interaction interface, the operating system, database ODBC interface, network access interface) etc.

In force, input step can be designed to accept Text Flag but not whole text, and this should be considered to and accept text, and input is of equal value as text.Such as: the technician constructs an extra input subsystem, receives user's input text, and is saved in the input text storehouse in the storage medium; And in this input step, native system receives the sign that the user submits to, by access storage media, finds out and identify the text that is complementary as input text from the input text storehouse.

2, feature generates step

This step utilizes the feature generating apparatus to generate step according to the feature that input text generates candidate feature.Described feature generating apparatus identifies the feature that occurs, as candidate feature in input text.Further, if the scoring of scoring apparatus calculated candidate feature also depends on the distribution character data of candidate feature about input text, then the feature generating apparatus also generates the distribution character data of described candidate feature about input text.

The feature generating apparatus of identification candidate feature from input text can have multiple implementation:

(1) static recognition capability pattern.During enforcement, the technician is converted to given recognition data by Core Generator the part of the processing logic of described feature generating apparatus.The condition that the candidate feature that can be identified of having described described recognition data satisfy.

If the retrieval character that does not need the described input text of global observing just can correctly discern of forms such as identification character sequence, keyword only, described feature generating apparatus can be produced as a lexical analyzer by lexical analyzer Core Generator (such as LEX).Described recognition data is used to describe the rule (being generally regular expression) by the coupling retrieval character when containing the structure lexical analyzer.Observe the correctly retrieval character of identification of the text overall situation if the retrieval character of identification relates to the needs of the forms such as layout character of grammatical pattern, semantic role, SGML, described feature generating apparatus can be produced as a syntax analyzer by syntax analyzer Core Generator (such as YACC).Described recognition data is used to describe the rule (being generally regular expression) by the coupling retrieval character when containing the structure syntax analyzer.Especially, in Chinese patent " apparatus and method of ZL200510023589.8 analyzing approximate texts ", disclosed a kind of text analyzer,, can not capture grammatical pattern or the semantic role that meets ad hoc rules by text analyzing completely by means of the reduction of loose form.Certainly, the technician also can combine at least one lexical analyzer and at least one syntax analyzer, constructs feature generating apparatus with better function.How lexical analyzer and syntax analyzer are combined, in computer science, belong to known technology, repeat no more.

(2) Dynamic Recognition ability mode.During enforcement, the technician has added the processing logic of visit recognition data in described feature generating apparatus.The condition that the candidate feature that can be identified of having described described recognition data satisfy.

The simplest implementation is that a software scans described input text with a sliding window, and the various retrieval characters on character fragments in the sliding window and the look-up table are compared, and realizes the function of identification.Described look-up table is the implementation of described recognition data in this example.

(3) static recognition capability and the combined pattern of Dynamic Recognition ability.When implementing, the technician changes the some of the processing logic of described feature generating apparatus with partial content in the recognition data, and has added the processing logic of remainder content in the visit recognition data in the feature generating apparatus.

In the present invention, further, described feature generating apparatus has also generated candidate feature in the needed candidate feature data of scoring apparatus about the distribution character data of described input text.Described feature generating apparatus when identifying described candidate feature each in described input text and occur, with the contextual data record of described candidate feature when this occurs to the addressable storage medium of calculation element; Last according to all context datas of noting about described candidate feature, produce the distribution character data of described candidate feature by calculating about input text.Described context data is the data (such as variate-value, storehouse, memory buffer, temporary file etc.) that a certain step of described feature generating apparatus in processing procedure produces for preservation state.

Such as, scoring apparatus needs to obtain relative text starting position, position appears in a candidate feature in input text mean deviation in the score calculation process, then can increase corresponding control logic in the feature generating apparatus, feasible appearance whenever a candidate feature X is identified, context data＜X then, Y _iJust be recorded in the readable storage medium of calculation element, wherein Yi just represents this the position in the text of appearing at, i represents the i time to occur.After the feature generating apparatus is finished extraction, all context datas in the described storage medium are divided into groups according to candidate feature.Each candidate feature X correspondence＜X, Y ₁,＜X, Y ₂...,＜X, Y _n.The mean deviation of X in described input text like this

\overset{&OverBar;}{Y} = \frac{1}{n} Σ_{i = 1}^{n} Y_{i}

Just can calculate.During enforcement, the technician is not difficult to obtain a candidate feature other distribution character data about an input text with reference to this example.

Owing to and the aforementioned processing input text to produce the feature generating apparatus of candidate feature and the distribution character data in input text thereof very similar, for handling the first first characteristic generating apparatus that produces first characteristic with reference to collection of document and handling the second second characteristic generating apparatus that produces second characteristic with reference to collection of document, repeat no more.

3, keyword generates step

This step utilizes the keyword generating apparatus to generate the candidate feature of at least one keyword form according to input text.Generate step and feature generating apparatus with reference to aforementioned feature, the technician is not difficult to realize that keyword generates step and keyword generating apparatus.For what how to guarantee that keyword generates that step exports according to input text is the candidate feature of keyword form, can be with reference to the document of keyword abstraction (Keyword Extraction) aspect.If wish to realize this step under the situation that does not increase excessive data, method the simplest is to extract single Chinese character or single word as candidate keywords.Further can extract N continuous Chinese character or N continuous word (being N-gram) as candidate keywords.Rely on above-mentioned short-cut method,, can greatly simplify Project Realization though some the possibility of result is the character string with practical significance.And the feature of non-key speech often causes its scoring lower because occur less, thereby can be generally differentiates with legal keyword.

4, the candidate adjusts substep

The method that the present invention relates to can also comprise at least one candidate who utilizes candidate's adjusting gear to adjust candidate feature and adjust substep among feature generates step.

Described candidate adjusts the substep that substep can be at least one candidate feature of deletion.A described deletion substep meets the candidate feature deletion of first pre-set criteria with at least one.Described first pre-set criteria has multiple implementation when implementing, such as: first pre-set criteria can be a predefined blacklist, is used to remove candidate feature listed on the blacklist; If the score calculation of candidate feature depends on the distribution character data of candidate feature about input text, then first pre-set criteria can also contain some restriction rules about distribution character, minimum occurrence number that must reach in input text as candidate feature etc.In United States Patent (USP) " US7191177 Keyword extracting device ", disclosed and how from input text, to have extracted candidate keywords, filter by blacklist then and simplify candidate keywords.

It can be the substep that increases at least one candidate feature that described candidate adjusts substep.A described increase substep is visited at least one candidate feature, for current accessed candidate feature A, according to the reflection feature association data, find out at least one related associative search feature B of A, and with B as a new candidate feature.The effect that this will produce intelligent association makes that but the user can obtain not occurring the feature relevant with user's request at input text.Further, with the distribution character data of A distribution character data as B.The data of described reflection feature association, can specify from the user, also can from the knowledge base of relevance between the description feature of a manual maintenance (such as, ontology library), also can be the feature association that obtains by automated procedure knowledge (such as, co-occurrence between the retrieval character that obtains according to linguistic statistics analysis to certain language material, as association), or the knowledge of half feature association that obtain of artificial semi-automatic mode (such as, utilization has the machine learning of guidance to find co-occurrence between the retrieval character, as association).In U.S. Patent application " US2008/0243820 Semanticanalysis documents to rank term " and " US2008/0133509 Selecting Keywords Representative of aDocument ", disclosed and from input text, extracted candidate keywords, utilize body that candidate keywords is calculated scoring, realize the expansion of candidate keywords.Enforcement personnel of the present invention can realize that described candidate adjusts substep with reference to these documents and other pertinent literature.

5, scoring step

The method that the present invention relates to comprises a scoring step of utilizing at least one scoring of scoring apparatus calculated candidate feature.The scoring of described candidate feature, partly depend at least described candidate feature described with reference to the distribution character in the collection of document.Further, described scoring also partly depend at least described candidate feature in input text distribution character and (or) partly depend at least described candidate feature at least one second with reference to the distribution character in the collection of document.In the present invention, one with reference to document can belong to simultaneously with reference to collection of document and one second with reference to collection of document and (or) one can belong to more than one second simultaneously with reference to collection of document with reference to document.

This step relates to three technical matterss:

(1) candidate feature about input text, with reference to collection of document and (or) second how to produce with reference to the distribution character data in the collection of document.In the present invention, one first characteristic generating apparatus generates the distribution character data of a candidate feature about the reference collection of document; Further, feature generating apparatus except generate candidate feature also generate candidate feature about the distribution character data of input text and (or) at least one second characteristic generating apparatus generates described candidate feature about at least one second distribution character data with reference to collection of document.

When implementing, a feature generating apparatus and one first characteristic generating apparatus can be same devices, one first characteristic generating apparatus and one second characteristic generating apparatus can be same device and (or) a feature generating apparatus and one second characteristic generating apparatus can be same device.

(2) scoring apparatus how to obtain candidate feature about input text, with reference to collection of document and (or) second with reference to the distribution character data in the collection of document.

The needed candidate feature of scoring apparatus derives from the candidate feature data about the distribution character data of input text.

Scoring apparatus obtains the distribution character data of candidate feature about the reference collection of document, following mode can be arranged but is not limited to:

(A) static mode.Described distribution character data are obtained from one first characteristic by scoring apparatus.And first characteristic is produced according to the reference collection of document by other system, is perhaps produced according to the reference collection of document by native system.For preceding a kind of scheme, native system only need be responsible for obtaining necessary distribution character data and calculate, can simplified design.For a kind of scheme in back, before input step, comprise a preparation process, generate first contrast characteristic with the described first characteristic generating apparatus according to the reference collection of document, and generate the distribution character data of first contrast characteristic, and be saved in first characteristic about the reference collection of document.In static mode, system with the distribution character data that might be used to all be ready in advance, and be stored in the specific data structure.When system involved in the present invention is accessed by the user, do not need to handle with reference to collection of document, saved the time.This mode is applicable to the occasion that can frequently not change with reference to the content in the collection of document.

(B) dynamical fashion.Described distribution character data are directly obtained by calling the first characteristic generating apparatus by scoring apparatus.In this mode, when system involved in the present invention is accessed by the user, all to handle again with reference to collection of document to obtain described distribution character data.This mode is applicable to reference to frequent change of collection of document or the less occasion of document scale.

Scoring apparatus obtains candidate feature about at least one second distribution character data with reference to collection of document, following mode can be arranged but is not limited to:

(A) static mode.Described distribution character data are obtained from least one second characteristic by scoring apparatus.The generation of second characteristic depends on second with reference to collection of document, and specific implementation is with reference to the generation of first characteristic.

(B) dynamical fashion.Described distribution character data are obtained by calling the second characteristic generating apparatus by scoring apparatus.

(3) scoring apparatus how to depend on candidate feature about input text, with reference to collection of document and (or) second described candidate feature is calculated scoring with reference to the distribution character data in the collection of document.

Given one with reference to collection of document A (contain N with reference to document), and candidate feature X implements those skilled in the art and can construct multiple scoring formula.Such as:

s_{1} (X) = f_{2} (X) \log_{2} [\frac{N}{F_{1} (A, X)}],

?

s_{2} (X) = f_{0} (X) \cdot \log_{2} [\frac{N}{F_{1} (A, X)}],

s_{3} (X) = F_{2} (A, X) \log_{2} \frac{N}{F_{1} (A, X)},

s_{4} (X) = - \frac{\log_{2} F_{1} (A, X)}{F_{3} (A, X)},

s_{5} (X) = \frac{f_{1} (X) F_{4} (A, X)}{F_{3} (A, X)} \log_{2} \frac{N}{F_{1} (A, X)},

Deng.

Given one with reference to collection of document A, and M second is with reference to collection of document B _j(contain P _jIndividual second with reference to document), and a candidate feature X, implement those skilled in the art and can construct multiple scoring formula.Such as:

s_{6} (X) = \frac{1}{F_{1} (A, X)} Σ_{j = 1}^{M} \frac{1}{M} [\log_{2} \frac{F_{1} (B_{j}, X)}{P_{j}}],

s_{7} (X) = \log_{2} [\frac{N}{F_{1} (A, X)}] Σ_{j = 1}^{M} \frac{F_{2} (B_{j}, X)}{M},

s_{8} (X) = f_{1} \log_{2} [\frac{N}{F_{1} (A, X)}] Σ_{j = 1}^{M} \frac{1}{M} [\log_{2} \frac{F_{1} (B_{j}, X)}{P_{j}}],

Deng.

The used function of above formula sees table explanation:

f ₀(X)	Whether X exists (1: exist in the input text; 0: do not exist)
f ₀(X)		f ₁(X)	The occurrence number of X in the input text
f ₂(X)	The summation of the text filed size that all examples of X are covered in the input text	f ₁(X)	The occurrence number of X in the input text
f ₂(X)		F ₁(D，X)	The number of files that contains X among the collection of document D
F ₂(D，X)	The total degree that X occurs among the collection of document D	F ₁(D，X)
F ₂(D，X)		F ₃(D，X)	The deviation post that first occur of each example of X in affiliated each document a of A is with respect to the mean value of the ratio of document a length among the collection of document D
F ₄(D，X)	The summation of the text filed size that each example of X is covered in affiliated each document a of A among the collection of document D	F ₃(D，X)

Function in the reference in the table, when implementing, the technician can construct new function, thereby depends on candidate feature about input text, with reference to collection of document, and/or the second distribution character data with reference to collection of document, come described candidate feature is calculated scoring.

6, the result generates step

The method that the present invention relates to also comprise utilize generating apparatus as a result according to the candidate feature through scoring produce at least one as a result the result of feature generate step.The mode of the described feature that bears results can be that candidate feature is adjusted, with adjusted candidate feature feature as a result of; Or with the direct feature as a result of of candidate feature.

Described to the adjustment of candidate feature through scoring, can the following arbitrary mode or the combination of some kinds of modes:

(1) (this situation results from candidate feature in reference collection of document and (or) second undiscovered in reference to collection of document) to remove the insignificant candidate feature of scoring;

(2) remove the part candidate feature that meets second pre-set criteria;

(3) candidate feature is sorted.

(4) other causes the operation of candidate feature change.

The setting of described second pre-set criteria is in order to shrink the scale of return results better, to improve the quality of feedback information.Multiple mode is arranged when implementing, such as: the candidate feature that scoring is lower than threshold value is removed; Add up the scoring of each candidate feature, calculate the average E and the mean square deviation δ of scoring, the candidate feature that scoring is lower than E one 3 δ is removed; Add up the scoring of each candidate feature, calculate the median of scoring, the candidate feature that scoring is lower than median is removed.With reference to these examples, the technician can construct second pre-set criteria that other meets concrete engine request when implementing.

7, output step

At last, method involved in the present invention comprises an output step, utilizes form of expression output as a result feature and/or each the as a result scoring of feature correspondence of output unit being handled by the user or understand.The described form of expression can be but is not limited to: binary data file; Form; Chart; Animation; The input text of hypertext (HTML) form marks the feature as a result that scoring belongs to different brackets with different colours therein; The input text of hypertext form marks feature as a result with link therein, when the user by this hypertext of browser access, the link of clicking in the hypertext will contain and the document that links corresponding feature as a result by a searching system search; (or) other form of expression that can be handled or understand by the user.

Output unit can have multiple implementation, can be but be not limited to: the interface of example, in hardware (as network interface, USB interface, RS232 interface, chip pin), the interface of form of software (as the storage medium access interface in human-computer interaction interface, the operating system, database ODBC interface, network access interface) etc.In some was implemented, output unit can be shared same physical interface or logic interfacing with input media.

Below provide some specific embodiments of the invention.Be appreciated that simultaneously the present invention does not limit to these certain embodiments.

Embodiment one

A system that helps the user from the input text of submitting to, to seek feature with retrieval effectiveness

See also shown in Figure 1ly, obtain the system that helps the feature retrieved according to the embodiment of the invention a kind of shown in the figure.This system 100 runs on the computer system as follows:

In preparation process, with first characteristic generating apparatus scanning with reference in the collection of document 152 each with reference to document.When one of scanning during with reference to document, the first characteristic generating apparatus will identify each by the word that continuous English alphabet constitutes, be saved in dictionary tree (trie), the counting of this word correspondence adds 1 in the dictionary tree.When been scanned, this dictionary tree has comprised the appearance total degree in reference collection of document 152 with reference to English words all in the collection of document 152 and every kind of word.Every kind of English word all is described first contrast characteristic, and each first contrast characteristic's appearance total degree is exactly the distribution character data of this first contrast characteristic about reference collection of document 152.This dictionary tree is used as first characteristic.The first characteristic generating apparatus 106 can be compiled a LEX file and produced by LEX, uses regular expression (Regular Expression) to describe continuous English character in this LEX file.Such first characteristic generating apparatus 106 can be caught the represented word of continuous English character.

When the user submits input text to:

(1), utilizes input media 101 to obtain the input text 151 of user's keyboard input and be stored in internal memory at input step.

(2) generate step in feature, utilize feature generating apparatus 102, the input text 151 in the access memory, and according to input text 151 generation candidate feature X={x ₁, x ₂.., x _n, x wherein _iBe a candidate feature, these candidate feature are constituted an array.Feature generating apparatus 102 is produced by a LEX LEX file of compiling.The rule description that uses in this LEX file continuous English character, thereby feature generating apparatus 102 can be caught the represented word of continuous English character.In addition, contain the instruction of matched character string being put into the character string array in the action of this LEX rule correspondence, the LEX file has also been described in the program scanning input text back that finishes the character string array is sorted and removes repetition, and saves as a candidate feature array 154.Therefore behind the feature generating apparatus 102 scanning input texts 151, all English words that input text 151 is contained are saved in a candidate feature array 154.

(3) in the scoring step, utilize scoring apparatus 103, each member in the access candidate feature array 154 successively is for current member x _i, visit first characteristic and obtain x _iDistribution character data y _iCalculate this candidate feature x _iThe scoring of rendeing a service about retrieval, obtain about x _iScoring.Score function is

s _α(x _i)＝-log ₂F ₂(A，x _i)＝-log ₂y _i；

Function F wherein ₂(D, x) among the expression collection of document D candidate feature x number, i.e. y always appear _iA is with reference to collection of document.After finishing calculating, the scoring of all candidate feature is also constituted an array.Each member of candidate feature array 154 is corresponding one by one with each member of scoring array.

(4) generate step in the result, utilize generating apparatus 104 as a result,, candidate feature is sorted, the feature as a result of of the candidate feature after the ordering according to the scoring of candidate feature according to given candidate feature array 154 and scoring array.

(5) the output step is utilized output unit 105, with the form of expression that can be handled by the user or understand feature is as a result exported to the user as output result 153.

See also shown in Figure 2ly again, obtain the system that helps the feature retrieved according to the embodiment of the invention a kind of shown in the figure.This system 200 has several places different with aforementioned system 100:

(1) offhand step.

(2) generate in the step in feature, utilize feature generating apparatus 202 to produce candidate feature array 154.Feature generating apparatus 202 is behind the repertoire that executes feature generating apparatus 102, also have a candidate to adjust substep, utilize candidate's adjusting gear 206 to carry out the adjustment of candidate feature: each candidate feature in the access candidate feature array 154 successively, check whether this candidate feature is present in the blacklist that sets in advance, if then this candidate feature is removed from array 154.

(3) in the scoring step, utilize scoring apparatus 203, each member in the access candidate feature array 154 successively is for current member x _i, in reference collection of document 152, search each document successively, statistics contains x _iTotal number of files z _i, as x _iThe distribution character data; Calculate this candidate feature x _iScoring.Score function is

s _β(x _i)＝-log ₂F ₁(A，x _i)＝-log ₂z _i；

Function F wherein ₁(D, x) total number of files of candidate feature x, i.e. z among the expression collection of document D _iA is with reference to collection of document.After finishing calculating, the scoring of all candidate feature is also constituted an array.Each member of candidate feature array 154 is corresponding one by one with each member of scoring array.

See also shown in Figure 3ly again, obtain the system that helps the feature retrieved according to the embodiment of the invention a kind of shown in the figure.This system 300 has many places different with aforementioned system 100:

In preparation process, scan successively with reference to every part of document in the collection of document 152 with the first characteristic generating apparatus 106.For current processed document, scanning the document and each English word that identifies.Whenever identifying an English word, check whether there is this word in first dictionary tree.If there is no, then add this word in first dictionary tree, and also add this word in second dictionary tree, the document count of this word adds 1 in second dictionary tree simultaneously.Whenever having scanned a document, then first dictionary tree is emptied.After having scanned all documents, take out in second dictionary tree all words and counting thereof and be saved in first characteristic.Be exactly word of each first contrast characteristic in first characteristic, total each first contrast characteristic's distribution character data are exactly the number of files (document count) of this word.Similar with the first characteristic generating apparatus, generate second contrast characteristic according to second with reference to collection of document 353 with the second characteristic generating apparatus 307, and each second contrast characteristic the second total number of files that occurs in reference to collection of document 353 as this second contrast characteristic about second the distribution character data with reference to collection of document 353, and be saved in second characteristic.

When the user submitted input text to, its difference was:

(1) generate in the step in feature, utilize feature generating apparatus 302, the input text 151 in the access memory identifies the various English words that are made of continuous English character, and the occurrence number of every kind of English word is counted.Feature generating apparatus 302 is produced by a LEX LEX file of compiling.By constructing suitable LEX file, the technician is not difficult to realize to above-mentioned functions, repeats no more here.When feature generating apparatus 302 is finished scanning to input text 151, with every kind of word occurring in the input text 151 as a candidate feature, and with the occurrence number of this kind word in input text 151 as the distribution character data of this candidate feature about input text 151, be stored in the array 354.

(2) in the scoring step, utilize scoring apparatus 303 visit arrays 354, therefrom read each candidate feature; For each candidate feature x _i, scoring apparatus 303 reads x from visit array 354 _iDistribution character data w about input text 151 _i, from first characteristic, read x _iDistribution character data y about reference collection of document 152 _i, from second characteristic, read x _iAbout second the distribution character data z with reference to collection of document 353 _i, calculate this candidate feature x _iScoring.Score function is

s_{γ} (x_{i}) = \log_{2} \frac{F_{1} (B, x_{i})}{F_{1} (A, x_{i})} = \log_{2} \frac{z_{i}}{y_{i}},

s _τ(x _i)＝f ₁(x _i)＝w _i；

Function f wherein ₁(x) occurrence number of candidate feature x in the expression input text, function F ₁(D, x) total number of files of candidate feature x among the expression collection of document D; A is with reference to collection of document; B is second with reference to collection of document.After finishing calculating, the scoring of all candidate feature is appended to a scoring array 355, each member of array is corresponding one by one with each candidate feature of candidate feature array 354, and each member comprises this candidate feature x _iTwo the scoring s _γ(x _i) and s _τ(x _i).

(3) generate in the step in the result, utilize generating apparatus 304 as a result, according to given candidate feature array 154 and scoring array, according to each candidate feature x in the scoring array _iScoring s _τ(x _i) candidate feature is sorted, and with candidate feature feature as a result of, with feature x as a result _iAnd scoring s _γ(x _i) constitute an array element and be appended to array 356.

(4) in the output step, utilize output unit 305, generate following form according to array 356, result 153 exports to the user as output.

Feature as a result	Scoring
Feature as a result	Scoring	x ₁	s _γ(x ₁)
x ₂	s _γ(x ₂)	x ₁	s _γ(x ₁)
x ₂	s _γ(x ₂)	……	……

Embodiment two

A kind of system that estimates the value of the input text correlate of submitting to the user.

See also again shown in Figure 4, shown in the figure according to a kind of system that estimates the value of the input text correlate of submitting to the user of the embodiment of the invention.The operation of this system 400 comprises following steps:

In preparation process, scan successively with reference to every part of document in the collection of document 152 with the first characteristic generating apparatus 406.For current processed document, scanning the document and each English word that identifies.Whenever identifying an English word, check whether there is this word in first dictionary tree.If there is no, then add this word in first dictionary tree, and also add this word in second dictionary tree, the document count of this word adds 1 in second dictionary tree simultaneously.Whenever having scanned a document, then first dictionary tree is emptied.After having scanned all documents, take out in second dictionary tree all words and document count thereof and be saved in first characteristic.Be exactly word of each first contrast characteristic in first characteristic, total each first contrast characteristic's distribution character data are exactly the number of files (document count) of this word.Similar with the first characteristic generating apparatus, generate second contrast characteristic according to second with reference to collection of document 353 with the second characteristic generating apparatus 407, and each second contrast characteristic the second total number of files that occurs in reference to collection of document 353 as this second contrast characteristic about second the distribution character data with reference to collection of document 353, and be saved in second characteristic.

When the user submits input text to:

(1) input step utilizes input media 101 to obtain input text 151 and is stored in internal memory.

(2) keyword generates step, utilizes keyword generating apparatus 402, the input text 151 in the access memory, and according to input text 151 generation candidate keywords X={x ₁, x ₂.., x _n, x wherein _iBe a candidate keywords, these candidate keywords are constituted an array.Keyword generating apparatus 402 is produced by a LEX LEX file of compiling.The rule description that uses in this LEX file continuous English character, thereby keyword generating apparatus 402 can be caught the represented word of continuous English character.In addition, contain the instruction of matched character string being put into the character string array in the action of this LEX rule correspondence, the LEX file has also been described in the program scanning input text back that finishes the character string array is sorted and removes repetition, and saves as a candidate keywords array 454.Therefore behind the keyword generating apparatus 402 scanning input texts 151, all English words that input text 151 is contained are saved in a candidate keywords array 454.

(3) in the scoring step, utilize scoring apparatus 403 visit arrays 454, therefrom read each candidate keywords; For each candidate keywords x _i, scoring apparatus 403 reads x from visit array 454 _iDistribution character data w about input text 151 _i, from first characteristic, read x _iDistribution character data y about reference collection of document 152 _i, from second characteristic, read x _iAbout second the distribution character data z with reference to collection of document 353 _i, calculate this candidate keywords x _iScoring.Score function is:

s_{γ} (x_{i}) = \log_{2} \frac{F_{1} (B, x_{i})}{F_{1} (A, x_{i})} = \log_{2} \frac{z_{i}}{y_{i}},

s _τ(x _i)＝f ₁(x _i)＝w _i；

Function f wherein ₁(x) occurrence number of candidate keywords x in the expression input text, function F ₁(D, x) total number of files of candidate keywords x among the expression collection of document D.After finishing calculating, the scoring of all candidate keywords is appended to a scoring array 355, each member of array is corresponding one by one with each candidate keywords of candidate keywords array 454, and array 355 each member comprise two s that mark of this candidate keywords _γ(x _i) and s _τ(x _i).

(4) generate in the step in the result, utilize generating apparatus 404 as a result, according to given candidate keywords array 454 and scoring array, according to each candidate keywords x in the scoring array _iScoring s _τ(x _i) candidate keywords is sorted, and with candidate keywords keyword as a result of, with keyword x as a result _iAnd scoring s _γ(x _i) constitute an integral body and be appended to array 456.

(5) in the output step, utilize the input text of output unit 405 output HTML forms, wherein related each of array 456 as a result keyword in the input text of HTML form, be marked by different font colors according to the size of its scoring.

This system can be used to solve a plurality of particular problems, can be but is not limited to:

(1) value (seeing embodiment three for details) of the various technical ability that evaluation is relevant with the resume of submitting to: described input text is a resume; The various technical ability that described correlate relates to for this resume; Described value be technical ability because of by enterprises pay attention and (or) had a professional advantage that is embodied by the job hunter.Described is the job notice storehouse of enterprise with reference to collection of document.Further, described second is a plurality of job hunters' resume storehouse with reference to collection of document.The distribution character of various technical ability keywords has embodied these technical ability " supply-demand " relation in job hunting in resume and the job notice.

(2) novelty of the various academic topics that evaluation is relevant with the research paper of submitting to: described input text is a paper; The academic topic that described correlate relates to for this paper etc.; Described value for these academic topics because of paid close attention to by periodical, meeting and (or) novelty that embodies is discussed by other papers.Described is the Contribution Wanted (call forpaper) of periodical, meeting with reference to collection of document.Further, described second is the paper storehouse with reference to collection of document.The distribution character of various academic topic keywords has embodied the supplydemand relationship of these academic topics in paper is published in paper and the Contribution Wanted.

(3) the popular degree of the various product performances that evaluation is relevant with the product introduction of submitting to: described input text is product introduction; Described correlate is the related various product performances of product introduction; Described value for these product performances because of paid close attention to by client comment and (or) had a popular degree that embodies by other products.The described comment of delivering for various products for the client with reference to collection of document.Further, described second is product introduction storehouse about a plurality of products with reference to collection of document.The distribution character of various product performance keywords has embodied the supplydemand relationship of these product performances in customer experience in product introduction and the client's comment.

(4) on the net in the community system, estimate the personalized degree of the various hobbies relevant with the member's of Web Community of submission individual brief introduction: described correlate is the related various hobbies of individual brief introduction; Described value is the member of Web Community have and the desired personalized degree that embodies that has for these hobbies.Described is individual profile repository with reference to collection of document.A hobby that relates in each individual brief introduction has represented that not only the member of Web Community of this brief introduction correspondence has this hobby, has represented potentially that also the member of this Web Community expects that other people have this hobby, therefore embodies supply and demand simultaneously.

Embodiment three

The system of its professional advantage assessed in a kind of resume according to someone.

See also shown in Figure 5ly again, assess the system of its professional advantage shown in the figure according to a kind of resume of the embodiment of the invention according to someone.This system 500 is based on aforementioned system 400, more specifically:

(1) input text is a resume text 551

(2) be resume storehouse 552 with reference to collection of document, store some people's resume text

(3) second is job notice storehouse 553 with reference to collection of document

What deposit in first characteristic is the keyword that occurs in the resume storehouse, and the number of files that occurs in how many resumes of each keyword; What deposit in second characteristic is the keyword that occurs in the job notice storehouse, and how many each keywords recruit the number of files that occurs in the resumes at.What deposit in the array 454 is keyword and the occurrence number that occurs in the resume text.Since adopted aforementioned scoring apparatus 300, thus less if a keyword in someone the resume text occurs in the recruitment resume in resume storehouse, and appearance is more in job notice, will cause the scoring of this keyword higher.This means that also the technical ability of this keyword representative, experience grasped but by less applicant and to be paid close attention to than multiple enterprises.Therefore, this of this person technical ability, experience have bigger professional advantage.By system 500, just can obtain the scoring that comprises each keyword in the resume text like this, thus the professional advantage of the corresponding technical ability experience of a reflection keyword.Therefore, utilize system 500 just can make assessment to its professional advantage according to someone resume text.

Embodiment four

See also again shown in Figure 6, a kind of shown in the figure with the DRS of text as inquiry input.The operation of this system 600 comprises following steps:

(1) the inquiry input step obtains input text 151;

(2) feature obtaining step is by system 300, according to input text 151 produce output results 357;

(3) searching step, according to output among the result 357 feature as a result and (or) its scoring, be configured to searching system 602 intelligible inquiries, system 602 is submitted in inquiry, and obtains the result for retrieval 657 of system 602;

(4) retrieval output step is with result for retrieval 657 outputs.

Each document that 602 pairs of searching systems can the system of being retrieved have access to is marked and (is identified the feature as a result of exporting result 357 that belongs to that the document contains, from output result 357, obtain these scorings of feature correspondence as a result, calculate these scorings and, as the scoring of the document); Searching system 602 document that can be retrieved is according to document scores descending sort and paging output, as result for retrieval 657 then.The scoring of document has embodied the similarity of input text and the document.

Prior art can realize above-mentioned searching system, receives to contain the output result of feature as a result, produces result for retrieval.Such as, some searching system (such as Google) can receive the set of several retrieval characters formations as the inquiry input, and the feedback searching result; Some searching system (such as the patent search system of USPTO) can receive the query expression that is made of logical predicates such as several retrieval characters and AND OR and import as inquiry, and the feedback searching result; U.S. Patent application " US20060122997 System and method for text searching using weighted keywords " has disclosed a kind of system that can carry out file retrieval according to keyword and weight thereof.

As enforcement of the present invention, when this system is used to retrieve job notice, after the user submits resume to, system obtains keyword according to resume, and determine the scoring of keyword, job notice and the feedback of utilizing a searching system from the job notice storehouse, to obtain being correlated with then according to these keywords according to the distribution character of keyword in the job notice storehouse.The user just can obtain the job notice relevant with resume by submitting resume to like this.Further, job notice according to the similarity descending sort of this resume.With respect to traditional method that job notice searched in keyword of passing through, convenience of the present invention is tangible.

As a variation of expecting easily, this system also can be used to the retrieval of resume, after the user submits a job notice to, just can obtain the resume of system feedback, these resumes are relevant with job notice, use the enterprise customer of this system, just can be in a large amount of resumes contraction scope rapidly, find the job hunter who is fit to the post needs.

To sum up, adopted above-mentioned obtaining to help the System and method for of the feature of text retrieval, the user can search out the retrieval character to the helpful meaning of Search Requirement of self, make the user when facing the document of magnanimity, can use these effectively inquiries of retrieval characters structure, and then the fast contraction range of search, but avoided losing of potential valuable result for retrieval again simultaneously, and simple and convenient, stable and reliable for performance, the scope of application is comparatively extensive.And this System and method for is combined with existing searching system, can construct more convenient easy-to-use searching system, the user only need import descriptive text, just can retrieve related data, has avoided selecting because of keyword the decline of the improper retrieval effectiveness that causes.

In this instructions, the present invention is described with reference to its certain embodiments.But, still can make various modifications and conversion obviously and not deviate from the spirit and scope of the present invention.Therefore, instructions and accompanying drawing are regarded in an illustrative, rather than a restrictive.

Claims

1, a kind of system that obtains the feature that helps text retrieval is characterized in that, described system comprises:

Input media is used to receive the input text that the user submits to;

2, the system that obtains the feature that helps text retrieval according to claim 1, it is characterized in that, described output unit is also exported the scoring of each described feature as a result, and the scoring of described feature as a result is the scoring of described scoring apparatus for a candidate feature that is equal to mutually with described feature as a result in the candidate feature data.

3, the system that obtains the feature that helps text retrieval according to claim 1, it is characterized in that, described feature generating apparatus is also operated at least one candidate's adjusting gear, is used for deleting on the basis of original described candidate feature and/or increasing at least one candidate feature.

4, the system that obtains the feature that helps text retrieval according to claim 1, it is characterized in that, the computation process of described scoring also depends on the distribution character of described candidate feature in described input text, and described feature generating apparatus also generates the data of each described candidate feature about the distribution character of described input text.

5, the system that obtains the feature that helps text retrieval according to claim 1 is characterized in that, the computation process of described scoring also depend on described candidate feature at least one second with reference to the distribution character in the collection of document.

6, a kind ofly realize obtaining the method for the feature that helps document retrieval, it is characterized in that described method may further comprise the steps based on the described device of claim 1:

(1) input step receives the input text that the user submits to;

7, the method for the feature that helps document retrieval is obtained in realization according to claim 6, it is characterized in that, and is further comprising the steps of in the described output step:

8, the method for the feature that helps document retrieval is obtained in realization according to claim 6, it is characterized in that, described feature generates in the step further comprising the steps of:

9, the method for the feature that helps document retrieval is obtained in realization according to claim 6, it is characterized in that, the computation process of described scoring also depends on the distribution character of described candidate feature in described input text, and described feature generates in the step and may further comprise the steps:

10, the method for the feature that helps document retrieval is obtained in realization according to claim 6, it is characterized in that, the computation process of described scoring also depend on described candidate feature at least one second with reference to the distribution character in the collection of document.

11, a kind of system that estimates the value of the things relevant with the input text of user's submission is characterized in that described system comprises:

Input media receives the input text that the user submits to;

12, the system of the value of the things that evaluation according to claim 11 is relevant with the input text of user's submission, it is characterized in that, described keyword generating apparatus is also operated at least one candidate's adjusting gear, in order to adjust described candidate feature, promptly from original candidate feature, delete some candidate feature and/or add some features as new candidate feature.

13, the system of the value of the things that evaluation according to claim 11 is relevant with the input text of user's submission, it is characterized in that, described output unit has also been exported the scoring of each described feature as a result, and the scoring of a described feature as a result is the scoring of a described candidate feature being equal to mutually with described feature as a result in the candidate feature data.

14, the system of the value of the things that evaluation according to claim 11 is relevant with the input text of user's submission, it is characterized in that, described input text contains the description of first party demand, describedly contains the description of supplying with the corresponding second party of first party demand with reference to collection of document; Perhaps described input text contains the description that first party is supplied with, and describedly contains the description of supplying with corresponding second party demand with first party with reference to collection of document; Perhaps described input text contains the description of first party supply or demand, and described containing with first party demand or supply with reference to collection of document belongs to the second party demand of same type or the description of supply.

15, the system of the value of the things that evaluation according to claim 11 is relevant with the input text of user's submission, it is characterized in that, the computation process of described scoring also depend on described candidate feature at least one second with reference to the distribution character in the collection of document.

16, the system of the value of the things that evaluation according to claim 15 is relevant with the input text of user's submission, it is characterized in that described second contains with the demand of first party or supply with reference to collection of document and to belong to the third-party demand of the same type or the description of supply.

17, the system of the value of the things that evaluation according to claim 15 is relevant with the input text of user's submission is characterized in that described input text is added into second with reference in the collection of document.

18, the method for the value of the things that a kind of input text that realization is estimated and the user submits to based on the described system of claim 11 is relevant is characterized in that described method may further comprise the steps:

(1) input step receives the input text that the user submits to;

19, the method for the value of the things that realization evaluation according to claim 18 is relevant with the input text of user's submission is characterized in that described keyword generates in the step further comprising the steps of:

20, the method for the value of the things that realization evaluation according to claim 18 is relevant with the input text of user's submission is characterized in that, and is further comprising the steps of in the described output step:

21, the method for the value of the things that realization evaluation according to claim 18 is relevant with the input text of user's submission, it is characterized in that, the computation process of the scoring of described candidate feature, partly depend at least described candidate feature described with reference to the distribution character in the collection of document.

22, the method for the value of the things that realization evaluation according to claim 18 is relevant with the input text of user's submission, it is characterized in that, the computation process of described scoring also partly depends on the distribution character of described candidate feature in described input text at least, and/or partly depend at least described candidate feature at least one second with reference to the distribution character in the collection of document.

23, a kind ofly realize assessing it is characterized in that the system of its professional advantage described input text and be one of following configuration with reference to collection of document according to personnel's resume based on the described system of claim 11:

Input text With reference to collection of document The resume text The job notice storehouse The resume text The resume storehouse

。

24, the system of its professional advantage is assessed in realization according to claim 23 according to personnel's resume, it is characterized in that, the computation process of described scoring also depend on described candidate feature at least one second with reference to the distribution character in the collection of document, described input text, be one of following configuration with reference to collection of document with reference to collection of document and second:

Input text With reference to collection of document Second with reference to collection of document The resume text The job notice storehouse The resume storehouse The resume text The resume storehouse The job notice storehouse

。

25, a kind ofly realize assessing it is characterized in that the method for its professional advantage described input text and be one of following configuration with reference to collection of document according to personnel's resume based on the described method of claim 18:

。

26, the method for its professional advantage is assessed in realization according to claim 25 according to personnel's resume, it is characterized in that, the computation process of described scoring also depend on described candidate feature at least one second with reference to the distribution character in the collection of document, described input text, be one of following configuration with reference to collection of document with reference to collection of document and second:

。

27, a kind of with the system of text as the file retrieval of inquiry input, it is characterized in that described system comprises: input media receives the input text that the user submits to;

The described system that obtains the feature that helps text retrieval of claim 1 obtains containing the output result of feature as a result according to described input text;

The retrieval output unit is with described result for retrieval output.

28, a kind of realization is characterized in that with the method for text as the file retrieval of inquiry input described method may further comprise the steps:

(1) input step receives the input text that the user submits to;

(2) feature obtaining step, the method for utilizing the described realization of claim 6 to obtain the feature that helps document retrieval is obtained feature as a result;

(4) retrieval output step is with described result for retrieval output.