CN103699625B - Method and device for retrieving based on keyword - Google Patents
Method and device for retrieving based on keyword Download PDFInfo
- Publication number
- CN103699625B CN103699625B CN201310710834.7A CN201310710834A CN103699625B CN 103699625 B CN103699625 B CN 103699625B CN 201310710834 A CN201310710834 A CN 201310710834A CN 103699625 B CN103699625 B CN 103699625B
- Authority
- CN
- China
- Prior art keywords
- document
- theme
- key word
- vector
- weight
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a method and a device for retrieving based on a keyword. The method comprises the steps of: determining candidate keywords in a retrieval request based on a predicted weight of a basic keyword in a document library, wherein the predicted weight of the keyword is determined based on structure information of the basic keyword in the document of the document library; determining other extended keywords based on a theme which the candidate keywords belong to in the document library; retrieving in the document library based on the candidate keywords and the extended keywords. The technical scheme provided by the invention can improve accuracy rate and recall rate of the retrieval result, and is more satisfied with user demands.
Description
Technical field
The present embodiments relate to data searching technology field, more particularly to the method and dress of line retrieval are entered based on key word
Put.
Background technology
At present, some searching systems are related according to certain decision search typically according to the retrieval request of user input
Information in document library, so as to provide the user file retrieval service.For example, the searching system is the service of Kingsoft illustrative sentence retrieval
System, the system after the query statement for receiving user input, can according to the query statement to document library in each document
Keywords matching lookup is carried out, and then provides the user outstanding example sentence or model essay described in document.
In the prior art, searching system is after retrieval request is received, first to the search phrase included in the request
Sentence carries out participle, carries out these participles in document library as key word, based on literal retrieval, finally to retrieve afterwards
As a result user is returned to after merging.
Defect present in prior art is:
On the one hand, retrieval result accuracy rate is low, larger with user view gap.For example, the search statement of user input
For the sentence of scene " description snow ", existing searching system can by occurrence number in a document more " snowing ", " scene ",
How many participle such as " description " place documents, be ranked up in retrieval result according to number of times, " the snowing " of user's real demand this
The document that one participle is located tends not to occupy preferential position.
On the other hand, it is impossible to comprehensively extract other documents that can represent user's request, recall rate is low.For example, use
The search statement of family input is " spring ", and existing searching system is only able to find the document containing " spring ", and now has
Example sentence often describes spring scenery, and this example sentence often can more meet the demand of user, but existing technology but cannot
Find example sentence but the literal text for but not containing " spring " that such semanteme is description spring.
The content of the invention
The embodiment of the present invention provides the method and device for entering line retrieval based on key word, to improve the accuracy rate of retrieval result
And recall rate, more meet user's request.
In a first aspect, embodiments providing a kind of method for entering line retrieval based on key word, methods described includes:
The candidate keywords in retrieval request are determined according to the prediction weight of basic key word in document library, wherein the base
The prediction weight of plinth key word is that the structural information according to basic key word in the document of document library determines;
According to the affiliated theme in the document library of the candidate keywords, other expanded keywords are determined;
Enter line retrieval in the document library according to candidate keywords and expanded keyword.
Second aspect, the embodiment of the present invention additionally provides a kind of device for entering line retrieval based on key word, described device bag
Include:
Candidate keywords determining module, for being determined in retrieval request according to the prediction weight of basic key word in document library
Candidate keywords, wherein the prediction weight of the basic key word is the knot according to basic key word in the document of document library
What structure information determined;
Expanded keyword determining module, for according to the affiliated theme in the document library of the candidate keywords, really
Fixed other expanded keywords;
Retrieval module, for entering line retrieval in the document library according to candidate keywords and expanded keyword.
In the technical scheme that the embodiment of the present invention is proposed, according to the structural information of basic key word in document library, obtain
The prediction weight of basic key word, the candidate keywords in retrieval request are determined according to resulting prediction weight, so can
Treat each participle in retrieval request with a certain discrimination, extracting can express the candidate keywords of user view so that retrieval result is accurate
Really rate is higher;According to the affiliated theme in document library of candidate keywords, other expanded keywords are determined, according to candidate keywords
Enter line retrieval in document library with expanded keyword, it is achieved thereby that the retrieval to retrieval request based on semantic level, can be with standard
Really, the document for representing user's request is comprehensively extracted, recall rate is higher.
Description of the drawings
Fig. 1 is a kind of method flow schematic diagram for entering line retrieval based on key word that the embodiment of the present invention one is provided;
Fig. 2 is a kind of method flow schematic diagram for entering line retrieval based on key word that the embodiment of the present invention two is provided;
Fig. 3 is a kind of method flow schematic diagram for entering line retrieval based on key word that the embodiment of the present invention three is provided;
Fig. 4 is a kind of apparatus structure schematic diagram for entering line retrieval based on key word that the embodiment of the present invention four is provided;
Fig. 5 is a kind of apparatus structure schematic diagram for entering line retrieval based on key word that the embodiment of the present invention five is provided;
Fig. 6 is a kind of apparatus structure schematic diagram for entering line retrieval based on key word that the embodiment of the present invention six is provided;
Fig. 7 is the schematic diagram of the candidate keywords in a kind of determination retrieval request that the embodiment of the present invention seven is provided;
Fig. 8 is the schematic diagram of a kind of determination expanded keyword that the embodiment of the present invention seven is provided and retrieval.
Specific embodiment
With reference to the accompanying drawings and examples the present invention is described in further detail.It is understood that this place is retouched
The specific embodiment stated is used only for explaining the present invention, rather than limitation of the invention.It also should be noted that, in order to just
Part related to the present invention rather than entire infrastructure are illustrate only in description, accompanying drawing.
Embodiment one
Fig. 1 is a kind of method flow schematic diagram for entering line retrieval based on key word that the embodiment of the present invention one is provided.This reality
Apply example to be applicable to after the retrieval request for receiving user input, the retrieval of relevant information carried out according to the request, so as to for
User provides the situation of service.The method can be performed by the equipment with search function, be specifically included:
101st, the candidate keywords in retrieval request are determined according to the prediction weight of basic key word in document library.
Retrieval facility can predefine each basic key word in document library, and be calculated and each basis by the algorithm for setting
A corresponding prediction weight of key word.Wherein, each document in document library can be that retrieval facility is locally stored, also may be used
To be acquired from related Website server by Internet technology.The prediction weight of the basic key word in document library by
Structural information according to the key word in the document of document library determines.Structural information of each basic key word in each document can
Position, the part of speech of basic key word, the part of speech of previous word and/or latter word including the basic key word in each document
Part of speech.For example, retrieval facility carry out model essay retrieval, user object search it is more be some modification words rather than verb
When, if the part of speech of certain word is noun in document, the part of speech of previous word is verb, then the word into based on key word it is general
Rate is larger, gives the word relatively large prediction weight.
Retrieval facility, can be in retrieval request after the retrieval request for receiving the search statement for including user input
Search statement carries out participle, and then basis precalculates the prediction weight of each basic key word in the document library for obtaining to each point
Word is analyzed, using the participle met under imposing a condition as the candidate keywords in retrieval request.Specifically, retrieval is being asked
The search statement asked is carried out after participle, can search basis pass consistent with the participle in the basic keyword set in document library
Keyword and its corresponding prediction weight, if it is determined that prediction weight reaches the threshold value of a setting, then using the participle as one
Individual candidate keywords.
102nd, according to the theme that the candidate keywords are affiliated in the document library, other expanded keywords are determined.
103rd, line retrieval is entered in the document library according to candidate keywords and expanded keyword.
After candidate keywords of the retrieval facility in retrieval request is obtained, directly can enter line retrieval in document library, but
Preferably according to the theme that the candidate keywords are affiliated in the document library, other expanded keywords are further determined that,
Then line retrieval is entered according to all or part of expanded keyword and candidate keywords.Specifically, will can close with candidate in document
Keyword belongs to the higher key word of other distribution probabilities on same theme as expanded keyword, because these key words exist
Residing context environmental semantically has similar feature when scene is described than relatively similar in document.
In this example, on the one hand, each participle in by treating retrieval request is analyzed and can express use to extract
The candidate keywords that family is intended to, can improve the accuracy rate of retrieval result.For example, for the search statement in retrieval request is
For the situation of " describing the sentence of spring vigorous scene ", if using existing technology, such as based on part of speech and word IDF
(Inverse Document Frequency, inverse document frequency)The candidate keywords of weight information extract strategy, can obtain
To " spring ", " vigorous ", " scene ", " sentence " key word, and then enter line retrieval using these key words, its retrieval result
Usually contain many documents with user expection inconsistent description " scene ", " sentence " etc.Although " scene ", " sentence
Son " is the main noun phrase of search statement, but in class retrieval is described, more expression user views are modifiers, such as
If fruit puts on an equal footing the key word that extraction is obtained, the confusion of result is will result in.And in the present invention, key word can be passed through
Structural information in each document of document library, by qualifier larger prediction weight is given, then according to the prediction weight
Determining can express candidate keywords " spring ", " vigorous " of user view in retrieval request, and then can improve retrieval knot
The accuracy of fruit.
In this example, on the other hand, by according to the affiliated theme in document library of candidate keywords, determining other expansions
Exhibition key word, according to all of candidate keywords and expanded keyword line retrieval is entered in document library, is realized to retrieval request
Based on the retrieval of semantic level, can accurately, comprehensively extract the document for representing user's request, it is possible to increase retrieval result is called together
The rate of returning.For example, in the case of the candidate keywords in the retrieval request for obtaining are " spring ", using existing technology,
The document containing " spring " is only able to find, or for " spring " carries out synonym extension, is found some other containing " spring tide "
Etc. synon document;And in embodiments of the present invention, according to the theme that " spring " is affiliated in document library, can be included
The document of the expanded keywords such as " greenweed ", " spring breeze ", the recall rate of retrieval result is lifted.
Embodiment two
Fig. 2 is a kind of method flow schematic diagram for entering line retrieval based on key word that the embodiment of the present invention two is provided.This reality
Example is applied on the basis of embodiment one, increased two sorting algorithms of non-supervisory keyword abstraction method based on figure and setting come
The operation of the prediction weight of basic key word in document is obtained, to carry out to the key word in document library by the way of semi-supervised
Analysis, and then effectively, quickly determine the candidate keywords that user view can be expressed in retrieval request.Referring to Fig. 2, methods described
Including:
201st, the non-supervisory keyword abstraction method based on figure, to the document in document library keyword abstraction is carried out, and is obtained
The basic keyword set of the document library, and generate in basic keyword set the statistical weight and in a document of basic key word
Structural information;
202nd, using two sorting algorithms of setting, structure letter according to resulting basic key word and its in a document
Breath, obtains the prediction weight of the basic key word;
203rd, the candidate keywords in retrieval request are determined according to the prediction weight of basic key word in document library;
204th, according to the theme that the candidate keywords are affiliated in the document library, other expanded keywords are determined;
205th, line retrieval is entered in the document library according to candidate keywords and expanded keyword.
In the present embodiment, the first-selected non-supervisory keyword abstraction method using based on figure, to the part in document library or
Whole documents carry out the extraction work of basic key word, and the basic key word for being obtained using extraction afterwards sets up basic key word
Statistical weight information, while the structural information for extracting the basic key word for obtaining is analyzed, wherein the statistical weight is document library
In the basic key word place document quantity and total number of documents amount ratio;Then, the basic key word for being obtained according to extraction
And its corresponding structural information these features, using two classification method of setting, obtain the prediction weight of basic key word.Wherein,
Two classification method for setting are as based on two classification method of supporting vector machine model, two classification method based on maximum entropy or based on patrolling
Collect two classification method of regression model.For example, the base using features such as the structural informations being drawn into, in the document library that extraction is obtained
Plinth keyword tag is positive example, and non-basic keyword tag is negative example, trains a supporting vector machine model, and then obtains one
The prediction weight of basic key word.Here in embodiments of the present invention, improves the prediction mode of support vector machine, will be original
The prediction weight for belonging to two classifications is changed in 01 outputs.
In the present embodiment, the basic key word in by the way of semi-supervised to document library is analyzed, and then effectively,
Quickly determine the candidate keywords that user view can be expressed in retrieval request, overcoming individually cannot using non supervision model
The drawbacks of comprehensive utilization much information extracts candidate keywords, and individually adopt monitor mode extraction candidate keywords to take time and effort
Problem.
Here is it should be noted that for performing the statistical weight for generating each basic key word of document library with prediction weight
Operation, with the search operaqtion for retrieval request, not strict sequential relationship.Examined retrieval request is received first
Suo Shi, operation 201 and 202 must be performed once prior to operation 203-205, but As time goes on, it is new when receiving again
Retrieval request when entering line retrieval, execution operation 201 and operation 202 can be repeated, or in document library is detected
In the case that document updates the threshold value that degree reaches a setting, operation 201 can be again performed with operation 202.
On the basis of above-mentioned technical proposal, determined in retrieval request according to the prediction weight of basic key word in document library
Candidate keywords, more preferably:
The basic key word matched with participle in the retrieval request is searched for from basic keyword set, the base of matching is obtained
The prediction weight of plinth key word and statistical weight;
The statistical weight of the basic key word to being matched is weighted with prediction weight, generates matched basis crucial
The new weight of word;
New weight is met the basic key word under imposing a condition as candidate keywords.
For example, the prediction weight in document library and statistical weight included in retrieval request can be compared high base
The corresponding participle of plinth key word is used as candidate keywords.Under the preferred technical scheme of here, it is considered to which each basis is crucial in document library
The prediction weight of word and the two factors of statistical weight can further improve the accurate of retrieval result determining candidate keywords
Rate.
Embodiment three
Fig. 3 is a kind of method flow schematic diagram for entering line retrieval based on key word that the embodiment of the present invention three is provided.This reality
Example is applied on the basis of the various embodiments described above, acquisition document and its theme vector of participle this technical characteristic is increased, and
Under the technical characteristic, other expanded keywords will be determined according to the affiliated theme in the document library of the candidate keywords,
And the technical characteristic for according to candidate keywords and expanded keyword entering line retrieval in the document library further optimizes.Referring to
Fig. 3, methods described includes:
301st, the candidate keywords in retrieval request are determined according to the prediction weight of basic key word in document library;
302nd, participle is carried out to the document in document library, the square that the weight by participle in document library in a document is constituted is generated
Battle array;
303rd, the document in document library is trained using topic model, is by the theme of participle by the matrix decomposition
First matrix of vector composition and the product of the second matrix being made up of the theme vector of document;
304th, from the theme vector of query candidate key word in the first matrix, the theme vector obtained according to inquiry determines waits
The front M theme for selecting key word distribution of weights maximum;
305th, the participle vector of the M theme is inquired about from the first matrix, is determined according to the participle vector that inquiry is obtained
The top n participle of the theme distribution weight maximum in the M theme, as the expanded keyword of corresponding theme;
306th, according to the theme vector of candidate keywords and expanded keyword in the first matrix determine a new theme to
Amount;
307th, the destination document in document library is determined according to the theme vector of new theme vector and document.
Wherein, weight of the theme vector of participle by the participle in each theme is constituted, and the theme vector of document is by each master
Weight composition of the topic in the document;M and N are natural number.
In the present embodiment, the candidate in retrieval request is determined according to the prediction weight of each basic key word in document library
After key word:First, from the theme vector of the document obtained by topic model training, find belonging to candidate keywords
The larger theme of weight;Then, then from the theme vector of the participle obtained by topic model training, candidate keywords institute is searched
The larger participle of weight under the larger theme of the weight of category;Further, using these participles searched as expanded keyword.Difference
In based on synon extended mode, the expanded keyword obtained using the extended mode in the embodiment of the present invention can be in language
Meet the user search intent that the search statement in retrieval request is embodied in justice.Using the expanded keyword and candidate keywords
Enter line retrieval, the recall rate of document can be greatly improved, and meet user's request.
Similar with implementing two, here is it should be noted that for performing the operation for generating the first matrix and the second matrix
302-303, with operation 301 and operation 304-307, not strict sequential relationship, the present embodiment is intended only as therein
A kind of situation is illustrated.When first receiving retrieval request and entering line retrieval, operation 302-303 must be prior to operating 304-307
Perform once, it is also possible to perform prior to operation 301.But As time goes on, enter when receiving new retrieval request again
During line retrieval, execution operation 302-303 can be repeated, or the document in document library is detected updates degree and reaches one
In the case of the threshold value of individual setting, operation 302-303 can be again performed.
On the basis of above-mentioned each embodiment, according to the theme of candidate keywords and expanded keyword in the first matrix to
Amount determines a new theme vector, and the target text in document library is determined according to the theme vector of new theme vector and each document
Shelves, more preferably:
By the theme vector of the corresponding candidate keywords of theme in M theme expanded keyword corresponding with the theme
Theme vector be weighted, obtain theme vector collection;
The theme vector that theme vector is concentrated is normalized after addition, new theme vector is obtained;
The theme vector of the document in new theme vector and the second matrix is carried out into Similarity Measure, according to similarity
Result of calculation determines the destination document in document library.
Wherein, weighter factor is obtained according to weight of the corresponding candidate keywords of theme in the theme in the M theme
Arrive.
Under the preferred technical scheme of here, the theme of each document that can be generated by new theme vector and using topic model
Vector carries out the calculating of K-L distances or cosine similarity, if similarity is higher, judges the two vectors in each theme
On distribution it is more similar, can be using document corresponding under this similarity as destination document.
On the basis of above-mentioned each embodiment, document library is being determined according to the theme vector of new theme vector and document
In destination document after, methods described may also include:
Searching in candidate keywords and expanded keyword place sentence and retrieval request in destination document determined by calculating
The degree of association of rope sentence;
Sentence when degree of association is met into given threshold in corresponding document carries out output display;
When the trigger action to shown sentence is received, the corresponding document of the display statement is carried out into output display.
In this technical scheme, only the correlative in destination document is carried out into output display, readding for user can be saved
Read time;When the trigger action to shown sentence is received, then that the corresponding document of the display statement is carried out into output is aobvious
Show, user can be helped quickly to navigate to concrete document.
Example IV
Fig. 4 is a kind of apparatus structure schematic diagram for entering line retrieval based on key word that the embodiment of the present invention four is provided.This reality
Apply example to be applicable to after the retrieval request for receiving user input, the retrieval of relevant information carried out according to the request, so as to for
User provides the situation of service.Referring to Fig. 4, described device, including:
Candidate keywords determining module 401, for determining that retrieval please according to the prediction weight of basic key word in document library
Candidate keywords in asking, wherein the prediction weight of the basic key word be according to basic key word in the document of document library
Structural information determine;
Expanded keyword determining module 402, for according to the affiliated theme in the document library of the candidate keywords,
Determine other expanded keywords;
Retrieval module 403, for entering line retrieval in the document library according to candidate keywords and expanded keyword.
Wherein, basic key word structural information in a document includes basic key word position, institute in a document
State the part of speech of the part of speech, the part of speech of previous word and/or latter word of basic key word.
Embodiment five
Fig. 5 is a kind of apparatus structure schematic diagram for entering line retrieval based on key word that the embodiment of the present invention five is provided.The skill
Art scheme increased keyword extracting module 501 and prediction weight determination module 502 on the basis of above-mentioned technical proposal.Ginseng
See Fig. 5, in said device:
Keyword extracting module 501, for the non-supervisory keyword abstraction method based on figure, enters to the document in document library
Row keyword abstraction, obtains the basic keyword set of the document library, and generates the system of basic key word in basic keyword set
Weighted weight and structural information in a document;
Prediction weight determination module 502, for using two sorting algorithms of setting, according to resulting basic key word and
Its structural information in a document, obtains the prediction weight of the basic key word;
Preferably, candidate keywords determining module 503, specifically for:
The basic key word matched with participle in retrieval request is searched for from the basic keyword set, the base of matching is obtained
The prediction weight of plinth key word and statistical weight;
The statistical weight of the basic key word to being matched is weighted with prediction weight, generates matched basis crucial
The new weight of word;
New weight is met the basic key word for being matched under imposing a condition as candidate keywords;
Expanded keyword determining module 504, for according to the affiliated theme in the document library of the candidate keywords,
Determine other expanded keywords;
Retrieval module 505, for entering line retrieval in the document library according to candidate keywords and expanded keyword.
Wherein, the statistical weight is the quantity and total number of documents amount of basic key word place document in the document library
Ratio;Two classification method for setting as based on two classification method of supporting vector machine model, two classification method based on maximum entropy or
Two classification method of person's logic-based regression model.
Embodiment six
Fig. 6 is a kind of apparatus structure schematic diagram for entering line retrieval based on key word that the embodiment of the present invention six is provided.The dress
On the basis of putting above-mentioned each technical scheme, weight matrix generation module 602 and theme vector generation module 603 are increased.Referring to
Fig. 6, in said device:
Candidate keywords determining module 601, for determining that retrieval please according to the prediction weight of basic key word in document library
Candidate keywords in asking, wherein the prediction weight of the basic key word be according to basic key word in the document of document library
Structural information determine;
Weight matrix generation module 602, for carrying out participle to the document in the document library, generates by the document library
The matrix of middle participle weight composition within said document;
Theme vector generation module 603, for being trained to the document in the document library using topic model, by institute
It is the first matrix being made up of the theme vector of participle and the second matrix being made up of the theme vector of document to state matrix decomposition
Product, wherein, the weight of the theme vector of participle by participle in theme is constituted, the theme vector of document by theme in a document
Weight composition.
Expanded keyword determining module 604, specifically for:From the theme of query candidate key word in first matrix to
Amount, the theme vector obtained according to inquiry determines the maximum front M theme of candidate keywords distribution of weights;From first matrix
The participle vector of middle inquiry theme, according to the theme distribution weight in the participle vector determination M theme that inquiry is obtained most
Big top n participle, as the expanded keyword of corresponding theme;Wherein, the M and N are natural number.
Retrieval module 605, specifically for:According to the theme of candidate keywords and expanded keyword in first matrix to
Amount determines a new theme vector;The target in document library is determined according to the theme vector of the new theme vector and document
Document.
Preferably, retrieve module 605 to further include:
Theme vector collection signal generating unit, for by the theme vector of the corresponding candidate keywords of theme in the M theme
The theme vector of expanded keyword corresponding with the theme is weighted, and obtains theme vector collection, and wherein weighter factor is according to institute
State weight of the corresponding candidate keywords of theme in the theme in M theme to obtain;
New theme vector signal generating unit, the theme vector for the theme vector to be concentrated carries out normalizing after addition
Change, obtain the new theme vector;
Similarity calculated, for by the theme vector of the document in the new theme vector and second matrix
Similarity Measure is carried out, the destination document in the document library is determined according to the result of calculation of the similarity.
On the basis of above-mentioned each technical scheme, described device also includes Display processing module(It is not shown), for institute
State retrieval module to be determined after the destination document in document library according to the theme vector of the new theme vector and document:
Searching in candidate keywords and expanded keyword place sentence and retrieval request in destination document determined by calculating
The degree of association of rope sentence;
Sentence when degree of association is met into given threshold in corresponding document carries out output display;
When the trigger action to shown sentence is received, the corresponding document of the display statement is carried out into output display.
In each embodiment of the invention described above, embodiment of the method belongs to same inventive concept with device embodiment, in dress
Not detailed description in embodiment is put, the embodiment one to three that search method is carried out based on key word for description is can be found in.On
State product and can perform the method that any embodiment of the present invention is provided, possess the corresponding functional module of execution method and beneficial effect
Really.
Embodiment seven
Fig. 7 is the schematic diagram of the candidate keywords in a kind of determination retrieval request that the embodiment of the present invention seven is provided.Fig. 8 is
A kind of determination expanded keyword and the schematic diagram of retrieval that the embodiment of the present invention seven is provided.The present embodiment can be with above-described embodiment
Based on, there is provided a kind of preferred embodiment.
Referring to Fig. 7, determining the process of the candidate keywords in retrieval request includes:
(1)Model essay resource 701, all documents in the resource namely document library are provided;(2)Non-supervisory pass based on figure
Keyword extraction model 702, extracts model essay key word 703(Basic key word);(3)Based on descriptor weight analysis module 704, obtain
Each word is the information 705 of key word, the i.e. statistical weight of each word of model essay key word 703 in model essay key word 703;(4)
Feature based abstraction module and support vector machine training module 706, generate a key word judgment models 707, obtain model essay pass
The prediction weight of each word in keyword 703;(5)Determined according to the prediction weight and statistical weight of model essay key word 703 each word
Candidate keywords in retrieval request.
Referring to Fig. 8, determine that expanded keyword and the process of retrieval include:
(1)Based on topic model training module 802, model essay resource 801 is trained, obtains word theme distribution matrix
803(The first matrix described in embodiment three)With document subject matter distribution matrix 804(The second matrix described in embodiment three);
(2)According to the descriptor 805 obtained from inquiry(The candidate keywords extracted from retrieval request)And word theme distribution square
Battle array 803, descriptor expands module 806 and obtains comprehensively and accurately expressing the expanded keyword and candidate key of user view
Word;(3)One inquiry is obtained according to the theme vector of expanded keyword and candidate keywords in word theme distribution matrix 803
Semantic vector 807(New vector described in embodiment three);(4)Semantic Similarity Measurement module 808 calculate query semantics to
The similarity of the theme vector of each document in amount 807 and document subject matter distribution matrix 804, determines according to the result of calculation of similarity
Target model essay in model essay resource 801.
Note, above are only presently preferred embodiments of the present invention and institute's application technology principle.It will be appreciated by those skilled in the art that
The invention is not restricted to specific embodiment described here, can carry out for a person skilled in the art various obvious changes,
Readjust and substitute without departing from protection scope of the present invention.Therefore, although the present invention is carried out by above example
It is described in further detail, but the present invention is not limited only to above example, without departing from the inventive concept, also
More other Equivalent embodiments can be included, and the scope of the present invention is determined by scope of the appended claims.
Claims (18)
1. a kind of method that line retrieval is entered based on key word, it is characterised in that include:
The candidate keywords in retrieval request are determined according to the prediction weight of basic key word in document library, wherein the basis is closed
The prediction weight of keyword is that the structural information according to basic key word in the document of document library determines;
According to the affiliated theme in the document library of the candidate keywords, other expanded keywords are determined;
Enter line retrieval in the document library according to candidate keywords and expanded keyword;
Wherein, the prediction weight according to basic key word in document library determines the candidate keywords in retrieval request, including:
The basic key word matched with participle in the retrieval request is searched for from basic keyword set, the basis pass of matching is obtained
The prediction weight of keyword and statistical weight;
The statistical weight of the basic key word to being matched is weighted with prediction weight, the basic key word that generation is matched
New weight;
New weight is met the basic key word under imposing a condition as candidate keywords.
2. the method that line retrieval is entered based on key word according to claim 1, it is characterised in that also include:
Based on the non-supervisory keyword abstraction method of figure, keyword abstraction is carried out to the document in document library, obtain the document
The basic keyword set in storehouse, and generate the statistical weight of basic key word and structure letter in a document in basic keyword set
Breath;
Using two sorting algorithms of setting, structural information according to resulting basic key word and its in a document obtains institute
State the prediction weight of basic key word;
Wherein, the statistical weight is the ratio of the quantity of basic key word place document and total number of documents amount in the document library
Value.
3. the method that line retrieval is entered based on key word according to claim 2, it is characterised in that two classification of the setting
Method is based on two classification method of supporting vector machine model, two classification method based on maximum entropy or logic-based regression model
Two classification method.
4. the method that line retrieval is entered based on key word according to claim 1, it is characterised in that the basic key word exists
Structural information in document includes basic key word position in a document, the part of speech of the basic key word, previous word
The part of speech of part of speech and/or latter word.
5. the method that line retrieval is entered based on key word according to claim 1, it is characterised in that also include:
Participle is carried out to the document in the document library, generation is made up of participle weight within said document in the document library
Matrix;
The document in the document library is trained using topic model, is by the theme vector of participle by the matrix decomposition
First matrix of composition and the product of the second matrix being made up of the theme vector of document, wherein, the theme vector of participle by point
Weight composition of the word in theme, the weight of the theme vector of document by theme in a document is constituted.
6. the method that line retrieval is entered based on key word according to claim 5, it is characterised in that described according to the candidate
The affiliated theme in the document library of key word, determines other expanded keywords, including:
From the theme vector of query candidate key word in first matrix, the theme vector obtained according to inquiry determines that candidate is closed
The maximum front M theme of keyword distribution of weights;
The participle vector of the M theme is inquired about from first matrix, the M is determined according to the participle vector that inquiry is obtained
The top n participle of the theme distribution weight maximum in individual theme, as the expanded keyword of corresponding theme;
Wherein, the M and N are natural number.
7. the method that line retrieval is entered based on key word according to claim 6, it is characterised in that described according to candidate key
Word and expanded keyword enter line retrieval in the document library, including:
One new theme vector is determined according to the theme vector of candidate keywords and expanded keyword in first matrix;
The destination document in document library is determined according to the theme vector of the new theme vector and document.
8. the method that line retrieval is entered based on key word according to claim 7, it is characterised in that described according to described first
The theme vector of candidate keywords and expanded keyword determines a new theme vector in matrix, according to the new theme to
The theme vector of amount and document determines the destination document in document library, including:
By the master of the theme vector of the corresponding candidate keywords of theme in M theme expanded keyword corresponding with the theme
Topic vector is weighted, and obtains theme vector collection, and wherein weighter factor is according to the corresponding candidate key of theme in the M theme
Weight of the word in the theme is obtained;
The theme vector that the theme vector is concentrated is normalized after addition, the new theme vector is obtained;
The theme vector of the document in the new theme vector and second matrix is carried out into Similarity Measure, according to described
The result of calculation of similarity determines the destination document in the document library.
9. the method that line retrieval is entered based on key word according to claim 7, it is characterised in that according to the new master
The theme vector of topic vector sum document determines after the destination document in document library, also includes:
Candidate keywords and expanded keyword place sentence and the search phrase in retrieval request in destination document determined by calculating
The degree of association of sentence;
Sentence when degree of association is met into given threshold in corresponding document carries out output display;
When the trigger action to shown sentence is received, the corresponding document of the display statement is carried out into output display.
10. a kind of device for entering line retrieval based on key word, it is characterised in that include:
Candidate keywords determining module, for determining the time in retrieval request according to the prediction weight of basic key word in document library
Key word is selected, wherein the prediction weight of the basic key word is the structure letter according to basic key word in the document of document library
What breath determined;
Expanded keyword determining module, for according to the affiliated theme in the document library of the candidate keywords, determining it
His expanded keyword;
Retrieval module, for entering line retrieval in the document library according to candidate keywords and expanded keyword;
Wherein, the candidate keywords determining module, specifically for:
The basic key word matched with participle in retrieval request is searched for from the basic keyword set, the basis pass of matching is obtained
The prediction weight of keyword and statistical weight;
The statistical weight of the basic key word to being matched is weighted with prediction weight, the basic key word that generation is matched
New weight;
New weight is met the basic key word for being matched under imposing a condition as candidate keywords.
11. devices for entering line retrieval based on key word according to claim 10, it is characterised in that also include:
Keyword extracting module, for the non-supervisory keyword abstraction method based on figure, to the document in document library key is carried out
Word is extracted, and obtains the basic keyword set of the document library, and generates the statistical weight of basic key word in basic keyword set
Structural information in a document;
Prediction weight determination module, for using two sorting algorithms of setting, according to resulting basic key word and its in text
Structural information in shelves, obtains the prediction weight of the basic key word;
Wherein, the statistical weight is the ratio of the quantity of basic key word place document and total number of documents amount in the document library
Value.
12. devices for entering line retrieval based on key word according to claim 11, it is characterised in that two points of the setting
Class method is based on two classification method of supporting vector machine model, two classification method based on maximum entropy or logic-based regression model
Two classification method.
13. devices for entering line retrieval based on key word according to claim 10, it is characterised in that the basic key word
Structural information in a document includes basic key word position, the part of speech of the basic key word, previous word in a document
Part of speech and/or latter word part of speech.
14. devices for entering line retrieval based on key word according to claim 10, it is characterised in that also include:
Weight matrix generation module, for carrying out participle to the document in the document library, generates by participle in the document library
The matrix of weight composition within said document;
Theme vector generation module, for being trained to the document in the document library using topic model, by the matrix
The product of the first matrix being made up of the theme vector of participle and the second matrix being made up of the theme vector of document is decomposed into, its
In, the weight of the theme vector of participle by participle in theme is constituted, the weight of the theme vector of document by theme in a document
Composition.
15. devices for entering line retrieval based on key word according to claim 14, it is characterised in that the expanded keyword
Determining module, specifically for:
From the theme vector of query candidate key word in first matrix, the theme vector obtained according to inquiry determines that candidate is closed
The maximum front M theme of keyword distribution of weights;
The participle vector of the M theme is inquired about from first matrix, the M is determined according to the participle vector that inquiry is obtained
The top n participle of the theme distribution weight maximum in individual theme, as the expanded keyword of corresponding theme;
Wherein, the M and N are natural number.
16. devices for entering line retrieval based on key word according to claim 14, it is characterised in that the retrieval module,
Specifically for according to the theme vector of candidate keywords and expanded keyword in first matrix determine a new theme to
Amount;The destination document in document library is determined according to the theme vector of the new theme vector and document.
17. devices for entering line retrieval based on key word according to claim 16, it is characterised in that the retrieval module,
Including:
Theme vector collection signal generating unit, for by the theme vector of the corresponding candidate keywords of theme in the M theme with should
The theme vector of the corresponding expanded keyword of theme is weighted, and obtains theme vector collection, and wherein weighter factor is according to the M
Weight of the corresponding candidate keywords of theme in the theme is obtained in theme;
New theme vector signal generating unit, for the theme vector that the theme vector is concentrated to be normalized after addition, obtains
To the new theme vector;
Similarity calculated, for the theme vector of the document in the new theme vector and second matrix to be carried out
Similarity Measure, according to the result of calculation of the similarity destination document in the document library is determined.
18. devices for entering line retrieval based on key word according to claim 16, it is characterised in that also including display processing
Module, for the target in determining document library according to the theme vector of the new theme vector and document in the retrieval module
After document:
Candidate keywords and expanded keyword place sentence and the search phrase in retrieval request in destination document determined by calculating
The degree of association of sentence;
Sentence when degree of association is met into given threshold in corresponding document carries out output display;
When the trigger action to shown sentence is received, the corresponding document of the display statement is carried out into output display.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310710834.7A CN103699625B (en) | 2013-12-20 | 2013-12-20 | Method and device for retrieving based on keyword |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310710834.7A CN103699625B (en) | 2013-12-20 | 2013-12-20 | Method and device for retrieving based on keyword |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103699625A CN103699625A (en) | 2014-04-02 |
CN103699625B true CN103699625B (en) | 2017-05-10 |
Family
ID=50361153
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201310710834.7A Active CN103699625B (en) | 2013-12-20 | 2013-12-20 | Method and device for retrieving based on keyword |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103699625B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108717407A (en) * | 2018-05-11 | 2018-10-30 | 北京三快在线科技有限公司 | Entity vector determines method and device, information retrieval method and device |
Families Citing this family (28)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103927358B (en) * | 2014-04-15 | 2017-02-15 | 清华大学 | text search method and system |
CN104376065B (en) * | 2014-11-05 | 2018-09-18 | 百度在线网络技术(北京)有限公司 | The determination method and apparatus of term importance |
CN104505090B (en) * | 2014-12-15 | 2017-11-14 | 北京国双科技有限公司 | The audio recognition method and device of sensitive word |
CN104809154B (en) * | 2015-03-19 | 2019-03-08 | 百度在线网络技术(北京)有限公司 | The method and device recommended for information |
CN106326300A (en) * | 2015-07-02 | 2017-01-11 | 富士通株式会社 | Information processing method and information processing device |
CN105512101B (en) * | 2015-11-30 | 2018-06-26 | 北大方正集团有限公司 | A kind of method and device of automatic structure descriptor |
CN105912563B (en) * | 2016-03-23 | 2019-04-02 | 北京数字跃动科技有限公司 | A method of the artificial intelligence learning of machine is assigned based on psychological knowledge |
CN105930358B (en) * | 2016-04-08 | 2019-06-04 | 南方电网科学研究院有限责任公司 | Case retrieving method and its system based on the degree of association |
CN105930527B (en) * | 2016-06-01 | 2019-09-20 | 北京百度网讯科技有限公司 | Searching method and device |
CN107665222B (en) * | 2016-07-29 | 2020-11-06 | 北京国双科技有限公司 | Keyword expansion method and device |
CN106355429A (en) * | 2016-08-16 | 2017-01-25 | 北京小米移动软件有限公司 | Image material recommendation method and device |
CN107943781B (en) * | 2016-10-13 | 2021-08-13 | 北京国双科技有限公司 | Keyword recognition method and device |
CN108427686A (en) * | 2017-02-15 | 2018-08-21 | 北京国双科技有限公司 | Text data querying method and device |
US10747825B2 (en) * | 2017-02-27 | 2020-08-18 | Google Llc | Content search engine |
WO2018174397A1 (en) | 2017-03-20 | 2018-09-27 | 삼성전자 주식회사 | Electronic device and control method |
KR102529262B1 (en) * | 2017-03-20 | 2023-05-08 | 삼성전자주식회사 | Electronic device and controlling method thereof |
CN107330752B (en) * | 2017-05-31 | 2020-09-29 | 北京京东尚科信息技术有限公司 | Method and device for identifying brand words |
US10824657B2 (en) * | 2017-06-01 | 2020-11-03 | Interactive Solutions Inc. | Search document information storage device |
CN107480879A (en) * | 2017-08-09 | 2017-12-15 | 郑州星睿水利科技有限公司 | Hydrology worker's professional knowledge examining method and system |
CN108052520A (en) * | 2017-11-01 | 2018-05-18 | 平安科技(深圳)有限公司 | Conjunctive word analysis method, electronic device and storage medium based on topic model |
CN109241525B (en) * | 2018-08-20 | 2022-05-06 | 深圳追一科技有限公司 | Keyword extraction method, device and system |
CN110969018A (en) * | 2018-09-30 | 2020-04-07 | 北京国双科技有限公司 | Case description element extraction method, machine learning model acquisition method and device |
MY195969A (en) * | 2018-10-24 | 2023-02-27 | Advanced New Technologies Co Ltd | Intelligent Customer Services Based on a Vector Propagation on a Click Graph Model |
JP6651189B1 (en) * | 2019-03-29 | 2020-02-19 | 株式会社 情報システムエンジニアリング | Data structure, learning method and information providing system for machine learning |
CN110866102A (en) * | 2019-11-07 | 2020-03-06 | 浪潮软件股份有限公司 | Search processing method |
CN111831884B (en) * | 2020-07-14 | 2021-02-05 | 深圳市众创达企业咨询策划有限公司 | Matching system and method based on information search |
CN112507068B (en) * | 2020-11-30 | 2023-11-14 | 北京百度网讯科技有限公司 | Document query method, device, electronic equipment and storage medium |
CN112650914A (en) * | 2020-12-30 | 2021-04-13 | 深圳市世强元件网络有限公司 | Long-tail keyword identification method, keyword search method and computer equipment |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101685455A (en) * | 2008-09-28 | 2010-03-31 | 华为技术有限公司 | Method and system of data retrieval |
CN103164521A (en) * | 2013-03-11 | 2013-06-19 | 亿赞普(北京)科技有限公司 | Keyword calculation method and device based on user browse and search actions |
CN103425799A (en) * | 2013-09-04 | 2013-12-04 | 北京邮电大学 | Personalized research direction recommending system and method based on themes |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7831595B2 (en) * | 2007-12-31 | 2010-11-09 | Yahoo! Inc. | Predicting and ranking search query results |
-
2013
- 2013-12-20 CN CN201310710834.7A patent/CN103699625B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101685455A (en) * | 2008-09-28 | 2010-03-31 | 华为技术有限公司 | Method and system of data retrieval |
CN103164521A (en) * | 2013-03-11 | 2013-06-19 | 亿赞普(北京)科技有限公司 | Keyword calculation method and device based on user browse and search actions |
CN103425799A (en) * | 2013-09-04 | 2013-12-04 | 北京邮电大学 | Personalized research direction recommending system and method based on themes |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108717407A (en) * | 2018-05-11 | 2018-10-30 | 北京三快在线科技有限公司 | Entity vector determines method and device, information retrieval method and device |
CN108717407B (en) * | 2018-05-11 | 2022-08-09 | 北京三快在线科技有限公司 | Entity vector determination method and device, and information retrieval method and device |
Also Published As
Publication number | Publication date |
---|---|
CN103699625A (en) | 2014-04-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103699625B (en) | Method and device for retrieving based on keyword | |
CN108629043B (en) | Webpage target information extraction method, device and storage medium | |
KR102092691B1 (en) | Web page training methods and devices, and search intention identification methods and devices | |
CN106649818B (en) | Application search intention identification method and device, application search method and server | |
CN110704621B (en) | Text processing method and device, storage medium and electronic equipment | |
CN109766544B (en) | Document keyword extraction method and device based on LDA and word vector | |
CN101470732B (en) | Auxiliary word stock generation method and apparatus | |
CN111428488A (en) | Resume data information analyzing and matching method and device, electronic equipment and medium | |
KR101754473B1 (en) | Method and system for automatically summarizing documents to images and providing the image-based contents | |
CN113761218B (en) | Method, device, equipment and storage medium for entity linking | |
CN110134792B (en) | Text recognition method and device, electronic equipment and storage medium | |
CN109388743B (en) | Language model determining method and device | |
KR101491627B1 (en) | Quantification method, apparatus and system of reviews for mobile application evaluation | |
CN106919575A (en) | application program searching method and device | |
CN112559684A (en) | Keyword extraction and information retrieval method | |
CN111414763A (en) | Semantic disambiguation method, device, equipment and storage device for sign language calculation | |
Wu et al. | Extracting topics based on Word2Vec and improved Jaccard similarity coefficient | |
CN110956021A (en) | Original article generation method, device, system and server | |
US9652997B2 (en) | Method and apparatus for building emotion basis lexeme information on an emotion lexicon comprising calculation of an emotion strength for each lexeme | |
CN103744887A (en) | Method and device for people search and computer equipment | |
CN107844493A (en) | A kind of file association method and system | |
CN114021577A (en) | Content tag generation method and device, electronic equipment and storage medium | |
CN110717038A (en) | Object classification method and device | |
CN113515589A (en) | Data recommendation method, device, equipment and medium | |
JP2013003663A (en) | Information processing apparatus, information processing method, and program |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |