CN103699625B - Method and device for retrieving based on keyword - Google Patents

Method and device for retrieving based on keyword Download PDF

Info

Publication number
CN103699625B
CN103699625B CN201310710834.7A CN201310710834A CN103699625B CN 103699625 B CN103699625 B CN 103699625B CN 201310710834 A CN201310710834 A CN 201310710834A CN 103699625 B CN103699625 B CN 103699625B
Authority
CN
China
Prior art keywords
document
theme
key word
vector
weight
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201310710834.7A
Other languages
Chinese (zh)
Other versions
CN103699625A (en
Inventor
姜宇
吴华
胡晓光
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201310710834.7A priority Critical patent/CN103699625B/en
Publication of CN103699625A publication Critical patent/CN103699625A/en
Application granted granted Critical
Publication of CN103699625B publication Critical patent/CN103699625B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method and a device for retrieving based on a keyword. The method comprises the steps of: determining candidate keywords in a retrieval request based on a predicted weight of a basic keyword in a document library, wherein the predicted weight of the keyword is determined based on structure information of the basic keyword in the document of the document library; determining other extended keywords based on a theme which the candidate keywords belong to in the document library; retrieving in the document library based on the candidate keywords and the extended keywords. The technical scheme provided by the invention can improve accuracy rate and recall rate of the retrieval result, and is more satisfied with user demands.

Description

The method and device of line retrieval is entered based on key word
Technical field
The present embodiments relate to data searching technology field, more particularly to the method and dress of line retrieval are entered based on key word Put.
Background technology
At present, some searching systems are related according to certain decision search typically according to the retrieval request of user input Information in document library, so as to provide the user file retrieval service.For example, the searching system is the service of Kingsoft illustrative sentence retrieval System, the system after the query statement for receiving user input, can according to the query statement to document library in each document Keywords matching lookup is carried out, and then provides the user outstanding example sentence or model essay described in document.
In the prior art, searching system is after retrieval request is received, first to the search phrase included in the request Sentence carries out participle, carries out these participles in document library as key word, based on literal retrieval, finally to retrieve afterwards As a result user is returned to after merging.
Defect present in prior art is:
On the one hand, retrieval result accuracy rate is low, larger with user view gap.For example, the search statement of user input For the sentence of scene " description snow ", existing searching system can by occurrence number in a document more " snowing ", " scene ", How many participle such as " description " place documents, be ranked up in retrieval result according to number of times, " the snowing " of user's real demand this The document that one participle is located tends not to occupy preferential position.
On the other hand, it is impossible to comprehensively extract other documents that can represent user's request, recall rate is low.For example, use The search statement of family input is " spring ", and existing searching system is only able to find the document containing " spring ", and now has Example sentence often describes spring scenery, and this example sentence often can more meet the demand of user, but existing technology but cannot Find example sentence but the literal text for but not containing " spring " that such semanteme is description spring.
The content of the invention
The embodiment of the present invention provides the method and device for entering line retrieval based on key word, to improve the accuracy rate of retrieval result And recall rate, more meet user's request.
In a first aspect, embodiments providing a kind of method for entering line retrieval based on key word, methods described includes:
The candidate keywords in retrieval request are determined according to the prediction weight of basic key word in document library, wherein the base The prediction weight of plinth key word is that the structural information according to basic key word in the document of document library determines;
According to the affiliated theme in the document library of the candidate keywords, other expanded keywords are determined;
Enter line retrieval in the document library according to candidate keywords and expanded keyword.
Second aspect, the embodiment of the present invention additionally provides a kind of device for entering line retrieval based on key word, described device bag Include:
Candidate keywords determining module, for being determined in retrieval request according to the prediction weight of basic key word in document library Candidate keywords, wherein the prediction weight of the basic key word is the knot according to basic key word in the document of document library What structure information determined;
Expanded keyword determining module, for according to the affiliated theme in the document library of the candidate keywords, really Fixed other expanded keywords;
Retrieval module, for entering line retrieval in the document library according to candidate keywords and expanded keyword.
In the technical scheme that the embodiment of the present invention is proposed, according to the structural information of basic key word in document library, obtain The prediction weight of basic key word, the candidate keywords in retrieval request are determined according to resulting prediction weight, so can Treat each participle in retrieval request with a certain discrimination, extracting can express the candidate keywords of user view so that retrieval result is accurate Really rate is higher;According to the affiliated theme in document library of candidate keywords, other expanded keywords are determined, according to candidate keywords Enter line retrieval in document library with expanded keyword, it is achieved thereby that the retrieval to retrieval request based on semantic level, can be with standard Really, the document for representing user's request is comprehensively extracted, recall rate is higher.
Description of the drawings
Fig. 1 is a kind of method flow schematic diagram for entering line retrieval based on key word that the embodiment of the present invention one is provided;
Fig. 2 is a kind of method flow schematic diagram for entering line retrieval based on key word that the embodiment of the present invention two is provided;
Fig. 3 is a kind of method flow schematic diagram for entering line retrieval based on key word that the embodiment of the present invention three is provided;
Fig. 4 is a kind of apparatus structure schematic diagram for entering line retrieval based on key word that the embodiment of the present invention four is provided;
Fig. 5 is a kind of apparatus structure schematic diagram for entering line retrieval based on key word that the embodiment of the present invention five is provided;
Fig. 6 is a kind of apparatus structure schematic diagram for entering line retrieval based on key word that the embodiment of the present invention six is provided;
Fig. 7 is the schematic diagram of the candidate keywords in a kind of determination retrieval request that the embodiment of the present invention seven is provided;
Fig. 8 is the schematic diagram of a kind of determination expanded keyword that the embodiment of the present invention seven is provided and retrieval.
Specific embodiment
With reference to the accompanying drawings and examples the present invention is described in further detail.It is understood that this place is retouched The specific embodiment stated is used only for explaining the present invention, rather than limitation of the invention.It also should be noted that, in order to just Part related to the present invention rather than entire infrastructure are illustrate only in description, accompanying drawing.
Embodiment one
Fig. 1 is a kind of method flow schematic diagram for entering line retrieval based on key word that the embodiment of the present invention one is provided.This reality Apply example to be applicable to after the retrieval request for receiving user input, the retrieval of relevant information carried out according to the request, so as to for User provides the situation of service.The method can be performed by the equipment with search function, be specifically included:
101st, the candidate keywords in retrieval request are determined according to the prediction weight of basic key word in document library.
Retrieval facility can predefine each basic key word in document library, and be calculated and each basis by the algorithm for setting A corresponding prediction weight of key word.Wherein, each document in document library can be that retrieval facility is locally stored, also may be used To be acquired from related Website server by Internet technology.The prediction weight of the basic key word in document library by Structural information according to the key word in the document of document library determines.Structural information of each basic key word in each document can Position, the part of speech of basic key word, the part of speech of previous word and/or latter word including the basic key word in each document Part of speech.For example, retrieval facility carry out model essay retrieval, user object search it is more be some modification words rather than verb When, if the part of speech of certain word is noun in document, the part of speech of previous word is verb, then the word into based on key word it is general Rate is larger, gives the word relatively large prediction weight.
Retrieval facility, can be in retrieval request after the retrieval request for receiving the search statement for including user input Search statement carries out participle, and then basis precalculates the prediction weight of each basic key word in the document library for obtaining to each point Word is analyzed, using the participle met under imposing a condition as the candidate keywords in retrieval request.Specifically, retrieval is being asked The search statement asked is carried out after participle, can search basis pass consistent with the participle in the basic keyword set in document library Keyword and its corresponding prediction weight, if it is determined that prediction weight reaches the threshold value of a setting, then using the participle as one Individual candidate keywords.
102nd, according to the theme that the candidate keywords are affiliated in the document library, other expanded keywords are determined.
103rd, line retrieval is entered in the document library according to candidate keywords and expanded keyword.
After candidate keywords of the retrieval facility in retrieval request is obtained, directly can enter line retrieval in document library, but Preferably according to the theme that the candidate keywords are affiliated in the document library, other expanded keywords are further determined that, Then line retrieval is entered according to all or part of expanded keyword and candidate keywords.Specifically, will can close with candidate in document Keyword belongs to the higher key word of other distribution probabilities on same theme as expanded keyword, because these key words exist Residing context environmental semantically has similar feature when scene is described than relatively similar in document.
In this example, on the one hand, each participle in by treating retrieval request is analyzed and can express use to extract The candidate keywords that family is intended to, can improve the accuracy rate of retrieval result.For example, for the search statement in retrieval request is For the situation of " describing the sentence of spring vigorous scene ", if using existing technology, such as based on part of speech and word IDF (Inverse Document Frequency, inverse document frequency)The candidate keywords of weight information extract strategy, can obtain To " spring ", " vigorous ", " scene ", " sentence " key word, and then enter line retrieval using these key words, its retrieval result Usually contain many documents with user expection inconsistent description " scene ", " sentence " etc.Although " scene ", " sentence Son " is the main noun phrase of search statement, but in class retrieval is described, more expression user views are modifiers, such as If fruit puts on an equal footing the key word that extraction is obtained, the confusion of result is will result in.And in the present invention, key word can be passed through Structural information in each document of document library, by qualifier larger prediction weight is given, then according to the prediction weight Determining can express candidate keywords " spring ", " vigorous " of user view in retrieval request, and then can improve retrieval knot The accuracy of fruit.
In this example, on the other hand, by according to the affiliated theme in document library of candidate keywords, determining other expansions Exhibition key word, according to all of candidate keywords and expanded keyword line retrieval is entered in document library, is realized to retrieval request Based on the retrieval of semantic level, can accurately, comprehensively extract the document for representing user's request, it is possible to increase retrieval result is called together The rate of returning.For example, in the case of the candidate keywords in the retrieval request for obtaining are " spring ", using existing technology, The document containing " spring " is only able to find, or for " spring " carries out synonym extension, is found some other containing " spring tide " Etc. synon document;And in embodiments of the present invention, according to the theme that " spring " is affiliated in document library, can be included The document of the expanded keywords such as " greenweed ", " spring breeze ", the recall rate of retrieval result is lifted.
Embodiment two
Fig. 2 is a kind of method flow schematic diagram for entering line retrieval based on key word that the embodiment of the present invention two is provided.This reality Example is applied on the basis of embodiment one, increased two sorting algorithms of non-supervisory keyword abstraction method based on figure and setting come The operation of the prediction weight of basic key word in document is obtained, to carry out to the key word in document library by the way of semi-supervised Analysis, and then effectively, quickly determine the candidate keywords that user view can be expressed in retrieval request.Referring to Fig. 2, methods described Including:
201st, the non-supervisory keyword abstraction method based on figure, to the document in document library keyword abstraction is carried out, and is obtained The basic keyword set of the document library, and generate in basic keyword set the statistical weight and in a document of basic key word Structural information;
202nd, using two sorting algorithms of setting, structure letter according to resulting basic key word and its in a document Breath, obtains the prediction weight of the basic key word;
203rd, the candidate keywords in retrieval request are determined according to the prediction weight of basic key word in document library;
204th, according to the theme that the candidate keywords are affiliated in the document library, other expanded keywords are determined;
205th, line retrieval is entered in the document library according to candidate keywords and expanded keyword.
In the present embodiment, the first-selected non-supervisory keyword abstraction method using based on figure, to the part in document library or Whole documents carry out the extraction work of basic key word, and the basic key word for being obtained using extraction afterwards sets up basic key word Statistical weight information, while the structural information for extracting the basic key word for obtaining is analyzed, wherein the statistical weight is document library In the basic key word place document quantity and total number of documents amount ratio;Then, the basic key word for being obtained according to extraction And its corresponding structural information these features, using two classification method of setting, obtain the prediction weight of basic key word.Wherein, Two classification method for setting are as based on two classification method of supporting vector machine model, two classification method based on maximum entropy or based on patrolling Collect two classification method of regression model.For example, the base using features such as the structural informations being drawn into, in the document library that extraction is obtained Plinth keyword tag is positive example, and non-basic keyword tag is negative example, trains a supporting vector machine model, and then obtains one The prediction weight of basic key word.Here in embodiments of the present invention, improves the prediction mode of support vector machine, will be original The prediction weight for belonging to two classifications is changed in 01 outputs.
In the present embodiment, the basic key word in by the way of semi-supervised to document library is analyzed, and then effectively, Quickly determine the candidate keywords that user view can be expressed in retrieval request, overcoming individually cannot using non supervision model The drawbacks of comprehensive utilization much information extracts candidate keywords, and individually adopt monitor mode extraction candidate keywords to take time and effort Problem.
Here is it should be noted that for performing the statistical weight for generating each basic key word of document library with prediction weight Operation, with the search operaqtion for retrieval request, not strict sequential relationship.Examined retrieval request is received first Suo Shi, operation 201 and 202 must be performed once prior to operation 203-205, but As time goes on, it is new when receiving again Retrieval request when entering line retrieval, execution operation 201 and operation 202 can be repeated, or in document library is detected In the case that document updates the threshold value that degree reaches a setting, operation 201 can be again performed with operation 202.
On the basis of above-mentioned technical proposal, determined in retrieval request according to the prediction weight of basic key word in document library Candidate keywords, more preferably:
The basic key word matched with participle in the retrieval request is searched for from basic keyword set, the base of matching is obtained The prediction weight of plinth key word and statistical weight;
The statistical weight of the basic key word to being matched is weighted with prediction weight, generates matched basis crucial The new weight of word;
New weight is met the basic key word under imposing a condition as candidate keywords.
For example, the prediction weight in document library and statistical weight included in retrieval request can be compared high base The corresponding participle of plinth key word is used as candidate keywords.Under the preferred technical scheme of here, it is considered to which each basis is crucial in document library The prediction weight of word and the two factors of statistical weight can further improve the accurate of retrieval result determining candidate keywords Rate.
Embodiment three
Fig. 3 is a kind of method flow schematic diagram for entering line retrieval based on key word that the embodiment of the present invention three is provided.This reality Example is applied on the basis of the various embodiments described above, acquisition document and its theme vector of participle this technical characteristic is increased, and Under the technical characteristic, other expanded keywords will be determined according to the affiliated theme in the document library of the candidate keywords, And the technical characteristic for according to candidate keywords and expanded keyword entering line retrieval in the document library further optimizes.Referring to Fig. 3, methods described includes:
301st, the candidate keywords in retrieval request are determined according to the prediction weight of basic key word in document library;
302nd, participle is carried out to the document in document library, the square that the weight by participle in document library in a document is constituted is generated Battle array;
303rd, the document in document library is trained using topic model, is by the theme of participle by the matrix decomposition First matrix of vector composition and the product of the second matrix being made up of the theme vector of document;
304th, from the theme vector of query candidate key word in the first matrix, the theme vector obtained according to inquiry determines waits The front M theme for selecting key word distribution of weights maximum;
305th, the participle vector of the M theme is inquired about from the first matrix, is determined according to the participle vector that inquiry is obtained The top n participle of the theme distribution weight maximum in the M theme, as the expanded keyword of corresponding theme;
306th, according to the theme vector of candidate keywords and expanded keyword in the first matrix determine a new theme to Amount;
307th, the destination document in document library is determined according to the theme vector of new theme vector and document.
Wherein, weight of the theme vector of participle by the participle in each theme is constituted, and the theme vector of document is by each master Weight composition of the topic in the document;M and N are natural number.
In the present embodiment, the candidate in retrieval request is determined according to the prediction weight of each basic key word in document library After key word:First, from the theme vector of the document obtained by topic model training, find belonging to candidate keywords The larger theme of weight;Then, then from the theme vector of the participle obtained by topic model training, candidate keywords institute is searched The larger participle of weight under the larger theme of the weight of category;Further, using these participles searched as expanded keyword.Difference In based on synon extended mode, the expanded keyword obtained using the extended mode in the embodiment of the present invention can be in language Meet the user search intent that the search statement in retrieval request is embodied in justice.Using the expanded keyword and candidate keywords Enter line retrieval, the recall rate of document can be greatly improved, and meet user's request.
Similar with implementing two, here is it should be noted that for performing the operation for generating the first matrix and the second matrix 302-303, with operation 301 and operation 304-307, not strict sequential relationship, the present embodiment is intended only as therein A kind of situation is illustrated.When first receiving retrieval request and entering line retrieval, operation 302-303 must be prior to operating 304-307 Perform once, it is also possible to perform prior to operation 301.But As time goes on, enter when receiving new retrieval request again During line retrieval, execution operation 302-303 can be repeated, or the document in document library is detected updates degree and reaches one In the case of the threshold value of individual setting, operation 302-303 can be again performed.
On the basis of above-mentioned each embodiment, according to the theme of candidate keywords and expanded keyword in the first matrix to Amount determines a new theme vector, and the target text in document library is determined according to the theme vector of new theme vector and each document Shelves, more preferably:
By the theme vector of the corresponding candidate keywords of theme in M theme expanded keyword corresponding with the theme Theme vector be weighted, obtain theme vector collection;
The theme vector that theme vector is concentrated is normalized after addition, new theme vector is obtained;
The theme vector of the document in new theme vector and the second matrix is carried out into Similarity Measure, according to similarity Result of calculation determines the destination document in document library.
Wherein, weighter factor is obtained according to weight of the corresponding candidate keywords of theme in the theme in the M theme Arrive.
Under the preferred technical scheme of here, the theme of each document that can be generated by new theme vector and using topic model Vector carries out the calculating of K-L distances or cosine similarity, if similarity is higher, judges the two vectors in each theme On distribution it is more similar, can be using document corresponding under this similarity as destination document.
On the basis of above-mentioned each embodiment, document library is being determined according to the theme vector of new theme vector and document In destination document after, methods described may also include:
Searching in candidate keywords and expanded keyword place sentence and retrieval request in destination document determined by calculating The degree of association of rope sentence;
Sentence when degree of association is met into given threshold in corresponding document carries out output display;
When the trigger action to shown sentence is received, the corresponding document of the display statement is carried out into output display.
In this technical scheme, only the correlative in destination document is carried out into output display, readding for user can be saved Read time;When the trigger action to shown sentence is received, then that the corresponding document of the display statement is carried out into output is aobvious Show, user can be helped quickly to navigate to concrete document.
Example IV
Fig. 4 is a kind of apparatus structure schematic diagram for entering line retrieval based on key word that the embodiment of the present invention four is provided.This reality Apply example to be applicable to after the retrieval request for receiving user input, the retrieval of relevant information carried out according to the request, so as to for User provides the situation of service.Referring to Fig. 4, described device, including:
Candidate keywords determining module 401, for determining that retrieval please according to the prediction weight of basic key word in document library Candidate keywords in asking, wherein the prediction weight of the basic key word be according to basic key word in the document of document library Structural information determine;
Expanded keyword determining module 402, for according to the affiliated theme in the document library of the candidate keywords, Determine other expanded keywords;
Retrieval module 403, for entering line retrieval in the document library according to candidate keywords and expanded keyword.
Wherein, basic key word structural information in a document includes basic key word position, institute in a document State the part of speech of the part of speech, the part of speech of previous word and/or latter word of basic key word.
Embodiment five
Fig. 5 is a kind of apparatus structure schematic diagram for entering line retrieval based on key word that the embodiment of the present invention five is provided.The skill Art scheme increased keyword extracting module 501 and prediction weight determination module 502 on the basis of above-mentioned technical proposal.Ginseng See Fig. 5, in said device:
Keyword extracting module 501, for the non-supervisory keyword abstraction method based on figure, enters to the document in document library Row keyword abstraction, obtains the basic keyword set of the document library, and generates the system of basic key word in basic keyword set Weighted weight and structural information in a document;
Prediction weight determination module 502, for using two sorting algorithms of setting, according to resulting basic key word and Its structural information in a document, obtains the prediction weight of the basic key word;
Preferably, candidate keywords determining module 503, specifically for:
The basic key word matched with participle in retrieval request is searched for from the basic keyword set, the base of matching is obtained The prediction weight of plinth key word and statistical weight;
The statistical weight of the basic key word to being matched is weighted with prediction weight, generates matched basis crucial The new weight of word;
New weight is met the basic key word for being matched under imposing a condition as candidate keywords;
Expanded keyword determining module 504, for according to the affiliated theme in the document library of the candidate keywords, Determine other expanded keywords;
Retrieval module 505, for entering line retrieval in the document library according to candidate keywords and expanded keyword.
Wherein, the statistical weight is the quantity and total number of documents amount of basic key word place document in the document library Ratio;Two classification method for setting as based on two classification method of supporting vector machine model, two classification method based on maximum entropy or Two classification method of person's logic-based regression model.
Embodiment six
Fig. 6 is a kind of apparatus structure schematic diagram for entering line retrieval based on key word that the embodiment of the present invention six is provided.The dress On the basis of putting above-mentioned each technical scheme, weight matrix generation module 602 and theme vector generation module 603 are increased.Referring to Fig. 6, in said device:
Candidate keywords determining module 601, for determining that retrieval please according to the prediction weight of basic key word in document library Candidate keywords in asking, wherein the prediction weight of the basic key word be according to basic key word in the document of document library Structural information determine;
Weight matrix generation module 602, for carrying out participle to the document in the document library, generates by the document library The matrix of middle participle weight composition within said document;
Theme vector generation module 603, for being trained to the document in the document library using topic model, by institute It is the first matrix being made up of the theme vector of participle and the second matrix being made up of the theme vector of document to state matrix decomposition Product, wherein, the weight of the theme vector of participle by participle in theme is constituted, the theme vector of document by theme in a document Weight composition.
Expanded keyword determining module 604, specifically for:From the theme of query candidate key word in first matrix to Amount, the theme vector obtained according to inquiry determines the maximum front M theme of candidate keywords distribution of weights;From first matrix The participle vector of middle inquiry theme, according to the theme distribution weight in the participle vector determination M theme that inquiry is obtained most Big top n participle, as the expanded keyword of corresponding theme;Wherein, the M and N are natural number.
Retrieval module 605, specifically for:According to the theme of candidate keywords and expanded keyword in first matrix to Amount determines a new theme vector;The target in document library is determined according to the theme vector of the new theme vector and document Document.
Preferably, retrieve module 605 to further include:
Theme vector collection signal generating unit, for by the theme vector of the corresponding candidate keywords of theme in the M theme The theme vector of expanded keyword corresponding with the theme is weighted, and obtains theme vector collection, and wherein weighter factor is according to institute State weight of the corresponding candidate keywords of theme in the theme in M theme to obtain;
New theme vector signal generating unit, the theme vector for the theme vector to be concentrated carries out normalizing after addition Change, obtain the new theme vector;
Similarity calculated, for by the theme vector of the document in the new theme vector and second matrix Similarity Measure is carried out, the destination document in the document library is determined according to the result of calculation of the similarity.
On the basis of above-mentioned each technical scheme, described device also includes Display processing module(It is not shown), for institute State retrieval module to be determined after the destination document in document library according to the theme vector of the new theme vector and document:
Searching in candidate keywords and expanded keyword place sentence and retrieval request in destination document determined by calculating The degree of association of rope sentence;
Sentence when degree of association is met into given threshold in corresponding document carries out output display;
When the trigger action to shown sentence is received, the corresponding document of the display statement is carried out into output display.
In each embodiment of the invention described above, embodiment of the method belongs to same inventive concept with device embodiment, in dress Not detailed description in embodiment is put, the embodiment one to three that search method is carried out based on key word for description is can be found in.On State product and can perform the method that any embodiment of the present invention is provided, possess the corresponding functional module of execution method and beneficial effect Really.
Embodiment seven
Fig. 7 is the schematic diagram of the candidate keywords in a kind of determination retrieval request that the embodiment of the present invention seven is provided.Fig. 8 is A kind of determination expanded keyword and the schematic diagram of retrieval that the embodiment of the present invention seven is provided.The present embodiment can be with above-described embodiment Based on, there is provided a kind of preferred embodiment.
Referring to Fig. 7, determining the process of the candidate keywords in retrieval request includes:
(1)Model essay resource 701, all documents in the resource namely document library are provided;(2)Non-supervisory pass based on figure Keyword extraction model 702, extracts model essay key word 703(Basic key word);(3)Based on descriptor weight analysis module 704, obtain Each word is the information 705 of key word, the i.e. statistical weight of each word of model essay key word 703 in model essay key word 703;(4) Feature based abstraction module and support vector machine training module 706, generate a key word judgment models 707, obtain model essay pass The prediction weight of each word in keyword 703;(5)Determined according to the prediction weight and statistical weight of model essay key word 703 each word Candidate keywords in retrieval request.
Referring to Fig. 8, determine that expanded keyword and the process of retrieval include:
(1)Based on topic model training module 802, model essay resource 801 is trained, obtains word theme distribution matrix 803(The first matrix described in embodiment three)With document subject matter distribution matrix 804(The second matrix described in embodiment three); (2)According to the descriptor 805 obtained from inquiry(The candidate keywords extracted from retrieval request)And word theme distribution square Battle array 803, descriptor expands module 806 and obtains comprehensively and accurately expressing the expanded keyword and candidate key of user view Word;(3)One inquiry is obtained according to the theme vector of expanded keyword and candidate keywords in word theme distribution matrix 803 Semantic vector 807(New vector described in embodiment three);(4)Semantic Similarity Measurement module 808 calculate query semantics to The similarity of the theme vector of each document in amount 807 and document subject matter distribution matrix 804, determines according to the result of calculation of similarity Target model essay in model essay resource 801.
Note, above are only presently preferred embodiments of the present invention and institute's application technology principle.It will be appreciated by those skilled in the art that The invention is not restricted to specific embodiment described here, can carry out for a person skilled in the art various obvious changes, Readjust and substitute without departing from protection scope of the present invention.Therefore, although the present invention is carried out by above example It is described in further detail, but the present invention is not limited only to above example, without departing from the inventive concept, also More other Equivalent embodiments can be included, and the scope of the present invention is determined by scope of the appended claims.

Claims (18)

1. a kind of method that line retrieval is entered based on key word, it is characterised in that include:
The candidate keywords in retrieval request are determined according to the prediction weight of basic key word in document library, wherein the basis is closed The prediction weight of keyword is that the structural information according to basic key word in the document of document library determines;
According to the affiliated theme in the document library of the candidate keywords, other expanded keywords are determined;
Enter line retrieval in the document library according to candidate keywords and expanded keyword;
Wherein, the prediction weight according to basic key word in document library determines the candidate keywords in retrieval request, including:
The basic key word matched with participle in the retrieval request is searched for from basic keyword set, the basis pass of matching is obtained The prediction weight of keyword and statistical weight;
The statistical weight of the basic key word to being matched is weighted with prediction weight, the basic key word that generation is matched New weight;
New weight is met the basic key word under imposing a condition as candidate keywords.
2. the method that line retrieval is entered based on key word according to claim 1, it is characterised in that also include:
Based on the non-supervisory keyword abstraction method of figure, keyword abstraction is carried out to the document in document library, obtain the document The basic keyword set in storehouse, and generate the statistical weight of basic key word and structure letter in a document in basic keyword set Breath;
Using two sorting algorithms of setting, structural information according to resulting basic key word and its in a document obtains institute State the prediction weight of basic key word;
Wherein, the statistical weight is the ratio of the quantity of basic key word place document and total number of documents amount in the document library Value.
3. the method that line retrieval is entered based on key word according to claim 2, it is characterised in that two classification of the setting Method is based on two classification method of supporting vector machine model, two classification method based on maximum entropy or logic-based regression model Two classification method.
4. the method that line retrieval is entered based on key word according to claim 1, it is characterised in that the basic key word exists Structural information in document includes basic key word position in a document, the part of speech of the basic key word, previous word The part of speech of part of speech and/or latter word.
5. the method that line retrieval is entered based on key word according to claim 1, it is characterised in that also include:
Participle is carried out to the document in the document library, generation is made up of participle weight within said document in the document library Matrix;
The document in the document library is trained using topic model, is by the theme vector of participle by the matrix decomposition First matrix of composition and the product of the second matrix being made up of the theme vector of document, wherein, the theme vector of participle by point Weight composition of the word in theme, the weight of the theme vector of document by theme in a document is constituted.
6. the method that line retrieval is entered based on key word according to claim 5, it is characterised in that described according to the candidate The affiliated theme in the document library of key word, determines other expanded keywords, including:
From the theme vector of query candidate key word in first matrix, the theme vector obtained according to inquiry determines that candidate is closed The maximum front M theme of keyword distribution of weights;
The participle vector of the M theme is inquired about from first matrix, the M is determined according to the participle vector that inquiry is obtained The top n participle of the theme distribution weight maximum in individual theme, as the expanded keyword of corresponding theme;
Wherein, the M and N are natural number.
7. the method that line retrieval is entered based on key word according to claim 6, it is characterised in that described according to candidate key Word and expanded keyword enter line retrieval in the document library, including:
One new theme vector is determined according to the theme vector of candidate keywords and expanded keyword in first matrix;
The destination document in document library is determined according to the theme vector of the new theme vector and document.
8. the method that line retrieval is entered based on key word according to claim 7, it is characterised in that described according to described first The theme vector of candidate keywords and expanded keyword determines a new theme vector in matrix, according to the new theme to The theme vector of amount and document determines the destination document in document library, including:
By the master of the theme vector of the corresponding candidate keywords of theme in M theme expanded keyword corresponding with the theme Topic vector is weighted, and obtains theme vector collection, and wherein weighter factor is according to the corresponding candidate key of theme in the M theme Weight of the word in the theme is obtained;
The theme vector that the theme vector is concentrated is normalized after addition, the new theme vector is obtained;
The theme vector of the document in the new theme vector and second matrix is carried out into Similarity Measure, according to described The result of calculation of similarity determines the destination document in the document library.
9. the method that line retrieval is entered based on key word according to claim 7, it is characterised in that according to the new master The theme vector of topic vector sum document determines after the destination document in document library, also includes:
Candidate keywords and expanded keyword place sentence and the search phrase in retrieval request in destination document determined by calculating The degree of association of sentence;
Sentence when degree of association is met into given threshold in corresponding document carries out output display;
When the trigger action to shown sentence is received, the corresponding document of the display statement is carried out into output display.
10. a kind of device for entering line retrieval based on key word, it is characterised in that include:
Candidate keywords determining module, for determining the time in retrieval request according to the prediction weight of basic key word in document library Key word is selected, wherein the prediction weight of the basic key word is the structure letter according to basic key word in the document of document library What breath determined;
Expanded keyword determining module, for according to the affiliated theme in the document library of the candidate keywords, determining it His expanded keyword;
Retrieval module, for entering line retrieval in the document library according to candidate keywords and expanded keyword;
Wherein, the candidate keywords determining module, specifically for:
The basic key word matched with participle in retrieval request is searched for from the basic keyword set, the basis pass of matching is obtained The prediction weight of keyword and statistical weight;
The statistical weight of the basic key word to being matched is weighted with prediction weight, the basic key word that generation is matched New weight;
New weight is met the basic key word for being matched under imposing a condition as candidate keywords.
11. devices for entering line retrieval based on key word according to claim 10, it is characterised in that also include:
Keyword extracting module, for the non-supervisory keyword abstraction method based on figure, to the document in document library key is carried out Word is extracted, and obtains the basic keyword set of the document library, and generates the statistical weight of basic key word in basic keyword set Structural information in a document;
Prediction weight determination module, for using two sorting algorithms of setting, according to resulting basic key word and its in text Structural information in shelves, obtains the prediction weight of the basic key word;
Wherein, the statistical weight is the ratio of the quantity of basic key word place document and total number of documents amount in the document library Value.
12. devices for entering line retrieval based on key word according to claim 11, it is characterised in that two points of the setting Class method is based on two classification method of supporting vector machine model, two classification method based on maximum entropy or logic-based regression model Two classification method.
13. devices for entering line retrieval based on key word according to claim 10, it is characterised in that the basic key word Structural information in a document includes basic key word position, the part of speech of the basic key word, previous word in a document Part of speech and/or latter word part of speech.
14. devices for entering line retrieval based on key word according to claim 10, it is characterised in that also include:
Weight matrix generation module, for carrying out participle to the document in the document library, generates by participle in the document library The matrix of weight composition within said document;
Theme vector generation module, for being trained to the document in the document library using topic model, by the matrix The product of the first matrix being made up of the theme vector of participle and the second matrix being made up of the theme vector of document is decomposed into, its In, the weight of the theme vector of participle by participle in theme is constituted, the weight of the theme vector of document by theme in a document Composition.
15. devices for entering line retrieval based on key word according to claim 14, it is characterised in that the expanded keyword Determining module, specifically for:
From the theme vector of query candidate key word in first matrix, the theme vector obtained according to inquiry determines that candidate is closed The maximum front M theme of keyword distribution of weights;
The participle vector of the M theme is inquired about from first matrix, the M is determined according to the participle vector that inquiry is obtained The top n participle of the theme distribution weight maximum in individual theme, as the expanded keyword of corresponding theme;
Wherein, the M and N are natural number.
16. devices for entering line retrieval based on key word according to claim 14, it is characterised in that the retrieval module, Specifically for according to the theme vector of candidate keywords and expanded keyword in first matrix determine a new theme to Amount;The destination document in document library is determined according to the theme vector of the new theme vector and document.
17. devices for entering line retrieval based on key word according to claim 16, it is characterised in that the retrieval module, Including:
Theme vector collection signal generating unit, for by the theme vector of the corresponding candidate keywords of theme in the M theme with should The theme vector of the corresponding expanded keyword of theme is weighted, and obtains theme vector collection, and wherein weighter factor is according to the M Weight of the corresponding candidate keywords of theme in the theme is obtained in theme;
New theme vector signal generating unit, for the theme vector that the theme vector is concentrated to be normalized after addition, obtains To the new theme vector;
Similarity calculated, for the theme vector of the document in the new theme vector and second matrix to be carried out Similarity Measure, according to the result of calculation of the similarity destination document in the document library is determined.
18. devices for entering line retrieval based on key word according to claim 16, it is characterised in that also including display processing Module, for the target in determining document library according to the theme vector of the new theme vector and document in the retrieval module After document:
Candidate keywords and expanded keyword place sentence and the search phrase in retrieval request in destination document determined by calculating The degree of association of sentence;
Sentence when degree of association is met into given threshold in corresponding document carries out output display;
When the trigger action to shown sentence is received, the corresponding document of the display statement is carried out into output display.
CN201310710834.7A 2013-12-20 2013-12-20 Method and device for retrieving based on keyword Active CN103699625B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310710834.7A CN103699625B (en) 2013-12-20 2013-12-20 Method and device for retrieving based on keyword

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310710834.7A CN103699625B (en) 2013-12-20 2013-12-20 Method and device for retrieving based on keyword

Publications (2)

Publication Number Publication Date
CN103699625A CN103699625A (en) 2014-04-02
CN103699625B true CN103699625B (en) 2017-05-10

Family

ID=50361153

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310710834.7A Active CN103699625B (en) 2013-12-20 2013-12-20 Method and device for retrieving based on keyword

Country Status (1)

Country Link
CN (1) CN103699625B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108717407A (en) * 2018-05-11 2018-10-30 北京三快在线科技有限公司 Entity vector determines method and device, information retrieval method and device

Families Citing this family (28)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103927358B (en) * 2014-04-15 2017-02-15 清华大学 text search method and system
CN104376065B (en) * 2014-11-05 2018-09-18 百度在线网络技术(北京)有限公司 The determination method and apparatus of term importance
CN104505090B (en) * 2014-12-15 2017-11-14 北京国双科技有限公司 The audio recognition method and device of sensitive word
CN104809154B (en) * 2015-03-19 2019-03-08 百度在线网络技术(北京)有限公司 The method and device recommended for information
CN106326300A (en) * 2015-07-02 2017-01-11 富士通株式会社 Information processing method and information processing device
CN105512101B (en) * 2015-11-30 2018-06-26 北大方正集团有限公司 A kind of method and device of automatic structure descriptor
CN105912563B (en) * 2016-03-23 2019-04-02 北京数字跃动科技有限公司 A method of the artificial intelligence learning of machine is assigned based on psychological knowledge
CN105930358B (en) * 2016-04-08 2019-06-04 南方电网科学研究院有限责任公司 Case retrieving method and its system based on the degree of association
CN105930527B (en) * 2016-06-01 2019-09-20 北京百度网讯科技有限公司 Searching method and device
CN107665222B (en) * 2016-07-29 2020-11-06 北京国双科技有限公司 Keyword expansion method and device
CN106355429A (en) * 2016-08-16 2017-01-25 北京小米移动软件有限公司 Image material recommendation method and device
CN107943781B (en) * 2016-10-13 2021-08-13 北京国双科技有限公司 Keyword recognition method and device
CN108427686A (en) * 2017-02-15 2018-08-21 北京国双科技有限公司 Text data querying method and device
US10747825B2 (en) * 2017-02-27 2020-08-18 Google Llc Content search engine
WO2018174397A1 (en) 2017-03-20 2018-09-27 삼성전자 주식회사 Electronic device and control method
KR102529262B1 (en) * 2017-03-20 2023-05-08 삼성전자주식회사 Electronic device and controlling method thereof
CN107330752B (en) * 2017-05-31 2020-09-29 北京京东尚科信息技术有限公司 Method and device for identifying brand words
US10824657B2 (en) * 2017-06-01 2020-11-03 Interactive Solutions Inc. Search document information storage device
CN107480879A (en) * 2017-08-09 2017-12-15 郑州星睿水利科技有限公司 Hydrology worker's professional knowledge examining method and system
CN108052520A (en) * 2017-11-01 2018-05-18 平安科技(深圳)有限公司 Conjunctive word analysis method, electronic device and storage medium based on topic model
CN109241525B (en) * 2018-08-20 2022-05-06 深圳追一科技有限公司 Keyword extraction method, device and system
CN110969018A (en) * 2018-09-30 2020-04-07 北京国双科技有限公司 Case description element extraction method, machine learning model acquisition method and device
MY195969A (en) * 2018-10-24 2023-02-27 Advanced New Technologies Co Ltd Intelligent Customer Services Based on a Vector Propagation on a Click Graph Model
JP6651189B1 (en) * 2019-03-29 2020-02-19 株式会社 情報システムエンジニアリング Data structure, learning method and information providing system for machine learning
CN110866102A (en) * 2019-11-07 2020-03-06 浪潮软件股份有限公司 Search processing method
CN111831884B (en) * 2020-07-14 2021-02-05 深圳市众创达企业咨询策划有限公司 Matching system and method based on information search
CN112507068B (en) * 2020-11-30 2023-11-14 北京百度网讯科技有限公司 Document query method, device, electronic equipment and storage medium
CN112650914A (en) * 2020-12-30 2021-04-13 深圳市世强元件网络有限公司 Long-tail keyword identification method, keyword search method and computer equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101685455A (en) * 2008-09-28 2010-03-31 华为技术有限公司 Method and system of data retrieval
CN103164521A (en) * 2013-03-11 2013-06-19 亿赞普(北京)科技有限公司 Keyword calculation method and device based on user browse and search actions
CN103425799A (en) * 2013-09-04 2013-12-04 北京邮电大学 Personalized research direction recommending system and method based on themes

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7831595B2 (en) * 2007-12-31 2010-11-09 Yahoo! Inc. Predicting and ranking search query results

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101685455A (en) * 2008-09-28 2010-03-31 华为技术有限公司 Method and system of data retrieval
CN103164521A (en) * 2013-03-11 2013-06-19 亿赞普(北京)科技有限公司 Keyword calculation method and device based on user browse and search actions
CN103425799A (en) * 2013-09-04 2013-12-04 北京邮电大学 Personalized research direction recommending system and method based on themes

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108717407A (en) * 2018-05-11 2018-10-30 北京三快在线科技有限公司 Entity vector determines method and device, information retrieval method and device
CN108717407B (en) * 2018-05-11 2022-08-09 北京三快在线科技有限公司 Entity vector determination method and device, and information retrieval method and device

Also Published As

Publication number Publication date
CN103699625A (en) 2014-04-02

Similar Documents

Publication Publication Date Title
CN103699625B (en) Method and device for retrieving based on keyword
CN108629043B (en) Webpage target information extraction method, device and storage medium
KR102092691B1 (en) Web page training methods and devices, and search intention identification methods and devices
CN106649818B (en) Application search intention identification method and device, application search method and server
CN110704621B (en) Text processing method and device, storage medium and electronic equipment
CN109766544B (en) Document keyword extraction method and device based on LDA and word vector
CN101470732B (en) Auxiliary word stock generation method and apparatus
CN111428488A (en) Resume data information analyzing and matching method and device, electronic equipment and medium
KR101754473B1 (en) Method and system for automatically summarizing documents to images and providing the image-based contents
CN113761218B (en) Method, device, equipment and storage medium for entity linking
CN110134792B (en) Text recognition method and device, electronic equipment and storage medium
CN109388743B (en) Language model determining method and device
KR101491627B1 (en) Quantification method, apparatus and system of reviews for mobile application evaluation
CN106919575A (en) application program searching method and device
CN112559684A (en) Keyword extraction and information retrieval method
CN111414763A (en) Semantic disambiguation method, device, equipment and storage device for sign language calculation
Wu et al. Extracting topics based on Word2Vec and improved Jaccard similarity coefficient
CN110956021A (en) Original article generation method, device, system and server
US9652997B2 (en) Method and apparatus for building emotion basis lexeme information on an emotion lexicon comprising calculation of an emotion strength for each lexeme
CN103744887A (en) Method and device for people search and computer equipment
CN107844493A (en) A kind of file association method and system
CN114021577A (en) Content tag generation method and device, electronic equipment and storage medium
CN110717038A (en) Object classification method and device
CN113515589A (en) Data recommendation method, device, equipment and medium
JP2013003663A (en) Information processing apparatus, information processing method, and program

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant