WO2018176913A1 - 搜索方法、装置及非临时性计算机可读存储介质 - Google Patents

搜索方法、装置及非临时性计算机可读存储介质 Download PDF

Info

Publication number
WO2018176913A1
WO2018176913A1 PCT/CN2017/115680 CN2017115680W WO2018176913A1 WO 2018176913 A1 WO2018176913 A1 WO 2018176913A1 CN 2017115680 W CN2017115680 W CN 2017115680W WO 2018176913 A1 WO2018176913 A1 WO 2018176913A1
Authority
WO
WIPO (PCT)
Prior art keywords
search
text
text index
policy
weight
Prior art date
Application number
PCT/CN2017/115680
Other languages
English (en)
French (fr)
Inventor
刘铭
陈达遥
庞盟盟
冯涛
曾之肇
魏永超
潘文彬
Original Assignee
北京三快在线科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京三快在线科技有限公司 filed Critical 北京三快在线科技有限公司
Priority to EP17903012.7A priority Critical patent/EP3608799A4/en
Priority to JP2020502745A priority patent/JP2020512651A/ja
Priority to US16/499,858 priority patent/US11144594B2/en
Priority to CA3059929A priority patent/CA3059929C/en
Priority to KR1020197032313A priority patent/KR20190128246A/ko
Priority to SG11201909119Y priority patent/SG11201909119YA/en
Publication of WO2018176913A1 publication Critical patent/WO2018176913A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3338Query expansion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation
    • G06F16/24532Query optimisation of parallel queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/9032Query formulation
    • G06F16/90324Query formulation using system suggestions
    • G06F16/90328Query formulation using system suggestions using search space presentation or visualization, e.g. category or range presentation and selection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2457Query processing with adaptation to user needs
    • G06F16/24578Query processing with adaptation to user needs using ranking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90348Query processing by searching ordered data, e.g. alpha-numerically ordered data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines

Definitions

  • the present application relates to computer technology, and in particular, to a search method, apparatus, and non-transitory computer readable storage medium.
  • a search engine may perform an information search based on text input by a user and perform a search service based on text relevance.
  • web pages are also the main information carrier of the Internet. Therefore, searching for web pages can basically obtain the content that users pay attention to.
  • the local living service provided by the O2O (Online-to-Offline) platform is convenient for people's lives, and the search demand on the O2O platform is gradually increasing.
  • the information description carrier of the O2O platform may have multiple text index fields for describing platform services from different perspectives.
  • the descriptive text index field on the O2O platform can sometimes be as many as fifty or more.
  • the information described by these text index fields may not be related. It may be difficult to obtain comprehensive and accurate content of user attention by using the webpage search method to perform information retrieval on all text index fields.
  • the present application provides a search method for which relatively accurate search results can be obtained for information having a multi-text index field.
  • an embodiment of the present application provides a search method, including:
  • each of the first search strategies corresponds to a search weight matched by the at least one first text index domain and the first text index domain;
  • an embodiment of the present application provides a search apparatus, including: a processor and a non-transitory computer readable storage medium.
  • the non-transitory computer readable storage medium stores machine executable instructions executable by the processor, the processor being caused by the machine executable instructions to perform a search method as disclosed in the first aspect of the present application.
  • embodiments of the present application provide a non-transitory computer readable storage medium storing machine executable instructions that, when invoked and executed by a processor, cause the processor to execute, eg, The search method disclosed in the first aspect of the present application.
  • the search method disclosed in the embodiment of the present application determines at least one first search policy that matches the query text, wherein each of the first search policies corresponds to at least one text index domain and a matching search weight of the text index domain And then performing, according to each of the text index fields corresponding to each of the first search strategies, a search operation of the query text; and finally, the search results of all the search operations are merged and output.
  • relatively accurate search results are available.
  • FIG. 1 is a flow chart of a search method according to an embodiment of the present application.
  • FIG. 2 is a flow chart of a search method of another embodiment of the present application.
  • FIG. 3 is a flow chart of a search method according to still another embodiment of the present application.
  • FIG. 4 is a schematic diagram showing the hardware structure of a search apparatus according to an embodiment of the present application.
  • FIG. 5 is a functional block diagram of search logic provided by an embodiment of the present application.
  • FIG. 6 is a functional block diagram of search logic provided by another embodiment of the present application.
  • FIG. 7 is a functional block diagram of search logic provided by another embodiment of the present application.
  • FIG. 8 is a functional block diagram of search logic provided by still another embodiment of the present application.
  • the present application discloses a search method, as shown in FIG. 1, the method includes steps 100 to 120.
  • the search method of the present application may include two types of search strategies, namely a first search strategy and a second search strategy.
  • the first search strategy may perform a search operation only for a partial text index field of the search item
  • the second search strategy may perform a search operation for all text index fields of the search item.
  • Step 100 Determine at least one first search strategy that matches the query text.
  • Each of the first search strategies may correspond to at least one text index field and a search weight matched by the text index field.
  • the first search strategy can be used to define a text index field of the search item to be queried and a search weight that matches the text index field.
  • Each of the first search strategies may correspond to at least one text index field, and each of the text index fields may have the same or different search weights.
  • Each of the text index fields corresponding to the first search strategy may correspond to the same or different query texts.
  • the text index field can be used to build an index, such as an inverted index.
  • the content of a text index field is usually a meaningful text that can be used to describe an aspect of a searched item.
  • a POI Point of Interest
  • a material may include a business name, a registered company name, a brand name, a business district, an address, a main menu, and a business hours.
  • the poi_name for searching for the material "Golden Million's branch in Wangjing Garden” can be Jinmian Roast Duck Restaurant (Wangjing Garden Store).
  • poi_name refers to the name of the text index field recorded in the system, for example, the merchant name "Golden Milled Duck Restaurant”
  • the text behind the poi_name is the specific content of the text index field, which can be used to establish an inverted index.
  • the text index field can be used to represent the field in which the item is searched. In this way, after obtaining the query text to be searched, the first search strategy that the query text matches may be first determined.
  • a text index field of a plurality of first search strategies may be set in advance, and query text corresponding to each first search policy may be set.
  • the first search strategy may include: a merchant policy, a landmark policy, a dish name policy, and the like. Then, can be divided Do not set the query text corresponding to each first search strategy.
  • the query text corresponding to the merchant policy may include: Kim Million, KFC, Quanjude, and the like.
  • the query text to be searched may be input by the user in the search bar of the client, or may be automatically generated by the client according to the historical behavior log of the user. For example, when the client detects that a female user enters the cosmetics sales page, the user may push relevant search results according to the user's age information. At this point, the client may first generate a query text according to the user's information (eg, middle-aged female), and then invoke the search engine to perform a search operation on the automatically generated query text.
  • the user's information eg, middle-aged female
  • the correspondence between the query text and the first search policy may be first pre-established by manual.
  • the search strategy corresponding to the query text "KFC" and "Golden Million” can be set as the merchant strategy.
  • the text index field included in each first search policy and the search weight of each text index field may be set at the same time.
  • the text index fields included in the merchant policy can be set: business name, brand name, registered company name, and the like.
  • the search weight of each text index field corresponding to the merchant policy may be set as follows: the search weight of the business name is 50%; the search weight of the brand name is 30%; and the search weight of the registered company name is 20%.
  • the text index field corresponding to the first search strategy and the search weight of each corresponding text index field may be set according to prior knowledge.
  • Determining the at least one first search strategy that matches the query text to be searched may include: determining, according to a preset correspondence between the first search policy and the query text, at least one first search strategy that matches the query text; or The pre-trained classifier identifies the query text and can determine at least one first search strategy that matches the query text.
  • the first search strategy may be manually pre-established, or may be determined by the recognition model obtained by training according to the historical behavior of the user.
  • the classifier may first be trained based on the search log. For example, after obtaining the search log for a period of time, the acquired search log may be clustered according to the query text, the text index field, the matching text, and the like in the search log to train the classification for identifying the first search strategy.
  • a classifier based on the search log training can be used to determine at least one first search strategy that matches the query text.
  • Step 110 Perform a search operation of the query text according to a text index field corresponding to each of the first search policies.
  • a query text may correspond to a plurality of first search strategies, and each of the first search strategies may include multiple text index fields.
  • a search operation may be performed on the query text based on a text index field in each of the first search policies, respectively.
  • the first search policy that can be determined based on the query text "Golden Million” Slightly include business strategy and landmark strategy.
  • the text index field that matches the query text "Golden Million” includes: business name, brand name.
  • the text index fields that match the query text "Golden Million” include: buildings.
  • the search operation "Gold Million” can be performed in the search material based on the three text index fields of the merchant name, the brand name and the building, respectively, and three search result lists are respectively obtained. Based on different text index fields, when performing search operations on the query text in the search material, the search weight of each text index field can be combined to calculate the correlation between the query text and the search material.
  • a search operation may also be performed based on the second search strategy.
  • the second search strategy corresponds to all text index fields.
  • the obtained second search result may be used as a search operation of executing the query text in a corresponding text index field based on the first search strategy. A supplement to the first search result obtained.
  • step 120 the search results of all the above search operations are merged and output.
  • the search results can be sorted first, then the repeated search results are filtered out, and the remaining search results are output.
  • the search results may be ranked according to the priority of the search strategy; or, the search results may be ranked according to the discriminant score of each search strategy; or, according to the search results
  • the evaluation score is a mixed sort of all search results. If the performing search operation includes performing a search operation of the query text based on the second search policy, the second search result obtained by performing the search operation based on the second search policy may be ranked last.
  • At least one first search policy that matches the query text to be searched may be determined first.
  • Each of the first search policies corresponds to at least one text index field, and each of the text index fields has a preset search weight.
  • the search operation of the query text is separately performed.
  • the search results of all the above search operations are merged and output.
  • the search material has information of multiple text index fields, relatively accurate search results can be obtained.
  • a search method disclosed in this embodiment is shown in FIG. 2, and the method includes steps 200 to 250.
  • Step 200 Train a classifier for identifying the first search policy based on the search log.
  • the classifier may first be trained based on the search log.
  • Training a classifier for identifying the first search strategy based on the search log may include: Row clustering, generating a search strategy space definition, the search policy space definition may be used to represent a mapping relationship between each of the first search strategy and the query text in the search log; and each of the first a search log corresponding to the search strategy; and a classifier for identifying the corresponding first search policy is respectively trained based on the search logs corresponding to each of the first search policies.
  • the clustering of the search logs to generate the search strategy space definition may include: performing a hit score of the query text extracted according to each search log in the text index field as a feature, and clustering the search logs to obtain a query text category.
  • Each query text category can correspond to one or more search strategies.
  • a search log for performing a search operation based on the second search policy may be first acquired.
  • the search log of the order behavior can be selected for classifier training.
  • the search logs recorded by the search server will vary slightly from system to system.
  • the search log may include search time, query text, matching text, text index field, presentation result list, behavior identification such as click or order, and the like. If the search log of the ordering behavior is too low relative to all the search logs, the click log and the order log can be selected to jointly train the classifier. When the click log and the order log are combined to train the classifier, the behavior type weight of the click log may be less than the behavior type weight of the order log.
  • the hit score for each text index field can be calculated separately based on the acquired search log.
  • the hit score score i of each text index field in the search log can be calculated using Equation 1 below:
  • match i represents the text of the query text matching in the i-th text index field when the search operation is performed on the query text
  • len(match i ) represents the length of the text matched by the query text in the i-th text index field.
  • Field i represents the content of the i-th text index field
  • len(field i ) represents the length of the text of the i-th text index field.
  • len(match i ) ⁇ len(field i ).
  • N is a smoothing factor
  • the denominator in Equation 1 represents the smaller of the text length and the upper limit N of the length of the text index field.
  • the upper limit of length N is used as the upper limit of the denominator to make the entire score not too small.
  • N can be set to a natural number, such as 30, according to the function of the search service.
  • the text index field vector can be an M-dimensional vector.
  • the hit score score i of the text index field in each search log can be calculated by Equation 1, respectively.
  • an M-dimensional vector can be obtained for each search log.
  • a plurality of M-dimensional vectors similar to [0, 0, 1.0, 0.8, 0...0], [0, 0, 0.9, 0.9, 0...0] and the like can be obtained.
  • M is the number of text index fields in the search log
  • the i-th dimension of each M-dimensional vector corresponds to the hit score of the i-th text index field in each search log.
  • the obtained M-dimensional vector may be clustered by using a multi-dimensional spatial clustering method, such as a Dbscan clustering algorithm and a k-means clustering algorithm.
  • the clustering algorithm used in the present application is not limited.
  • the center point of the cluster can be considered as the spatial definition of the first search strategy.
  • the spatial definition of the first search policy may be used to represent a mapping relationship between the first search policy and the query text in the search log, such that the query text of a certain category may correspond to a specific first search policy. For example, when a user enters a query text such as "Golden Million”, “Haidian Fishing", “Nine-headed Eagle Restaurant”, it is usually necessary to search for a corresponding merchant. According to the aforementioned clustering method, the query texts "Golden Million”, “Hai Dian” and "Nine-headed Eagle Restaurant” will be grouped together.
  • the process of clustering according to the search log actually performs supervised learning on seemingly messy search results, and learns that a certain type of query text searches on some text index fields than on all text index fields.
  • a more efficient process Usually the clustering result should not be too fine, and the control is better within one hundred.
  • the method of automatic clustering does not need to pay attention to the specific meaning that the first search strategy wants to express, and does not need to define the first search strategy in advance, and can determine the first search strategy corresponding to the query text, and further determine the first search strategy. Corresponding text index field. This method can effectively reduce the possibility of manual development of strategy errors, and can identify potential, difficult to find data rules.
  • the classifier for identifying the first search strategy can then be trained based on the query text for each category, respectively.
  • the query text of each category may be used as a positive sample, and a certain number of negative samples are collected, and the positive samples and the negative samples are used as training sample data for supervised learning to train the first search strategy for identification.
  • Classifier Each query text category may correspond to a first search strategy.
  • the multi-classifier can be implemented in two ways: one is a multi-classifier; the other is a plurality of two-classifier fits. For example, multiple two can be used in this embodiment. Classifier fitting.
  • the classification model can have multiple choices.
  • the SVM Small Vector Machine
  • the SVM Small Vector Machine
  • the extracted sample features may include at least: a text feature of the query text, such as a query text, and a word segment combination obtained after segmentation of the query text.
  • the sample features extracted from the training sample data may also include: query length, prefix, suffix, POS+bigram, POS+unigram, POS, and other combined features.
  • query length is the length of the query text
  • prefix and suffix are the prefix and suffix of the query text respectively
  • unigram and bigram are the text features of the query text respectively
  • POS+unigram is the position of the text feature of the query text.
  • the extracted sample features can be trained using an SVM classifier to obtain a classifier for identifying the first search strategy.
  • the classifier can be trained based on sample characteristics using any technique well known to those skilled in the art and will not be described again herein.
  • a corresponding classifier for identifying the first search strategy may be obtained for subsequent recognition of the obtained query text.
  • Step 210 Determine a text index field corresponding to each of the first search policies, and a search weight matched by each text index field.
  • the text index field corresponding to each of the first search strategies and the search weights matched by each text index field There are two ways to determine the text index field corresponding to each of the first search strategies and the search weights matched by each text index field. First, if the first search strategy is manually preset, the correspondence between the text index field and the query text in the first search strategy is also manually preset, and the text index field corresponding to each first search policy matches.
  • the search weight can also be manually preset.
  • the text index field corresponding to each first search strategy and the search weight matched by each text index field may be manually set in the program code according to experience, or may be set by the user according to the requirement, and will not be described here. .
  • the text index field of each first search policy is set according to the search log, and the search weights matched by each text index field. For example, for each first search policy, all search logs corresponding to the first search policy may be acquired; and then, according to the hit text in the search log corresponding to the first search policy, a hit score in each text index field is obtained. And iteratively calculating an average weight of the first search strategy corresponding to each text index field; determining, according to the average weight of each text index field, the text index domain corresponding to the first search strategy and the search of each text index domain matching Weights.
  • the search log may be a search log obtained when a search operation is performed on all text index fields by using the second search policy.
  • the search log corresponding to each of the first search policies may be determined by indexing a search log used when obtaining a spatial definition of the first search strategy by clustering.
  • the search log may also be a search log obtained when a search operation is performed on all text index fields by using an initial search weight of a text index field according to each first search policy.
  • the hypothesized first search strategy is then run, a search operation is performed on the query text in accordance with the assumed first search policy, and a search log of the search operation over a period of time is obtained.
  • a search log corresponding to each first search strategy can be obtained, including obtaining query text, hit text, text index field, and behavior type of each search log.
  • the hit text is the matching text of the query text on the text index field.
  • the search weight of the index domain can include the following four steps.
  • the first step is to get a single log weight for each text index field in each search log. For example, if the search material includes M text index fields, each search log matches at least one text index field. Before calculating the hit score, the search weights of the M text index fields can be initialized to 1/M, respectively. Then, a single log weight in each search index of each text index field can be calculated by Equation 2 as follows:
  • Field i represents the content of the i-th text index field, and len(field i ) represents the length of the content of the i-th text index field.
  • Match i indicates the matching content of the query text of the jth search log in the i-th text index field, which can be obtained during the search process.
  • Other formulas can also be used to calculate a single log weight for each text index field in each search log. In this example, the ratio of the index is used to control the upper limit of a single log weight in order to obtain a smooth upper limit.
  • a single log weight of all text index fields in each search log can be obtained. For example, suppose there are a total of Y ordering logs, and each of the ordering logs has M text index fields, and after obtaining a single log weight of each M text index fields in Y records by formula 2, each The text index field will correspond to Y single log weights.
  • each text index field may correspond to multiple A search strategy.
  • the merchant policy can correspond to the three text index fields of the merchant name, the address, and the merchant brand; and the landmark policy can also correspond to the two text index domains of the merchant name and the address.
  • the average weight of each text index field corresponding to each first search strategy is calculated separately. For example, an average value may be calculated for each text index field in each search log corresponding to each first search policy, and the average weight of each text index field corresponding to the first search strategy may be obtained. Equation 3 is as follows:
  • weight i is a single log weight of the i-th text index field in a certain search log corresponding to a first search strategy
  • count i is the i-th text index field in all search logs corresponding to the first search policy.
  • the number of non-zero single log weights, weight_avg i represents the average weight of the i-th text index field corresponding to the first search strategy.
  • the first search strategy G1 corresponds to three text index fields, which can be separately recorded. It is T1, T2 and T3.
  • the average weight weight_avg 1 of the first search strategy G1 corresponding to the text index field T1 the average weight weight_avg 2 of the first search strategy G1 corresponding to the text index field T2, and the average weight weight_avg 3 of the first search strategy G1 corresponding to the text index field T3 are calculated.
  • the normalized weight value of the average weight of each text index field corresponding to each first search strategy is obtained.
  • Equation 4 is as follows:
  • weight_avg j is a non-zero average weight corresponding to the j-th text index field of a certain first search strategy
  • weight' i is a normalized weight value corresponding to the i-th text index field of the first search strategy
  • N is non- 0 The average number of weights.
  • the sum of the weights of all text index fields corresponding to each first search strategy is 1.
  • a text index field having a non-zero normalized weight value is determined as a text index field corresponding to each first search strategy.
  • the non-zero normalized weight value is a search weight of the text index field under the first search policy.
  • each first search strategy multiple text index fields with non-zero normalized weight values are determined, so that the text index domain of the user interested in the search material can be selected, and the text index domain is
  • the normalized weight value can be used as a search weight that can be used when calculating the relevance of a search item.
  • the non-zero normalized weight value corresponding to the text index field of each first search strategy may be too small.
  • the threshold may be set to remove the non-zero normalized weight value that is too small.
  • the method further includes: determining that the normalized weight value is greater than
  • the text index field of the preset threshold is a text index field corresponding to each first search strategy.
  • the preset threshold may be a number of 1/non-zero normalized weight values.
  • the entire query text may be separately input into the trained classifier, and it will be determined whether the query text is applicable to the result of the current first search strategy.
  • Step 220 Obtain a query text to be searched.
  • the query text to be searched may be the query text input by the user in the search bar of the client, or may be the query text automatically generated by the client according to the historical behavior log of the user. For example, after detecting that a female user enters the cosmetics sales page, the client may push relevant search results to the user according to the age information of the user. At this time, the client first generates a query text according to the user's information (for example, a middle-aged woman), and then calls the search engine to perform a search operation on the automatically generated query text.
  • the user's information for example, a middle-aged woman
  • Step 230 determining at least one first search policy that matches the query text.
  • Each of the first search policies corresponds to at least one text index field and a search weight matched by the text index field.
  • Determining the at least one first search strategy that matches the query text may include: determining, according to a preset correspondence between the first search policy and the query text, at least one first search strategy that matches the query text; or, by pre-training The classifier separately identifies the query text and determines at least one first search strategy that matches the query text.
  • the query text may be separately input into a plurality of pre-trained classifiers to obtain each of the a recognition result of the classifier, when one or some classifiers are recognized as being adapted to the query text, the first search strategy corresponding to the one or some classifiers is used as the matching of the query text A search strategy.
  • Step 240 Perform a search operation of the query text on each of the text index fields corresponding to each first search policy.
  • a query text may be identified as matching one or more first search strategies, each of the first search strategies corresponding to a respective text index domain and a search weight, and the search server may perform a search operation according to the plurality of first search policies, respectively. In order to obtain a set of recall results corresponding to each first search strategy.
  • the correlation may be determined based on a search weight of a text index domain.
  • the search operation may be performed in parallel by the search server based on the plurality of first search strategies using the multi-threading technique to obtain a set of recall results corresponding to each of the first search strategies. Since each first search strategy corresponds to a respective text index field and its search weight, a more relevant text index field can be scored higher by calculating a correlation score between the search material and the query text. Therefore, the recall result of the entire search server can be effectively improved.
  • Equation 5 is as follows:
  • Correlation score ⁇ (text index field matching length / text index field length) ⁇ search weight (Equation 5).
  • the first text index field is "business name”, the corresponding query text is "KFC”; the second text index field is "place”, corresponding query text It is “the west side of Wudaokou subway station”.
  • the merchant "Pizza” can correspond to two identical text index fields: the first text index field is "business name”, the corresponding query text is “Pizza”; the second text index field is "place”, corresponding query text It is "the east side of KFC's Wudaokou shop”.
  • the query text is "KFC”
  • the search index weight of the text index field corresponding to "Business Name” Larger the merchant "KFC” will score higher than the merchant "Pizza”.
  • step 250 the search results of all the above search operations are merged and output.
  • the merging and outputting the search results of all the above search operations may include: sorting the search results based on the at least one first search strategy according to a preset policy; filtering out the repeated search results ranked later; and outputting the remaining search results.
  • the search results may be first sorted according to a preset policy.
  • the search results obtained by performing the search operation based on the plurality of first search strategies may be ranked according to the manually set priority; or, when the search operation is performed according to each of the first search strategies
  • the correlation scores of the obtained search results are ranked in blocks; or, the search results obtained by all the first search strategies may be mixed and sorted according to the relevance score of the search results. Then, the duplicated search results that are listed later are filtered out, and the remaining search results are output.
  • the search method disclosed in the embodiment of the present application may determine a text index field corresponding to each of the first search policies and a search weight matched by each text index field by training a classifier for identifying the first search policy based on the search log. .
  • at least one first search strategy matching the query text may be determined according to the acquired query text to be searched, and executed separately according to the text index domain corresponding to each of the first search policies.
  • the search results of all the above search operations are merged and output.
  • the classifier for identifying the first search strategy is trained based on the search log, and the iterative calculation is performed based on the search log to determine the search index weight corresponding to the text index domain and the text index domain corresponding to the first search strategy, which fully embodies the user's Search expectations to further improve the accuracy of your search results.
  • a search method disclosed in this embodiment may include steps 300 to 370.
  • Step 300 training a classifier for identifying the first search strategy based on the search log.
  • Step 310 Determine a text index field corresponding to each of the first search policies, and a search weight matched by each text index field.
  • Step 320 Acquire a query text to be searched.
  • Step 330 Determine at least one first search policy that matches the query text.
  • Each of the first search policies may correspond to at least one text index field and a search weight matched by the text index field.
  • Step 340 Perform a search operation of the query text in each of the text index fields corresponding to the at least one first search policy.
  • each of the text index fields corresponding to the at least one first search policy may also be referred to the foregoing embodiment, and details are not described herein again.
  • Step 350 Perform a search operation of the query text based on the second search policy.
  • the second search strategy corresponds to all text index fields of the search material, and the search weight of each of the text index fields is the same.
  • a search operation of the query text may be performed on all the text index fields based on the second search strategy.
  • the search result of the second search strategy is placed behind the search result of the first search strategy, so as to avoid no result being recalled.
  • step 360 the search results of all the above search operations are merged and output.
  • the merging and outputting the search results of all the above search operations may include: sorting all the search results of the search operations performed based on the first search policy according to the preset policy; and ranking the search results obtained by the search operations performed based on the second search policy Behind the search results obtained based on the search operation performed by the first search strategy; filtering out the repeated search results that are ranked later; outputting the remaining search results.
  • sorting the search results obtained by the search operation performed by the first search policy refer to the foregoing embodiment, and details are not described herein again. Then, the duplicated search results that are listed later are filtered out, and the remaining search results are output.
  • Step 370 When the preset condition is met, the classifier for identifying the first search policy is trained and updated based on the search log corresponding to the second search policy.
  • the preset condition may include at least one of the following: a preset update period is reached, and a ratio of the first click rate to the second click rate is less than a preset threshold.
  • the first click rate is a click rate of a search result obtained by performing a search operation based on the first search policy
  • the second click rate is a click on a search result obtained by performing a search operation based on the second search policy. rate.
  • the preset update period may be determined according to the update speed of the searched material, or artificially set, for example, may be 1 month.
  • a first click rate of the search result obtained by the user performing the search operation based on the first search policy and a second click rate of the search result obtained by the user performing the search operation based on the second search policy may be performed on the search server Search logs are obtained for statistical analysis.
  • the search log obtained by performing the search operation based on the second search policy may perform steps 300 and 310, and repeat based on the search log. Performing an operation for identifying a classifier of the first search strategy and determining a search weight matched by the text index field and the text index field corresponding to the first search strategy, and supplementing the trained classifier and the first search strategy to the original In the first search strategy.
  • the classifier for identifying the first search strategy is repeatedly executed, and the problem that the first search strategy is not applicable due to the change of the user's usage habits can be found, and the new one can be found in time.
  • the first search strategy is repeatedly performed.
  • the embodiment of the present application further provides a search device.
  • 4 is a schematic diagram showing the hardware structure of a search device.
  • the search device can include a processor 401, a non-transitory computer readable storage medium 402 that stores machine executable instructions.
  • Processor 401 and non-transitory computer readable storage medium 402 can communicate via system bus 403. And, by reading and executing machine executable instructions in the non-transitory computer readable storage medium 402 corresponding to the search logic, the processor 401 can perform the search method described above.
  • the search device may be a PC, a mobile terminal, a personal digital assistant, a tablet, or the like.
  • the non-transitory computer readable storage medium 402 referred to herein can be any electronic, magnetic, optical, or other physical storage device that can contain or store information such as executable instructions, data, and the like.
  • the non-transitory computer readable storage medium may be: RAM (Radom Access Memory), volatile memory, nonvolatile memory, flash memory, storage drive (such as a hard disk drive), solid state drive, any type Storage disk (such as light Disk, dvd, etc., or similar storage medium, or a combination thereof.
  • FIG. 5 is a functional block diagram of search logic according to an embodiment of the present application. As shown in FIG. 5, the above search logic may include a first search policy determination module 510, a search module 520, and a search result output module 530.
  • a first search policy determining module 510 configured to determine at least one first search policy that matches a query text to be searched, where each of the first search policies corresponds to at least one first text index field and the first text The search weight of the index field match.
  • the searching module 520 is configured to perform a search operation of the query text based on each of the first text index fields corresponding to each of the first search policies determined by the first search policy determining module 510;
  • the search result output module 530 is configured to merge and output the search results of all the above search operations.
  • the search apparatus disclosed in the embodiment of the present application determines at least one first search policy that matches the query text, wherein each of the first search policies corresponds to at least one first text index domain and the first text index domain match Searching weights; then, performing a search operation of the query text based on each of the text index fields corresponding to each of the first search strategies; and finally, the search results of all the search operations are merged and output.
  • first search policies corresponds to at least one first text index domain and the first text index domain match Searching weights
  • the first search policy determining module 510 includes:
  • the first determining unit 511 is configured to determine, according to a preset correspondence between the first search policy and the query text, the at least one first search policy that matches the query text.
  • the first search policy determination module 510 includes:
  • the second determining unit 512 is configured to separately identify the query text by using a pre-trained classifier for identifying each first search policy, and determine at least one first search policy that matches the query text.
  • the search logic further includes:
  • the search strategy classifier training module 540 is configured to train the classifier based on the search log.
  • the search logic further includes:
  • the text field and weight determination module 550 is configured to determine a first text index field corresponding to each first search policy, and a search weight matched by each first text index field.
  • the search strategy classifier training module 540 includes:
  • a search policy space definition determining unit 541, configured to cluster the search logs to generate a search policy space definition, where the search policy space definition is used to represent a mapping relationship between each first search policy and query text in the search log;
  • the training unit 542 is configured to acquire, according to the search policy space definition, a search log corresponding to each of the first search policies, and perform training for identifying corresponding ones based on search logs corresponding to each of the first search policies.
  • the classifier of the first search strategy is configured to acquire, according to the search policy space definition, a search log corresponding to each of the first search policies, and perform training for identifying corresponding ones based on search logs corresponding to each of the first search policies.
  • the text field and weight determination module 550 includes:
  • a log obtaining unit 551, configured to acquire a search log corresponding to the first search policy
  • the weight calculation unit 552 is configured to iteratively calculate, according to the hit score of each second text index field in the search material, the query text in the search log corresponding to the first search policy, and correspondingly calculate the first search strategy corresponding to each of the first The average weight of the two text index fields.
  • the weight calculation unit 552 is further configured to obtain a single log weight of each of the second text index fields in each search log corresponding to the first search policy; and based on each of the second text index domains And calculating an average weight of each of the second text index domains corresponding to the first search policy in a single log weight in each search log corresponding to the first search policy.
  • a text field and weight determining unit 553, configured to determine, according to an average weight of each of the second text index fields, the first text index field corresponding to the first search policy, and each of the first texts The search weight of the index field match.
  • the text field and weight determining unit 553 is further configured to calculate, according to the average weight of each of the second text index fields, the first search policy, corresponding to each of the first a normalized weight value of the second text index field; the second text index field corresponding to the normalized weight value greater than a preset threshold is determined as the first text index field corresponding to the first search policy And determining, according to the normalized weight value corresponding to the first text index field, a search weight matched by the first text index domain.
  • the first search strategy and its classifier are trained based on the search log, and the search index corresponding to the text index field corresponding to the first search strategy and the search weight of each text index field are obtained by iterative calculation based on the search log, and the user's search expectation is fully experienced, and the search is effective. Improve the accuracy of your search results.
  • the searching module 510 is specifically configured to:
  • the search logic further includes:
  • the supplementary search module 560 is configured to separately perform a search operation of the query text based on the second search policy; wherein the second search policy corresponds to all second text index fields of the search material, and each of the second text indexes The search weight of the domain is the same.
  • the search logic further includes:
  • the search policy update module 570 is configured to train and update a classifier for identifying the first search policy based on a search log corresponding to the second search policy when a preset condition is met.
  • the preset condition includes at least one of: reaching a preset update period; a ratio of the first click rate to the second click rate is less than a preset threshold; wherein the first click rate is based on The first search strategy performs a click rate of the search result obtained by the search operation, and the second click rate is a click rate of the search result obtained by performing the search operation based on the second search policy.
  • the classifier for identifying the first search strategy is repeatedly executed, and the problem that the first search strategy is not applicable due to the change of the user's usage habits can be found, and the new one can be found in time.
  • the first search strategy is repeatedly performed.
  • the present application also discloses a non-transitory computer readable storage medium having stored thereon a computer program that, when executed by a processor, implements the steps of the search method described in the above embodiments.

Abstract

一种搜索方法、装置及非临时性计算机可读存储介质。所述方法包括:确定与待搜索的查询文本匹配的至少一个第一搜索策略(100),其中,每个所述第一搜索策略对应至少一个第一文本索引域及所述第一文本索引域匹配的搜索权重;基于每个所述第一搜索策略对应的每个所述第一文本索引域,执行所述查询文本的搜索操作(110);将上述所有搜索操作的搜索结果进行归并输出(120)。

Description

搜索方法、装置及非临时性计算机可读存储介质
相关申请的交叉引用
本专利申请要求于2017年03月31日提交的、申请号为201710209677.X、发明名称为“一种搜索方法及装置,电子设备”的中国专利申请的优先权,该申请的全文以引用的方式并入本文中。
技术领域
本申请涉及计算机技术,具体涉及一种搜索方法、装置及非临时性计算机可读存储介质。
背景技术
随着互联网技术的发展,互联网上的信息呈爆炸式的增长,越来越多的用户通过互联网的信息搜索获取自己关注的内容。例如,搜索引擎可基于用户输入的文本进行信息搜索,并基于文本相关性执行搜索服务。搜索引擎发展伊始,网页也是互联网的主要信息载体,因而针对网页进行搜索就基本能够获得用户关注的内容。然而,随着移动互联网的发展,O2O(Online-to-Offline)平台提供的本地生活化服务方便了人们的生活,在O2O平台上的搜索需求也逐渐增多。与网页不同,O2O平台的信息描述载体可具有多个文本索引域,用于从不同的角度对平台服务进行描述。比如:当描述一个提供餐饮服务的商家POI(Point of Interest)时,可能会从商家名称、商家注册公司名称、品牌名称、商家所处商圈、商家地址、商家主营菜品、商家营业时间等等角度来进行描述。在这种情况下,O2O平台上的描述性文本索引域有时可多达五十个以上。并且,这些文本索引域描述的信息可能并不相关,利用网页搜索方法对所有文本索引域进行信息检索可能很难获得全面的、准确的用户关注的内容。
发明内容
本申请提供一种搜索方法,对于具有多文本索引域的信息,可获得相对准确的搜索结果。
第一方面,本申请实施例提供了一种搜索方法,包括:
确定与待搜索的查询文本匹配的至少一个第一搜索策略,其中,每个所述第一搜索策略对应至少一个第一文本索引域及所述第一文本索引域匹配的搜索权重;
基于每个所述第一搜索策略对应的每个所述第一文本索引域,分别执行所述查询文本的搜索操作;
将上述所有搜索操作的搜索结果进行归并输出。
第二方面,本申请实施例提供了一种搜索装置,包括:处理器和非临时性计算机可读存储介质。所述非临时性计算机可读存储介质存储有能够被所述处理器执行的机器可执行指令,所述处理器被所述机器可执行指令促使执行如本申请第一方面所公开的搜索方法。
第三方面,本申请实施例提供了一种非临时性计算机可读存储介质,存储有机器可执行指令,在被处理器调用和执行时,所述机器可执行指令促使所述处理器执行如本申请第一方面所公开的搜索方法。
本申请实施例公开的搜索方法,通过确定与查询文本匹配的至少一个第一搜索策略,其中,每个所述第一搜索策略对应至少一个文本索引域以及所述文本索引域的匹配的搜索权重;然后,基于每个所述第一搜索策略对应的每个所述文本索引域,分别执行所述查询文本的搜索操作;最后,将上述所有搜索操作的搜索结果进行归并输出。对于具有多个文本索引域的信息,可获得相对准确的搜索结果。通过在仅与查询文本关联的文本索引域执行搜索操作,而不需要搜索所有的文本索引域,从而可避免因在不相关的文本索引域上字面命中而带来误召回,有效提升了搜索结果的相关性。并且,通过对不同文本域索引匹配设置搜索权重,可有效提高搜索结果的准确率。
附图说明
为了更清楚地说明本申请实施例的技术方案,下面将对实施例的技术描述中所需要使用的附图作简单地介绍。下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1是本申请一实施例的搜索方法的流程图。
图2是本申请另一实施例的搜索方法的流程图。
图3是本申请又一实施例的搜索方法的流程图。
图4是本申请一实施例的搜索装置的硬件结构示意图。
图5是本申请一实施例提供的搜索逻辑的功能模块图。
图6是本申请另一实施例提供的搜索逻辑的功能模块图。
图7是本申请又一实施例提供的搜索逻辑的功能模块图。
图8是本申请再一实施例提供的搜索逻辑的功能模块图。
具体实施方式
下面将结合附图,对本申请实施例中的技术方案进行清楚、完整地描述。显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动下所获得的所有其他实施例,都属于本申请保护的范围。
本申请公开一种搜索方法,如图1所示,该方法包括步骤100至步骤120。
本申请的搜索方法可包括两类搜索策略,即第一搜索策略、第二搜索策略。其中,第一搜索策略可仅针对搜索物料的部分文本索引域执行搜索操作,第二搜索策略可针对搜索物料的全部文本索引域执行搜索操作。
步骤100,确定与查询文本匹配的至少一个第一搜索策略。
其中,每个所述第一搜索策略可对应至少一个文本索引域,以及所述文本索引域匹配的搜索权重。
第一搜索策略可用于限定所要查询的搜索物料的文本索引域以及与所述文本索引域匹配的搜索权重。每个所述第一搜索策略可对应至少一个文本索引域,每个所述文本索引域可具有相同或不同的搜索权重。每个所述第一搜索策略对应的文本索引域各自可对应相同或不同的查询文本。文本索引域可用来建立索引,例如倒排索引。文本索引域的内容通常是有意义的文本,可用来描述搜索物料的某一方面。以提供餐饮服务的商家为例,搜索物料的兴趣点POI(Point of Interest)可能会包括商家名称、注册公司名称、品牌名称、所处商圈、地址、主营菜品和营业时间等等字段中至少之一,这些文本字段即是文本索引域。如搜索物料“金百万位于望京花园的分店”的poi_name可为金百万烤鸭店(望京花园店)。其中,poi_name是指系统中记录的文本索引域的名称,例如为商家名称“金百万烤鸭店”,而poi_name后面的文本是该文本索引域的具体内容,可被用来建立倒排索引。文本索引域可用于表示搜索物料的字段。这样,获取待搜索的查询文本之后,可首先确定所述查询文本匹配的第一搜索策略。例如,可以预先设置多个第一搜索策略的文本索引域,并设置与每个第一搜索策略对应的查询文本。例如,第一搜索策略可包括:商家策略、地标策略、菜名策略等。然后,可分 别设置每个第一搜索策略对应的查询文本,如商家策略对应的查询文本可包括:金百万、肯德基、全聚德等。
待搜索的查询文本可以是用户在客户端的搜索栏中输入的,也可以是客户端根据用户的历史行为日志自动生成的。例如,当客户端检测到某一女性用户进入化妆品销售页面时,可根据用户的年龄信息给用户推送相关的搜索结果。此时,客户端可首先根据用户的信息生成查询文本(如:中年女性),然后调用搜索引擎对自动生成的查询文本执行搜索操作。
当基于查询文本与第一搜索策略的对应关系来确定所述查询文本匹配的至少一个第一搜索策略时,可首先通过人工预先建立查询文本与第一搜索策略的对应关系。如:可设置查询文本“肯德基”、“金百万”对应的搜索策略为商家策略。在设置查询文本与第一搜索策略的对应关系时,可同时设置每个第一搜索策略包含的文本索引域以及每个文本索引域的搜索权重。如可设置商家策略中包含的文本索引域有:商家名称、品牌名称、注册公司名称等。并且,商家策略对应的每个文本索引域的搜索权重可以设置为:商家名称的搜索权重为50%;品牌名称的搜索权重为30%;注册公司名称的搜索权重为20%。第一搜索策略对应的文本索引域以及所对应的每个文本索引域的搜索权重可以根据先验知识设置。
确定与待搜索的查询文本匹配的至少一个第一搜索策略可包括:根据预先设置的第一搜索策略和查询文本的对应关系,可确定与查询文本匹配的至少一个第一搜索策略;或者,通过预先训练的分类器对查询文本进行识别,可确定与所述查询文本匹配的至少一个第一搜索策略。其中,所述第一搜索策略可以是人工预先建立的,也可以是通过根据用户历史行为训练得到的识别模型识别确定的。
当通过预先训练的分类器来确定与查询文本匹配的至少一个第一搜索策略时,可首先根据搜索日志训练分类器。例如,获取一段时间内的搜索日志后,可根据搜索日志中的查询文本、文本索引域、匹配文本等信息,对所获取的搜索日志进行聚类,以训练用于识别第一搜索策略的分类器。基于搜索日志训练得到的分类器可以用于确定与所述查询文本匹配的至少一个第一搜索策略。
步骤110,基于每个所述第一搜索策略对应的文本索引域,分别执行所述查询文本的搜索操作。
一个查询文本可能对应多个第一搜索策略,每个第一搜索策略中有可能包括多个文本索引域。在确定了与查询文本匹配的第一搜索策略之后,可分别基于每个第一搜索策略中的文本索引域对所述查询文本执行搜索操作。例如,根据查询文本“金百万”可确定的第一搜索策 略包括商家策略、地标策略。在商家策略中,与查询文本“金百万”匹配的文本索引域包括:商家名称、品牌名称。在地标策略中,与查询文本“金百万”匹配的文本索引域包括:建筑物。可分别基于商家名称、品牌名称和建筑物三个文本索引域,在搜索物料中对查询文本“金百万”执行搜索操作,并分别得到三个搜索结果列表。基于不同的文本索引域,在搜索物料中对查询文本执行搜索操作时,可结合每个文本索引域的搜索权重来计算查询文本与搜索物料的相关性。
为了避免遗漏搜索结果,还可基于第二搜索策略执行搜索操作。其中,所述第二搜索策略对应所有文本索引域。这样,通过基于第二搜索策略在所有文本索引域内执行所述查询文本的搜索操作,所得到的第二搜索结果可作为基于第一搜索策略在对应的文本索引域执行所述查询文本的搜索操作所得到的第一搜索结果的补充。
步骤120,将上述所有搜索操作的搜索结果进行归并输出。
将所有所述搜索操作的搜索结果进行归并输出时,首先可对搜索结果进行排序,然后过滤掉重复的搜索结果,将剩余的搜索结果输出。进行搜索结果排序时,可以将搜索结果按照搜索策略的优先级进行分块排位;或者,可将搜索结果按照每个搜索策略的判别得分进行分块排位;再或者,可按照搜索结果的评价得分对所有搜索结果进行混合排序。如果执行的搜索操作,包括基于第二搜索策略执行所述查询文本的搜索操作,则可将基于第二搜索策略执行搜索操作得到的第二搜索结果排在最后。
根据本申请实施例公开的搜索方法,可先确定与待搜索的查询文本匹配的至少一个第一搜索策略。其中,每个所述第一搜索策略对应至少一个文本索引域,并且每个所述文本索引域具有预设的搜索权重。然后,基于每个所述第一搜索策略对应的文本索引域,分别执行所述查询文本的搜索操作。最后,将上述所有搜索操作的搜索结果进行归并输出。这样,即使搜索物料具有多个文本索引域的信息,也可得到相对准确的搜索结果。通过仅在与查询文本关联的文本索引域执行搜索操作,而不需要搜索所有的文本索引域,从而可避免了因在不相关的文本索引域上字面命中而带来误召回,有效提升了搜索结果的相关性。并且,通过对不同文本索引域设置搜索权重,可有效提高了搜索结果的准确率。
本实施例公开的一种搜索方法,如图2所示,该方法包括步骤200至步骤250。
步骤200,基于搜索日志训练用于识别第一搜索策略的分类器。
当要通过分类器来确定与查询文本匹配的至少一个第一搜索策略时,可首先根据搜索日志训练分类器。基于搜索日志训练用于识别第一搜索策略的分类器,可包括:对搜索日志进 行聚类,生成搜索策略空间定义,所述搜索策略空间定义可用于表示各个第一搜索策略和搜索日志中的查询文本的映射关系;基于所述搜索策略空间定义,分别获取每个所述第一搜索策略对应的搜索日志;基于每个所述第一搜索策略对应的搜索日志,分别训练用于识别相应的第一搜索策略的分类器。
其中,对搜索日志进行聚类,生成搜索策略空间定义,可包括:将根据每条搜索日志提取的查询文本在文本索引域的命中得分作为特征,对搜索日志进行聚类,获得查询文本类别。每一个查询文本类别可对应一个或多个搜索策略。
在训练得到分类器之前,可首先获取基于第二搜索策略执行搜索操作的搜索日志。为了使训练得到的分类器更准确,并且尽量减少训练的运算量,可选择下单行为的搜索日志进行分类器训练。搜索服务器记录的搜索日志在不同的系统中会有些差异。例如,搜索日志可包括搜索时间、查询文本、匹配文本、文本索引域、展现结果列表、点击或下单等行为标识等。如果下单行为的搜索日志相对于所有搜索日志占比太低,则可选择点击日志和下单日志共同训练分类器。当选择点击日志和下单日志共同训练分类器时,点击日志的行为类型权重可小于下单日志的行为类型权重。
可基于所获取的搜索日志分别计算每一个文本索引域的命中得分。例如,可以采用以下公式1计算每个文本索引域在该搜索日志中的命中得分scorei:
Figure PCTCN2017115680-appb-000001
其中,matchi表示对查询文本执行搜索操作时所述查询文本在第i个文本索引域匹配的文本,len(matchi)表示所述查询文本在第i个文本索引域匹配的文本的长度。fieldi表示第i个文本索引域的内容,len(fieldi)表示第i个文本索引域的文本的长度。一般来说,len(matchi)<=len(fieldi)。N为平滑因子,公式1中分母表示取文本索引域的文本长度和长度上限N中的较小者。长度上限N作为该分母的上限,用于使得整个score不至于太小。typej表示当前第j个搜索日志对应的用户行为类型的权重,例如点击日志的行为类型权重type=0.8;下单日志的行为类型权重type=1。可见,基于发生了点击或者下单行为的每一条日志中的每一个文 本索引域,都可以得到至少一个非零的值作为该文本索引域在该条日志中的命中得分。N可以根据搜索服务的功能设置为一个自然数,如30。
初始化文本索引域向量,该向量的维数等于搜索日志中文本索引域的数量。以搜索日志中包括M个文本索引域为例,则文本索引域向量可为一个M维的向量。对于每一个文本索引域,可分别通过公式1计算该文本索引域在每条搜索日志的命中得分scorei。这样,针对每一搜索日志都可以得到一个M维向量。针对多条搜索日志,则可以得到类似于[0,0,1.0,0.8,0...0]、[0,0,0.9,0.9,0...0]等的多个M维向量。其中,M是搜索日志中文本索引域的个数,而每个M维向量的第i维数值对应着第i个文本索引域在各个搜索日志中的命中得分。
根据多个下单行为日志或/和点击行为日志得到多个M维非零向量之后,通过对所得到的多个M维向量进行聚类,即将在文本索引域上匹配情况相似的一类搜索聚到同一个类别中,可以建立每个第一搜索策略和搜索日志中的查询文本的映射关系。在一实施例中,可以采用多维空间的数值聚类方法对所获得的M维向量进行聚类,如Dbscan聚类算法、k-means聚类算法,本申请对采用的聚类算法不作限定。
经过聚类计算,聚类的中心点可认为是第一搜索策略的空间定义。所述第一搜索策略的空间定义可用于表示第一搜索策略和搜索日志中的查询文本的映射关系,以使得某一类别的查询文本可对应特定的第一搜索策略。例如:当用户输入“金百万”,“海底捞”,“九头鹰酒家”等查询文本时,通常是要搜索对应的商家。按照前述的聚类方法,查询文本“金百万”、“海底捞”、“九头鹰酒家”将聚成一类。由此可见,根据搜索日志进行聚类的过程实际是通过对看似杂乱的搜索结果做监督学习,学习到某一类查询文本在某些文本索引域上搜索要比在所有文本索引域上搜索更高效的过程。通常聚类结果不宜太细,控制在百以内为佳。采用自动聚类的方法,无需关注第一搜索策略想要表达的具体意义,也无需预先定义第一搜索策略,就可以确定与查询文本对应的第一搜索策略,并进一步确定该第一搜索策略对应的文本索引域。该方法可有效的减少了人工制定策略出错的可能性,并能识别出潜在的、难以发现的数据规律。
然后,可分别基于每一类别的查询文本来训练用于识别第一搜索策略的分类器。
在一实施例中,可以使用每个类别的查询文本作为正样本,并采集一定数量的负样本,将正样本和负样本作为训练样本数据做监督学习,以训练出用于识别第一搜索策略的分类器。每一个查询文本类别可对应一种第一搜索策略。在一实施例中,多分类器的实现方式可以有两种:一种是一个多分类器;另一种是多个二分类器拟合。例如,本实施例中可使用多个二 分类器拟合。分类模型可以有多重选择,本实施例中以使用SVM(Support Vector Machine)分类器对训练样本数据进行监督学习为例,说明分类器的训练过程。首先,从训练样本数据中提取样本特征。所述提取的样本特征至少可包括:查询文本的文本特征,如查询文本、对查询文本进行分词后得到的分词组合。从训练样本数据中提取的样本特征还可以包括:query length,prefix,suffix,POS+bigram,POS+unigram,POS及其他组合特征。其中,query length为查询文本长度,prefix和suffix分别为查询文本的前缀和后缀,unigram和bigram分别为查询文本的文本特征,POS+unigram为查询文本的文本特征的位置。
可将上述提取的样本特征利用SVM分类器进行训练,得到用于识别第一搜索策略的分类器。可利用本领域技术人员熟知的任意技术来基于样本特征训练分类器,此处不再赘述。
经过样本训练,对于每一个查询文本类别,可以得到相应的用于识别第一搜索策略的分类器,用于后续对获取的查询文本进行识别。
步骤210,确定每个所述第一搜索策略对应的文本索引域,及每个文本索引域匹配的搜索权重。
确定每个所述第一搜索策略对应的文本索引域,及每个文本索引域匹配的搜索权重的方式有两种。第一种,如果第一搜索策略由人工预先设定,第一搜索策略中的文本索引域和查询文本的对应关系也通过人工预先设定,则每个第一搜索策略对应的文本索引域匹配的搜索权重也可以通过人工预先设定。每个第一搜索策略对应的文本索引域以及各文本索引域匹配的搜索权重,可以根据经验预先人工设置在程序代码中,也可以通过提供用户界面,由用户根据需要设置,此处不再赘述。
第二种,根据搜索日志设置每个第一搜索策略的文本索引域,以及每个文本索引域匹配的搜索权重。例如,对于每个第一搜索策略,可获取基于该第一搜索策略对应的所有搜索日志;然后,根据该第一搜索策略对应的所述搜索日志中的查询文本在各文本索引域的命中得分,迭代计算该第一搜索策略对应各文本索引域的平均权重;根据该第一搜索策略对应各文本索引域的平均权重确定该第一搜索策略对应的文本索引域以及各文本索引域匹配的搜索权重。其中,所述搜索日志可以为采用第二搜索策略对所有文本索引域执行搜索操作时得到的搜索日志。例如,可通过对聚类获得第一搜索策略的空间定义时采用的搜索日志进行标引,确定每个所述第一搜索策略对应的搜索日志。
所述搜索日志也可以为分别根据每个第一搜索策略,采用文本索引域的初始化搜索权重在所有文本索引域上执行搜索操作时得到的搜索日志。以搜索物料包括M个文本索引域为例, 假设每个第一搜索策略都对应所述M个文本索引域,并且每个所述文本索引域匹配的搜索权重均为1/M。然后运行所述假设的第一搜索策略,对于查询文本按照假设的第一搜索策略执行搜索操作,并获取一段时间内的所述搜索操作的搜索日志。
通过搜索服务器,可以获得每一个第一搜索策略对应的搜索日志,包括获取每条搜索日志的查询文本、命中文本、文本索引域以及行为类型等。其中,命中文本是查询文本在文本索引域上的匹配文本。在本申请的一个实施例中,针对每一个第一搜索策略,根据该第一搜索策略对应的各搜索日志中查询文本在各文本索引域的命中得分,迭代计算该第一搜索策略对应各文本索引域的搜索权重可包括以下四个步骤。
第一步,获取所有文本索引域各自在每一条搜索日志的单一日志权重。以搜索物料包括M个文本索引域为例,每条搜索日志匹配的文本索引域至少为1个。计算命中得分前,可将M个文本索引域的搜索权重分别初始化为1/M。然后,可通过如下公式2计算所有文本索引域各自在每一天搜索日志中的单一日志权重:
Figure PCTCN2017115680-appb-000002
其中,typej为第j条搜索日志的行为类型权重。如:若第j条搜索日志为点击日志,则typej=0.8,若第j条搜索日志为下单日志,则typej=1。typej还可以取其他值,只要满足点击日志的行为类型权重小于下单日志的类型权重即可。fieldi表示第i个文本索引域的内容,len(fieldi)表示第i个文本索引域的内容的长度。matchi表示第j条搜索日志的查询文本在第i个文本索引域的匹配内容,搜索过程中可以得到。还可以采用其他公式计算各文本索引域在每一条搜索日志中的单一日志权重,本实例施中,采用指数的比例是为了控制单一日志权重的上限,以便获取平滑的上限。
通过上述公式2可以得到所有文本索引域在每一条搜索日志中的单一日志权重。例如,假设共有Y条下单日志,每条下单日志有M个文本索引域,则在通过公式2分别获得所有M个文本索引域各自在Y条下单日志的单一日志权重之后,每个文本索引域将对应Y个单一日志权重。
由于每个第一搜索策略可对应至少一个文本索引域,每个文本索引域可能对应于多个第 一搜索策略。例如:商家策略可对应商家名称、地址、商家品牌这三个文本索引域;而地标策略也可以对应商家名称、地址这两个文本索引域。通过采用上述公式2分别对每一个第一搜索策略对应的所有搜索日志进行计算,可以得到所有文本索引域在每个第一搜索策略对应的每一条搜索日志中的单一日志权重。
第二步,基于所有文本索引域在每个第一搜索策略对应的每一条搜索日志中的单一日志权重,分别计算每个第一搜索策略对应每个文本索引域的平均权重。例如,可对每个文本索引域在每个第一搜索策略对应的每一条搜索日志中的单一日志权重计算平均值,得到该第一搜索策略对应各文本索引域的平均权重,公式3如下:
Figure PCTCN2017115680-appb-000003
其中,weighti为第i个文本索引域在一个第一搜索策略对应的某一搜索日志中的单一日志权重,counti是第i个文本索引域在该第一搜索策略对应的所有搜索日志中的非零单一日志权重的数量,weight_avgi表示该第一搜索策略对应第i个文本索引域的平均权重。
以聚类得到P个第一搜索策略(例如,P个第一搜索策略可分别记为G1、G2、…、Gp)为例,假设第一搜索策略G1对应3个文本索引域,可分别记为T1、T2和T3。计算第一搜索策略G1对应文本索引域T1的平均权重weight_avg1、第一搜索策略G1对应文本索引域T2的平均权重weight_avg2和第一搜索策略G1对应文本索引域T3的的平均权重weight_avg3
第三步,获取每个第一搜索策略对应各文本索引域的平均权重的归一化权重值。
通过前面两个步骤的计算,可获得每一个第一搜索策略对应M个文本索引域的平均权重,其中部分非零,其余为零。可使用如下公式对非零的平均权重做归一化,得到平均权重的归一化权重值。公式4如下:
Figure PCTCN2017115680-appb-000004
其中,weight_avgj为某个第一搜索策略对应第j个文本索引域的非0平均权重,weight′i为该第一搜索策略对应第i个文本索引域的归一化权重值,N为非0平均权重的个数。例如,对第一搜索策略G1对应文本索引域T1的平均权重weight_avg1、第一搜索策略G1对应文本索引域T2的平均权重weight_avg2和第一搜索策略G1对应文本索引域T3的平均权重weight_avg3进行归一化处理,得到该第一搜索策略G1对应所有文本索引域T1、T2、T3的归一化权重值weight′1、weight′2和weight′3。经过归一化,每一个第一搜索策略对应的所有文本索引域的权重之和为1。
第四步,确定具有非零归一化权重值的文本索引域为每个第一搜索策略对应的文本索引域。所述非零归一化权重值是该文本索引域在该第一搜索策略下的搜索权重。
经过以上的迭代计算,针对每一个第一搜索策略会确定具有非零归一化权重值的多个文本索引域,从而可选择出搜索物料中用户感兴趣的文本索引域,而文本索引域的归一化权重值可作为计算搜索物料的相关性时可使用的搜索权重。
得到的每一个第一搜索策略对应文本索引域的非零归一化权重值可能存在过小的情况,为了避免噪声,可以设置阈值将过小的非零归一化权重值去掉。在根据第一搜索策略对应的搜索日志中的查询文本在各文本索引域的命中得分,迭代计算第一搜索策略对应各文本索引域的搜索权重时,还可包括:确定归一化权重值大于预设阈值的文本索引域为每个第一搜索策略对应的文本索引域。其中,所述预设阈值可以为1/非零归一化权重值的个数。
在进行第一搜索策略识别时,可将整个查询文本分别输入训练好的分类器,将得出所述查询文本是否适用于当前第一搜索策略的结果。
步骤220,获取待搜索的查询文本。
待搜索的查询文本可以是用户在客户端的搜索栏中输入的查询文本,也可以是客户端根据用户的历史行为日志自动生成的查询文本。例如,客户端检测到某一女性用户进入化妆品销售页面后,可根据用户的年龄信息给用户推送相关的搜索结果。此时,客户端首先根据用户的信息生成查询文本(如:中年女性),然后,调用搜索引擎对自动生成的查询文本执行搜索操作。
步骤230,确定与所述查询文本匹配的至少一个第一搜索策略。
其中,每个所述第一搜索策略对应至少一个文本索引域以及所述文本索引域匹配的搜索权重。
确定与所述查询文本匹配的至少一个第一搜索策略可包括:根据预先设置的第一搜索策略和查询文本的对应关系,确定与查询文本匹配的至少一个第一搜索策略;或者,通过预先训练的分类器对查询文本分别进行识别,确定与所述查询文本匹配的至少一个第一搜索策略。当通过预先训练的分类器对查询文本进行识别,确定与所述查询文本匹配的至少一个第一搜索策略时,可将所述查询文本分别输入预先训练好的多个分类器,获取每个所述分类器的识别结果,当某个或某些分类器识别为适应于所述查询文本时,则将所述某个或某些分类器对应的第一搜索策略作为所述查询文本匹配的第一搜索策略。
步骤240,基于每个第一搜索策略对应的每个所述文本索引域,分别执行所述查询文本的搜索操作。
一个查询文本可以被识别为匹配一个或者多个第一搜索策略,每个第一搜索策略又对应着各自的文本索引域及搜索权重,搜索服务器可根据多个第一搜索策略分别执行搜索操作,以便得到每个第一搜索策略对应的召回结果集合。
基于每个所述第一搜索策略对应的每个所述文本索引域分别执行所述查询文本的搜索操作,包括:根据搜索物料中的文本索引域与所述查询文本的相关性执行物料召回。其中,所述相关性可基于文本索引域的搜索权重确定。可以使用多线程技术由搜索服务器基于多个第一搜索策略并行执行搜索操作,以便得到每个第一搜索策略对应的召回结果集合。由于每个第一搜索策略都对应着各自的文本索引域及其搜索权重,可通过计算搜索物料与所述查询文本的相关性得分,来使得更重要的文本索引域得到更高的相关性得分,从而可有效提升了整个搜索服务器的召回结果排序效果。
举例如下,假设搜索服务器使用线性相关性加权得分作为相关性得分,公式5如下:
相关性得分=∑(文本索引域匹配长度/文本索引域长度)×搜索权重  (公式5)。
以商家“肯德基”对应两个文本索引域为例:第一个文本索引域为“商家名”,对应的查询文本为“肯德基”;第二个文本索引域为“地点”,对应的查询文本为“五道口地铁站西侧”。商家“必胜客”可对应两个同样的文本索引域:第一个文本索引域为“商家名”,对应的查询文本为“必胜客”;第二个文本索引域为“地点”,对应的查询文本为“肯德基五道口店东侧”。查询文本为“肯德基”时,如果“商家名”对应的文本索引域的搜索权重 更大,商家“肯德基”的相关性得分将高于商家“必胜客”。
步骤250,将上述所有搜索操作的搜索结果进行归并输出。
将上述所有搜索操作的搜索结果进行归并输出可包括:按照预设策略对基于上述至少一个第一搜索策略的搜索结果进行排序;滤除排在后面的重复的搜索结果;输出剩余的搜索结果。在将所有所述搜索操作的搜索结果进行归并输出时,可首先对搜索结果按照预设策略进行排序。进行搜索结果排序时,可以将基于多个第一搜索策略执行搜索操作得到的搜索结果,按照人工设定的优先级分块排位;或者,可按照基于每个第一搜索策略执行搜索操作时得到的搜索结果的相关性得分进行分块排位;再或者,可按照搜索结果的相关性得分对所有第一搜索策略得到的搜索结果进行混合排序。然后,过滤掉排在后面的重复的搜索结果,将剩余的搜索结果输出。
本申请实施例公开的搜索方法,可通过基于搜索日志训练用于识别第一搜索策略的分类器,确定每个所述第一搜索策略对应的文本索引域及每个文本索引域匹配的搜索权重。这样,在搜索过程中,可根据所获取的待搜索的查询文本,确定与所述查询文本匹配的至少一个第一搜索策略,并基于每个所述第一搜索策略对应的文本索引域分别执行所述查询文本的搜索操作后,将上述所有搜索操作的搜索结果进行归并输出。通过在与查询文本关联的文本索引域执行搜索操作,使得同一查询文本仅在对应的文本索引域进行搜索,而不需要搜索所有的文本索引域,从而避免了在不相关的文本索引域字面命中带来误召回,有效提升了对于具有多个文本索引域的信息的搜索结果的相关性。并且,通过基于不同文本域索引匹配的搜索权重来优化搜索结果的排位,可有效提高了搜索结果的准确率。
基于搜索日志进行训练用于识别第一搜索策略的分类器,并且基于搜索日志进行迭代计算以确定第一搜索策略对应的文本索引域及各文本索引域匹配的搜索权重,可充分体现了用户的搜索期望,进一步有效提高搜索结果的准确度。
本实施例公开的一种搜索方法,如图3所示,该方法可包括步骤300至步骤370。
步骤300,基于搜索日志训练用于识别第一搜索策略的分类器。
基于搜索日志训练用于识别第一搜索策略的分类器的具体实施方式参见上述实施例,此处不再赘述。
步骤310,确定每个所述第一搜索策略对应的文本索引域,及每个文本索引域匹配的搜索权重。
确定每个所述第一搜索策略对应的文本索引域,及每个文本索引域匹配的搜索权重的具 体实施方式,也可参见上述实施例,此处不再赘述。
步骤320,获取待搜索的查询文本。
获取待搜索的查询文本的具体实施方式也可参见上述,此处不再赘述。
步骤330,确定与所述查询文本匹配的至少一个第一搜索策略。
其中,每个所述第一搜索策略可对应至少一个文本索引域及所述文本索引域匹配的搜索权重。
确定与所述查询文本匹配的至少一个第一搜索策略的具体实施方式也可参见上述实施例,此处不再赘述。
步骤340,在所述至少一个第一搜索策略对应的每个所述文本索引域,分别执行所述查询文本的搜索操作。
基于所述至少一个第一搜索策略对应的每个所述文本索引域,分别执行所述查询文本的搜索操作的具体实施方式也可参见上述实施例,此处不再赘述。
步骤350,基于第二搜索策略执行所述查询文本的搜索操作。
其中,所述第二搜索策略对应搜索物料的全部文本索引域,且每个所述文本索引域的搜索权重相同。
为了增加系统的鲁棒性,还可以基于第二搜索策略对全部文本索引域进行查询文本的搜索操作。在排序时将第二搜索策略的搜索结果放在第一搜索策略的搜索结果的后面,以免无结果被召回。
步骤360,将上述所有搜索操作的搜索结果进行归并输出。
将上述所有搜索操作的搜索结果进行归并输出可包括:按照预设策略对所有基于第一搜索策略执行的搜索操作的搜索结果进行排序;将基于第二搜索策略执行的搜索操作得到的搜索结果排在基于第一搜索策略执行的搜索操作得到的搜索结果的后面;滤除排在后面的重复的搜索结果;输出剩余的搜索结果。对基于第一搜索策略执行的搜索操作得到的搜索结果进行排序的具体方法可参见上述实施例,此处不再赘述。然后,过滤掉排在后面的重复的搜索结果,将剩余的搜索结果输出。
步骤370,当满足预设条件时,基于所述第二搜索策略对应的搜索日志训练并更新所述用于识别第一搜索策略的分类器。
随着用户使用习惯的改变或者搜索物料的不断增加,第一搜索策略可能会出现无法适应用户搜索需求的问题。在这种情况下,用户可能会频繁选择基于第二搜索策略执行搜索操作返回的搜索结果。这时,则需要基于用户对所展现的搜索结果的选择行为日志,更新第一搜索策略。所述预设条件可包括以下至少一项:达到预设的更新周期,第一点击率与第二点击率的比值小于预设阈值。其中,所述第一点击率为对基于所述第一搜索策略执行搜索操作得到的搜索结果的点击率,所述第二点击率为对基于第二搜索策略执行搜索操作得到的搜索结果的点击率。
所述预设的更新周期可根据搜索物料的更新速度确定,或者人为设定,例如,可以为1个月。用户对基于所述第一搜索策略执行搜索操作得到的搜索结果的第一点击率以及用户对基于所述第二搜索策略执行搜索操作得到的搜索结果的第二点击率,可以通过对搜索服务器的搜索日志进行统计分析获得。
当达到预设的更新周期,或第一点击率与第二点击率的比值小于预设阈值时,可基于第二搜索策略执行搜索操作得到的搜索日志执行步骤300和步骤310,基于搜索日志重复执行训练用于识别第一搜索策略的分类器以及确定第一搜索策略对应的文本索引域和文本索引域匹配的搜索权重的操作,并将训练得到的分类器及第一搜索策略补充至原有第一搜索策略中。
通过结合第二搜索策略执行搜索操作,可以避免漏检导致的无结果被召回的问题。同时,通过结合第二搜索策略的搜索结果,重复执行训练用于识别第一搜索策略的分类器,可以发现由于用户的使用习惯的改变导致第一搜索策略不适用的问题,并可以及时发现新的第一搜索策略。
对应上述的搜索方法,本申请实施例还提供了一种搜索装置。图4为一种搜索装置的硬件结构示意图。该搜索装置可包括处理器401、存储有机器可执行指令的非临时性计算机可读存储介质402。处理器401与非临时性计算机可读存储介质402可经由系统总线403通信。并且,通过读取并执行非临时性计算机可读存储介质402中与搜索逻辑对应的机器可执行指令,处理器401可执行上文所述的搜索方法。所述搜索装置可以为PC机、移动终端、个人数字助理、平板电脑等。
本文中提到的非临时性计算机可读存储介质402可以是任何电子、磁性、光学或其它物理存储装置,可以包含或存储信息,如可执行指令、数据,等等。例如,非临时性计算机可读存储介质可以是:RAM(Radom Access Memory,随机存取存储器)、易失存储器、非易失性存储器、闪存、存储驱动器(如硬盘驱动器)、固态硬盘、任何类型的存储盘(如光 盘、dvd等),或者类似的存储介质,或者它们的组合。
图5为本申请一实施例提供的搜索逻辑的功能模块图。如图5所示,从功能上划分,上述搜索逻辑可以包括第一搜索策略确定模块510、搜索模块520和搜索结果输出模块530。
第一搜索策略确定模块510,用于确定与待搜索的查询文本匹配的至少一个第一搜索策略,其中,每个所述第一搜索策略对应至少一个第一文本索引域及所述第一文本索引域匹配的搜索权重。
搜索模块520,用于基于所述第一搜索策略确定模块510确定的每个所述第一搜索策略对应的每个所述第一文本索引域,执行所述查询文本的搜索操作;
搜索结果输出模块530,用于将上述所有搜索操作的搜索结果进行归并输出。
本申请实施例公开的搜索装置,通过确定与查询文本匹配的至少一个第一搜索策略,其中,每个所述第一搜索策略对应至少一个第一文本索引域以及所述第一文本索引域匹配的搜索权重;然后,基于每个所述第一搜索策略对应的每个所述文本索引域,执行所述查询文本的搜索操作;最后,将上述所有搜索操作的搜索结果进行归并输出。这样,对于具有多个文本索引域的信息,可获得相对准确的搜索结果。通过在仅与查询文本关联的文本索引域执行搜索操作,而不需要搜索所有的文本索引域,从而可避免因在不相关的文本索引域上字面命中而带来误召回,有效提升了搜索结果的相关性。并且,通过对不同文本域索引匹配设置搜索权重,可有效提高搜索结果的准确率。
在一实施例中,如图6所示,所述第一搜索策略确定模块510包括:
第一确定单元511,用于根据预先设置的第一搜索策略和查询文本的对应关系,确定与查询文本匹配的至少一个第一搜索策略。
在另一实施例中,如图7所示,所述第一搜索策略确定模块510包括:
第二确定单元512,用于通过预先训练的用于识别每个第一搜索策略的分类器对查询文本分别进行识别,确定与所述查询文本匹配的至少一个第一搜索策略。
在一实施例中,若通过第二确定单元512确定与查询文本匹配的至少一个第一搜索策略,则如图7所示,所述搜索逻辑还包括:
搜索策略分类器训练模块540,用于基于搜索日志训练分类器。
在一实施例中,若通过第二确定单元512确定与查询文本匹配的至少一个第一搜索策略,则如图7所示,所述搜索逻辑还包括:
文本域及权重确定模块550,用于确定每个第一搜索策略对应的第一文本索引域,及每个第一文本索引域匹配的搜索权重。
在一实施例中,如图7所示,所述搜索策略分类器训练模块540包括:
搜索策略空间定义确定单元541,用于对搜索日志进行聚类,生成搜索策略空间定义,所述搜索策略空间定义用于表示每个第一搜索策略和搜索日志中的查询文本的映射关系;
训练单元542,用于基于所述搜索策略空间定义,获取每个所述第一搜索策略对应的搜索日志;并基于每个所述第一搜索策略对应的搜索日志,分别训练用于识别相应的第一搜索策略的分类器。
在一实施例中,如图7所示,所述文本域及权重确定模块550包括:
日志获取单元551,用于获取第一搜索策略对应的搜索日志;
权重计算单元552,用于根据所述第一搜索策略对应的搜索日志中的查询文本在搜索物料中的各第二文本索引域的命中得分,迭代计算所述第一搜索策略对应各所述第二文本索引域的平均权重。在一实施例中,权重计算单元552还可用于获取各所述第二文本索引域在所述第一搜索策略对应的每一条搜索日志中的单一日志权重;基于各所述第二文本索引域在在所述第一搜索策略对应的每一条搜索日志中的单一日志权重,计算所述第一搜索策略对应每个所述第二文本索引域的平均权重。
文本域及权重确定单元553,用于根据所述第一搜索策略对应各所述第二文本索引域的平均权重,确定该第一搜索策略对应的第一文本索引域以及各所述第一文本索引域匹配的搜索权重。在一实施例中,文本域及权重确定单元553还可用于基于所述第一搜索策略对应每个所述第二文本索引域的平均权重,计算所述第一搜索策略对应每个所述第二文本索引域的归一化权重值;将大于预设阈值的所述归一化权重值对应的所述第二文本索引域确定为所述第一搜索策略对应的所述第一文本索引域;并将所述第一文本索引域对应的所述归一化权重值确定为所述第一文本索引域匹配的搜索权重。
基于搜索日志进行训练第一搜索策略及其分类器,并且基于搜索日志进行迭代计算获得第一搜索策略对应的文本索引域及各文本索引域匹配的搜索权重,充分体验了用户的搜索期望,有效提高搜索结果的准确度。
在一实施例中,所述搜索模块510具体用于:
根据搜索物料中的各所述第一文本索引域的内容与所述查询文本的相关性执行物料 召回;其中,所述相关性基于所述第一文本索引域的搜索权重确定。
在一实施例中,如图8所示,所述搜索逻辑还包括:
补充搜索模块560,用于基于第二搜索策略分别执行所述查询文本的搜索操作;其中,所述第二搜索策略对应搜索物料的全部第二文本索引域,且每个所述第二文本索引域的搜索权重相同。
在一实施例中,如图8所示,所述搜索逻辑还包括:
搜索策略更新模块570,用于当满足预设条件时,基于所述第二搜索策略对应的搜索日志训练并更新用于识别所述第一搜索策略的分类器。
在一实施例中,所述预设条件包括以下至少一项:达到预设更新周期;第一点击率与第二点击率的比值小于预设阈值;其中,所述第一点击率为对基于所述第一搜索策略执行搜索操作得到的搜索结果的点击率,所述第二点击率为对基于第二搜索策略执行搜索操作得到的搜索结果的点击率。
通过结合第二搜索策略执行搜索操作,可以避免漏检导致的无结果被召回的问题。同时,通过结合第二搜索策略的搜索结果,重复执行训练用于识别第一搜索策略的分类器,可以发现由于用户的使用习惯的改变导致第一搜索策略不适用的问题,并可以及时发现新的第一搜索策略。
本申请还公开了一种非临时性计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现上述实施例所述的搜索方法的步骤。
本说明书中的各个实施例均采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似的部分互相参见即可。对于装置实施例而言,由于其与方法实施例基本相似,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。
以上对本申请提供的一种搜索方法、装置进行了详细介绍,本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的一般技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本申请的限制。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到各实施方式可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件实现。基于这样的理解,上 述技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品可以存储在计算机可读存储介质中,如ROM/RAM、磁碟、光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行各个实施例或者实施例的某些部分所述的方法。

Claims (15)

  1. 一种搜索方法,包括:
    确定与待搜索的查询文本匹配的至少一个第一搜索策略,其中,每个所述第一搜索策略对应至少一个第一文本索引域及所述第一文本索引域匹配的搜索权重;
    基于每个所述第一搜索策略对应的每个所述第一文本索引域,执行所述查询文本的搜索操作;
    将上述所有搜索操作的搜索结果进行归并输出。
  2. 根据权利要求1所述的方法,其中,确定与所述待搜索的查询文本匹配的所述至少一个第一搜索策略,包括:
    根据预先设置的第一搜索策略和查询文本的对应关系,确定与所述查询文本匹配的所述至少一个第一搜索策略。
  3. 根据权利要求1所述的方法,其中,确定与所述待搜索的查询文本匹配的所述至少一个第一搜索策略,包括:
    通过预先训练的用于识别每个所述第一搜索策略的分类器对所述查询文本分别进行识别,确定与所述查询文本匹配的所述至少一个第一搜索策略。
  4. 根据权利要求3所述的方法,还包括:
    基于搜索日志训练所述分类器。
  5. 根据权利要求4所述的方法,其中,基于所述搜索日志训练所述分类器,包括:
    对所述搜索日志进行聚类,生成搜索策略空间定义,其中所述搜索策略空间定义用于表示每个所述第一搜索策略和所述搜索日志中的查询文本的映射关系;
    基于所述搜索策略空间定义,获取每个所述第一搜索策略对应的搜索日志;
    基于每个所述第一搜索策略对应的搜索日志,分别训练用于识别相应的所述第一搜索策略的分类器。
  6. 根据权利要求1所述的方法,其中,还包括:
    确定每个所述第一搜索策略对应的所述第一文本索引域及每个所述第一文本索引域匹配的搜索权重。
  7. 根据权利要求6所述的方法,其中,确定所述第一搜索策略对应的所述第一文本索引域及每个所述第一文本索引域匹配的搜索权重,包括:
    获取所述第一搜索策略对应的搜索日志;
    根据所述第一搜索策略对应的搜索日志中的查询文本在搜索物料中的各第二文本索引域的命中得分,迭代计算所述第一搜索策略对应各所述第二文本索引域的平均权重;
    根据所述第一搜索策略对应各所述第二文本索引域的平均权重,确定该第一搜索策略对应的第一文本索引域以及各所述第一文本索引域匹配的搜索权重。
  8. 根据权利要求7所述的方法,其中,根据所述第一搜索策略对应的搜索日志中的查询文本在搜索物料中的各所述第二文本索引域的命中得分,迭代计算所述第一搜索策略对应各所述第二文本索引域的平均权重,包括:
    获取各所述第二文本索引域在所述第一搜索策略对应的每一条搜索日志中的单一日志权重;
    基于各所述第二文本索引域在在所述第一搜索策略对应的每一条搜索日志中的单一日志权重,计算所述第一搜索策略对应每个所述第二文本索引域的平均权重。
  9. 根据权利要求7所述的方法,其中,根据所述第一搜索策略对应各所述第二文本索引域的平均权重,确定该第一搜索策略对应的第一文本索引域以及各所述第一文本索引域匹配的搜索权重,包括:
    基于所述第一搜索策略对应每个所述第二文本索引域的平均权重,计算所述第一搜索策略对应每个所述第二文本索引域的归一化权重值;
    将大于预设阈值的所述归一化权重值对应的所述第二文本索引域确定为所述第一搜索策略对应的所述第一文本索引域,并
    将所述第一文本索引域对应的所述归一化权重值确定为所述第一文本索引域匹配的搜索权重。
  10. 根据权利要求1所述的方法,其中,基于每个所述第一搜索策略对应的每个所述第一文本索引域,分别执行所述查询文本的搜索操作,包括:
    根据搜索物料中的各所述第一文本索引域的内容与所述查询文本的相关性执行物料召回;其中,所述相关性基于所述第一文本索引域的搜索权重确定。
  11. 根据权利要求1所述的方法,还包括:
    基于第二搜索策略执行所述查询文本的搜索操作;其中,所述第二搜索策略对应搜索物料的全部第二文本索引域,且每个所述第二文本索引域的搜索权重相同。
  12. 根据权利要求11所述的方法,还包括:
    当满足预设条件时,基于所述第二搜索策略对应的搜索日志训练并更新用于识别所述第一搜索策略的分类器。
  13. 根据权利要求12所述的方法,其中,所述预设条件包括以下至少一项:
    达到预设更新周期;以及
    第一点击率与第二点击率的比值小于预设阈值,其中,所述第一点击率为对基于所述第一搜索策略执行搜索操作得到的搜索结果的点击率,所述第二点击率为对基于所述第二搜索策略执行搜索操作得到的搜索结果的点击率。
  14. 一种搜索装置,包括:
    处理器;和
    非临时性计算机可读存储介质;
    所述非临时性计算机可读存储介质存储有能够被所述处理器执行的机器可执行指令,所述处理器被所述机器可执行指令促使执行如权利要求1-13之任一项所述的搜索方法。
  15. 一种非临时性计算机可读存储介质,存储有机器可执行指令,在被处理器调用和执行时,所述机器可执行指令促使所述处理器执行如权利要求1-13之任一项所述的搜索方法。
PCT/CN2017/115680 2017-03-31 2017-12-12 搜索方法、装置及非临时性计算机可读存储介质 WO2018176913A1 (zh)

Priority Applications (6)

Application Number Priority Date Filing Date Title
EP17903012.7A EP3608799A4 (en) 2017-03-31 2017-12-12 RESEARCH PROCESS AND APPARATUS, AND INFORMATION MEDIA READABLE BY NON-TEMPORARY COMPUTER
JP2020502745A JP2020512651A (ja) 2017-03-31 2017-12-12 検索方法、装置及び非一時的コンピュータ読取可能記憶媒体
US16/499,858 US11144594B2 (en) 2017-03-31 2017-12-12 Search method, search apparatus and non-temporary computer-readable storage medium for text search
CA3059929A CA3059929C (en) 2017-03-31 2017-12-12 Text searching method, apparatus, and non-transitory computer-readable storage medium
KR1020197032313A KR20190128246A (ko) 2017-03-31 2017-12-12 검색 방법 및 장치 및 비-일시적 컴퓨터-판독가능 저장 매체
SG11201909119Y SG11201909119YA (en) 2017-03-31 2017-12-12 Search method and apparatus and non-temporary computer-readable storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710209677.XA CN108664515B (zh) 2017-03-31 2017-03-31 一种搜索方法及装置,电子设备
CN201710209677.X 2017-03-31

Publications (1)

Publication Number Publication Date
WO2018176913A1 true WO2018176913A1 (zh) 2018-10-04

Family

ID=63674133

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/115680 WO2018176913A1 (zh) 2017-03-31 2017-12-12 搜索方法、装置及非临时性计算机可读存储介质

Country Status (8)

Country Link
US (1) US11144594B2 (zh)
EP (1) EP3608799A4 (zh)
JP (1) JP2020512651A (zh)
KR (1) KR20190128246A (zh)
CN (1) CN108664515B (zh)
CA (1) CA3059929C (zh)
SG (1) SG11201909119YA (zh)
WO (1) WO2018176913A1 (zh)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108256070B (zh) * 2018-01-17 2022-07-15 北京百度网讯科技有限公司 用于生成信息的方法和装置
CN111897807A (zh) * 2020-07-01 2020-11-06 拉扎斯网络科技(上海)有限公司 一种数据处理方法以及策略引擎系统
CN111984689B (zh) 2020-08-21 2023-07-25 北京百度网讯科技有限公司 信息检索的方法、装置、设备以及存储介质
CN112989164B (zh) * 2021-03-26 2023-11-03 北京金堤征信服务有限公司 搜索结果处理方法、装置及电子设备
CN113032549B (zh) * 2021-05-31 2021-09-10 北京明略昭辉科技有限公司 一种文档排序方法、装置、电子设备及存储介质
CN116776869A (zh) * 2023-06-30 2023-09-19 荣耀终端有限公司 文档评分方法和电子设备

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102270222A (zh) * 2010-06-03 2011-12-07 微软公司 使用搜索策略确定搜索结果
CN104462143A (zh) * 2013-09-24 2015-03-25 高德软件有限公司 连锁品牌词词库、类别词词库建立方法和装置
CN105488113A (zh) * 2015-11-23 2016-04-13 百度在线网络技术(北京)有限公司 论文的搜索方法、装置及搜索引擎
CN105955991A (zh) * 2016-04-19 2016-09-21 乐视控股(北京)有限公司 一种搜索结果聚合及定位的方法和装置
US20170068712A1 (en) * 2015-09-04 2017-03-09 Palantir Technologies Inc. Systems and methods for database investigation tool

Family Cites Families (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6480843B2 (en) 1998-11-03 2002-11-12 Nec Usa, Inc. Supporting web-query expansion efficiently using multi-granularity indexing and query processing
US6438539B1 (en) 2000-02-25 2002-08-20 Agents-4All.Com, Inc. Method for retrieving data from an information network through linking search criteria to search strategy
GB2449501A (en) 2007-05-25 2008-11-26 Univ Sheffield Searching method and system
JP2010237721A (ja) 2007-07-02 2010-10-21 Nec Corp 検索システム、検索方法および検索用プログラム
KR100898458B1 (ko) 2007-08-10 2009-05-21 엔에이치엔(주) 정보 검색 방법 및 그 시스템
US7945571B2 (en) * 2007-11-26 2011-05-17 Legit Services Corporation Application of weights to online search request
WO2009107628A1 (ja) 2008-02-27 2009-09-03 日本電気株式会社 検索システム、検索方法およびプログラム
CN102236663B (zh) 2010-04-30 2014-04-09 阿里巴巴集团控股有限公司 一种基于垂直搜索的查询方法、系统和装置
US9152674B2 (en) * 2012-04-27 2015-10-06 Quixey, Inc. Performing application searches
US8983991B2 (en) 2012-07-27 2015-03-17 Facebook, Inc. Generating logical expressions for search queries
US9384244B1 (en) * 2012-11-28 2016-07-05 BloomReach Inc. Search with autosuggest and refinements
US9727595B2 (en) * 2013-09-20 2017-08-08 Uber Technologies, Inc. Location searching with category indices
JP6167029B2 (ja) 2013-12-02 2017-07-19 株式会社Nttドコモ レコメンド情報生成装置およびレコメンド情報生成方法
CN104063497B (zh) 2014-07-04 2018-03-06 百度在线网络技术(北京)有限公司 观点处理方法和装置以及搜索方法和装置
CN105335391B (zh) * 2014-07-09 2019-02-15 阿里巴巴集团控股有限公司 基于搜索引擎的搜索请求的处理方法和装置
US10049208B2 (en) * 2015-12-03 2018-08-14 Bank Of America Corporation Intrusion assessment system
US10146815B2 (en) * 2015-12-30 2018-12-04 Oath Inc. Query-goal-mission structures

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102270222A (zh) * 2010-06-03 2011-12-07 微软公司 使用搜索策略确定搜索结果
CN104462143A (zh) * 2013-09-24 2015-03-25 高德软件有限公司 连锁品牌词词库、类别词词库建立方法和装置
US20170068712A1 (en) * 2015-09-04 2017-03-09 Palantir Technologies Inc. Systems and methods for database investigation tool
CN105488113A (zh) * 2015-11-23 2016-04-13 百度在线网络技术(北京)有限公司 论文的搜索方法、装置及搜索引擎
CN105955991A (zh) * 2016-04-19 2016-09-21 乐视控股(北京)有限公司 一种搜索结果聚合及定位的方法和装置

Also Published As

Publication number Publication date
US20200110778A1 (en) 2020-04-09
CN108664515B (zh) 2019-09-17
KR20190128246A (ko) 2019-11-15
SG11201909119YA (en) 2019-10-30
CA3059929C (en) 2023-08-29
EP3608799A1 (en) 2020-02-12
JP2020512651A (ja) 2020-04-23
US11144594B2 (en) 2021-10-12
CA3059929A1 (en) 2018-10-04
EP3608799A4 (en) 2020-11-04
CN108664515A (zh) 2018-10-16

Similar Documents

Publication Publication Date Title
WO2018176913A1 (zh) 搜索方法、装置及非临时性计算机可读存储介质
CN110188168B (zh) 语义关系识别方法和装置
WO2020108608A1 (zh) 搜索结果处理方法、装置、终端、电子设备及存储介质
CN106649818B (zh) 应用搜索意图的识别方法、装置、应用搜索方法和服务器
WO2019214245A1 (zh) 一种信息推送方法、装置、终端设备及存储介质
KR102092691B1 (ko) 웹페이지 트레이닝 방법 및 기기, 그리고 검색 의도 식별 방법 및 기기
CN110162695B (zh) 一种信息推送的方法及设备
US9110922B2 (en) Joint embedding for item association
WO2017024884A1 (zh) 一种搜索意图识别方法及装置
CN111615706A (zh) 基于子流形稀疏卷积神经网络分析空间稀疏数据
US10049148B1 (en) Enhanced text clustering based on topic clusters
WO2016180270A1 (zh) 网页分类方法和装置、计算设备以及机器可读存储介质
US20140214835A1 (en) System and method for automatically classifying documents
US10140315B1 (en) Identifying visual portions of visual media files responsive to visual portions of media files submitted as search queries
CN110503506B (zh) 基于评分数据的物品推荐方法、装置及介质
CN108846097B (zh) 用户的兴趣标签表示方法、文章推荐方法、及装置、设备
WO2013138516A1 (en) Publishing product information
CN109460519B (zh) 浏览对象推荐方法及装置、存储介质、服务器
US20180046721A1 (en) Systems and Methods for Automatic Customization of Content Filtering
CN111090771A (zh) 歌曲搜索方法、装置及计算机存储介质
CN113934941A (zh) 一种基于多维度信息的用户推荐系统及方法
JP2018504686A (ja) 検索データを処理するための方法及び装置
CN114756570A (zh) 采购场景的垂直搜索方法、装置和系统
CN116882414B (zh) 基于大规模语言模型的评语自动生成方法及相关装置
WO2023151576A1 (zh) 搜索推荐方法、搜索推荐系统、计算机设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17903012

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 3059929

Country of ref document: CA

ENP Entry into the national phase

Ref document number: 2020502745

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 20197032313

Country of ref document: KR

Kind code of ref document: A

WWE Wipo information: entry into national phase

Ref document number: 2017903012

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2017903012

Country of ref document: EP

Effective date: 20191031