WO2020233344A1 - 一种搜索方法、装置及存储介质 - Google Patents

一种搜索方法、装置及存储介质 Download PDF

Info

Publication number
WO2020233344A1
WO2020233344A1 PCT/CN2020/086677 CN2020086677W WO2020233344A1 WO 2020233344 A1 WO2020233344 A1 WO 2020233344A1 CN 2020086677 W CN2020086677 W CN 2020086677W WO 2020233344 A1 WO2020233344 A1 WO 2020233344A1
Authority
WO
WIPO (PCT)
Prior art keywords
current search
sample data
features
search
text
Prior art date
Application number
PCT/CN2020/086677
Other languages
English (en)
French (fr)
Inventor
刘利
Original Assignee
深圳壹账通智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳壹账通智能科技有限公司 filed Critical 深圳壹账通智能科技有限公司
Publication of WO2020233344A1 publication Critical patent/WO2020233344A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Definitions

  • This application relates to the field of data processing of big data, and in particular to a search method, device and storage medium.
  • Personalized search refers to customizing search results for users based on their previous search records. Personalized search can provide search results for the user's next search behavior based on the user's historical search records, historical browsing conditions, clicks or interaction behaviors. In this process, using traditional personalized search methods,
  • the embodiments of the application provide a search method, device and storage medium, which can obtain the current search features by inputting the current search keywords into the target model, and then classify and sort the search features to determine the search results. This process avoids manual The process of obtaining search features simplifies the search process and improves search efficiency.
  • the first aspect of the embodiments of the present application provides a search method, which includes:
  • the multiple current search features are classified and sorted using a two-class classification algorithm, and the current search result corresponding to the current search keyword is determined according to the ranking of the multiple current search features.
  • a second aspect of the embodiments of the present application provides a search device, and the search device includes:
  • the training unit is used to train the sample data set in combination with the convolutional neural network to obtain the target model, and the sample data set is obtained according to the historical search records of the current searching user;
  • An obtaining unit configured to obtain a current search keyword, input the current search keyword into the target model, and obtain multiple current search features
  • the search unit is configured to classify and sort the multiple current search features using a two-class classification algorithm, and determine the current search result corresponding to the current search keyword according to the ranking of the multiple current search features.
  • the third aspect of the embodiments of the present application provides an electronic device, including a processor, a memory, a communication interface, and one or more programs.
  • the one or more programs are stored in the memory and configured by The processor executes, the program includes executing a search method, wherein the method includes:
  • the multiple current search features are classified and sorted using a two-class classification algorithm, and the current search result corresponding to the current search keyword is determined according to the ranking of the multiple current search features.
  • the fourth aspect of the embodiments of the present application provides a computer-readable storage medium storing a computer program, wherein the computer program is executed by a processor to implement a search method, wherein the Methods include:
  • the multiple current search features are classified and sorted using a two-class classification algorithm, and the current search result corresponding to the current search keyword is determined according to the ranking of the multiple current search features.
  • the search method and device provided in the embodiments of this application first train the sample data set with the convolutional neural network to obtain the target model; then obtain the current search keyword, and input the current search keyword of the search method into the search method target model, Obtain multiple current search features; finally, the multiple current search features of the search method are classified and sorted using a two-class classification algorithm, and the current search result corresponding to the current search keyword of the search method is determined according to the ranking of the multiple current search features of the search method. Because the search method sample data set is obtained based on the historical search records of the current search user, the target model is the personalized target model corresponding to the current search user.
  • FIG. 1A is a schematic flowchart of a search method provided by an embodiment of this application.
  • FIG. 1B is a schematic structural diagram of a convolutional neural network provided by an embodiment of the present application.
  • FIG. 2 is a schematic flowchart of another search method provided by an embodiment of this application.
  • FIG. 3 is a schematic flowchart of another search method provided by an embodiment of the application.
  • FIG. 4 is a schematic flowchart of another search method provided by an embodiment of the application.
  • FIG. 5 is a schematic structural diagram of an electronic device provided by an embodiment of the application.
  • FIG. 6 is a structural block diagram of a search device provided by an embodiment of this application.
  • FIG. 1A is a schematic flowchart of a search method in an embodiment of this application.
  • the search method includes:
  • a series of historical search records will be generated, including the search keywords entered by the user on the search interface, the search results obtained according to the search keywords, the probability and number of times the user clicks the search page, and the number of times the user clicks the link. Route, the time spent on each page, whether there are interactive behaviors such as consumption and comments on the page. Collecting these historical search records, and then preprocessing or vectorization, can be used as a sample data set for convolutional neural network operations, training the target model, for subsequent search tasks.
  • FIG. 1B is a schematic structural diagram of a convolutional neural network provided by an embodiment of the application.
  • the back propagation algorithm and the stochastic gradient descent method are used, according to the loss of forward propagation.
  • the value of the value is used to perform back-propagation iteratively to update the weight of each layer, until the loss value of the model tends to converge, stop training the model to obtain the deep learning model.
  • the convolutional neural network includes a convolutional layer, a pooling layer, and a fully connected layer.
  • the parameters of the convolutional neural network are obtained, and the target model is obtained, which is The target model corresponding to the sample data set.
  • the sample data set is the user's historical search records.
  • the historical input keywords are used as input values, and the historical search results are used as output values.
  • the historical search behavior is combined with the convolutional neural network for feature extraction, and then the target model corresponding to the historical search records is trained.
  • the method before combining the sample data set with the convolutional neural network for training, the method further includes: collecting sample data according to the historical search records of the current search user.
  • the historical search records include historical input keywords, historical search behaviors, and historical search behaviors. Search results; preprocess the sample data to obtain clean sample data; vectorize the clean sample data to obtain text vector data, and the text vector data forms a sample data set.
  • Standardized methods include preprocessing and vectorization, among which, because all web pages are involved in the search process, and web pages include a large amount of text content, the preprocessing and vectorization of text are involved.
  • Text preprocessing methods include word segmentation, stop words, filtering low-frequency words, encoding normalization, etc.; text vectorization is to use vector space model VSM or probability statistical model to represent the text, so that the computer can understand the calculation and the method used Based on set theory model, based on algebraic wheel model, based on frequency statistical model, etc.
  • search behaviors For example, according to the probability and number of times the user clicks the search page, the search validity of the webpage is determined. And other interactive behaviors to determine the user’s satisfaction with the webpage.
  • vectorizing the clean sample data to obtain the text vector data includes: obtaining the sample text corresponding to the clean sample data; using text weighting technology to perform weighted statistics on the sample text to obtain the sample text score; using the sample text score as the text Vector data.
  • clean sample data means sample data from which invalid data is removed after preprocessing. Assuming that the sample data is a web page, the invalid data includes invalid browsing web pages, invalid text in the web pages, etc.
  • a search keyword may be entered, but many web pages are involved in the process of clicking on the link multiple times. These web pages are vectorized, and then features are extracted, and based on the extracted features Sorting. When the user searches next time, he can directly set the webpage corresponding to the feature with the highest ranking as the search result of the keyword entered by the user according to the ranking of the feature, which can improve the search efficiency of the user.
  • the clean sample data is a web page, and a web page corresponds to a sample text.
  • the method of vectorizing the web page includes vectorizing the sample text, which can be realized by text mining.
  • TF-IDF term frequency-inverse document frequency
  • IDF inverse document frequency index
  • word frequency represents the frequency of words appearing in the sample text, denoted as
  • n i,j represents the number of occurrences of the i-th word A in document j
  • n k,j represents the number of occurrences of the k-th word in document j.
  • the inverse text frequency index is a measure of the universal importance of a word. Recorded as:
  • the sample text score is the TF-IDF value, recorded as:
  • sample text score is used as the text vector data corresponding to the clean sample data to complete the vectorization of the clean sample data.
  • vectorize the clean sample data to obtain text vector data including: obtaining the link relationship of the webpage corresponding to the clean sample data, the link relationship including the link object or the number of links; using a webpage ranking algorithm to calculate the ranking of the webpage corresponding to the clean sample data Value; use the ranking value as text vector data.
  • the search engine will give the search result.
  • the importance of each web page clicked by the user is different, that is, the user clicks
  • the number of pages referencing the page is different from the importance of the pages referencing the page, so the importance of each page is also different.
  • the page ranking algorithm is the PageRank algorithm
  • the page vector value obtained according to the algorithm is the ranking value, that is, the PageRank value.
  • the formula for calculating the PageRank value of a webpage is as follows:
  • PR(pi) represents the PageRank value of webpage pi
  • M pi is the set of all webpages that have links to webpage pi
  • L(pj) is the number of links from webpage pj
  • N is the total number of webpages
  • a is the damping coefficient
  • vectorizing the clean sample data to obtain text vector data further includes: determining the validity of the webpage according to the relevant data of the user clicking on the webpage, the relevant data including the click probability and stay time; and the search validity
  • the binary method is used for representation; the result of the binary method is used as the vector data of the text.
  • vectorizing the clean sample data to obtain text vector data further includes: determining page satisfaction according to user operation data on the search page, and the operation data includes page clicks, page comments or page payment.
  • the data related to the user clicking on the webpage includes the page click probability and stay time.
  • p ⁇ p1, y1 0
  • p ⁇ p1, y1 1.
  • the user's operation data on the search page includes page clicks, page comments or page payment.
  • the number of page clicks is s1
  • the number of page comments is s2
  • the number of page payments is s3.
  • model training can be performed directly based on each text vector data described above in combination with a convolutional neural network, or model training can be performed in combination with all text vector data described above. All the above text vector data can form a text vector, as shown in Table 1:
  • Table 1 Text vector table corresponding to webpage
  • the text vector corresponding to the webpage is obtained by vectorizing the user history search records, and then the text vector is used as a sample data set combined with the convolutional neural network for training to obtain the target model, so that the model training process Numericalization improves the efficiency and reliability of model training.
  • the convolutional layer needs to perform feature extraction.
  • the extracted features are various values representing text content and the correlation between the values Relationships, numerical weights, etc.
  • For each input sample data its corresponding feature can be obtained, and for the input current search keyword, after matching with the historical input keyword corresponding to the sample data set, the feature corresponding to the sample data set can be obtained as the current The characteristics of the search keyword.
  • input the current search keyword into the target model to obtain multiple current search features including: performing semantic analysis on the current search keyword to obtain at least one target word segmentation; inputting multiple target word segmentation into the target model to perform one or Multiple convolution operations; the feature of the last convolution operation is obtained as multiple current search features corresponding to multiple target word segmentation.
  • the current search keyword entered by the user may be a long sentence
  • the historical input keywords may not include content that can directly match the long sentence. Therefore, semantic analysis of the long sentence is required, and then the long sentence is split into multiple sentences.
  • Target participle Input multiple target word segmentation into the target model at one time to match the text corresponding to the historical input keywords, and obtain the text features corresponding to the historical input keywords, and obtain the text features corresponding to all target words, which is the current search keyword Corresponding text characteristics.
  • the target model may include multiple convolutional layers. Each convolutional layer corresponds to a convolution operation. The last convolutional operation performed by the last convolutional layer has the highest level of text features, so it is used as the last current search feature .
  • the second-class classification algorithm is a support vector machine SVM algorithm
  • the multiple current search features are sorted and sorted by the second-class classification algorithm, including: pairwise combination of every two current search features of the multiple current search features , Obtain multiple feature groups; Use SVM algorithm to score and sort each feature group in multiple feature groups, and determine the relative size of the two current search features in each feature group; According to the two feature groups in each feature group The relative size of the current search feature determines the order of multiple current search features.
  • multiple current search features can be combined in pairs to obtain multiple feature groups. Assuming that the two current search features in a feature group are dj and di, when dj>di, If it is assigned a value of 1, otherwise it is assigned a value of -1, then the sorting problem of multiple current search features is transformed into a classification problem.
  • Support Vector Machine SVM is widely used in binary classification problems, which can complete the size comparison of each feature group, and then complete the size ordering problem of all current search features.
  • the search result recommendation corresponding to the current search keyword After determining the size and ranking of all current search features, select the top ranked TOP-K as the target feature, and then obtain the output value corresponding to the target feature, that is, the historical search result, as the search result recommendation corresponding to the current search keyword.
  • the K in TOP-K can be 1, or other positive integers, that is, the search result can be one search result or multiple search results.
  • the sample data set is first trained with the convolutional neural network to obtain the target model; then the current search keyword is obtained, and the current search keyword of the search method is input into the search method target model to obtain multiple The current search feature; finally, the multiple current search features of the search method are sorted and sorted by a two-class classification algorithm, and the current search result corresponding to the current search keyword of the search method is determined according to the ranking of the multiple current search features of the search method. Because the search method sample data set is obtained based on the historical search records of the current search user, the target model is the personalized target model corresponding to the current search user.
  • FIG. 2 is a schematic flowchart of another search method provided by an embodiment of the present application. As shown in FIG. 2, the search method in this embodiment includes:
  • sample text score as text vector data of the clean sample data, and the text vector data constitutes the sample data set
  • the historical search records of the current search user are obtained, the historical search records are preprocessed, and the webpage text in the historical search records is weighted and counted by the TF-IDF method to obtain the text vector corresponding to the webpage text Then use the text vector data as a sample data set for convolutional neural network training to obtain the target model and perform search recommendations.
  • the web page text is vectorized through the TF-IDF method, and the importance of the web page text is determined through word frequency statistics It improves the pertinence and reliability of target model construction.
  • FIG. 3 is a schematic flowchart of another search method provided by an embodiment of the present application. As shown in FIG. 3, the search method in this embodiment includes:
  • the multiple current search features are classified and sorted using a two-class classification algorithm, and the current search result corresponding to the current search keyword is determined according to the ranking of the multiple current search features.
  • the historical search records of the current search user are obtained, the historical search records are preprocessed, and the importance of the webpage text in the historical search records is scored by the PageRank algorithm to obtain the text vector data corresponding to the webpage text. Then use the text vector data as a sample data set for convolutional neural network training to obtain the target model and perform search recommendations.
  • the PageRank algorithm is used to vectorize the web page text, and the importance of the web page text is determined through the link relationship of the web page to improve The pertinence and reliability of the target model construction are improved.
  • FIG. 4 is a schematic flowchart of another search method provided by an embodiment of the present application. As shown in FIG. 4, the search method in this embodiment includes:
  • the current search keyword is input into the target model for feature extraction, and before the current search keyword is input into the target model, it is semantically analyzed and split to obtain at least one target Word segmentation, and then input the target word segmentation into the target model in turn, and obtain the features corresponding to the historical input keywords used to construct the target model as the current search features corresponding to the current search keywords, and use the SVM algorithm to score and sort the current search features to obtain search recommendations result.
  • the current search keywords are split and the features of the corresponding historical input keywords are obtained in turn to form the current search features, which improves the accuracy of the current search feature extraction.
  • the SVM algorithm is used to score and sort the current search features to obtain The recommendation result simplifies the steps of manually obtaining features, improves the efficiency of obtaining search recommendation results, and further improves search efficiency.
  • Fig. 5 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.
  • the electronic device includes a processor, a memory, a communication interface, and one or more programs, wherein the one or more programs are Stored in the aforementioned memory and configured to be executed by the aforementioned processor, the aforementioned program includes instructions for executing the following steps:
  • the multiple current search features are classified and sorted using a two-class classification algorithm, and the current search result corresponding to the current search keyword is determined according to the ranking of the multiple current search features.
  • the electronic device first trains the sample data set with the convolutional neural network to obtain the target model; then obtains the current search keyword, and enters the current search keyword of the search method into the search method target model , To obtain multiple current search features; finally, the multiple current search features of the search method are sorted and sorted using a two-class classification algorithm, and the current search result corresponding to the current search keyword of the search method is determined according to the sorting of the multiple current search features of the search method. Because the search method sample data set is obtained based on the historical search records of the current search user, the target model is the personalized target model corresponding to the current search user.
  • the input of the current search keyword into the target model to obtain multiple current search features includes:
  • the feature of the last convolution operation is acquired as the multiple current search features corresponding to the multiple target word segmentation.
  • the second-class classification algorithm is a support vector machine (SVM) algorithm
  • SVM support vector machine
  • the order of the multiple current search features is determined according to the relative size of the two current search features in each feature group.
  • the method before combining the sample data set with the convolutional neural network for training, the method further includes:
  • Collect sample data according to the historical search records of the current search user including historical input keywords, historical search behaviors, and historical search results;
  • the clean sample data is vectorized to obtain text vector data, and the text vector data constitutes the sample data set.
  • the vectorization of the clean sample data to obtain text vector data includes:
  • the sample text score is used as the text vector data of the clean sample data.
  • the vectorization of the clean sample data to obtain text vector data includes:
  • FIG. 6 is a block diagram of the functional unit composition of the search device 600 involved in an embodiment of the present application.
  • the search device 600 is applied to a business system.
  • the business system includes a server and a client.
  • the search device includes:
  • the training unit 601 is configured to train a sample data set in combination with a convolutional neural network to obtain a target model, and the sample data set is obtained according to the historical search records of the currently searching user;
  • the obtaining unit 602 is configured to obtain a current search keyword, and input the current search keyword into the target model to obtain multiple current search features;
  • the searching unit 603 is configured to classify and sort the multiple current search features using a two-category algorithm, and determine the current search result corresponding to the current search keyword according to the ranking of the multiple current search features.
  • the search device first trains the sample data set with the convolutional neural network to obtain the target model; then obtains the current search keyword, and enters the current search keyword of the search method into the search method target model, Obtain multiple current search features; finally, the multiple current search features of the search method are classified and sorted using a two-class classification algorithm, and the current search result corresponding to the current search keyword of the search method is determined according to the ranking of the multiple current search features of the search method. Because the search method sample data set is obtained based on the historical search records of the current search user, the target model is the personalized target model corresponding to the current search user.
  • the acquiring unit 602 is specifically configured to:
  • the feature of the last convolution operation is acquired as the multiple current search features corresponding to the multiple target word segmentation.
  • the second-class classification algorithm is a support vector machine SVM algorithm
  • the searching unit 603 is specifically configured to:
  • the order of the multiple current search features is determined according to the relative size of the two current search features in each feature group.
  • the search device further includes a data processing unit 604, specifically configured to:
  • Collect sample data according to the historical search records of the current search user including historical input keywords, historical search behaviors, and historical search results;
  • the clean sample data is vectorized to obtain text vector data, and the text vector data constitutes the sample data set.
  • the data processing unit 604 is further specifically configured to:
  • the sample text score is used as the text vector data of the clean sample data.
  • the data processing unit 604 is further specifically configured to:
  • the embodiment of the present application also provides a computer storage medium, the storage medium is a volatile medium and a non-volatile medium, wherein the computer storage medium stores a computer program for electronic data exchange, and the computer program makes the computer at least Perform the following steps:
  • the multiple current search features are classified and sorted using a two-class classification algorithm, and the current search result corresponding to the current search keyword is determined according to the ranking of the multiple current search features.
  • the embodiments of the present application also provide a computer program product.
  • the above-mentioned computer program product includes a non-transitory computer-readable storage medium storing a computer program.
  • the above-mentioned computer program is operable to cause a computer to execute any of the methods described in the above-mentioned method embodiments. Part or all of the steps of the method.
  • the computer program product may be a software installation package, and the above-mentioned computer includes a mobile terminal.
  • the disclosed device may be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the above-mentioned units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components can be combined or integrated. To another system, or some features can be ignored, or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical or other forms.
  • the units described above as separate components may or may not be physically separate, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • each unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.
  • the above integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable memory.
  • the technical solution of the present application essentially or the part that contributes to the prior art or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a memory, A number of instructions are included to enable a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the foregoing methods of the various embodiments of the present application.
  • the aforementioned memory includes: U disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), mobile hard disk, magnetic disk or optical disk and other media that can store program codes.
  • the program can be stored in a computer-readable memory, and the memory can include: flash disk , ROM, RAM, magnetic disk or CD, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Software Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种搜索方法和装置,其中搜索方法包括:将样本数据集结合卷积神经网络进行训练,获得目标模型,所述样本数据集根据当前搜索用户的历史搜索记录获取(101);获取当前搜索关键字,将所述当前搜索关键字输入目标模型中,得到多个当前搜索特征(102);将多个所述当前搜索特征采用二类分类算法进行分类排序,并根据所述多个当前搜索特征的排序确定所述当前搜索关键字对应的当前搜索结果(103)。采用上述方法和装置能够通过将当前搜索关键字输入目标模型中,获得当前搜索特征,然后对搜索特征进行分类排序,确定搜索结果,这个过程中避免了手动获取搜索特征的过程,简化了搜索过程,提升了搜索效率。

Description

一种搜索方法、装置及存储介质
本申请要求于2019年5月21日提交中国专利局、申请号为201910421974.X,发明名称为“一种搜索方法、装置及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及大数据的数据处理领域,具体涉及一种搜索方法、装置及存储介质。
背景技术
个性化搜索是指基于用户之前的搜索记录为其定制搜索结果。通过个性化搜索可以根据用户的历史搜索记录、历史浏览情况、点击情况或交互行为,为用户的下一次搜索行为提供搜索结果。在这个过程中,采用传统的个性化搜索方法,
需要手动提取适合个性化搜索行为的特征,发明人意识到针对不同领域的搜索行为特征的提取需要花费大量的时间,并且需要大量的相关领域的经验,这使得在获取搜索结果时需要耗费大量的时间成本和运算成本,亟待发现一种优化方法以便更高效地进行个性化搜索。
发明内容
本申请实施例提供一种搜索方法、装置及存储介质,能够通过将当前搜索关键字输入目标模型中,获得当前搜索特征,然后对搜索特征进行分类排序,确定搜索结果,这个过程中避免了手动获取搜索特征的过程,简化了搜索过程,提升了搜索效率。
本申请实施例的第一方面提供了一种搜索方法,所述方法包括:
将样本数据集结合卷积神经网络进行训练,获得目标模型,所述样本数据集根据当前搜索用户的历史搜索记录获取;
获取当前搜索关键字,将所述当前搜索关键字输入所述目标模型中,得到多个当前搜索特征;
将所述多个当前搜索特征采用二类分类算法进行分类排序,并根据所述多个当前搜索特征的排序确定所述当前搜索关键字对应的当前搜索结果。
本申请实施例的第二方面提供了一种搜索装置,所述搜索装置包括:
训练单元,用于将样本数据集结合卷积神经网络进行训练,获得目标模型,,所述样本数据集根据当前搜索用户的历史搜索记录获取;
获取单元,用于获取当前搜索关键字,将所述当前搜索关键字输入所述目标模型中,得到多个当前搜索特征;
搜索单元,用于将所述多个当前搜索特征采用二类分类算法进行分类排序,并根据所述多个当前搜索特征的排序确定所述当前搜索关键字对应的当前搜索结果。
本申请实施例第三方面提供了一种电子装置,包括处理器、存储器、通信接口,以及一个或多个程序,所述一个或多个程序被存储在所述存储器中,并且被配置由所述处理器执行,所述程序包括执行一种搜索方法,其中,所述方法包括:
将样本数据集结合卷积神经网络进行训练,获得目标模型,所述样本数据集根据当前搜索用户的历史搜索记录获取;
获取当前搜索关键字,将所述当前搜索关键字输入所述目标模型中,得到多个当前搜索特征;
将所述多个当前搜索特征采用二类分类算法进行分类排序,并根据所述多个当前搜索特征的排序确定所述当前搜索关键字对应的当前搜索结果。
本申请实施例第四方面提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,其中,所述计算机程序被处理器执行以实现一种搜索方法,其中,所述方法包括:
将样本数据集结合卷积神经网络进行训练,获得目标模型,所述样本数据集根据当前搜索用户的历史搜索记录获取;
获取当前搜索关键字,将所述当前搜索关键字输入所述目标模型中,得到多个当前搜索特征;
将所述多个当前搜索特征采用二类分类算法进行分类排序,并根据所述多个当前搜索特征的排序确定所述当前搜索关键字对应的当前搜索结果。
本申请实施例中提供的搜索方法和装置,首先将样本数据集结合卷积神经网络进行训练,获得目标模型;然后获取当前搜索关键字,将搜索方法当前搜索关键字输入搜索方法目标模型中,得到多个当前搜索特征;最后将搜索方法多个当前搜索特征采用二类分类算法进行分类排序,并根据搜索方法多个当前搜索特征的排序确定搜索方法当前搜索关键字对应的当前搜索结果。因为搜索方法样本数据集根据当前搜索用户的历史搜索记录获取的,因此目标模型是当前搜索用户对应的个性化目标模型,将当前搜索关键字输入目标模型后,可以自动提取出多个当前搜索特征而不需要人工参与,简化了特征获取的步骤,提升了特征获取的效率,最后将多个当前搜索特征进行分类排序获得当前搜索结果,提升了搜索效率。
附图说明
图1A为本申请实施例提供的一种搜索方法流程示意图;
图1B本申请实施例提供的一种卷积神经网络结构示意图;
图2为本申请实施例提供的另一种搜索方法的流程示意图;
图3为本申请实施例提供的另一种搜索方法的流程示意图;
图4为本申请实施例提供的另一种搜索方法的流程示意图;
图5为本申请实施例提供的一种电子装置的结构示意图;
图6为本申请实施例提供的一种搜索装置的结构框图。
具体实施方式
本下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
在本文中提及“实施例”意味着,结合实施例描述的特定特征、结构或特性可以包含在本申请的至少一个实施例中。在说明书中的各个位置展示该短语并不一定均是指相同的实施例,也不是与其它实施例互斥的独立的或备选的实施例。本领域技术人员显式地和隐式地理解的是,本文所描述的实施例可以与其它实施例相结合。
下面对本申请实施例进行详细介绍。
请参阅图1A,图1A为本申请实施例中一种搜索方法流程示意图,如图1A所示,所述搜索方法包括:
101、将样本数据集结合卷积神经网络进行训练,获得目标模型,所述样本数据集根据当前搜索用户的历史搜索记录获取。
用户在进行搜索的过程中,将产生一系列的历史搜索记录,包括用户在搜索界面输入的搜索关键字,根据搜索关键字获得的搜索结果,用户点击搜索页面的概率、次数,用户 点击链接的路线,在每个页面停留的时间,在页面是否进行消费、评论等互动行为等。收集这些历史搜索记录,然后进行预处理或向量化,即可作为样本数据集,用于进行卷积神经网络运算,训练出目标模型,用于进行后续的搜索任务。
请参阅图1B,图1B为本申请实施例提供的一种卷积神经网络的结构示意图,卷积神经网络的训练过程中,采用反向传播算法和随机梯度下降方法,根据前向传播的loss值的大小,来进行反向传播迭代更新每一层的权重,直到模型的loss值趋向于收敛时,停止训练模型,得到深度学习模型。如图1B所示,卷积神经网络中包括卷积层、池化层和全连接层,经过不同层对样本数据集的训练,获得卷积神经网络中的参数,得出目标模型,即为样本数据集对应的目标模型。样本数据集为用户历史搜索记录,将历史输入关键字作为输入值,将历史搜索结果作为输出值,对历史搜索行为结合卷积神经网络进行特征提取,进而训练出历史搜索记录对应的目标模型。
可选的,在将样本数据集结合卷积神经网络进行训练之前,所述方法还包括:根据当前搜索用户的历史搜索记录采集样本数据,历史搜索记录包括历史输入关键字,历史搜索行为,历史搜索结果;对样本数据进行预处理,得到清洁样本数据;对清洁样本数据进行向量化,得到文本向量数据,文本向量数据组成样本数据集。
具体地,用户的历史搜索记录是多种多样的,包括搜索关键字,搜索行为,搜索结果等,这些数据的种类繁多,并不能直接用于进行目标模型的训练。因此首先要对这些数据进行标准化。标准化的方法包括预处理和向量化,其中,因为搜索过程中涉及的都是网页,而网页中包括大量文本内容,因此涉及文本的预处理和向量化。文本的预处理方法包括分词,取出停用词,过滤低频词,编码归一化等;文本向量化即使用向量空间模型VSM或者概率统计模型对文本进行表示,使计算机能够理解计算,用的方法基于集合论模型、基于代数轮模型、基于频率统计模型等等。此外,除了对网页的标准化,还包括对搜索行为的标准化,例如根据用户点击搜索页面的概率、次数,确定网页的搜索有效性,根据用户在搜索页面停留的时间,在页面是否进行消费、评论等互动行为,确定用户对网页的满意度等。
可选的,对清洁样本数据进行向量化,得到文本向量数据,包括:获取清洁样本数据对应的样本文本;对样本文本采用文本加权技术进行加权统计,获得样本文本评分;将样本文本评分作为文本向量数据。
具体地,清洁样本数据表示进行预处理后去除无效数据的样本数据。假设样本数据为网页,其中无效数据包括无效浏览网页,网页中的无效文本等。在用户的历史搜索记录中,可能输入一个搜索关键字,但是在这个过程中进行了多次点击链接,那么就涉及到很多网页,对这些网页进行向量化,然后提取特征,并根据提取的特征进行排序,在用户下次进行搜索的时候,可以直接根据特征的排序将排序靠前的特征对应的网页设定为用户输入关键字的搜索结果,可以提升用户的搜索效率。
在这个过程中,假设清洁样本数据为网页,一个网页对应一个样本文本,对网页进行向量化的方法包括对样本文本的向量化,可以通过文本挖掘的方法来实现。TF-IDF(term frequency–inverse document frequency)是一种用于信息检索与数据挖掘的常用加权技术。TF意思是词频(Term Frequency),IDF意思是逆文本频率指数(Inverse Document Frequency)。其中,词频表示词语在样本文本中出现的频率,记为
Figure PCTCN2020086677-appb-000001
其中n i,j表示文档j中第i个词语A出现的次数,n k,j表示文档j中第k个词语出现的次 数。逆文本频率指数是指一个词语普遍重要性的度量。记为:
Figure PCTCN2020086677-appb-000002
其中|D|表示输入搜索关键字后用户浏览的所有网页个数,|{j:t i∈d j}|表示输入搜索关键字后用户浏览的所有网页中,包含词语A的文档个数。
样本文本评分为TF-IDF值,记做:
R 1=TF*IDF (1)
最后将样本文本评分作为清洁样本数据对应的文本向量数据,完成清洁样本数据的向量化。
可选的,对清洁样本数据进行向量化,得到文本向量数据,包括:获取清洁样本数据对应网页的链接关系,链接关系包括链接对象或链接数量;采用网页排名算法计算清洁样本数据对应网页的排名值;将排名值作为文本向量数据。
具体地,用户在输入搜索关键字后,搜索引擎会给出搜索结果,用户根据搜索结果进行点击,然后进行链接点击,在这个过程中,用户点击的每个网页的重要程度不同,即用户点击的每个网页,引用该网页的网页数量和和引用该网页的网页重要程度不同,因此造成每个网页的重要程度也不同。可以对网页进行排名,然后根据排名值对网页文本进行向量化。网页排名算法为PageRank算法,根据该算法得到的网页向量值为排名值,即PageRank值。一个网页的PageRank值计算公式如下:
Figure PCTCN2020086677-appb-000003
其中PR(pi)表示网页pi的PageRank值,M pi是所有对网页pi有出链的网页集合,L(pj)是网页pj的出链数目,N是网页总数,a是阻尼系数。
另外,可选的,对所述清洁样本数据进行向量化,得到文本向量数据还包括:根据用户点击网页的相关数据,确定网页的有效性,相关数据包括点击概率和停留时长;将搜索有效性采用二值法进行表示;将二值法表示结果作为文本的向量数据。
可选的,对所述清洁样本数据进行向量化,得到文本向量数据还包括:根据用户在搜索页面的操作数据,确定页面满意度,操作数据包括页面点击次数,页面评论或页面付费。
具体地,用户点击网页的相关数据包括页面点击概率和停留时长,点击概率可以根据页面显示次数和用户点击次数进行确定,例如P=m/k*100%,其中k表示页面显示次数,m表示用户点击次数,设置第一概率阈值p1,将p与p1的大小对比结果进行二值化,当p<p1时,y1=0,当p≥p1时,y1=1。用户点击了页面后,如果在页面停留时间过短,则可能是刷机操作,因此要确保用户在页面的停留时间达到一定的时长。设置第一时长阈值为t1,用户在页面停留的时长为t,当t<t1时,y2=0,当t≥t1时,y2=1。页面B对应的文本的向量数据R2=y1&y2,即只有当y1和y2同时为1时,文本的向量数据为1。
用户在搜索页面的操作数据包括页面点击次数,页面评论或页面付费。页面点击次数为s1,页面评论数为s2,页面付费次数为s3,然后对这三个参数赋予对应的权值α,β,γ,并进行加权求和,可获得R3=α*s1+β*s2+γ*s3,R3即为页面满意度。其中因为付费的可能性最低,其次评论,再次点击,因此权值γ>β>α,或者可以直接设置为γ=10β=100α。
根据上述实施例中确定的文本向量数据,可以直接根据上述每一个文本向量数据结合卷积神经网络进行模型训练,也可以结合上述所有文本向量数据进行模型训练。上述所有文本向量数据可以组成一个文本向量,如表1所示:
表1网页对应的文本向量表
Figure PCTCN2020086677-appb-000004
Figure PCTCN2020086677-appb-000005
将上述文本向量结合卷积神经网络进行训练,确定模型中的各个参数,即可获得目标模型。
可见,在本申请实施例中,通过对用户历史搜索记录进行向量化,获得网页对应的文本向量,然后将本文向量作为样本数据集结合卷积神经网络进行训练,获得目标模型,使得模型训练过程数值化,提升了模型训练的效率和可靠性。
102、获取当前搜索关键字,将所述当前搜索关键字输入所述目标模型中,得到多个当前搜索特征。
在目标模型训练的过程中,将样本数据集进行卷积神经网络时,卷积层需要进行特征提取,对于文本数据,提取的特征即为各类表现文本内容的数值,以及数值之间的关联关系,数值权重等。对于输入的每一个样本数据,都可以获得其对应的特征,而对于输入的当前搜索关键字,在与样本数据集对应的历史输入关键字进行匹配后,可以获得样本数据集对应的特征作为当前搜索关键字的特征。
可选的,将当前搜索关键字输入目标模型中,得到多个当前搜索特征,包括:将当前搜索关键字进行语义分析,得到至少一个目标分词;将多个目标分词输入目标模型,进行一次或多次卷积运算;获取最后一次卷积运算的特征作为多个目标分词对应的多个当前搜索特征。
具体地,用户输入的当前搜索关键字可能是一个长句,历史输入关键字可能不包括能够与该长句直接匹配的内容,因此要对长句进行语义分析,然后将长句拆分为多个目标分词。将多个目标分词一次输入目标模型,即可匹配到历史输入关键字对应的文本,亦即可获得历史输入关键字对应的文本特征,获取所有目标分词对应的文本特征,即为当前搜索关键字对应的文本特征。其中,目标模型中可能包括多个卷积层,每个卷积层对应一次卷积运算,最后一个卷积层进行的最后一次卷积运算得到的文本特征最高级,因此作为最后的当前搜索特征。
103、将所述多个当前搜索特征采用二类分类算法进行分类排序,并根据所述多个当前搜索特征的排序确定所述当前搜索关键字对应的当前搜索结果。
获得当前搜索关键字对应的当前搜索特征后,对多个搜索特征进行分类排序,即可确定搜索特征对应的搜索结果的排序,进而获得当前搜索关键字对应的搜索结果推荐顺序。
可选的,二类分类算法为支持向量机SVM算法,将多个当前搜索特征采用二类分类算法进行分类排序,包括:将多个当前搜索特征中的每两个当前搜索特征进行成对组合,获得多个特征组;对多个特征组中的每个特征组采用SVM算法进行评分排序,确定每个特征组中的两个当前搜索特征的相对大小;根据每个特征组中的两个当前搜索特征的相对大小确定多个当前搜索特征的排序。
具体地,采用二分类算法,可以将多个当前搜索特征进行两两组合,获得多个特征组,假设一个特征组中的两个当前搜索特征分别为dj和di,当dj>di时,为其赋值为1,否者为其赋值为-1,那么对多个当前搜索特征的排序问题转化为分类问题。支持向量机(Support Vector Machine,SVM)广泛用于二元分类问题,可以完成每个特征组的大小对比,进而完成所有当前搜索特征的大小排序问题。
确定所有当前搜索特征的大小排序后,选择排序最前的TOP-K作为目标特征,然后获取目标特征对应的输出值,即历史搜索结果,作为当前搜索关键字对应的搜索结果推荐。其中TOP-K中的K可以是1,也可以是其他正整数,即搜索结果可以是一个搜索结果,也可以是多个搜索结果。
可见,在本申请实施例中,首先将样本数据集结合卷积神经网络进行训练,获得目标 模型;然后获取当前搜索关键字,将搜索方法当前搜索关键字输入搜索方法目标模型中,得到多个当前搜索特征;最后将搜索方法多个当前搜索特征采用二类分类算法进行分类排序,并根据搜索方法多个当前搜索特征的排序确定搜索方法当前搜索关键字对应的当前搜索结果。因为搜索方法样本数据集根据当前搜索用户的历史搜索记录获取的,因此目标模型是当前搜索用户对应的个性化目标模型,将当前搜索关键字输入目标模型后,可以自动提取出多个当前搜索特征而不需要人工参与,简化了特征获取的步骤,提升了特征获取的效率,最后将多个当前搜索特征进行分类排序获得当前搜索结果,提升了搜索效率。
请参阅图2,图2是本申请实施例提供的另一种搜索方法的流程示意图,如图2所示,本实施例中的搜索方法包括:
201、根据当前搜索用户的历史搜索记录采集样本数据,所述历史搜索记录包括历史输入关键字,历史搜索行为,历史搜索结果;
202、对所述样本数据进行预处理,得到清洁样本数据;
203、获取所述清洁样本数据对应的样本文本,并对所述样本文本采用文本加权技术进行加权统计,获得样本文本评分;
204、将所述样本文本评分作为所述清洁样本数据的文本向量数据,所述文本向量数据组成所述样本数据集;
205、将样本数据集结合卷积神经网络进行训练,获得目标模型,所述样本数据集根据当前搜索用户的历史搜索记录获取;
206、获取当前搜索关键字,将所述当前搜索关键字输入所述目标模型中,得到多个当前搜索特征;
207、将所述多个当前搜索特征采用二类分类算法进行分类排序,并根据所述多个当前搜索特征的排序确定所述当前搜索关键字对应的当前搜索结果。
其中,上述步骤201-步骤207的具体描述可以参照图1A所描述的搜索方法的相应描述,在此不再赘述。
在本申请实施例中,通过获取当前搜索用户的历史搜索记录,对历史搜索记录进行预处理,并通过TF-IDF方法对历史搜索记录中的网页文本进行加权统计,获得网页文本对应的文本向量数据,再将文本向量数据作为样本数据集进行卷积神经网络训练,获得目标模型并进行搜索推荐,这个过程中,通过TF-IDF方法对网页文本进行向量化,通过词频统计确定网页文本的重要性,提升了目标模型构建的针对性和可靠性。
请参阅图3,图3是本申请实施例提供的另一种搜索方法的流程示意图,如图3所示,本实施例中的搜索方法包括:
301、根据当前搜索用户的历史搜索记录采集样本数据,所述历史搜索记录包括历史输入关键字,历史搜索行为,历史搜索结果;
302、获取所述清洁样本数据对应网页的链接关系,所述链接关系包括链接对象或链接数量;
303、采用网页排名算法计算所述清洁样本数据对应网页的排名值;
304、将所述排名值作为所述文本向量数据,所述文本向量数据组成所述样本数据集;
305、将样本数据集结合卷积神经网络进行训练,获得目标模型,所述样本数据集根据当前搜索用户的历史搜索记录获取;
306、获取当前搜索关键字,将所述当前搜索关键字输入所述目标模型中,得到多个当前搜索特征;
307、将所述多个当前搜索特征采用二类分类算法进行分类排序,并根据所述多个当前搜索特征的排序确定所述当前搜索关键字对应的当前搜索结果。
其中,上述步骤301-步骤307的具体描述可以参照图1A所描述的搜索方法的相应描述,在此不再赘述。
在本申请实施例中,通过获取当前搜索用户的历史搜索记录,对历史搜索记录进行预处理,并通过PageRank算法对历史搜索记录中的网页文本重要程度评分,获得网页文本对应的文本向量数据,再将文本向量数据作为样本数据集进行卷积神经网络训练,获得目标模型并进行搜索推荐,这个过程中,通过PageRank算法对网页文本进行向量化,通过网页链接关系确定网页文本的重要性,提升了目标模型构建的针对性和可靠性。
请参阅图4,图4是本申请实施例提供的另一种搜索方法的流程示意图,如图4所示,本实施例中的搜索方法包括:
401、将样本数据集结合卷积神经网络进行训练,获得目标模型,所述样本数据集根据当前搜索用户的历史搜索记录获取;
402、将所述当前搜索关键字进行语义分析,得到至少一个目标分词;
403、将所述多个目标分词输入所述目标模型,进行一次或多次卷积运算;
404、获取最后一次卷积运算的特征作为所述多个目标分词对应的多个当前搜索特征;
405、将所述多个当前搜索特征中的每两个当前搜索特征进行成对组合,获得多个特征组;
406、对所述多个特征组中的每个特征组采用SVM算法进行评分排序,确定所述每个特征组中的两个当前搜索特征的相对大小;
407、根据所述每个特征组中的两个当前搜索特征的相对大小确定所述多个当前搜索特征的排序,并根据所述多个当前搜索特征的排序确定所述当前搜索关键字对应的当前搜索结果。
其中,上述步骤401-步骤407的具体描述可以参照图1A所描述的搜索方法的相应描述,在此不再赘述。
在本申请实施例中,在训练获得目标模型后,将当前搜索关键字输入目标模型中进行特征提取,在将当前搜索关键字输入目标模型之前,将其进行语义分析并拆分得到至少一个目标分词,然后将目标分词依次输入目标模型,得到与构建目标模型的历史输入关键字对应的特征作为当前搜索关键字对应的当前搜索特征,对当前搜索特征采用SVM算法进行评分排序,进而获得搜索推荐结果。这个过程中,对当前搜索关键字进行拆分并依次获得对应的历史输入关键字的特征组成当前搜索特征,提升了当前搜索特征提取的准确性,采用SVM算法对当前搜索特征进行评分排序进而获得推荐结果,简化了人工获取特征的步骤,提升了获得搜索推荐结果的效率,进而提升了搜索效率。
图5是本申请实施例提供的一种电子装置的结构示意图,如图5所示,该电子装置包括处理器、存储器、通信接口以及一个或多个程序,其中,上述一个或多个程序被存储在上述存储器中,并且被配置由上述处理器执行,上述程序包括用于执行以下步骤的指令:
将样本数据集结合卷积神经网络进行训练,获得目标模型,所述样本数据集根据当前搜索用户的历史搜索记录获取;
获取当前搜索关键字,将所述当前搜索关键字输入所述目标模型中,得到多个当前搜索特征;
将所述多个当前搜索特征采用二类分类算法进行分类排序,并根据所述多个当前搜索特征的排序确定所述当前搜索关键字对应的当前搜索结果。
可以看出,在本申请实施例中,电子装置首先将样本数据集结合卷积神经网络进行训练,获得目标模型;然后获取当前搜索关键字,将搜索方法当前搜索关键字输入搜索方法目标模型中,得到多个当前搜索特征;最后将搜索方法多个当前搜索特征采用二类分类算 法进行分类排序,并根据搜索方法多个当前搜索特征的排序确定搜索方法当前搜索关键字对应的当前搜索结果。因为搜索方法样本数据集根据当前搜索用户的历史搜索记录获取的,因此目标模型是当前搜索用户对应的个性化目标模型,将当前搜索关键字输入目标模型后,可以自动提取出多个当前搜索特征而不需要人工参与,简化了特征获取的步骤,提升了特征获取的效率,最后将多个当前搜索特征进行分类排序获得当前搜索结果,提升了搜索效率。
在一个可能的示例中,所述将所述当前搜索关键字输入所述目标模型中,得到多个当前搜索特征,包括:
将所述当前搜索关键字进行语义分析,得到至少一个多个目标分词;
将所述多个目标分词输入所述目标模型,进行一次或多次卷积运算;
获取最后一次卷积运算的特征作为所述多个目标分词对应的多个当前搜索特征。
在一个可能的示例中,所述二类分类算法为支持向量机SVM算法,所述将所述多个当前搜索特征采用二类分类算法进行分类排序,包括:
将所述多个当前搜索特征中的每两个当前搜索特征进行成对组合,获得多个特征组;
对所述多个特征组中的每个特征组采用SVM算法进行评分排序,确定所述每个特征组中的两个当前搜索特征的相对大小;
根据所述每个特征组中的两个当前搜索特征的相对大小确定所述多个当前搜索特征的排序。
在一个可能的示例中,在将样本数据集结合卷积神经网络进行训练之前,所述方法还包括:
根据当前搜索用户的历史搜索记录采集样本数据,所述历史搜索记录包括历史输入关键字,历史搜索行为,历史搜索结果;
对所述样本数据进行预处理,得到清洁样本数据;
对所述清洁样本数据进行向量化,得到文本向量数据,所述文本向量数据组成所述样本数据集。
在一个可能的示例中,所述对所述清洁样本数据进行向量化,得到文本向量数据,包括:
获取所述清洁样本数据对应的样本文本;
对所述样本文本进行TF-IDF加权统计,获得样本文本评分;
将所述样本文本评分作为所述清洁样本数据的文本向量数据。
在一个可能的示例中,所述对所述清洁样本数据进行向量化,得到文本向量数据,包括:
获取所述清洁样本数据对应网页的链接关系,所述链接关系包括链接对象或链接数量;
采用PageRank算法计算所述清洁样本数据对应网页的PageRank值;
将所述PageRank值作为所述文本向量数据。
图6是本申请实施例中所涉及的搜索装置600的功能单元组成框图。该搜索装置600应用于业务系统,业务系统包括服务器和客户端,所述搜索装置包括:
训练单元601,用于将样本数据集结合卷积神经网络进行训练,获得目标模型,所述样本数据集根据当前搜索用户的历史搜索记录获取;
获取单元602,用于获取当前搜索关键字,将所述当前搜索关键字输入所述目标模型中,得到多个当前搜索特征;
搜索单元603,用于将所述多个当前搜索特征采用二类分类算法进行分类排序,并根据所述多个当前搜索特征的排序确定所述当前搜索关键字对应的当前搜索结果。
在此需要说明的是,上述训练单元601、获取单元602和搜索单元603的具体工作过程参见上述步骤101-103的相关描述。在此不再赘述。
可以看出,本申请实施例中,搜索装置首先将样本数据集结合卷积神经网络进行训练,获得目标模型;然后获取当前搜索关键字,将搜索方法当前搜索关键字输入搜索方法目标模型中,得到多个当前搜索特征;最后将搜索方法多个当前搜索特征采用二类分类算法进行分类排序,并根据搜索方法多个当前搜索特征的排序确定搜索方法当前搜索关键字对应的当前搜索结果。因为搜索方法样本数据集根据当前搜索用户的历史搜索记录获取的,因此目标模型是当前搜索用户对应的个性化目标模型,将当前搜索关键字输入目标模型后,可以自动提取出多个当前搜索特征而不需要人工参与,简化了特征获取的步骤,提升了特征获取的效率,最后将多个当前搜索特征进行分类排序获得当前搜索结果,提升了搜索效率。
在可选情况下,所述获取单元602具体用于:
将所述当前搜索关键字进行语义分析,得到至少一个多个目标分词;
将所述多个目标分词输入所述目标模型,进行一次或多次卷积运算;
获取最后一次卷积运算的特征作为所述多个目标分词对应的多个当前搜索特征。
在可选情况下,所述二类分类算法为支持向量机SVM算法,所述搜索单元603具体用于:
将所述多个当前搜索特征中的每两个当前搜索特征进行成对组合,获得多个特征组;
对所述多个特征组中的每个特征组采用SVM算法进行评分排序,确定所述每个特征组中的两个当前搜索特征的相对大小;
根据所述每个特征组中的两个当前搜索特征的相对大小确定所述多个当前搜索特征的排序。
在可选情况下,所述搜索装置还包括数据处理单元604,具体用于:
根据当前搜索用户的历史搜索记录采集样本数据,所述历史搜索记录包括历史输入关键字,历史搜索行为,历史搜索结果;
对所述样本数据进行预处理,得到清洁样本数据;
对所述清洁样本数据进行向量化,得到文本向量数据,所述文本向量数据组成所述样本数据集。
在可选情况下,所述数据处理单元604还具体用于:
获取所述清洁样本数据对应的样本文本;
对所述样本文本进行TF-IDF加权统计,获得样本文本评分;
将所述样本文本评分作为所述清洁样本数据的文本向量数据。
在可选情况下,所述数据处理单元604还具体用于:
获取所述清洁样本数据对应网页的链接关系,所述链接关系包括链接对象或链接数量;
采用PageRank算法计算所述清洁样本数据对应网页的PageRank值;
将所述PageRank值作为所述文本向量数据。
本申请实施例还提供一种计算机存储介质,所述存储介质为易失性介质和非易失性介质,其中,该计算机存储介质存储用于电子数据交换的计算机程序,该计算机程序使得计算机至少执行以下步骤:
将样本数据集结合卷积神经网络进行训练,获得目标模型,所述样本数据集根据当前搜索用户的历史搜索记录获取;
获取当前搜索关键字,将所述当前搜索关键字输入所述目标模型中,得到多个当前搜索特征;
将所述多个当前搜索特征采用二类分类算法进行分类排序,并根据所述多个当前搜索 特征的排序确定所述当前搜索关键字对应的当前搜索结果。
本申请实施例还提供一种计算机程序产品,上述计算机程序产品包括存储了计算机程序的非瞬时性计算机可读存储介质,上述计算机程序可操作来使计算机执行如上述方法实施例中记载的任一方法的部分或全部步骤。该计算机程序产品可以为一个软件安装包,上述计算机包括移动终端。
在本申请所提供的几个实施例中,应该理解到,所揭露的装置,可通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如上述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性或其它的形式。
上述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
上述集成的单元若以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储器中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储器中,包括若干指令用以使得一台计算机设备(可为个人计算机、服务器或者网络设备等)执行本申请各个实施例上述方法的全部或部分步骤。而前述的存储器包括:U盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。
本领域普通技术人员可以理解上述实施例的各种方法中的全部或部分步骤是可以通过程序来指令相关的硬件来完成,该程序可以存储于一计算机可读存储器中,存储器可以包括:闪存盘、ROM、RAM、磁盘或光盘等。
以上对本申请实施例进行了详细介绍,本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的一般技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本申请的限制。

Claims (20)

  1. 一种搜索方法,其中,所述方法包括:
    将搜索用户的样本数据集结合卷积神经网络进行训练,获得目标模型,所述样本数据集根据所述搜索用户的历史搜索记录获取;
    获取当前搜索关键字,将所述当前搜索关键字输入所述目标模型中,得到多个当前搜索特征;
    将所述多个当前搜索特征采用二类分类算法进行分类排序,并根据所述多个当前搜索特征的排序确定所述当前搜索关键字对应的当前搜索结果。
  2. 根据权利要求1所述的方法,其中,所述将所述当前搜索关键字输入所述目标模型中,得到多个当前搜索特征,包括:
    将所述当前搜索关键字进行语义分析,得到至少一个目标分词;
    将所述多个目标分词输入所述目标模型,进行一次或多次卷积运算;
    获取最后一次卷积运算的特征作为所述多个目标分词对应的多个当前搜索特征。
  3. 根据权利要求2所述的方法,其中,所述二类分类算法为支持向量机SVM算法,所述将所述多个当前搜索特征采用二类分类算法进行分类排序,包括:
    将所述多个当前搜索特征中的每两个当前搜索特征进行成对组合,获得多个特征组;
    对所述多个特征组中的每个特征组采用SVM算法进行评分排序,确定所述每个特征组中的两个当前搜索特征的相对大小;
    根据所述每个特征组中的两个当前搜索特征的相对大小确定所述多个当前搜索特征的排序。
  4. 根据权利要求1所述的方法,其中,在将样本数据集结合卷积神经网络进行训练之前,所述方法还包括:
    根据当前搜索用户的历史搜索记录采集样本数据,所述历史搜索记录包括历史输入关键字,历史搜索行为,历史搜索结果;
    对所述样本数据进行预处理,得到清洁样本数据;
    对所述清洁样本数据进行向量化,得到文本向量数据,所述文本向量数据组成所述样本数据集。
  5. 根据权利要求4所述的方法,其中,所述对所述清洁样本数据进行向量化,得到文本向量数据,包括:
    获取所述清洁样本数据对应的样本文本;
    对所述样本文本采用文本加权技术进行加权统计,获得样本文本评分;
    将所述样本文本评分作为所述清洁样本数据的文本向量数据。
  6. 根据权利要求4或5所述的方法,其中,所述对所述清洁样本数据进行向量化,得到文本向量数据,包括:
    获取所述清洁样本数据对应网页的链接关系,所述链接关系包括链接对象或链接数量;
    采用网页排名算法计算所述清洁样本数据对应网页的排名值;
    将所述排名值作为所述文本向量数据。
  7. 一种搜索装置,其中,所述搜索装置包括:
    训练单元,用于将样本数据集结合卷积神经网络进行训练,获得目标模型,所述样本数据集根据当前搜索用户的历史搜索记录获取;
    获取单元,用于获取当前搜索关键字,将所述当前搜索关键字输入所述目标模型中,得到多个当前搜索特征;
    搜索单元,用于将所述多个当前搜索特征采用二类分类算法进行分类排序,并根据所述多个当前搜索特征的排序确定所述当前搜索关键字对应的当前搜索结果。
  8. 根据权利要求7所述的装置,其中,所述获取单元具体用于:
    将所述当前搜索关键字进行语义分析,得到至少一个目标分词;
    将所述多个目标分词输入所述目标模型,进行一次或多次卷积运算;
    获取最后一次卷积运算的特征作为所述多个目标分词对应的多个当前搜索特征。
  9. 一种电子装置,包括处理器、存储器、通信接口,以及一个或多个程序,所述一个或多个程序被存储在所述存储器中,并且被配置由所述处理器执行,所述程序包括用于执行一种搜索方法,其中所述方法包括:
    将搜索用户的样本数据集结合卷积神经网络进行训练,获得目标模型,所述样本数据集根据所述搜索用户的历史搜索记录获取;
    获取当前搜索关键字,将所述当前搜索关键字输入所述目标模型中,得到多个当前搜索特征;
    将所述多个当前搜索特征采用二类分类算法进行分类排序,并根据所述多个当前搜索特征的排序确定所述当前搜索关键字对应的当前搜索结果。
  10. 根据权利要求9所述的电子装置,其中,所述将所述当前搜索关键字输入所述目标模型中,得到多个当前搜索特征,包括:
    将所述当前搜索关键字进行语义分析,得到至少一个目标分词;
    将所述多个目标分词输入所述目标模型,进行一次或多次卷积运算;
    获取最后一次卷积运算的特征作为所述多个目标分词对应的多个当前搜索特征。
  11. 根据权利要求10所述的电子装置,其中,所述二类分类算法为支持向量机SVM算法,所述将所述多个当前搜索特征采用二类分类算法进行分类排序,包括:
    将所述多个当前搜索特征中的每两个当前搜索特征进行成对组合,获得多个特征组;
    对所述多个特征组中的每个特征组采用SVM算法进行评分排序,确定所述每个特征组中的两个当前搜索特征的相对大小;
    根据所述每个特征组中的两个当前搜索特征的相对大小确定所述多个当前搜索特征的排序。
  12. 根据权利要求9所述的电子装置,其中,在将样本数据集结合卷积神经网络进行训练之前,所述方法还包括:
    根据当前搜索用户的历史搜索记录采集样本数据,所述历史搜索记录包括历史输入关键字,历史搜索行为,历史搜索结果;
    对所述样本数据进行预处理,得到清洁样本数据;
    对所述清洁样本数据进行向量化,得到文本向量数据,所述文本向量数据组成所述样本数据集。
  13. 根据权利要求12所述的电子装置,其中,所述对所述清洁样本数据进行向量化,得到文本向量数据,包括:
    获取所述清洁样本数据对应的样本文本;
    对所述样本文本采用文本加权技术进行加权统计,获得样本文本评分;
    将所述样本文本评分作为所述清洁样本数据的文本向量数据。
  14. 根据权利要求12或13所述的电子装置,其中,所述对所述清洁样本数据进行向量化,得到文本向量数据,包括:
    获取所述清洁样本数据对应网页的链接关系,所述链接关系包括链接对象或链接数量;
    采用网页排名算法计算所述清洁样本数据对应网页的排名值;
    将所述排名值作为所述文本向量数据。
  15. 一种计算机可读存储介质,其中,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行以实现一种搜索方法,其中所述方法包括:
    将搜索用户的样本数据集结合卷积神经网络进行训练,获得目标模型,所述样本数据集根据所述搜索用户的历史搜索记录获取;
    获取当前搜索关键字,将所述当前搜索关键字输入所述目标模型中,得到多个当前搜索特征;
    将所述多个当前搜索特征采用二类分类算法进行分类排序,并根据所述多个当前搜索特征的排序确定所述当前搜索关键字对应的当前搜索结果。
  16. 根据权利要求15所述的计算机可读存储介质,其中,所述将所述当前搜索关键字输入所述目标模型中,得到多个当前搜索特征,包括:
    将所述当前搜索关键字进行语义分析,得到至少一个目标分词;
    将所述多个目标分词输入所述目标模型,进行一次或多次卷积运算;
    获取最后一次卷积运算的特征作为所述多个目标分词对应的多个当前搜索特征。
  17. 根据权利要求16所述的计算机可读存储介质,其中,所述二类分类算法为支持向量机SVM算法,所述将所述多个当前搜索特征采用二类分类算法进行分类排序,包括:
    将所述多个当前搜索特征中的每两个当前搜索特征进行成对组合,获得多个特征组;
    对所述多个特征组中的每个特征组采用SVM算法进行评分排序,确定所述每个特征组中的两个当前搜索特征的相对大小;
    根据所述每个特征组中的两个当前搜索特征的相对大小确定所述多个当前搜索特征的排序。
  18. 根据权利要求15所述的计算机可读存储介质,其中,在将样本数据集结合卷积神经网络进行训练之前,还包括:
    根据当前搜索用户的历史搜索记录采集样本数据,所述历史搜索记录包括历史输入关键字,历史搜索行为,历史搜索结果;
    对所述样本数据进行预处理,得到清洁样本数据;
    对所述清洁样本数据进行向量化,得到文本向量数据,所述文本向量数据组成所述样本数据集。
  19. 根据权利要求18所述的计算机可读存储介质,其中,所述对所述清洁样本数据进行向量化,得到文本向量数据,包括:
    获取所述清洁样本数据对应的样本文本;
    对所述样本文本采用文本加权技术进行加权统计,获得样本文本评分;
    将所述样本文本评分作为所述清洁样本数据的文本向量数据。
  20. 根据权利要求18或19所述的计算机可读存储介质,其中,所述对所述清洁样本数据进行向量化,得到文本向量数据,包括:
    获取所述清洁样本数据对应网页的链接关系,所述链接关系包括链接对象或链接数量;
    采用网页排名算法计算所述清洁样本数据对应网页的排名值;
    将所述排名值作为所述文本向量数据。
PCT/CN2020/086677 2019-05-21 2020-04-24 一种搜索方法、装置及存储介质 WO2020233344A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910421974.X 2019-05-21
CN201910421974.XA CN110222260A (zh) 2019-05-21 2019-05-21 一种搜索方法、装置及存储介质

Publications (1)

Publication Number Publication Date
WO2020233344A1 true WO2020233344A1 (zh) 2020-11-26

Family

ID=67821526

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/086677 WO2020233344A1 (zh) 2019-05-21 2020-04-24 一种搜索方法、装置及存储介质

Country Status (2)

Country Link
CN (1) CN110222260A (zh)
WO (1) WO2020233344A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113343082A (zh) * 2021-05-25 2021-09-03 北京字节跳动网络技术有限公司 可热字段预测模型生成方法、装置、存储介质及设备

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110222260A (zh) * 2019-05-21 2019-09-10 深圳壹账通智能科技有限公司 一种搜索方法、装置及存储介质
CN111143695A (zh) * 2019-12-31 2020-05-12 腾讯科技(深圳)有限公司 一种搜索方法、装置、服务器及存储介质
CN111460264B (zh) * 2020-03-30 2023-08-01 口口相传(北京)网络技术有限公司 语义相似度匹配模型的训练方法及装置
CN112733531B (zh) * 2020-12-15 2023-08-18 平安银行股份有限公司 虚拟资源分配方法、装置、电子设备及计算机存储介质
CN116431930A (zh) * 2023-06-13 2023-07-14 天津联创科技发展有限公司 科技成果转化数据查询方法、系统、终端及存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102737090A (zh) * 2012-03-21 2012-10-17 袁行远 网页搜索结果排序方法及装置
CN106649760A (zh) * 2016-12-27 2017-05-10 北京百度网讯科技有限公司 基于深度问答的提问型搜索词搜索方法及装置
CN107066553A (zh) * 2017-03-24 2017-08-18 北京工业大学 一种基于卷积神经网络与随机森林的短文本分类方法
US20180239989A1 (en) * 2017-02-20 2018-08-23 Alibaba Group Holding Limited Type Prediction Method, Apparatus and Electronic Device for Recognizing an Object in an Image
CN109543190A (zh) * 2018-11-29 2019-03-29 北京羽扇智信息科技有限公司 一种意图识别方法、装置、设备及存储介质
CN110222260A (zh) * 2019-05-21 2019-09-10 深圳壹账通智能科技有限公司 一种搜索方法、装置及存储介质

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101930438B (zh) * 2009-06-19 2016-08-31 阿里巴巴集团控股有限公司 一种搜索结果生成方法及信息搜索系统
CN108804443A (zh) * 2017-04-27 2018-11-13 安徽富驰信息技术有限公司 一种基于多特征融合的司法类案搜索方法
CN107491518B (zh) * 2017-08-15 2020-08-04 北京百度网讯科技有限公司 一种搜索召回方法和装置、服务器、存储介质
CN108121814B (zh) * 2017-12-28 2022-04-22 北京百度网讯科技有限公司 搜索结果排序模型生成方法和装置
CN108536678B (zh) * 2018-04-12 2023-04-07 腾讯科技(深圳)有限公司 文本关键信息提取方法、装置、计算机设备和存储介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102737090A (zh) * 2012-03-21 2012-10-17 袁行远 网页搜索结果排序方法及装置
CN106649760A (zh) * 2016-12-27 2017-05-10 北京百度网讯科技有限公司 基于深度问答的提问型搜索词搜索方法及装置
US20180239989A1 (en) * 2017-02-20 2018-08-23 Alibaba Group Holding Limited Type Prediction Method, Apparatus and Electronic Device for Recognizing an Object in an Image
CN107066553A (zh) * 2017-03-24 2017-08-18 北京工业大学 一种基于卷积神经网络与随机森林的短文本分类方法
CN109543190A (zh) * 2018-11-29 2019-03-29 北京羽扇智信息科技有限公司 一种意图识别方法、装置、设备及存储介质
CN110222260A (zh) * 2019-05-21 2019-09-10 深圳壹账通智能科技有限公司 一种搜索方法、装置及存储介质

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113343082A (zh) * 2021-05-25 2021-09-03 北京字节跳动网络技术有限公司 可热字段预测模型生成方法、装置、存储介质及设备

Also Published As

Publication number Publication date
CN110222260A (zh) 2019-09-10

Similar Documents

Publication Publication Date Title
WO2020233344A1 (zh) 一种搜索方法、装置及存储介质
US11586637B2 (en) Search result processing method and apparatus, and storage medium
US9767144B2 (en) Search system with query refinement
CN103914478B (zh) 网页训练方法及系统、网页预测方法及系统
US8909648B2 (en) Methods and systems of supervised learning of semantic relatedness
CN102792262B (zh) 使用权利要求分析排序知识产权文档的方法和系统
CN110929038B (zh) 基于知识图谱的实体链接方法、装置、设备和存储介质
US20190349320A1 (en) System and method for automatically responding to user requests
US10810378B2 (en) Method and system for decoding user intent from natural language queries
CN108073568A (zh) 关键词提取方法和装置
US20140207782A1 (en) System and method for computerized semantic processing of electronic documents including themes
US20170371965A1 (en) Method and system for dynamically personalizing profiles in a social network
US20170103439A1 (en) Searching Evidence to Recommend Organizations
CN107506472B (zh) 一种学生浏览网页分类方法
US20160154798A1 (en) Method of automatically constructing content for web sites
CN112559684A (zh) 一种关键词提取及信息检索方法
CN109255012A (zh) 一种机器阅读理解的实现方法以及装置
CN111325018A (zh) 一种基于web检索和新词发现的领域词典构建方法
US11763180B2 (en) Unsupervised competition-based encoding
CN109522275B (zh) 基于用户生产内容的标签挖掘方法、电子设备及存储介质
Kartal et al. TrClaim-19: The first collection for Turkish check-worthy claim detection with annotator rationales
CN111563361B (zh) 文本标签的提取方法及装置、存储介质
CN115860283B (zh) 基于知识工作者画像的贡献度预测方法及装置
CN115640439A (zh) 一种网络舆情监控的方法、系统及存储介质
CN112434126B (zh) 一种信息处理方法、装置、设备和存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20810787

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205 DATED 01/03/2022)

122 Ep: pct application non-entry in european phase

Ref document number: 20810787

Country of ref document: EP

Kind code of ref document: A1