CN108664637B

CN108664637B - Retrieval method and system

Info

Publication number: CN108664637B
Application number: CN201810463813.2A
Authority: CN
Inventors: 施俊
Original assignee: Wellong Etown International Logistics Co ltd
Current assignee: Wellong Etown International Logistics Co ltd
Priority date: 2018-05-15
Filing date: 2018-05-15
Publication date: 2021-10-08
Anticipated expiration: 2038-05-15
Also published as: CN108664637A

Abstract

The embodiment of the application provides a retrieval method and a retrieval system, which are used for solving the problems of high matching, low precision and single retrieval result of the existing retrieval method. The method comprises the following steps: acquiring historical query data of a user, and extracting commodity attributes in a title and a text; respectively calculating the correlation between the commodity attribute library and the commodity attribute contained in the historical query data and the correlation between the extracted commodity attribute library and the commodity attribute library of the commodity attribute obtained under each line; and based on the correlation degree, screening the commodity attributes obtained in the next process and the corresponding related commodity attribute list, and searching the commodity attributes serving as the query expansion words. Because the information in the final webpage of the point search of the user effectively reflects the requirements of the user, the information is subsequently utilized to search the expansion words, the search accuracy is improved, the recall degree of the search is expanded, the problem that no result exists in the search or the effective result of the search is few is effectively solved, and the search experience of the user is greatly improved.

Description

Retrieval method and system

Technical Field

The present application relates to the field of internet information retrieval, and in particular, to a retrieval method and system.

Background

Information retrieval is the basis of content-driven applications, and the quality of a search result directly influences whether a user quickly and timely acquires required information. At present, a vertical search engine in a specific field meets the requirement of a user for obtaining specific information to a certain extent, however, a search method used by the existing text matching-based full-text search engine performs web page screening according to the matching degree, so that search results depend on selection of search terms too much, and problems of high matching, low precision and single search result often occur.

Disclosure of Invention

The embodiment of the application provides a retrieval method and a retrieval system, which are used for solving the problems that the retrieval method cannot perform retrieval from the semantic perspective, the retrieval result depends on selection of retrieval words too much, high matching and low precision often occur, and the retrieval result is single.

A retrieval method, comprising:

acquiring historical query data of a user, wherein the historical query data comprises commodity attributes and titles and text identifications of final searched webpages;

extracting commodity attributes from the obtained titles and texts identified by the text identifications;

respectively calculating the correlation between the commodity attributes and the commodity attributes contained in the historical query data aiming at the commodity attributes obtained under each line and the relevant commodity attribute library of the commodity attributes; and are

Respectively calculating the correlation between the commodity attributes and the extracted commodity attributes according to the commodity attributes obtained under each line and the relevant commodity attribute library of the commodity attributes;

screening the commodity attributes obtained under the line and the corresponding related commodity attribute list based on the obtained correlation degree;

and searching by using the commodity attribute obtained by screening as a query expansion word.

A retrieval system, comprising:

the system comprises an acquisition unit, a search unit and a search unit, wherein the acquisition unit is used for acquiring historical query data of a user, and the historical query data comprises commodity attributes, and titles and text identifications of searched final webpages;

the extracting unit is used for extracting the commodity attribute from the obtained title and the text identified by the text identification;

the commodity attribute library generating unit is used for obtaining the commodity attributes and a relevant commodity attribute library of the commodity attributes offline;

the calculation unit is used for respectively calculating the correlation between the commodity attribute and the commodity attribute contained in the historical query data aiming at the commodity attribute obtained under each line and the relevant commodity attribute library of the commodity attribute; respectively calculating the correlation degree of the commodity attribute and the extracted commodity attribute aiming at the commodity attribute obtained under each line and the relevant commodity attribute library of the commodity attribute;

the screening unit is used for screening the commodity attributes acquired under the line and the corresponding related commodity attribute list based on the acquired correlation degree;

and the retrieval unit is used for retrieving by using the commodity attributes obtained by screening as query expansion words.

According to the method and the device for searching the expansion words, the title and the text identification information of the final webpage searched by the user are determined together with the on-line search expansion words, the information in the final webpage searched by the user effectively reflects the requirements of the user, and the information is subsequently utilized to search the expansion words, so that the search accuracy is improved, the recall degree of the search is enlarged, the problem that no result or few effective results are searched is effectively solved, and the search experience of the user is greatly improved.

Drawings

Fig. 1 is a schematic diagram illustrating a method for determining a candidate product attribute library according to an embodiment of the present disclosure;

fig. 2 is a schematic diagram of a retrieval method provided in an embodiment of the present application;

fig. 3 is a schematic structural diagram of a retrieval system according to an embodiment of the present application.

Detailed Description

In order to solve the problems of high matching, low precision and single retrieval result of the retrieval method used by the existing text matching-based full-text retrieval engine, the embodiment of the application provides a retrieval method and a retrieval system.

The method and the system provided by the embodiment of the application can be used for a vertical e-commerce transaction platform for selling bulk commodities, and the platform develops the regional large-scale industrial enterprises to become seller members according to financial settlement guarantee for financing on an offline credit line and data guarantee of a holographic map big data platform, can solve the bottleneck problem that information flow, customer flow, fund flow, logistics and commodity flow disturb electronic commerce development of bulk commodities, and realizes an innovation mode of direct supply and online vertical sale of manufacturers. In the platform, users can search the categories or names of bulk commodities, some users can forget some commodities and search adjacent products, and at the moment, the products searched by the users, other related products of the products and related attributes of the products can be displayed by using the searching method provided by the application, so that the users can conveniently search for many times.

The preferred embodiments of the present application will be described in conjunction with the drawings of the specification, it being understood that the preferred embodiments described herein are merely for purposes of illustration and explanation and are not intended to limit the present application. And the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

The embodiment of the application is divided into a method for determining a candidate commodity attribute library of related phrases offline and a method for searching online, wherein the method for determining offline can obtain the candidate commodity attribute library of related phrases, the method for searching online needs to use the candidate commodity attribute library of related phrases offline, and the related phrases refer to the related phrases corresponding to the commodity attributes inquired by a user. The following describes a determination method below the line and a retrieval method above the line, respectively.

As shown in fig. 1, a schematic diagram of a method for determining a candidate product attribute library of related phrases provided in an embodiment of the present application includes the following steps:

and step 11, acquiring a query log of a user to form a commodity attribute corpus, wherein the query log comprises commodity attributes.

Specifically, the website may record the commodity attribute input by the user into the search box or the commodity attribute clicked by the user, so as to obtain the query log. And extracting the commodity attributes in the query logs to obtain a commodity attribute corpus.

And step 12, taking the commodity attribute corpus as a sample, and training a Word2Vec Word vector model.

Here, before training the Word2Vec Word vector model, the commodity attribute corpus obtained in step 11 may be preprocessed, where the preprocessing mainly includes data cleaning and data description extraction.

The data cleaning is mainly used for realizing the consistency of data in the corpus; because the data volume of the logistics commodity is limited to a certain extent, the MySQL environment can be directly built, the condition that the display formats such as time, date, numerical value, full half angle and the like are inconsistent is eliminated, the problem is usually related to an input end, specifically, the problem can include at least one of unified case and case, elimination of redundant spaces, unified punctuation marks and unified full half angle formats, and the problem is also possibly encountered when multi-source data is integrated and is processed into a certain consistent format. And also includes characters which are not supposed to exist in the contents, and some contents may only include a part of the characters, in this case, the possible problems need to be found out in a semi-automatic checking and semi-manual mode, and the unnecessary characters need to be removed.

Extracting the data description comprises performing word segmentation on each corpus/each commodity attribute in the corpus, specifically adding a user dictionary and performing word segmentation through an NLPIR (natural language processing) Chinese analysis system.

And then, the Word2Vec model can be trained by utilizing the divided corpora.

Word2Vec model: word2vec is an efficient algorithmic model that characterizes words as real vectors. Based on the idea of deep learning, the text processing is simplified into vector operation in a vector space through training, so that the similarity in the vector space can be used for representing the similarity between texts.

The Word2vec basic idea is: firstly, words in a text are mapped into a K-dimensional real number vector space model through training, and the semantic similarity between the words in the space is judged according to the distance between the words, such as cosine (cosine) similarity and Euclidean distance. The algorithm uses a three-layer neural network, namely an input layer, a hidden layer and an output layer. Through the Huffman tree for word frequency, the activated contents of all word hiding layers with similar word frequency are basically consistent, the number of the word hiding layers activated by the words with higher frequency is less, and thus the calculation complexity is effectively reduced.

The Word2vec model may employ two models: CBOW models and Skip-Gram models, the embodiments of the present application may use CBOW models based on a hierarchical Softmax algorithm.

And step 13, obtaining the vector representation of each commodity attribute in the commodity attribute corpus by using the trained Word2Vec Word vector model.

Specifically, when the word segmentation operation is performed on the commodity attribute, the vector representation of the commodity attribute obtained here is the vector representation of each word segmentation, wherein each word segmentation is a word obtained after the word segmentation operation is performed on the commodity attribute.

And step 14, calculating the correlation degree among the commodity attributes by using the vector representation of the commodity attributes, and screening based on the calculation result to obtain a relevant commodity attribute library of the commodity attributes.

The correlation degree of the commodity attributes can be measured from two aspects of distance measurement and similarity measurement, the Euclidean distance can be adopted in the distance measurement method, and the cosine similarity can be adopted in the similarity measurement.

Qualitatively, the larger the cosine similarity is, the larger the correlation between the commodity attributes is, and the smaller the cosine similarity is, the smaller the correlation between the commodity attributes is. The smaller the euclidean distance, the greater the correlation between the commodity attributes, and the greater the euclidean distance, the smaller the correlation between the commodity attributes.

Specifically, the screening is performed based on the calculation result, and the related product attribute library for obtaining the product attribute may be:

and adding the commodity attribute with the similarity greater than a first set value to a commodity attribute list of each commodity attribute.

The method can also comprise the following steps:

and adding the commodity attribute with the distance from the commodity attribute smaller than a second set value into a commodity attribute library aiming at each commodity attribute.

Here, the first set value and the second set value may be set based on empirical values.

Here, the correlation degree with respect to the product attribute is converted into a vector operation in a multidimensional vector space by training, and is used to express the semantic correlation degree of the text.

The offline process can establish a related commodity attribute library of the commodity attributes aiming at specific target fields, such as a bulk commodity sales field and a logistics field. A related product attribute library is created for the product attributes (which may be selected representative product attributes) in the target field.

So far, namely through the offline process, the related commodity attribute library of the commodity attribute is obtained.

Here, the related product attribute library of the product attribute may be present in the form of a related product attribute list, and may of course be present in other data forms, which is not limited herein.

The above is an offline process. The following describes the on-line process.

As shown in fig. 2, which is a schematic diagram of an online retrieval method provided in the embodiment of the present application, the method includes the following steps:

step 21: acquiring historical query data of a user, wherein the historical query data comprises commodity attributes and titles and text identifications of final searched webpages;

in this step 21, the user is required to perform at least one inquiry operation. From the perspective of user side operation, one query operation comprises inputting commodity attributes in a search box, clicking a webpage title returned by a search engine, and entering a webpage; from the perspective of a search engine, a query operation includes: receiving input commodity attributes, performing text matching according to the commodity attributes, returning matching results in a webpage title link mode, receiving a click command, and returning a webpage pointed by the click command.

The final webpage searched in the method can be a webpage which is viewed by the user last time, can also be a webpage of which the browsing time length of the user is longer than the set time length, and can also be a webpage of which the user performs operations such as saving, collecting, screen capturing or forwarding and the like. The user inputs a commodity attribute in the search box to search, and a plurality of webpage titles can be obtained, wherein the webpage (including webpage text) pointed by the webpage title is matched with the commodity attribute input by the user.

The user usually performs initial viewing on each webpage title returned by the search engine, clicks a certain webpage, and indicates to a certain extent that the webpage relatively meets the search requirements of the user. The fact that the browsing time length of the user is longer than the set time length reflects the interest of the user in the webpage to a certain extent, and shows that the webpage relatively meets the search requirement of the user. That is, the web page contains word information which can be used for mining the product which the user is interested in, other related products of the product and related attributes of the product. By using the webpage title and the text information in the webpage, the query expansion words can be determined more accurately.

The user performs the operations of saving, collecting, screen capturing or forwarding and the like, and usually directly indicates that the information provided by the webpage is the information which is interested by the user, namely the webpage contains word information which can be used for mining the product which is interested by the user, other related products of the product and related attributes of the product. By using the webpage title and the text information in the webpage, the query expansion words can be determined more accurately.

Here, the present application may be specifically configured to perform historical query data recording, which records query terms (including product attributes) of the user, titles of search web pages, and text identifiers of texts on the web pages.

Step 22: extracting commodity attributes from the obtained titles and texts identified by the text identifications;

because the final searched webpage contains word information which can be used for mining the product which is interested by the user, other related products of the product and related attributes of the product, the commodity attributes extracted from the webpage title and the text in the step can be accurately determined for inquiring the expansion words.

Step 23: respectively calculating the correlation between the commodity attributes and the commodity attributes contained in the historical query data aiming at the commodity attributes obtained under each line and the relevant commodity attribute library of the commodity attributes;

step 24: respectively calculating the correlation between the commodity attributes and the extracted commodity attributes according to the commodity attributes obtained under each line and the relevant commodity attribute library of the commodity attributes;

specifically, the vector representation of the queried/extracted commodity attributes may be obtained using a trained Word2Vec Word vector model, and the vector representation of each relevant commodity attribute in the commodity attribute and its corresponding relevant commodity attribute library is known, so this step 23/24 may be calculated.

For example, assume the commodity attribute of the query is K;

the list of the commodity attributes and the corresponding related commodity attributes obtained in the offline process is as follows:

the related commodity attribute list corresponding to the commodity attribute B is as follows: a commodity attribute A, a commodity attribute C, a commodity attribute M and a commodity attribute J;

the related product attribute list corresponding to the product attribute B1 is: a commodity attribute a1, a commodity attribute C1, a commodity attribute M1, and a commodity attribute J1;

the related product attribute list corresponding to the product attribute B2 is: commodity attribute a2, commodity attribute C2, commodity attribute M2, and commodity attribute J2.

The calculation process of step 23 or step 24 is:

calculating the correlation between the commodity attribute K and the commodity attribute B, and the correlation between the commodity attribute K and each of the commodity attribute A, the commodity attribute C, the commodity attribute M and the commodity attribute J;

calculating the correlation degree between the commodity attribute K and the commodity attribute B1, and the correlation degrees between the commodity attribute K and the commodity attributes A1, C1, M1 and J1 respectively;

the degree of correlation between the commodity attribute K and the commodity attribute B2, and the degrees of correlation between the commodity attribute K and each of the commodity attribute a2, the commodity attribute C2, the commodity attribute M2, and the commodity attribute J2 are calculated.

The execution steps of step 23 and step 24 may be exchanged or performed simultaneously.

Step 25: and screening the commodity attributes obtained from the next line and the corresponding related commodity attribute list based on the correlation degrees obtained in the step 23 and the step 24, and searching by using the screened commodity attributes as query expansion words.

The screening process can be referred to as step 14.

In the scheme of the embodiment of the application, the commodity attribute related to the click behavior of the user is also subjected to relevancy calculation with the commodity attribute obtained in the online process and the corresponding relevant commodity attribute list thereof, and finally the screening of the step 25 is participated in.

Based on the same inventive concept, the embodiment of the present application further provides a retrieval system, including:

the acquiring unit 61 is configured to acquire historical query data of a user, where the historical query data includes a commodity attribute, a title of a final searched webpage, and a text identifier;

an extracting unit 62 for extracting the commodity attribute from the obtained title and the text identified by the text identification;

a product attribute library generating unit 63 for obtaining a product attribute and a product attribute-related product attribute library;

a calculating unit 64, configured to calculate, for each offline obtained product attribute and a product attribute-related product attribute library, a degree of correlation between the product attribute and a product attribute included in the historical query data, and a degree of correlation between the product attribute and the extracted product attribute;

a screening unit 65, configured to screen, based on the obtained correlation, the commodity attributes obtained under the line and the related commodity attribute list corresponding to the commodity attributes;

and the retrieval unit 66 is configured to perform retrieval by using the commodity attributes obtained by screening as query expansion words.

Preferably, the product attribute library generating unit 63 is specifically configured to obtain an inquiry log of a user offline to form a product attribute corpus, where the inquiry log includes a product attribute; taking the commodity attribute corpus as a sample, and training a Word2Vec Word vector model; obtaining the vector representation of each commodity attribute in the commodity attribute corpus by using the trained Word2Vec Word vector model; and calculating the correlation between the commodity attributes by using the vector representation of the commodity attributes, and screening based on the calculation result to obtain a relevant commodity attribute library of the commodity attributes.

Preferably, the final webpage of the search is a webpage which is saved, collected, captured, copied or forwarded by the user.

Preferably, the final searched web page is a web page that is viewed by the user last time or a web page whose browsing time length is longer than a set time length.

Preferably, the system is applied to a vertical e-commerce trading platform for selling bulk commodities.

Through the above description of the embodiments, it is clear to those skilled in the art that the embodiments of the present application may be implemented by hardware, or by software plus a necessary general hardware platform. Based on such understanding, the technical solutions of the embodiments of the present application may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.), and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments of the present application.

Those skilled in the art will appreciate that the drawings are merely schematic representations of one preferred embodiment and that the blocks or flow diagrams in the drawings are not necessarily required to practice the present application.

Those skilled in the art can understand that the modules in the terminal in the embodiment can be distributed in the terminal in the embodiment according to the description of the embodiment, and can also be located in one or more terminals different from the embodiment with corresponding changes. The modules of the above embodiments may be combined into one module, or further split into multiple sub-modules.

The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims

1. A retrieval method, comprising:

respectively calculating the correlation between the commodity attribute obtained under each line and the commodity attribute contained in the historical query data and the correlation between the commodity attribute and the extracted commodity attribute in the commodity attribute library related to the commodity attribute obtained under each line;

searching by using the commodity attribute obtained by screening as a query expansion word;

the commodity attribute library obtained offline and related to the commodity attribute comprises the following components:

acquiring a query log of a user to form a commodity attribute corpus, wherein the query log comprises commodity attributes;

taking the commodity attribute corpus as a sample, and training a Word2Vec Word vector model;

obtaining the vector representation of each commodity attribute in the commodity attribute corpus by using the trained Word2Vec Word vector model;

calculating the correlation degree between the commodity attributes by using the vector representation of the commodity attributes, and screening based on the calculation result to obtain a relevant commodity attribute library of the commodity attributes;

the calculation of the correlation between the commodity attributes by using the vector representation of the commodity attributes specifically comprises the following steps:

the inquired commodity attribute is K;

the related product attribute list corresponding to the product attribute B2 is: a commodity attribute a2, a commodity attribute C2, a commodity attribute M2, and a commodity attribute J2;

2. The method of claim 1, wherein the final web page of the search is a web page that the user has performed save, favorite, screenshot, copy, or forward operations.

3. The method of claim 1, wherein the final webpage of the search is a webpage that is viewed by the user last time or a webpage that has a browsing time longer than a set time.

4. The method of claim 1, wherein the method is applied to a vertical e-commerce trading platform for selling bulk goods.

5. A retrieval system, comprising:

the inquired commodity attribute is K;

calculating the correlation degree between the commodity attribute K and the commodity attribute B2, and the correlation degrees between the commodity attribute K and the commodity attributes A2, C2, M2 and J2 respectively;

the retrieval unit is used for retrieving by using the commodity attributes obtained by screening as query expansion words;

the commodity attribute library generating unit is specifically used for acquiring a query log of a user offline to form a commodity attribute corpus, wherein the query log comprises commodity attributes; taking the commodity attribute corpus as a sample, and training a Word2Vec Word vector model; obtaining the vector representation of each commodity attribute in the commodity attribute corpus by using the trained Word2Vec Word vector model; and calculating the correlation between the commodity attributes by using the vector representation of the commodity attributes, and screening based on the calculation result to obtain a relevant commodity attribute library of the commodity attributes.

6. The system of claim 5, wherein the final web page of the search is a web page that the user has performed save, favorite, screenshot, copy, or forward operations.

7. The system of claim 5, wherein the final webpage of the search is a webpage that the user has viewed last time or a webpage that the user has browsed for a time longer than a set time.

8. The system of claim 5, wherein the system is implemented on a vertical e-commerce trading platform for selling bulk goods.