CN111552767A

CN111552767A - Search method, search device and computer equipment

Info

Publication number: CN111552767A
Application number: CN201910110323.9A
Authority: CN
Inventors: 林方全; 杨超; 李越川; 张京桥; 杨程; 马君
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2019-02-11
Filing date: 2019-02-11
Publication date: 2020-08-18

Abstract

The application discloses a searching method, a searching device and computer equipment. Wherein, the method comprises the following steps: acquiring a query word; performing word segmentation on the query word to obtain a first word segmentation set; selecting an important word from the first word segmentation set; and determining the documents with the similarity to the important words larger than a preset threshold value to obtain a document set, and taking the document set as a retrieval result. The method and the device solve the technical problem that the retrieval result is inaccurate when the long query word is retrieved in the prior art.

Description

Search method, search device and computer equipment

Technical Field

The present application relates to the field of machine learning, and in particular, to a search method, a search apparatus, and a computer device.

Background

With the development of computer technology, people can post resumes in real time through the internet when looking for work, and enterprises can also acquire resumes in real time through the internet and screen the resumes. However, the content of the resume is more, and the server generally performs a tail truncation process on the long query word when retrieving the resume, for example, in some search engines, the query length is limited to be within a preset number of chinese characters, for example, within 38 chinese characters, and characters beyond the range are ignored. For example, in the field of job position recruitment search, the given job position matched with the resume query uses the full resume text as a query word, but the number of characters in the resume is large, generally hundreds of characters or even thousands of characters, and the actual requirements of the user cannot be met by simply reserving the first dozens of characters as the query word for retrieval, and the retrieval result is inaccurate. Similarly, given a resume where the job description query matches is with the job description as a query term, the job description is also typically over 100 words in length.

In order to solve the above problem, the prior art recalls a document having at least M keywords among N keywords input by a user, where M is less than or equal to N. Search results are then generated from the recalled documents. However, the scheme cannot distinguish the importance degree of the N keywords input by the user, and for the resume with long query content, the importance of the keywords is not distinguished, and the search relevance cannot be ensured.

In addition, in the prior art, document contents can be decomposed based on a topic model, and documents and long query words are mapped to a topic space, so that the purpose of reducing dimensions is achieved. However, the theme learning dimension reduction technology used in the scheme is not directly related to the target of the search task, and dimension reduction is information compression, which causes information loss in the process of dimension reduction. If the dimension reduction process is not directly related to the search task, the lost part of information may cause a great reduction in search relevance and user satisfaction.

In view of the above problems, no effective solution has been proposed.

Disclosure of Invention

The embodiment of the application provides a searching method, a searching device and computer equipment, and aims to at least solve the technical problem that in the prior art, a searching result is inaccurate when a long query word is searched.

According to an aspect of an embodiment of the present application, there is provided a search method including: acquiring a query word; performing word segmentation on the query word to obtain a first word segmentation set; selecting an important word from the first word segmentation set; and determining the documents with the similarity to the important words larger than a preset threshold value to obtain a document set, and taking the document set as a retrieval result.

According to another aspect of the embodiments of the present application, there is also provided a search method, including: displaying the obtained query words; displaying a first word segmentation set obtained after the query word is segmented; displaying the important words selected from the first word segmentation set; and displaying the document set obtained by determining the documents with the similarity greater than the preset threshold value with the important words.

According to another aspect of the embodiments of the present application, there is also provided a search apparatus, including: the acquisition module is used for acquiring the query words; the word segmentation module is used for segmenting the query word to obtain a first word segmentation set; selecting an important word from the first word segmentation set; and the determining module is used for determining the documents with the similarity to the important words larger than a preset threshold value to obtain a document set, and taking the document set as a retrieval result.

According to another aspect of the embodiments of the present application, there is also provided a storage medium including a stored program, wherein the apparatus in which the storage medium is located is controlled to perform the search method when the program runs.

According to another aspect of the embodiments of the present application, there is also provided a processor for executing a program, wherein the program executes to perform the search method.

According to another aspect of the embodiments of the present application, there is also provided a computer device, including: a processor; and a memory coupled to the processor for providing instructions to the processor for processing the following processing steps: acquiring a query word; performing word segmentation on the query word to obtain a first word segmentation set; selecting an important word from the first word segmentation set; and determining the documents with the similarity to the important words larger than a preset threshold value to obtain a document set, and taking the document set as a retrieval result.

In the embodiment of the application, after the query word is obtained, the method of screening and applying the matching model to the important word is adopted, word segmentation is performed on the query word to obtain a first word segmentation set, then the important word is selected from the first word segmentation set, the document with the similarity degree larger than a preset threshold value to the important word is determined, a document set is obtained, and finally the document set is used as a retrieval result.

In the process, tail truncation is not performed on the query word, and the integrity of the query word is reserved. In addition, after the segmentation of the query word is obtained, an important word is selected from the first segmentation set, and the document set is determined based on the important word, namely, the deleted important word is an unimportant segmentation word in the query word, and the influence of the deleted unimportant segmentation word on the retrieval result is small, so that the accuracy of the retrieval result is ensured.

According to the content, the scheme provided by the application achieves the purpose of retrieving the long query word, so that the technical effect of improving the accuracy of the retrieval result of the long query word is achieved, and the technical problem that the retrieval result is inaccurate when the long query word is retrieved in the prior art is solved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

fig. 1 is a block diagram of a hardware configuration of an alternative computer terminal according to an embodiment of the present application. (ii) a

FIG. 2 is a flow chart of a search method according to an embodiment of the present application;

FIG. 3 is a schematic diagram of an alternative first model training in accordance with embodiments of the present application;

FIG. 4 is a schematic illustration of an alternative first model training in accordance with embodiments of the present application;

FIG. 5 is a flow chart of an alternative search method according to an embodiment of the present application;

FIG. 6 is a flow chart of a search method according to an embodiment of the present application;

FIG. 7 is a schematic diagram of a search apparatus according to an embodiment of the present application; and

fig. 8 is a block diagram of a computer terminal according to an embodiment of the present application.

Detailed Description

In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only partial embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Example 1

There is also provided, in accordance with an embodiment of the present application, an embodiment of a search method, it should be noted that the steps illustrated in the flowchart of the accompanying drawings may be performed in a computer system, such as a set of computer-executable instructions, and that, although a logical order is illustrated in the flowchart, in some cases, the steps illustrated or described may be performed in an order different than here.

The method provided by the first embodiment of the present application may be executed in a mobile terminal, a computer terminal, or a similar computing device. Fig. 1 shows a hardware configuration block diagram of a computer terminal (or mobile device) for implementing the search method. As shown in fig. 1, the computer terminal 10 (or mobile device 10) may include one or more (shown as 102a, 102b, … …, 102 n) processors 102 (the processors 102 may include, but are not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA, etc.), a memory 104 for storing data, and a transmission device 106 for communication functions. Besides, the method can also comprise the following steps: a display, an input/output interface (I/O interface), a Universal Serial Bus (USB) port (which may be included as one of the ports of the I/O interface), a network interface, a power source, and/or a camera. It will be understood by those skilled in the art that the structure shown in fig. 1 is only an illustration and is not intended to limit the structure of the electronic device. For example, the computer terminal 10 may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.

It should be noted that the one or more processors 102 and/or other data processing circuitry described above may be referred to generally herein as "data processing circuitry". The data processing circuitry may be embodied in whole or in part in software, hardware, firmware, or any combination thereof. Further, the data processing circuit may be a single stand-alone processing module, or incorporated in whole or in part into any of the other elements in the computer terminal 10 (or mobile device). As referred to in the embodiments of the application, the data processing circuit acts as a processor control (e.g. selection of a variable resistance termination path connected to the interface).

The memory 104 may be used to store software programs and modules of application software, such as program instructions/data storage devices corresponding to the search method in the embodiment of the present application, and the processor 102 executes various functional applications and data processing by running the software programs and modules stored in the memory 104, so as to implement the search method described above. The memory 104 may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory located remotely from the processor 102, which may be connected to the computer terminal 10 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission device 106 is used for receiving or transmitting data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the computer terminal 10. In one example, the transmission device 106 includes a Network adapter (NIC) that can be connected to other Network devices through a base station to communicate with the internet. In one example, the transmission device 106 can be a Radio Frequency (RF) module, which is used to communicate with the internet in a wireless manner.

The display may be, for example, a touch screen type Liquid Crystal Display (LCD) that may enable a user to interact with a user interface of the computer terminal 10 (or mobile device).

It should be noted here that in some alternative embodiments, the computer device (or mobile device) shown in fig. 1 described above may include hardware elements (including circuitry), software elements (including computer code stored on a computer-readable medium), or a combination of both hardware and software elements. It should be noted that fig. 1 is only one example of a particular specific example and is intended to illustrate the types of components that may be present in the computer device (or mobile device) described above.

Under the above operating environment, the present application provides a search method as shown in fig. 2. Fig. 2 is a flowchart of a search method according to a first embodiment of the present application, where the method includes the following steps:

step S202, obtaining the query word.

It should be noted that the server may obtain the query term through a client device, where the client device may be a mobile terminal such as a mobile phone, a platform, or a computer. The query term is a long query term, that is, the length of the query term is greater than a preset length, for example, a term of more than 100 characters. Optionally, the query term may be a term input by the user through the client device, for example, content such as a job description input by the enterprise through the client device when recruiter works. The query term may also be a document file input by the user through the client device or resume information crawled from a webpage, for example, a resume input by the user through the client device.

Step S204, performing word segmentation on the query word to obtain a first word segmentation set.

Optionally, after obtaining the query word, the server first detects a language type corresponding to the query word, and determines an algorithm for performing word segmentation processing on the query word according to the language type, for example, in a case where the language type is determined to be chinese, the server may perform word segmentation processing on the query word by using a word segmentation algorithm based on dictionary or lexicon matching, a word segmentation algorithm based on word frequency statistics, and a word segmentation algorithm based on knowledge understanding. Under the condition that the query word is detected to be English, the server can adopt a word stem extraction algorithm to perform word segmentation processing on the query word.

Step S206, selecting important words from the first word segmentation set.

It should be noted that the important word is a participle in the first participle set, where the important word is a participle whose importance degree is greater than a preset degree.

Optionally, important words are stored in the preset participle set, and the important words in the first participle set can be obtained by performing intersection operation on the participles in the first participle set and the participles in the preset participle set.

And S208, determining the documents with the similarity to the important words larger than a preset threshold value to obtain a document set, and taking the document set as a retrieval result.

Optionally, the server may perform word segmentation on the document to be queried, calculate similarity between the word segmentation of the document after the word segmentation and the important word, and form a document set with the similarity greater than a preset threshold. For example, if the similarity between the document 1 and the important word is greater than a preset threshold, the similarity between the document 2 and the important word is greater than a preset threshold, and the similarity between the document 3 and the important word is less than a preset threshold, the document 1 and the document 2 are documents in the document set.

It should be noted that the server may feed back the document set as the search result to the client device, or may not feed back the search result to the client device, that is, the search result may be used in other application scenarios, for example, the search result is used for data statistical analysis, and is used for obtaining corresponding virtual resources (such as a bonus or a service fee), and the like, but is not limited thereto.

Optionally, the client device has a display, and after the server generates the document set, the server sends the document set to the client device, and the client device can display the searched document information to the user. For example, for an enterprise, after the enterprise inputs job description content through a client device, the client presents resumes of related people who meet the job to the user; and for the applicant, after the user inputs the resume through the client, the client displays the recruitment information matched with the resume of the user to the user.

Based on the schemes defined in the above steps S202 to S208, it can be known that, after the query word is obtained by using a method in training of applying a matching model to the screening of the important word, the query word is subjected to word segmentation processing to obtain a first word segmentation set, then the important word is selected from the first word segmentation set, a document with similarity to the important word greater than a preset threshold value is determined, a document set is obtained, and finally the document set is used as a retrieval result.

It is easy to notice that the tail truncation is not performed on the query term, and the integrity of the query term is preserved. In addition, after the segmentation of the query word is obtained, an important word is selected from the first segmentation set, and the document set is determined based on the important word, namely, the deleted important word is an unimportant segmentation word in the query word, and the influence of the deleted unimportant segmentation word on the retrieval result is small, so that the accuracy of the retrieval result is ensured.

In an alternative scheme, before performing the word segmentation processing on the query word, the server first needs to acquire the query word. Specifically, the server acquires a document file input by a user, identifies the document file to obtain content information, and takes the content information as a query word.

Optionally, the document file is taken as a resume for explanation, and the user is a job seeker. The user uploads the resume to the server through the client device, and the server identifies the resume after receiving the resume to obtain content information of the resume, such as the content of work experience, work capacity, employment position, age, sex, contact information, expected salary, expected work city and the like of the user. And after identifying the resume, the server takes the content information as a query word. In this way, a job position matching the user resume can be realized.

Correspondingly, in some embodiments of the present application, the recruiter may also query the resume related to the job by using the method provided by the embodiments of the present application, for example: the query term obtaining comprises: acquiring job description information input by a user; and identifying the job description information to obtain identification content, and determining a query word based on the identification content. And then performing word segmentation on the query word to obtain the first word segmentation set, selecting an important word in the first word segmentation set, determining the resume with the similarity to the important word being greater than a preset threshold value, and obtaining a resume set, wherein the resume set comprises at least one resume associated with the position description information.

Further, after the query word is obtained, the server performs word segmentation processing on the query word to obtain a first word segmentation set, and then selects an important word from the first word segmentation set. Specifically, the server takes an intersection of the first participle set and the preset participle set, and takes a result obtained after the intersection is taken as an important word. The preset word segmentation set is a set of important words generated offline.

Alternatively, the preset segmentation set may be determined by: and determining historical query words from the historical query word list, and performing word segmentation processing on the historical query words to obtain first query words. Meanwhile, the server also obtains a historical behavior log which represents the historical operation of the user on the server through the client device, such as clicking, downloading, inquiring and document consulting operations, and the operation time, operation times and the like. And then the server inputs the first query word and the historical behavior log into a semantic matching model to obtain a second query word, and detects whether the second query word exists in a preset word segmentation set. And if the second query word is detected not to exist in the preset segmentation set, storing the first query word in the preset segmentation set.

After determining the important words, the server determines the documents with the similarity greater than a preset threshold value with the important words from the document set to be queried, so as to obtain the document set. Specifically, the server firstly inputs important words into a first model for analysis to obtain query word vectors of the query words, then performs word segmentation on documents to be queried in a document set to be queried to obtain a second word segmentation set, inputs the second word segmentation set into the first model for analysis to obtain document vectors of the document set, finally determines similarity between the document vectors and the query word vectors, determines documents with similarity between the documents and the important words larger than a preset threshold value from the document set to be queried, and stores the determined documents into the document set.

It should be noted that, the first model is obtained by training a plurality of sets of data, and each set of data in the plurality of sets of data includes: sample tokens and vectors corresponding to the sample tokens.

Optionally, the first model may be trained by: firstly, obtaining sample data for training a first model, obtaining a bag-of-words model of the sample data, then determining a vector sequence of the sample data based on the bag-of-words model, combining the vector sequence with a real number matrix to obtain a real number vector of the sample data, and taking the real number vector as the output of the first model. The sample data is the combined data of the sample query words and the sample documents; each word in the bag-of-words model is numbered according to a preset sequence, the maximum value of the number is the length of a word list corresponding to the bag-of-words model, and the vector elements in the vector sequence are 0 or 1.

The explanation is given by taking the query word as a resume. The sample data for training the first model is the resume set. Optionally, the server may obtain the sample data via the internet. After obtaining sample data, the server represents the sample data as a bag-of-words model, that is, each word is numbered according to a word list sequence, for example, the length of a word list is N, each text can be represented as a one-dimensional binary vector with the length of N, and a vector element is not 0, that is, 1, wherein when the vector element is 1, a word representing a corresponding sequence number appears in the text. For example, there are three texts, whose contents are i am a teacher.i loveyou.you are a hero. a (1) am (2) are (3) hero (4) i (5) love (6) teacher (7) you (8), numbered from 1 to 8, respectively, and the word list length N is 8, wherein the text i am a teacher contains the numbers 1-2-5-7 in the word list, and the corresponding bag-of-words models, i.e. vector sequences, are represented as: 11001010, and so on the other two texts can be represented by a 01 vector.

Further, after obtaining the vector sequence of the sample data, the server represents the query word as an N-dimensional 01 vector q and the document as an N-dimensional 01 vector d by using a bag-of-words model, and q '═ q · E is a k-dimensional real number vector which is a real number vector representation of the query word calculated by the model, and similarly d' ═ d · E is a real number vector representation of the document.

It should be noted that, a real number matrix E of N × k maps each word to a k-dimensional real number vector, the word list length is N, and the real number matrix E needs model training and learning and uses random real numbers to perform variable initialization.

Optionally, before the real number vector is used as the output of the first model, the server further obtains the similarity between the first real number vector of the sample query term and the second real number vector of the sample document, determines the probability corresponding to the similarity by using a loss function, estimates the real number matrix by using maximum likelihood estimation to obtain an estimated value, and then optimizes the real number matrix based on a stochastic gradient descent algorithm to minimize the estimated value.

Specifically, the description is given by taking the query term as the job description. Aiming at position 1 (long query word), the enterprise selects resume 1 to initiate an interview after browsing three resumes, and does not select resume 2 and resume 3, the server obtains three sample data for training the model, namely a positive sample < position 1, resume 1>, two negative samples < position 1, resume 2> and < position 1, resume 3 >. After obtaining the positive and negative samples, the server represents the real number vector calculated by the first model for position 1 as jd1 ', and resumes 1, 2 and 3 also obtain vectors cv 1', cv2 'and cv 3', respectively. At this time, the server may select any one of the vector similarities as the vector similarity. Then cosine (jd1 ', cv 1') represents the matching degree of < position 1, resume 1> output by the first model, and the matching degree of < position 1, resume 2>, < position 1, resume 3>, …, < position i, resume j > can also be obtained.

In order to make the output of the first model more accurate, the first model needs to be optimized, which is essential to minimize the loss function. Alternatively, the loss function may map the degree of match to a probability between (0,1) using a Logistic function, and estimate the parameter matrix E using maximum likelihood estimation. During the training of the first model, the parameter E can be iteratively optimized using a standard random gradient descent algorithm such that the loss function values on the training data set are minimized.

It should be noted that a grouping lasso regular term is set in the loss function, and the grouping lasso regular term is used for reducing the vector of each participle in the sample data. Optionally, before the vector sequence is combined with the real number matrix to obtain the real number vector of the sample data, the server further deletes the real number vector of the participle in the sample data, where the real number vector is smaller than the specified threshold.

In addition, to limit the size of the word list, a group lasso regular term (grouplisso) may be added to the target loss function to make each word vector as small as possible, so that the unimportant word vectors are close to 0 (i.e., the above-mentioned specified threshold), and the value of the vector contributing the important word for matching is greater than the specified threshold. After the first model is trained, the server sets the word vector close to 0 to be 0, and deletes the words in the sample data, so that the purposes of saving the prediction time of the first model and optimizing the online computing efficiency are achieved.

Optionally, fig. 3 shows a training diagram of the first model. As shown in fig. 3, after obtaining sample data, the server performs word segmentation on the standard data JD1 and the sample data CV1, CV2, and CV3 to obtain the standard data and a word segmentation vector corresponding to each sample data, then performs pooling processing of the pooling layer and processing of the full connection layer to calculate similarity between the word segmentation vector of each sample data and the word segmentation vector of the standard data, determines a probability corresponding to the similarity according to a loss function, estimates a real matrix by using maximum likelihood estimation, then performs optimization processing on the real matrix based on a random gradient descent algorithm, and finally obtains an output of the first model by using the real matrix after the optimization processing (e.g., cross entropy probability in fig. 3).

In another optional scheme, after obtaining the important words from the first participle set, the server further performs participle processing on the documents to be queried in the document set to be queried to obtain a second participle set, and inputs the participles in the second participle set to the semantic matching model respectively for analysis to obtain a vector set of the participles in the second participle set, then determines the participle similarity between each vector in the vector set and the important words to obtain a plurality of similarities, determines the similarity between the important words and the documents to be queried based on the plurality of similarities, and finally determines the documents with the similarity between the important words and the documents to be queried being greater than a preset threshold value from the document set to be queried.

In the above process, the semantic matching model is obtained by training a plurality of sets of data, and each set of data in the plurality of sets of data includes: sample tokens and vectors corresponding to the sample tokens.

Optionally, the server inputs the participles of the important words into the semantic matching model for analysis, so as to obtain word vectors of the important words, then determines similarity between the word vectors of the important words and each vector in the vector set, so as to obtain a plurality of similarities, and performs summation operation on the similarities to obtain a sum, and the sum is used as the similarity between the document to be queried and the important words, for example, there are three documents to be queried.

Optionally, the server may further obtain an average value of the multiple similarity degrees, and use the average value as the similarity degree between the document to be queried and the important word. For example, there are three documents to be queried, the server first performs word segmentation processing on each document to be queried, then calculates the similarity between the word vector of the word segmentation of each document to be queried and the word vector of the important word, and finally performs an operation of averaging the similarity of the document to be queried to obtain the similarity of the document to be queried and the important word.

It should be noted that after the similarity of the documents to be queried is obtained, the server determines, from the document set to be queried, a document set whose similarity with the important word is greater than a preset threshold. The server further obtains the number of the documents sent by the client device, then arranges the documents in the document set according to the similarity with the important words, obtains the documents with the number of the documents from the arranged document set according to the sequence of the similarity from large to small to obtain a target document set, and finally feeds the target document set as a retrieval result back to the client device. For example, the client device sends three resumes, and the server sorts the resumes in the document set according to the size of the similarity, and pushes the three resumes with the highest similarity to the client device. At this point, the client device obtains the document with the greatest association degree with the query term.

Optionally, fig. 4 shows a training diagram of the first model. Specifically, the server first imports all documents from a database, wherein each piece of data in a data table in the database contains one document text. Meanwhile, the server can also import a historical query word data set and a historical user behavior log from the database, wherein each piece of data in the data table of the historical query word data set comprises a query word text, and each piece of data in the data table of the historical user behavior log comprises user behavior descriptions such as query words, documents, clicks/downloads and the like. After data is imported, the server carries out word segmentation processing on the text documents and the query words, the text documents and the query words after word segmentation are used as training targets, the words are coded into vectors through a deep learning technology, and the vectors of the query words and the documents are synthesized through synthesis operation on the basis of the word vectors contained in the vectors, wherein the synthesis operation comprises but is not limited to summation and averaging. After the vectors of the query words and the documents are obtained, the server calculates the similarity of the vectors of the query words and the documents to obtain a semantic matching model. And then, screening the vectors of the query words and the documents through grouping lasso to obtain important words, and storing the important words in an important word list to form the important word list. Furthermore, a document vector is generated by the semantic matching model, and document content and an important word list are generated to obtain a document vector index, so that the online system can query.

Further, FIG. 5 shows a flow chart of a search method for an online system to perform a search for query terms. Specifically, the user inputs a long query term (for example, job description information) through the client device, and the user may submit the long query term to the online system (i.e., the server) through an interactive form such as text pasting or file uploading. After receiving the long query word, the server performs word segmentation processing on the long query word, finds out intersection of the long query word after word segmentation processing and an important word list generated offline, and extracts the important word from the long query word after word segmentation processing. And then, the server generates a query word vector by combining the important words and the semantic matching model, and ranks according to the similarity of the query word vector and the document vector to obtain a document set with the most similar vector, namely the most semantic related document set. And finally, intercepting the document with the highest similarity with the query word according to the number of the documents required by the user and returning the document to the user.

It should be noted that, in the embodiment of the present application, a technology for synthesizing a document or a query word vector based on a word vector in a semantic matching model may also adopt technologies such as CNN, BiLSTM, and the like.

Compared with the prior art in which words in the long query word are deleted with the same weight, the query word deleted by the scheme provided by the application is an unimportant word learned in the training process of the matching model, and the deletion of the unimportant word from the long query word does not affect the retrieval result, so that the accuracy of the retrieval result is ensured. In addition, compared with the prior art that the important words are screened through the topic model without combining the matching training task, the important words selected by the scheme provided by the application are the important words learned in the matching model training process, and the influence on matching caused by removing other words from the long query words is the minimum. Finally, the scheme can be applied to a vertical search system for searching for query words such as online input resume query matching positions, input position query resumes and the like. The information loss of the query words of the user is overlarge and the correlation of the query results is poor, so that the requirements of the user cannot be met.

It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present application is not limited by the order of acts described, as some steps may occur in other orders or concurrently depending on the application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.

Through the above description of the embodiments, those skilled in the art can clearly understand that the search method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better embodiment in many cases. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present application.

Example 2

According to an embodiment of the present application, there is also provided a search method, as shown in fig. 6, the method including the steps of:

step S602, displaying the obtained query terms.

It should be noted that the execution subject executing the embodiment may be a server, where the server has a display device, for example, a display screen.

In addition, the query term is a long query term, that is, the length of the query term is greater than a preset length, for example, a term of more than 100 characters. Optionally, the query term may be a term input by the user through the client device, for example, content such as a job description input by the enterprise through the client device when recruiter works. The query term may also be a document file input by the user through the client device, for example, a resume input by the user through the client device.

Step S604, a first word segmentation set obtained after the word segmentation is carried out on the query word is displayed.

Step S606, displaying the important words selected from the first word segmentation set.

Step S608, showing a document set obtained by determining documents with similarity to the important word greater than a preset threshold.

Based on the schemes defined in the above steps S602 to S608, it can be known that, after the query term is obtained by using a mode in training of applying a matching model to the screening of the important term, the query term is subjected to word segmentation processing to obtain a first word segmentation set, then the important term is selected from the first word segmentation set, a document with similarity to the important term greater than a preset threshold value is determined, a document set is obtained, and finally the document set is used as a retrieval result.

It should be noted that the scheme provided in this embodiment is the same as the scheme provided in embodiment 1, and the details have been described in embodiment 1 and are not described herein again.

Example 3

According to an embodiment of the present application, there is also provided a search apparatus for implementing the above search method, as shown in fig. 7, the apparatus 70 includes: an obtaining module 701, a word segmentation module 703, a selection module 705 and a determination module 707.

The obtaining module 701 is configured to obtain a query term; a word segmentation module 703, configured to perform word segmentation on the query word to obtain a first word segmentation set; a selecting module 705, configured to select an important word from the first word set; the determining module 707 is configured to determine a document with similarity to the important word greater than a preset threshold, obtain a document set, and use the document set as a retrieval result.

Optionally, the searching apparatus may further include a feedback module 709, configured to feed back the document set as the retrieval result to the client device. It should be noted that the feedback module 709 is an optional module, that is, the above-mentioned apparatus may not feed the search result back to the client device, that is, the search result may also be used in other application scenarios, for example, the search result is used for data statistics analysis, used for obtaining corresponding virtual resources (such as a bonus or a service fee), and the like, but is not limited thereto.

It should be noted here that the obtaining module 701, the word segmentation module 703, the selection module 705, and the determination module 707 correspond to steps S202 to S208 in embodiment 1, and the four modules are the same as the corresponding steps in the implementation example and the application scenario, but are not limited to the disclosure in the first embodiment. It should be noted that the modules described above as part of the apparatus may be run in the computer terminal 10 provided in the first embodiment.

It should be noted that the search apparatus provided in this embodiment can execute the search method in embodiment 1, and related contents are already described in embodiment 1 and are not described herein again.

Example 4

According to an embodiment of the present application, there is also provided a computer device for implementing the above search method, the computer device including: a processor and a memory.

The memory is connected with the processor and used for providing instructions for the processor to process the following processing steps: acquiring a query word; performing word segmentation on the query word to obtain a first word segmentation set; selecting an important word from the first word segmentation set; and determining the documents with the similarity to the important words larger than a preset threshold value to obtain a document set, and taking the document set as a retrieval result.

As can be seen from the above, after the query word is obtained, the method of screening the important word and applying the matching model in training is adopted, the query word is subjected to word segmentation processing to obtain a first word segmentation set, then the important word is selected from the first word segmentation set, the document with the similarity to the important word larger than the preset threshold value is determined, a document set is obtained, and finally the document set is used as a retrieval result.

It should be noted that the computer device provided in this embodiment can execute the search method in embodiment 1, and relevant contents are already described in embodiment 1, and are not described herein again.

Example 5

The embodiment of the application can provide a computer terminal, and the computer terminal can be any one computer terminal device in a computer terminal group. Optionally, in this embodiment, the computer terminal may also be replaced with a terminal device such as a mobile terminal.

Optionally, in this embodiment, the computer terminal may be located in at least one network device of a plurality of network devices of a computer network.

In this embodiment, the computer terminal may execute the program code of the following steps in the search method: acquiring a query word; performing word segmentation on the query word to obtain a first word segmentation set; selecting an important word from the first word segmentation set; and determining the documents with the similarity to the important words larger than a preset threshold value to obtain a document set, and taking the document set as a retrieval result.

Optionally, fig. 8 is a block diagram of a computer terminal according to an embodiment of the present application. As shown in fig. 8, the computer terminal 10 may include: one or more processors 802 (only one of which is shown), a memory 804, and a transmitting device 806.

The memory may be configured to store software programs and modules, such as program instructions/modules corresponding to the searching method and apparatus in the embodiments of the present application, and the processor executes various functional applications and data processing by running the software programs and modules stored in the memory, so as to implement the searching method. The memory may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some instances, the memories may further include a memory located remotely from the processor, which may be connected to the terminal 10 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The processor can call the information and application program stored in the memory through the transmission device to execute the following steps: acquiring a query word; performing word segmentation on the query word to obtain a first word segmentation set; selecting an important word from the first word segmentation set; and determining the documents with the similarity to the important words larger than a preset threshold value to obtain a document set, and taking the document set as a retrieval result.

Optionally, the processor may further execute the program code of the following steps: acquiring the number of documents sent by client equipment; arranging the documents in the document set according to the similarity with the important words; obtaining documents of the number of the documents from the arranged document set according to the sequence of similarity from large to small to obtain a target document set; and feeding back the target document set as a retrieval result to the client device.

Optionally, the processor may further execute the program code of the following steps: inputting the important words into a first model for analysis to obtain query word vectors of the query words, wherein the first model is obtained by training a plurality of groups of data, and each group of data in the plurality of groups of data comprises: sample segmentation and a vector corresponding to the sample segmentation; performing word segmentation on documents to be queried in the document set to be queried to obtain a second word segmentation set; inputting the second word set into the first model for analysis to obtain a document vector of the document set; determining the similarity between the document vector and the query word vector; and determining the documents with the similarity to the important words larger than a preset threshold value from the document set to be queried, and storing the determined documents into the document set.

Optionally, the processor may further execute the program code of the following steps: acquiring sample data for training the first model, wherein the sample data is combined data of a sample query word and a sample document; acquiring a word bag model of sample data, wherein each word in the word bag model is numbered according to a preset sequence, and the maximum value of the number is the length of a word list corresponding to the word bag model; determining a vector sequence of the sample data based on the bag-of-words model, wherein vector elements in the vector sequence are 0 or 1; and combining the vector sequence with the real number matrix to obtain a real number vector of the sample data, and taking the real number vector as the output of the first model.

Optionally, the processor may further execute the program code of the following steps: obtaining the similarity of a first real number vector of a sample query word and a second real number vector of a sample document; determining the probability corresponding to the similarity by using a loss function, and estimating a real matrix by using maximum likelihood estimation to obtain an estimated value; and optimizing the real matrix based on a random gradient descent algorithm to minimize the estimated value.

Optionally, the processor may further execute the program code of the following steps: and deleting real number vectors of participles in the sample data, wherein the real number vectors of the participles are smaller than a specified threshold value. The loss function is provided with a grouping lasso regular term, and the grouping lasso regular term is used for reducing the vector of each participle in sample data.

Optionally, the processor may further execute the program code of the following steps: performing word segmentation on documents to be queried in the document set to be queried to obtain a second word segmentation set; respectively inputting the participles in the second participle set into a semantic matching model for analysis to obtain a vector set of the participles in the second participle set, wherein the semantic matching model is obtained through training of multiple groups of data, and each group of data in the multiple groups of data comprises: sample segmentation and a vector corresponding to the sample segmentation; determining the word segmentation similarity of each vector in the vector set and the important word to obtain a plurality of similarities; determining the similarity between the important words and the document to be queried based on the plurality of similarities; and determining the documents with the similarity to the important words larger than a preset threshold value from the document set to be queried.

Optionally, the processor may further execute the program code of the following steps: inputting the participles of the important words into a semantic matching model for analysis to obtain word vectors of the important words; and determining the similarity of the word vector of the important word and each vector in the vector set to obtain a plurality of similarities.

Optionally, the processor may further execute the program code of the following steps: summing the similarity degrees to obtain a sum value, and taking the sum value as the similarity degree of the document to be inquired and the important word; or obtaining an average value of the similarity, and taking the average value as the similarity between the document to be queried and the important word.

Optionally, the processor may further execute the program code of the following steps: and taking intersection of the word segmentation set and the preset word segmentation set, and taking the result obtained after intersection taking as the important word.

Optionally, the processor may further execute the program code of the following steps: determining historical query words from a historical query word list; and judging whether the historical query words exist in the historical query words, and storing the historical query words into a preset word segmentation set when the judgment result is yes.

Optionally, the processor may further execute the program code of the following steps: acquiring a document file input by a user; and identifying the document file to obtain content information, and taking the content information as a query word.

Optionally, the processor may further execute the program code of the following steps: identifying the job description information to obtain identification content, and determining a query term based on the identification content; and determining the resume with the similarity to the important words larger than a preset threshold value to obtain a resume set, wherein the resume set comprises at least one resume related to the position description information.

It can be understood by those skilled in the art that the structure shown in fig. 8 is only an illustration, and the computer terminal may also be a terminal device such as a smart phone (e.g., an Android phone, an iOS phone, etc.), a tablet computer, a palmtop computer, a Mobile Internet Device (MID), a PAD, and the like. Fig. 8 is a diagram illustrating a structure of the electronic device. For example, the computer terminal 10 may also include more or fewer components (e.g., network interfaces, display devices, etc.) than shown in FIG. 8, or have a different configuration than shown in FIG. 8.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by a program instructing hardware associated with the terminal device, where the program may be stored in a computer-readable storage medium, and the storage medium may include: flash disks, Read-Only memories (ROMs), Random Access Memories (RAMs), magnetic or optical disks, and the like.

Example 6

Embodiments of the present application also provide a storage medium. Optionally, in this embodiment, the storage medium may be configured to store a program code executed by the search method provided in the first embodiment.

Optionally, in this embodiment, the storage medium may be located in any one of computer terminals in a computer terminal group in a computer network, or in any one of mobile terminals in a mobile terminal group.

Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: acquiring a query word; performing word segmentation on the query word to obtain a first word segmentation set; selecting an important word from the first word segmentation set; and determining the documents with the similarity to the important words larger than a preset threshold value to obtain a document set, and taking the document set as a retrieval result.

Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: acquiring the number of documents sent by client equipment; arranging the documents in the document set according to the similarity with the important words; obtaining documents of the number of the documents from the arranged document set according to the sequence of similarity from large to small to obtain a target document set; and feeding back the target document set as a retrieval result to the client device.

Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: inputting the important words into a first model for analysis to obtain query word vectors of the query words, wherein the first model is obtained by training a plurality of groups of data, and each group of data in the plurality of groups of data comprises: sample segmentation and a vector corresponding to the sample segmentation; performing word segmentation on documents to be queried in the document set to be queried to obtain a second word segmentation set; inputting the second word set into the first model for analysis to obtain a document vector of the document set; determining the similarity between the document vector and the query word vector; and determining the documents with the similarity to the important words larger than a preset threshold value from the document set to be queried, and storing the determined documents into the document set.

Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: acquiring sample data for training the first model, wherein the sample data is combined data of a sample query word and a sample document; acquiring a word bag model of sample data, wherein each word in the word bag model is numbered according to a preset sequence, and the maximum value of the number is the length of a word list corresponding to the word bag model; determining a vector sequence of the sample data based on the bag-of-words model, wherein vector elements in the vector sequence are 0 or 1; and combining the vector sequence with the real number matrix to obtain a real number vector of the sample data, and taking the real number vector as the output of the first model.

Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: obtaining the similarity of a first real number vector of a sample query word and a second real number vector of a sample document; determining the probability corresponding to the similarity by using a loss function, and estimating a real matrix by using maximum likelihood estimation to obtain an estimated value; and optimizing the real matrix based on a random gradient descent algorithm to minimize the estimated value.

Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: and deleting real number vectors of participles in the sample data, wherein the real number vectors of the participles are smaller than a specified threshold value. The loss function is provided with a grouping lasso regular term, and the grouping lasso regular term is used for reducing the vector of each participle in sample data.

Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: performing word segmentation on documents to be queried in the document set to be queried to obtain a second word segmentation set; respectively inputting the participles in the second participle set into a semantic matching model for analysis to obtain a vector set of the participles in the second participle set, wherein the semantic matching model is obtained through training of multiple groups of data, and each group of data in the multiple groups of data comprises: sample segmentation and a vector corresponding to the sample segmentation; determining the word segmentation similarity of each vector in the vector set and the important word to obtain a plurality of similarities; determining the similarity between the important words and the document to be queried based on the plurality of similarities; and determining the documents with the similarity to the important words larger than a preset threshold value from the document set to be queried.

Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: inputting the participles of the important words into a semantic matching model for analysis to obtain word vectors of the important words; and determining the similarity of the word vector of the important word and each vector in the vector set to obtain a plurality of similarities.

Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: summing the similarity degrees to obtain a sum value, and taking the sum value as the similarity degree of the document to be inquired and the important word; or obtaining an average value of the similarity, and taking the average value as the similarity between the document to be queried and the important word.

Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: and taking intersection of the word segmentation set and the preset word segmentation set, and taking the result obtained after intersection taking as the important word.

Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: determining historical query words from a historical query word list; and judging whether the historical query words exist in the historical query words, and storing the historical query words into a preset word segmentation set when the judgment result is yes.

Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: acquiring a document file input by a user; and identifying the document file to obtain content information, and taking the content information as a query word.

Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: identifying the job description information to obtain identification content, and determining a query term based on the identification content; and determining the resume with the similarity to the important words larger than a preset threshold value to obtain a resume set, wherein the resume set comprises at least one resume related to the position description information.

The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.

In the above embodiments of the present application, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

The foregoing is only a preferred embodiment of the present application and it should be noted that those skilled in the art can make several improvements and modifications without departing from the principle of the present application, and these improvements and modifications should also be considered as the protection scope of the present application.

Claims

1. A method of searching, comprising:

acquiring a query word;

performing word segmentation on the query word to obtain a first word segmentation set;

selecting an important word from the first word segmentation set;

and determining the documents with the similarity greater than a preset threshold value with the important words to obtain a document set, and taking the document set as a retrieval result.

2. The method of claim 1, wherein after taking the set of documents as a result of the search, the method further comprises:

acquiring the number of documents sent by client equipment;

arranging the documents in the document set according to the similarity of the documents and the important words; obtaining the documents of the document quantity from the arranged document set according to the sequence of the similarity from big to small to obtain a target document set;

and feeding back the target document set as the retrieval result to the client device.

3. The method of claim 1, wherein determining the documents with similarity to the important words greater than a preset threshold value, and obtaining the document set comprises:

inputting the important words into a first model for analysis to obtain query word vectors of the query words, wherein the first model is obtained by training a plurality of groups of data, and each group of data in the plurality of groups of data comprises: sample segmentation and a vector corresponding to the sample segmentation;

performing word segmentation on documents to be queried in the document set to be queried to obtain a second word segmentation set;

inputting the second word set into the first model for analysis to obtain a document vector of the document set;

determining similarity of the document vector and the query term vector;

and determining the documents with the similarity to the important words larger than a preset threshold value from the document set to be queried, and storing the determined documents into the document set.

4. The method of claim 3, wherein the first model is trained by:

acquiring sample data for training the first model, wherein the sample data is combined data of a sample query word and a sample document;

acquiring a word bag model of the sample data, wherein each word in the word bag model is numbered according to a preset sequence, and the maximum value of the number is the length of a word list corresponding to the word bag model;

determining a vector sequence of the sample data based on the bag of words model, wherein vector elements in the vector sequence are 0 or 1;

and combining the vector sequence with a real number matrix to obtain a real number vector of the sample data, and using the real number vector as the output of the first model.

5. The method of claim 4, wherein before the real vector is used as the output of the first model, the method further comprises:

obtaining the similarity of a first real number vector of the sample query term and a second real number vector of the sample document;

determining the probability corresponding to the similarity by using a loss function, and estimating the real matrix by using maximum likelihood estimation to obtain an estimated value;

and optimizing the real matrix based on a random gradient descent algorithm to minimize the estimated value.

6. The method according to claim 5, wherein a group lasso regular term is set in the loss function, and the group lasso regular term is used for reducing the vector of each participle in the sample data; before the vector sequence is combined with a real number matrix to obtain a real number vector of the sample data, the method further includes:

and deleting real number vectors of participles in the sample data, wherein the real number vectors of the participles are smaller than a specified threshold value.

7. The method of claim 1, wherein determining the documents with similarity to the important words greater than a preset threshold value, and obtaining the document set comprises:

respectively inputting the participles in the second participle set into a semantic matching model for analysis to obtain a vector set of the participles in the second participle set, wherein the semantic matching model is obtained by training a plurality of groups of data, and each group of data in the plurality of groups of data comprises: sample segmentation and a vector corresponding to the sample segmentation;

determining word segmentation similarity between each vector in the vector set and the important word to obtain a plurality of similarities, and determining similarity between the important word and the document to be queried based on the similarities;

and determining the documents with the similarity to the important words larger than a preset threshold value from the document set to be queried.

8. The method of claim 7, wherein determining a word segmentation similarity between each vector in the vector set and the important word to obtain a plurality of similarities comprises:

inputting the participles of the important words into the semantic matching model for analysis to obtain word vectors of the important words;

and determining the similarity between the word vector of the important word and each vector in the vector set to obtain the multiple similarities.

9. The method of claim 7, wherein determining the similarity between the important word and the document to be queried based on the plurality of similarities comprises:

summing the similarity degrees to obtain a sum value, and taking the sum value as the similarity degree of the document to be inquired and the important word; alternatively, the first and second electrodes may be,

and obtaining an average value of the similarity, and taking the average value as the similarity between the document to be queried and the important word.

10. The method of claim 1, wherein selecting an important word from the first set of words comprises:

and taking intersection of the first participle set and a preset participle set, and taking the result obtained after intersection taking as the important word.

11. The method according to any one of claims 1 to 10,

the query term obtaining comprises: acquiring job description information input by a user; identifying the job description information to obtain identification content, and determining the query word based on the identification content;

determining the documents with the similarity greater than a preset threshold value with the important words to obtain a document set, wherein the determining comprises the following steps: and determining the resume with the similarity to the important words larger than a preset threshold value to obtain a resume set, wherein the resume set comprises at least one resume related to the position description information.

12. A method of searching, comprising:

displaying the obtained query words;

displaying a first word segmentation set obtained after the query word is segmented;

displaying important words selected from the first word segmentation set;

and displaying a document set obtained by determining the documents with the similarity greater than a preset threshold value with the important words.

13. A search apparatus, comprising:

the acquisition module is used for acquiring the query words;

the word segmentation module is used for segmenting the query word to obtain a first word segmentation set;

a selection module for selecting an important word from the first word set;

and the determining module is used for determining the documents with the similarity greater than a preset threshold value with the important words to obtain a document set, and taking the document set as a retrieval result.

14. A computer device, comprising:

a processor; and

a memory coupled to the processor for providing instructions to the processor for processing the following processing steps: acquiring a query word; performing word segmentation on the query word to obtain a first word segmentation set; selecting an important word from the first word segmentation set; and determining the documents with the similarity greater than a preset threshold value with the important words to obtain a document set, and taking the document set as a retrieval result.