CN108595620B

CN108595620B - Escape identification method and device, computer equipment and storage medium

Info

Publication number: CN108595620B
Application number: CN201810367116.7A
Authority: CN
Inventors: 邹红建; 方高林; 陈剑峰
Original assignee: Baidu Online Network Technology Beijing Co Ltd
Current assignee: Baidu Online Network Technology Beijing Co Ltd
Priority date: 2018-04-23
Filing date: 2018-04-23
Publication date: 2022-04-26
Anticipated expiration: 2038-04-23
Also published as: CN108595620A

Abstract

The application provides an escape identification method, an escape identification device, computer equipment and a storage medium, wherein the method comprises the following steps: acquiring a first target word and a second target word to be identified; determining a first feature vector and a second feature vector corresponding to the first target word, and a third feature vector and a fourth feature vector corresponding to the second target word; wherein the first feature vector is related to the second target word, the second feature vector is unrelated to the second target word, the third feature vector is related to the first target word, and the fourth feature vector is unrelated to the first target word; and determining the escape probability of the first target word and the second target word when the first target word and the second target word are combined according to the distance between the first feature vector and the second feature vector and the distance between the third feature vector and the fourth feature vector. By the aid of the method, the accuracy and the reliability of escape identification can be improved, and the accuracy of a search result is further improved.

Description

Escape identification method and device, computer equipment and storage medium

Technical Field

The present application relates to the field of search engine technologies, and in particular, to an escape identification method, an escape identification apparatus, a computer device, and a storage medium.

Background

Retrieval is the process by which a search engine returns a certain number of search results based on a search statement input by a user that represents the intent of the query. The search result returned by the search engine may only match the search sentence but not meet the actual query intention of the user, for example, the search sentence input by the user is "diamond", and the search result returned by the search engine is information of "diamond film", which is called escape. Escaping can seriously impact the user's search experience.

In order to return a search result which meets the purpose of the user query, the candidate search result needs to be subjected to escaping identification. In the related art, the escape recognition is realized by using an escape recognition model obtained by learning. In general, the higher the number of clicks of a search result presented, the higher the probability of no escape occurring between the search sentence and the search result, and the higher the probability of escape occurring for a search result presented a number of times with no or little clicks. Based on this, in the related art, the click data of the user is used as a training sample to learn to obtain an escape recognition model for escape recognition.

However, the method of obtaining the escape recognition model by training depending on the user click behavior is relatively simple, and it is difficult for the keywords that do not appear in the user click data to learn the escape information, so that the recognition accuracy of the escape recognition model is affected by both the user's accidental wrong click and the user's intentional cheating click, resulting in low accuracy of the escape recognition.

Disclosure of Invention

The present application is directed to solving, at least to some extent, one of the technical problems in the related art.

Therefore, a first objective of the present application is to provide a method for identifying a escaping from a word, in which a first feature vector of a first target word related to a second target word and a second feature vector unrelated to the second target word are obtained, a third feature vector of the second target word related to the first target word and a fourth feature vector unrelated to the first target word are obtained, and then an escaping probability when the first target word and the second target word are combined is determined according to a distance between the first feature vector and the second feature vector and a distance between the third feature vector and the fourth feature vector, so as to improve accuracy and reliability of escaping from the word and further improve accuracy of a search result.

A second object of the present application is to provide an escape identification apparatus.

A third object of the present application is to propose a computer device.

A fourth object of the present application is to propose a non-transitory computer-readable storage medium.

A fifth object of the present application is to propose a computer program product.

In order to achieve the above object, an embodiment of a first aspect of the present application provides an escape identification method, including:

acquiring a first target word and a second target word to be identified;

determining a first feature vector and a second feature vector corresponding to the first target word, and a third feature vector and a fourth feature vector corresponding to the second target word; wherein the first feature vector is related to the second target word, the second feature vector is unrelated to the second target word, the third feature vector is related to the first target word, and the fourth feature vector is unrelated to the first target word;

and determining the escape probability of the first target word and the second target word when the first target word and the second target word are combined according to the distance between the first feature vector and the second feature vector and the distance between the third feature vector and the fourth feature vector.

According to the escape identification method, the first target word and the second target word to be identified are obtained, the first feature vector related to the first target word and the second feature vector unrelated to the second target word are determined, the third feature vector related to the second target word and the fourth feature vector unrelated to the first target word are determined, and the escape probability when the first target word and the second target word are combined is determined according to the distance between the first feature vector and the second feature vector and the distance between the third feature vector and the fourth feature vector. Therefore, whether the escape occurs when the two words are combined is determined according to the influence of the two words on the feature vectors of each other, so that the accuracy and the reliability of escape identification are improved, and the accuracy of a search result is further improved.

In order to achieve the above object, a second aspect of the present application provides an escape identification apparatus, including:

the acquisition module is used for acquiring a first target word and a second target word to be identified;

the determining module is used for determining a first feature vector and a second feature vector corresponding to the first target word, and a third feature vector and a fourth feature vector corresponding to the second target word; wherein the first feature vector is related to the second target word, the second feature vector is unrelated to the second target word, the third feature vector is related to the first target word, and the fourth feature vector is unrelated to the first target word;

and the escape probability determination module is used for determining the escape probability when the first target word and the second target word are combined according to the distance between the first feature vector and the second feature vector and the distance between the third feature vector and the fourth feature vector.

The escape identification device of the embodiment of the application determines the escape probability when the first target word and the second target word are combined according to the distance between the first feature vector and the second feature vector and the distance between the third feature vector and the fourth feature vector by acquiring the first target word and the second target word to be identified, determining the first feature vector related to the first target word and the second feature vector unrelated to the second target word, and determining the third feature vector related to the second target word and the fourth feature vector unrelated to the first target word. Therefore, whether the escape occurs when the two words are combined is determined according to the influence of the two words on the feature vectors of each other, so that the accuracy and the reliability of escape identification are improved, and the accuracy of a search result is further improved.

To achieve the above object, a third aspect of the present application provides a computer device, including: a processor and a memory; wherein the processor executes a program corresponding to the executable program code by reading the executable program code stored in the memory, so as to implement the escape identification method according to the embodiment of the first aspect.

To achieve the above object, a fourth aspect of the present application provides a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the escape identification method according to the first aspect.

To achieve the above object, a fifth aspect of the present application provides a computer program product, where instructions of the computer program product, when executed by a processor, implement the escape identification method according to the first aspect.

Additional aspects and advantages of the present application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the present application.

Drawings

The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

fig. 1 is a schematic flowchart of an escape identification method according to an embodiment of the present application;

FIG. 2 is a flowchart illustrating a method for determining a first eigenvector and a second eigenvector from co-occurring words;

FIG. 3 is a flowchart illustrating a method for determining a first eigenvector and a second eigenvector according to web page information;

FIG. 4 is a flowchart illustrating a method for determining a first eigenvector and a second eigenvector according to picture content;

FIG. 5 is a flowchart illustrating another escape identification method according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of an escape identification apparatus according to an embodiment of the present application;

FIG. 7 is a schematic structural diagram of another escape identification apparatus according to an embodiment of the present application;

FIG. 8 is a schematic structural diagram of another escape identification apparatus according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of another escape identification apparatus according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of another escape identification apparatus provided in the embodiment of the present application; and

fig. 11 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary and intended to be used for explaining the present application and should not be construed as limiting the present application.

The escape identification method, apparatus, computer device, and storage medium of the embodiments of the present application are described below with reference to the accompanying drawings.

From the linguistic point of view, the semantics of a word are determined by the contextual distribution of that word. The applicant finds that most of the word escape occurs in adjacent contexts through statistical analysis of the word escape examples, and the word escape is basically not caused by the remote contexts. However, the semantics of a word are determined by its contextual distribution, which does not mean that the semantics of a word are temporarily determined by an isolated contextual environment, and thus a word in which an escape may occur can be acquired by learning the semantics of the word through big data. Further, whether or not a transition occurs between the search sentence and the title text may be determined by information other than the text, for example, whether or not a transition occurs in the search sentence may be discriminated according to a picture result retrieved by the search sentence.

Based on this, the embodiment of the application provides an escape identification method to improve the accuracy of escape identification, and further improve the accuracy of a search result.

Fig. 1 is a schematic flow chart of an escape identification method according to an embodiment of the present application.

As shown in fig. 1, the escape identification method may include the following steps:

step 101, a first target word and a second target word to be recognized are obtained.

The first target word and the second target word may be any two related words, for example, two words that appear simultaneously, or a search word and a keyword in a corresponding search result, and the like, which is not limited in this embodiment.

For example, if the escape identification method provided by the present application is implemented by a search engine, the first target word obtained by the search engine may be a keyword in a search sentence, and the second target word may be a keyword in a search result obtained according to the search sentence.

The keywords in the search result may be words that appear simultaneously with the keywords in the search sentence. For example, when the search sentence input by the user is a single word, the search sentence input by the user may be used as a first target word, an internet information set including the first target word in the network is obtained, and a word appearing together with the first target word is determined from the internet information set as a second target word. For example, when the user inputs "diamond", the first target word is "diamond", and the second target word may be "sticker", "grade", "brand", "joker", "how much money", etc.

Or, when the search sentence input by the user is a phrase, performing word segmentation processing on the search sentence input by the user by using a related word segmentation method, and taking words after word segmentation as the first target word and the second target word respectively. For example, when the user inputs "apple variety", the "apple variety" may be participled to obtain "apple" and "variety", with "apple" as the first target word and "variety" as the second target word.

Step 102, determining a first feature vector and a second feature vector corresponding to the first target word, and a third feature vector and a fourth feature vector corresponding to the second target word.

The first feature vector is related to the second target word, the second feature vector is unrelated to the second target word, the third feature vector is related to the first target word, and the fourth feature vector is unrelated to the first target word.

In this embodiment, after the first target word and the second target word to be recognized are obtained, the feature vectors corresponding to the first target word and the second target word may be respectively determined by using a preset language model; alternatively, the feature vectors corresponding to the first target word and the second target word may also be determined by a method such as deep learning, which is not limited in this embodiment.

Step 103, determining the escape probability when the first target word and the second target word are combined according to the distance between the first feature vector and the second feature vector and the distance between the third feature vector and the fourth feature vector.

In this embodiment, after determining the first feature vector and the second feature vector corresponding to the first target word and the third feature vector and the fourth feature vector corresponding to the second target word, the distance between the first feature vector and the second feature vector and the distance between the third feature vector and the fourth feature vector may be calculated, and the distance between the vectors may be calculated in various manners, for example, the euclidean distance, the mahalanobis distance, the hamming distance, the chebyshev distance, the manhattan distance, and the like between the first feature vector and the second feature vector may be calculated. The present application is not limited to the way of calculating the distance between the first feature vector and the second feature vector, and the distance between the third feature vector and the fourth feature vector, but it should be noted here that the same calculation way should be adopted when calculating the distance between the first feature vector and the second feature vector, and when calculating the distance between the third feature vector and the fourth feature vector, so as to ensure the same calculation accuracy.

Furthermore, the escape probability of the combination of the first target word and the second target word can be determined according to the calculated distance between the first feature vector and the second feature vector and the distance between the third feature vector and the fourth feature vector.

Specifically, when the distance between the first feature vector and the second feature vector is smaller than a first threshold and the distance between the third feature vector and the fourth feature vector is larger than a second threshold, determining that the escape probability when the first target word and the second target word are combined is larger than a third threshold; or when the distance between the first feature vector and the second feature vector is greater than a second threshold and the distance between the third feature vector and the fourth feature vector is less than a first threshold, determining that the escape probability when the first target word and the second target word are combined is greater than a third threshold.

The first threshold is smaller than or equal to the second threshold, and the first threshold, the second threshold and the third threshold are preset.

In actual use, if the first target word and the second target word are respectively a word in a search sentence and a word in a search result, when the escape probability when the first target word and the second target word are combined is greater than a third threshold, it can be determined that escape occurs when the first target word and the second target word are combined, and then the search result where the second target word is located can be screened, so as to return the search result matched with the query intention of the user to the user.

In the escape identification method of the embodiment, the escape probability when the first target word and the second target word are combined is determined according to the distance between the first feature vector and the second feature vector and the distance between the third feature vector and the fourth feature vector by acquiring the first target word and the second target word to be identified, determining the first feature vector related to the first target word and the second feature vector unrelated to the second target word, and determining the third feature vector related to the second target word and the fourth feature vector unrelated to the first target word. Therefore, whether the escape occurs when the two words are combined is determined according to the influence of the two words on the feature vectors of each other, so that the accuracy and the reliability of escape identification are improved, and the accuracy of a search result is further improved.

In order to determine the first feature vector and the second feature vector corresponding to the first target word, three possible implementations are provided.

As one possible implementation manner, a first feature vector and a second feature vector corresponding to the first target word may be determined according to the first target word and the co-occurrence word thereof in combination with the co-occurrence word of the first target word. Fig. 2 is a flowchart illustrating a method for determining a first eigenvector and a second eigenvector according to co-occurrence words.

As shown in fig. 2, based on the embodiment shown in fig. 1, step 102 may include the following steps:

step 201, crawling data of a network to obtain a first co-occurrence word set and a second co-occurrence word set corresponding to a first target word, wherein the first co-occurrence word set includes a second target word.

In this embodiment, according to the obtained first target word and the second target word, data crawling may be performed on the network to obtain a first co-occurrence word set and a second co-occurrence word set corresponding to the first target word. For example, data including the first target word may be obtained from the network data according to the first target word, for example, text data such as a web page text, a picture title text, and a query log of a user may be obtained, and preprocessing such as word segmentation processing and word removal, etc. is performed on the text data, the first target word and co-occurring words thereof are extracted from the obtained text data, words including the second target word are screened from the co-occurring words to form a first co-occurring word set, and the co-occurring words remaining after the second target word is removed from the first co-occurring word set form a second co-occurring word set.

Step 202, determining a first feature vector corresponding to the first target word according to each co-occurrence word included in the first co-occurrence word set.

For each co-occurring word in the first set of co-occurring words, a word vector between the first target word and the co-occurring word may be determined as a first feature vector.

As an example, a deep learning method may be adopted, a word embedding vector model is obtained through pre-training, words are vectorized, and then a first target word and a co-occurrence word are input into the word embedding vector model to obtain a first feature vector.

As an example, a Vector Space Model (VSM) may be employed to convert the first target word and the co-occurrence word into a word Vector, and then represent the first feature Vector with the word Vector and corresponding weights. Specifically, for the preprocessed text data including the first target word and the co-occurring word, the frequencies of the first target word and the co-occurring word appearing in the text data are respectively counted, the frequencies are used as initial weights, a TF-IDF (term frequency-inverse document frequency) weight algorithm is used to calculate final weights, and the final weights and the word vectors are used to determine a final first feature vector. For example, the first feature vector may be obtained by performing a weighted summation on the word vector corresponding to the first target word and the word vector corresponding to the co-occurrence word.

Step 203, determining a second feature vector corresponding to the first target word according to each co-occurrence word included in the second co-occurrence word set.

In this embodiment, for each co-occurrence word in the second co-occurrence word set, the second feature vector corresponding to the first target word may be determined in the same manner as the first feature vector is calculated, and specific manner refers to the relevant description in step 202, and is not described in detail here.

To sum up, through crawling the network data, a first co-occurrence word set including a second target word and corresponding to a first target word is obtained, and a second co-occurrence word set not including the second target word is obtained, and then according to each co-occurrence word included in the first co-occurrence word set and the second co-occurrence word set, a first feature vector and a second feature vector corresponding to the first target word are respectively determined, and the first feature vector and the second feature vector corresponding to the first target word can be determined from the co-occurrence word angle, so that a foundation is laid for realizing multi-angle escape identification and improving the coverage rate of the escape identification.

As another possible implementation manner, the first feature vector and the second feature vector corresponding to the first target word may also be determined according to the webpage information obtained by the first target word and the second target word. Fig. 3 is a flowchart illustrating a method for determining a first feature vector and a second feature vector according to web page information.

As shown in fig. 3, based on the embodiment shown in fig. 1, step 102 may include the following steps:

step 301, performing data crawling on a network, and acquiring a first page set and a second page set including a first target word, wherein at least one page in the first page set includes a second target word.

In this embodiment, after the first target word and the second target word to be recognized are obtained, the first page set and the second page set including the first target word may be retrieved from the network data.

Further, in order to ensure the quality of the pages included in the acquired first page set and the acquired second page set and avoid the problem that the data processing difficulty is high due to the fact that the data volume of the acquired page set is too large, in a possible implementation manner of the embodiment of the present application, the pages included in the first page set and the second page set can be further screened to obtain the page set with appropriate data volume and high quality. For example, the pages in the page set may be screened according to the number of times that the first target word appears in the page, or the pages in the page set may be screened according to the number of times that the second target word appears in the page, or the pages in the page set may be screened according to the total number of times that the first target word and the second target word appear in the page, and the like, and the pages whose number of times of appearance is smaller than the preset threshold value may be deleted from the page set; and/or low-quality pages containing sensitive words, advertisements, etc. in the page may also be removed from the page set. And further, determining a first feature vector and a second feature vector by using the deleted first page set and the deleted second page set.

Step 302, determining a first feature vector corresponding to the first target word according to the attribute information of each page in the first page set.

The attribute information of each page includes, but is not limited to, a type of each page or a type of a site to which each page belongs.

Step 303, determining a second feature vector corresponding to the first target word according to the attribute information of each page in the second page set.

Generally, for a word, when it is understood to have a different meaning, the types of pages containing the word tend to be different. For example, for "jadeite" and "jadeite bean curd", although both words contain "jadeite", the first word generally appears in pages of the jewelry category, while the second word generally appears in pages of the gourmet category. Therefore, in this embodiment, the first feature vector and the second feature vector corresponding to the first target word may be determined according to the type of the page.

As an example, for each page in the first page set and the second page set, the type of the page or the type of the site to which the page belongs may be obtained first, and then the obtained type is subjected to vector representation, and then the first feature word vector or the second feature word vector is obtained by combining the weight. Wherein the weight can be determined by counting the frequency of the first target word appearing in the page.

In summary, a first page set and a second page set containing a first target word are obtained by performing data crawling on a network, at least one page in the first page set comprises a second target word, and then a first feature vector and a second feature vector corresponding to the first target word are determined according to attribute information of each page in the first page set and the second page set, and the first feature vector and the second feature vector of the first target word can be determined according to the type of the page to which the word to be recognized belongs, so that a foundation is laid for realizing multi-angle escape recognition and improving the coverage rate of escape recognition.

As another possible implementation manner, a corresponding picture search result may be obtained according to the first target word, and a first feature vector and a second feature vector corresponding to the first target word may be determined according to picture content. Fig. 4 is a flowchart illustrating a method for determining a first feature vector and a second feature vector according to picture content.

As shown in fig. 4, based on the embodiment shown in fig. 1, step 102 may include the following steps:

step 401, a first picture set and a second picture set corresponding to the first target word are obtained, wherein at least one picture in the first picture set is the same as a picture in the picture set corresponding to the second target word.

In this embodiment, after the first target word and the second target word to be identified are obtained, the first target word and the second target word may be used as a search statement, a related picture may be obtained from a picture search engine, and a first picture set may be generated by using at least one picture, which is the same as a picture obtained by using the second target word as a search statement, in a picture obtained by using the first target word as a search statement.

In a possible implementation manner of the embodiment of the present application, after corresponding pictures are obtained according to a first target word and a second target word, the obtained pictures may be sorted, and a first N (N is a positive integer, and a value of N may be preset, for example, N is 1000, or N is 10000) pictures are selected from the sorted pictures, and a first picture set and a second picture set are generated by using the N pictures, so as to control a data amount of the first picture set and the second picture set to an appropriate size, thereby avoiding a difficulty in data processing from increasing a processing time and further causing a low feedback efficiency of a search engine.

Step 402, determining a first feature vector corresponding to the first target word according to the content of each picture in the first picture set.

Step 403, determining a second feature vector corresponding to the first target word according to the content of each picture in the second picture set.

In this embodiment, for each picture in the first picture set and the second picture set, the first feature vector or the second feature vector corresponding to the first target word may be determined according to the picture content.

As an example, color features, texture features and shape features of a picture may be extracted, and a related description method is sampled to perform vectorization representation on the color features, the texture features and the shape features of the picture, for example, for the color features, the color features of an image may be quantized by using a histogram method; for the texture features, the gray level co-occurrence matrix can be adopted to quantize the texture features; for shape features, the shape features can be quantified using a region invariant moment method. Furthermore, the color feature representation, the texture feature representation and the shape feature representation of the picture are utilized to determine a first feature vector or a second feature vector of the first target word.

As an example, a large number of picture samples (including pictures and their category labels) may be collected, and the picture samples are used as input to train the initial deep neural network model, so as to obtain a trained picture classification deep neural network model. The trained image classification deep neural network model can represent image contents as corresponding feature vectors at first, and then output class labels of the images at an output layer according to the feature vectors. Furthermore, after the first picture set and the second picture set corresponding to the first target word are obtained, the pictures in the first picture set can be input into the picture classification deep neural network model, then the feature vectors representing the picture contents can be extracted from the picture classification deep neural network model, and the extracted feature vectors are determined as the first feature vectors. Similarly, when the pictures in the second picture set are input into the picture classification deep neural network model, the feature vectors representing the picture contents can be extracted from the picture classification deep neural network model, and the extracted feature vectors are determined as second feature vectors.

In summary, by acquiring the first picture set and the second picture set corresponding to the first target word, the first feature vector and the second feature vector corresponding to the first target word are determined according to the content of each picture in the first picture set and the second picture set, and the first feature vector and the second feature vector can be determined according to the content of the picture, a foundation is laid for realizing multi-angle escape identification and improving the coverage rate of escape identification.

It should be noted that the foregoing method for determining the first feature vector and the second feature vector corresponding to the first target word is also applicable to determining the third feature vector and the fourth feature vector corresponding to the second target word, and the detailed description of the manner for determining the third feature vector and the fourth feature vector corresponding to the second target word is not repeated in this application.

In addition, the method for determining the feature vector described in the above embodiments may be used alone or in combination, and the determination method of the feature vector is not limited in the present application. When the characteristic vector is determined by adopting at least two methods, the escape recognition can be realized from different angles, the judgment basis of the escape recognition is increased, and the coverage rate of the escape recognition is improved.

Through experiments, the applicant finds that the coverage rate of the escape identification method in the embodiment of the application is obviously improved under the condition of ensuring the accuracy rate.

In a search processing scene, if the transfer recognition method provided by the application is adopted, the words are subjected to escape recognition, so that a search result meeting the query intention of a user is provided for the user, and the search experience of the user is improved. Then, after the escape probability is determined, the search results can be ranked and displayed according to the escape probability. Fig. 5 is a flowchart illustrating another escape identification method according to an embodiment of the present application.

As shown in fig. 5, the escape identification method may include the following steps:

step 501, determining a first target term and a second target term to be identified according to the query statement and the candidate result.

In this embodiment, after the user inputs the query sentence, the search engine may first obtain the candidate result according to the query sentence input by the user, perform pre-processing such as word segmentation and word deactivation for each candidate result text, obtain a word appearing adjacent to the query sentence input by the user from the pre-processed candidate result text, determine the query sentence input by the user as the first target word, and use the word appearing adjacent to the search sentence obtained from the candidate result text as the second target word.

In a possible implementation manner of the embodiment of the present application, when a plurality of words that appear adjacent to the query sentence and are obtained from the candidate result text are present, the number of times that each word that appears adjacent to the query sentence appears in the candidate result text may be counted, and the word with the largest number of times of appearance is determined as the second target word.

Step 502, determining a first feature vector and a second feature vector corresponding to the first target word, and a third feature vector and a fourth feature vector corresponding to the second target word.

Step 503, determining the escape probability when the first target word and the second target word are combined according to the distance between the first feature vector and the second feature vector and the distance between the third feature vector and the fourth feature vector.

It should be noted that, in this embodiment, the description of step 502 to step 503 may refer to the description of step 102 to step 103 in the foregoing embodiment, and is not described herein again.

Step 504, determining the display order of the candidate results according to the escape probability when the first target word and the second target word are combined.

In this embodiment, after determining the escape probability when the first target word is combined with the second target word, the candidate results may be ranked according to the escape probability to determine a display order of the candidate results, and the search engine displays the search result corresponding to the search statement to the user according to the display order. Since the higher the escape probability is, the higher the possibility of escape occurring when the first search statement and the second search statement are combined, in this embodiment, the candidate results may be sorted in the order of the escape probability from low to high according to the escape probability, so that the search engine preferentially displays the search result with low escape probability.

According to the escape identification method, the first target word and the second target word to be identified are determined according to the query statement and the candidate result, so that escape identification can be limited in the range of the candidate result, the escape identification is pointed, and the data processing amount during escape identification is reduced; by determining the display sequence of the candidate results according to the escape probability when the first target word and the second target word are combined, the search engine can preferentially display the search results with low escape probability, the matching degree of the search results and the query intention of the user is ensured, and the accuracy of the search results is improved.

In order to implement the above embodiments, the present application further provides an escape identification apparatus.

Fig. 6 is a schematic structural diagram of an escape identification apparatus according to an embodiment of the present application.

As shown in fig. 6, the escape identification apparatus 50 may include: an acquisition module 510, a determination module 520, and an escape probability determination module 530. Wherein the content of the first and second substances,

the obtaining module 510 is configured to obtain a first target word and a second target word to be recognized.

A determining module 520, configured to determine a first feature vector and a second feature vector corresponding to the first target word, and a third feature vector and a fourth feature vector corresponding to the second target word; the first feature vector is related to the second target word, the second feature vector is unrelated to the second target word, the third feature vector is related to the first target word, and the fourth feature vector is unrelated to the first target word.

The escape probability determining module 530 is configured to determine an escape probability when the first target word is combined with the second target word according to a distance between the first feature vector and the second feature vector and a distance between the third feature vector and the fourth feature vector.

Specifically, the escape probability determining module 530 is configured to determine that the escape probability when the first target word is combined with the second target word is greater than a third threshold when the distance between the first feature vector and the second feature vector is less than a first threshold and the distance between the third feature vector and the fourth feature vector is greater than a second threshold; or when the distance between the first feature vector and the second feature vector is greater than a second threshold and the distance between the third feature vector and the fourth feature vector is less than a first threshold, determining that the escape probability when the first target word and the second target word are combined is greater than a third threshold. Wherein the first threshold is less than or equal to the second threshold.

Further, in a possible implementation manner of the embodiment of the present application, as shown in fig. 7, on the basis of the embodiment shown in fig. 6, the determining module 520 includes:

the co-occurrence word set obtaining unit 5201 is configured to perform data crawling on a network, and obtain a first co-occurrence word set and a second co-occurrence word set corresponding to a first target word, where the first co-occurrence word set includes a second target word.

A first determining unit 5202, configured to determine, according to each co-occurrence word included in the first co-occurrence word set, a first feature vector corresponding to the first target word; and determining a second feature vector corresponding to the first target word according to each co-occurrence word included in the second co-occurrence word set.

The method comprises the steps of obtaining a first co-occurrence word set which comprises a second target word and corresponds to a first target word and a second co-occurrence word set which does not comprise the second target word through crawling network data, further respectively determining a first feature vector and a second feature vector which correspond to the first target word according to co-occurrence words in the first co-occurrence word set and the second co-occurrence word set, determining the first feature vector and the second feature vector which correspond to the first target word from the co-occurrence word angle, and laying a foundation for realizing multi-angle escape identification and improving the coverage rate of escape identification.

In a possible implementation manner of the embodiment of the present application, as shown in fig. 8, on the basis of the embodiment shown in fig. 6, the determining module 520 includes:

the page set obtaining unit 5211 is configured to perform data crawling on a network, and obtain a first page set and a second page set that include a first target term, where at least one page in the first page set includes a second target term.

A second determining unit 5212, configured to determine, according to attribute information of each page in the first page set, a first feature vector corresponding to the first target term, where the attribute information of each page includes, but is not limited to, a type of each page or a type of a site to which each page belongs; and determining a second feature vector corresponding to the first target word according to the attribute information of each page in the second page set.

The method comprises the steps of obtaining a first page set and a second page set containing first target words by performing data crawling on a network, determining a first feature vector and a second feature vector corresponding to the first target words according to attribute information of each page in the first page set and the second page set, determining the first feature vector and the second feature vector of the first target words according to the type of the page to which the words to be recognized belong, and laying a foundation for realizing multi-angle escape recognition and improving the coverage rate of the escape recognition.

In a possible implementation manner of the embodiment of the present application, as shown in fig. 9, on the basis of the embodiment shown in fig. 6, the determining module 520 includes:

the picture set obtaining unit 5221 is configured to obtain a first picture set and a second picture set corresponding to the first target word, where at least one picture in the first picture set is the same as a picture in the picture set corresponding to the second target word.

A third determining unit 5222, configured to determine, according to the content of each picture in the first picture set, a first feature vector corresponding to the first target word; and determining a second feature vector corresponding to the first target word according to the content of each picture in the second picture set.

The first feature vector and the second feature vector corresponding to the first target word are determined according to the content of each picture in the first picture set and the second picture set by obtaining the first picture set and the second picture set corresponding to the first target word, and the first feature vector and the second feature vector can be determined according to the content of the pictures, so that a foundation is laid for realizing multi-angle escape identification and improving the coverage rate of the escape identification.

It should be noted that the manner in which the determining module 520 determines the first feature vector and the second feature vector corresponding to the first target word is also applicable to determining the third feature vector and the fourth feature vector corresponding to the second target word, and the process of determining the third feature vector and the fourth feature vector corresponding to the second target word by the determining module 520 is not described in detail herein.

In addition, the determining module 520 may determine the feature vector in only one manner, or may determine the feature vector in multiple manners, and the manner in which the determining module 520 determines the feature vector is not limited in this application. When the determining module 520 determines the feature vector in at least two ways, the escape recognition from different angles can be realized, the judgment basis of the escape recognition is increased, and the coverage rate of the escape recognition is improved.

In a possible implementation manner of the embodiment of the present application, as shown in fig. 10, on the basis of the embodiment shown in fig. 6, the escape identification apparatus 50 may further include:

and a display order determining module 540, configured to determine a display order of the candidate result according to the escape probability when the first target word is combined with the second target word.

In this embodiment, the obtaining module 510 is specifically configured to determine a first target term and a second target term to be identified according to the query statement and the candidate result.

The first target words and the second target words to be identified are determined according to the query sentences and the candidate results, so that the escape identification can be limited in the range of the candidate results, the escape identification has pertinence, and the data processing amount during the escape identification is reduced; by determining the display sequence of the candidate results according to the escape probability when the first target word and the second target word are combined, the search engine can preferentially display the search results with low escape probability, the matching degree of the search results and the query intention of the user is ensured, and the accuracy of the search results is improved.

It should be noted that the foregoing explanation of the embodiment of the escape identification method is also applicable to the escape identification apparatus of this embodiment, and the implementation principle thereof is similar and will not be described herein again.

The escape identification device of this embodiment determines the escape probability when the first target word and the second target word are combined by obtaining the first target word and the second target word to be identified, determining a first feature vector related to the first target word and the second target word and a second feature vector unrelated to the second target word, determining a third feature vector related to the second target word and the first target word and a fourth feature vector unrelated to the first target word, and further determining the escape probability when the first target word and the second target word are combined according to the distance between the first feature vector and the second feature vector and the distance between the third feature vector and the fourth feature vector. Therefore, whether the escape occurs when the two words are combined is determined according to the influence of the two words on the feature vectors of each other, so that the accuracy and the reliability of escape identification are improved, and the accuracy of a search result is further improved.

In order to implement the foregoing embodiments, the present application also provides a computer device, including: a processor and a memory. Wherein the processor runs a program corresponding to the executable program code by reading the executable program code stored in the memory, for implementing the escape identification method as described in the foregoing embodiments.

FIG. 11 is a block diagram of a computer device, shown as an exemplary computer device 90, suitable for implementing embodiments of the present application. The computer device 90 shown in fig. 11 is only an example, and should not bring any limitation to the function and the scope of use of the embodiments of the present application.

As shown in fig. 11, the computer device 90 is in the form of a general purpose computer device. The components of computer device 90 may include, but are not limited to: one or more processors or processing units 906, a system memory 910, and a bus 908 that couples the various system components (including the system memory 910 and the processing unit 906).

Bus 908 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. These architectures include, but are not limited to, Industry Standard Architecture (ISA) bus, Micro Channel Architecture (MAC) bus, enhanced ISA bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus, to name a few.

Computer device 90 typically includes a variety of computer system readable media. Such media may be any available media that is accessible by computer device 90 and includes both volatile and nonvolatile media, removable and non-removable media.

The system Memory 610 may include computer system readable media in the form of volatile Memory, such as Random Access Memory (RAM) 911 and/or cache Memory 912. The computer device 90 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 913 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 11, and commonly referred to as a "hard disk drive"). Although not shown in FIG. 11, a disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a Compact disk Read Only Memory (CD-ROM), a Digital versatile disk Read Only Memory (DVD-ROM), or other optical media) may be provided. In these cases, each drive may be connected to bus 908 by one or more data media interfaces. System memory 910 may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the application.

A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server.

Program/utility 914 having a set (at least one) of program modules 9140 may be stored, for example, in system memory 610, such program modules 9140 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each of which or some combination of these examples may comprise an implementation of a network environment. Program modules 9140 generally perform the functions and/or methods of embodiments described herein.

The computer device 90 may also communicate with one or more external devices 10 (e.g., keyboard, pointing device, display 100, etc.), with one or more devices that enable a user to interact with the terminal device 90, and/or with any devices (e.g., network card, modem, etc.) that enable the computer device 90 to communicate with one or more other computing devices. Such communication may occur via input/output (I/O) interfaces 902. Moreover, computer device 90 may also communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public Network such as the Internet) via Network adapter 900. As shown in FIG. 11, network adapter 900 communicates with the other modules of computer device 90 via bus 908. It should be appreciated that although not shown in FIG. 11, other hardware and/or software modules may be used in conjunction with computer device 90, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

The processing unit 906 executes various functional applications and data processing by executing programs stored in the system memory 910, for example, implementing the escape recognition method mentioned in the foregoing embodiments.

In order to implement the above embodiments, the present application also proposes a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the escape identification method as described in the foregoing embodiments.

In order to implement the above embodiments, the present application also proposes a computer program product, wherein when the instructions of the computer program product are executed by a processor, the escape identification method as described in the foregoing embodiments is implemented.

In the description herein, reference to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present application, "plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing steps of a custom logic function or process, and alternate implementations are included within the scope of the preferred embodiment of the present application in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present application.

The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.

It should be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. If implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.

In addition, functional units in the embodiments of the present application may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.

The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc. Although embodiments of the present application have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present application, and that variations, modifications, substitutions and alterations may be made to the above embodiments by those of ordinary skill in the art within the scope of the present application.

Claims

1. An escape identification method, comprising:

acquiring a first target word and a second target word to be identified;

determining the escape probability of the first target word and the second target word when the first target word and the second target word are combined according to the distance between the first feature vector and the second feature vector and the distance between the third feature vector and the fourth feature vector;

the determining an escape probability for the first target word when combined with the second target word comprises:

if the distance between the first feature vector and the second feature vector is smaller than a first threshold value and the distance between the third feature vector and the fourth feature vector is larger than a second threshold value, determining that the escape probability when the first target word and the second target word are combined is larger than a third threshold value;

alternatively, the first and second electrodes may be,

if the distance between the first feature vector and the second feature vector is greater than a second threshold value, and the distance between the third feature vector and the fourth feature vector is less than a first threshold value, determining that the escape probability when the first target word and the second target word are combined is greater than a third threshold value.

2. The method of claim 1, wherein the determining the first feature vector and the second feature vector corresponding to the first target word comprises:

performing data crawling on a network, and acquiring a first co-occurrence word set and a second co-occurrence word set corresponding to the first target word, wherein the first co-occurrence word set comprises the second target word;

determining a first feature vector corresponding to the first target word according to each co-occurrence word included in the first co-occurrence word set;

and determining a second feature vector corresponding to the first target word according to each co-occurrence word included in the second co-occurrence word set.

3. The method of claim 1, wherein the determining the first feature vector and the second feature vector corresponding to the first target word comprises:

performing data crawling on a network, and acquiring a first page set and a second page set which comprise the first target words, wherein at least one page in the first page set comprises the second target words;

determining a first feature vector corresponding to the first target word according to the attribute information of each page in the first page set;

and determining a second feature vector corresponding to the first target word according to the attribute information of each page in the second page set.

4. The method of claim 3, wherein the attribute information of each page comprises: the type of each page or the type of site to which each page belongs.

5. The method of claim 1, wherein the determining the first feature vector and the second feature vector corresponding to the first target word comprises:

acquiring a first picture set and a second picture set corresponding to the first target word, wherein at least one picture in the first picture set is the same as a picture in the picture set corresponding to the second target word;

determining a first feature vector corresponding to the first target word according to the content of each picture in the first picture set;

and determining a second feature vector corresponding to the first target word according to the content of each picture in the second picture set.

6. The method of any one of claims 1-5, wherein obtaining the first target term and the second target term to be identified comprises:

and determining the first target terms and the second target terms to be identified according to the query statement and the candidate result.

7. The method of claim 6, wherein after determining the escape probability when the first target word is combined with the second target word, further comprising:

and determining the display sequence of the candidate results according to the escape probability when the first target word is combined with the second target word.

8. The method of any one of claims 1-5, wherein said determining an escape probability for said first target word when combined with said second target word comprises:

the first threshold is less than or equal to the second threshold.

9. An escape recognition apparatus, comprising:

a escaping probability determining module, configured to determine escaping probability when the first target word and the second target word are combined according to a distance between the first feature vector and the second feature vector and a distance between the third feature vector and the fourth feature vector;

alternatively, the first and second electrodes may be,

10. A computer device comprising a processor and a memory;

wherein the processor executes a program corresponding to the executable program code by reading the executable program code stored in the memory for implementing the escape identification method according to any one of claims 1 to 8.

11. A non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements an escape identification method according to any one of claims 1 to 8.