CN106599278B - Application search intention identification method and device - Google Patents

Application search intention identification method and device Download PDF

Info

Publication number
CN106599278B
CN106599278B CN201611207524.3A CN201611207524A CN106599278B CN 106599278 B CN106599278 B CN 106599278B CN 201611207524 A CN201611207524 A CN 201611207524A CN 106599278 B CN106599278 B CN 106599278B
Authority
CN
China
Prior art keywords
application search
search query
intention
query string
historical
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201611207524.3A
Other languages
Chinese (zh)
Other versions
CN106599278A (en
Inventor
庞伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Qihoo Technology Co Ltd
Original Assignee
Beijing Qihoo Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Qihoo Technology Co Ltd filed Critical Beijing Qihoo Technology Co Ltd
Priority to CN201611207524.3A priority Critical patent/CN106599278B/en
Publication of CN106599278A publication Critical patent/CN106599278A/en
Application granted granted Critical
Publication of CN106599278B publication Critical patent/CN106599278B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses an application search intention identification method and device, relates to the technical field of data search, and can improve the accuracy of identification of application search intentions. The method comprises the following steps: acquiring an input query and an application search intention dictionary; when the input query is contained in the application search intention dictionary, determining an intention label corresponding to the query input in the application search intention dictionary as an application search intention corresponding to the input query; when there is no input query in the application search intention dictionary, calculating semantic similarity between the input query and each historical query in the application search intention dictionary, screening preset number of intention labels from intention labels corresponding to the first n historical queries with the maximum semantic similarity according to a preset screening algorithm, and determining the screened intention labels as application search intentions corresponding to the input query. The method and the device are mainly suitable for the application search engine-based application search scene.

Description

Application search intention identification method and device
Technical Field
The invention relates to the technical field of data search, in particular to a method and a device for identifying application search intents.
Background
With the development of internet technology, the kinds and the number of application software applied to mobile terminals are increasing. In order to make a user quickly obtain application software to be downloaded, an application search engine specially used for searching the application software is provided in the prior art.
The application search engine provides accurate search service for users on the premise of accurately understanding application search intentions of the users. The existing method of determining the search intention of a user application remains the method used by the web search engine. I.e. a taxonomy. Specifically, the application search intention of the user is divided into three types, namely a navigation type, an information type and a resource type, through manual sorting, and then the type of application software which the application search intention of the user is determined based on an application search query string input by the user. However, since the functions of the application software are single and specific, and the three categories are wide in range and coarse in granularity, when the application search query string input by the user is analyzed, it cannot be determined which domain of the application software the user wants to search according to the application search query string, so that the accuracy of the application search intention analyzed by the application search intention identification method based on classification is low.
Disclosure of Invention
In view of this, the method and the device for identifying the application search intention provided by the invention can improve the accuracy of identifying the application search intention.
The purpose of the invention is realized by adopting the following technical scheme:
in one aspect, the present invention provides a method for identifying an application search intention, the method comprising:
the method comprises the steps of obtaining an input application search query string and an application search intention dictionary, wherein the application search intention dictionary is obtained by performing machine self-learning according to a historical application search query string and a historical application download record corresponding to the historical application search query string, and comprises the historical application search query string and intention labels corresponding to the historical application search query string, wherein each intention label has a weight;
when the input application search query string is contained in the application search intention dictionary, determining an intention label corresponding to the input application search query string in the application search intention dictionary as an application search intention corresponding to the input application search query string;
when the input application search query string does not exist in the application search intention dictionary, calculating the semantic similarity between the input application search query string and each historical application search query string in the application search intention dictionary, screening a preset number of intention labels from intention labels corresponding to the first n historical application search query strings with the maximum semantic similarity according to a preset screening algorithm, and determining the screened intention labels as the application search intention corresponding to the input application search query string, wherein n is a positive integer.
Optionally, the calculating semantic similarity between the input application search query string and each historical application search query string in the application search intention dictionary, and screening a preset number of intention labels from intention labels corresponding to the first n historical application search query strings with the largest semantic similarity according to a preset screening algorithm includes:
respectively calculating Euclidean distances between the input application search query string and the historical application search query string;
screening the first n Euclidean distances with the minimum distance from the calculated Euclidean distances;
determining the historical application search query strings corresponding to the first n Euclidean distances as the first n historical application search query strings with the maximum similarity to the input application search query string;
performing Gaussian kernel smoothing operation on Euclidean distances corresponding to the previous n historical application search query strings, and taking an operation result as the weight of the corresponding historical application search query string;
performing weight merging processing on the same intention label based on the weights of the first n historical application search query strings and the weight of each intention label in the first n historical application search query strings to obtain a merged intention label and the weight of the merged intention label;
and screening the first m intention labels with the largest weight from the merged intention labels, wherein m is the preset number.
Optionally, before obtaining the application search intention dictionary, the method further includes:
acquiring an original corpus required for constructing the application search intention dictionary, wherein the original corpus comprises a historical application search query string and expansion words corresponding to the historical application search query string, and the expansion words are obtained based on other historical application search query strings and historical application download records;
preprocessing the original corpus to obtain a model corpus required by document theme generation model LDA training, wherein the model corpus comprises a historical application search query string, a target noun and a target verb, the target noun is a categorical intention label required by construction of the application search intention dictionary, and the target verb is a functional intention label required by construction of the application search intention dictionary;
performing LDA model training based on the model training corpus to obtain document theme probability distribution and theme term probability distribution, wherein the document is each historical application search query string;
based on the document theme probability distribution, the theme term probability distribution and a preset probability algorithm, obtaining an initial intention label corresponding to each historical application search query string and the weight of the initial intention label, wherein the initial intention label is all target nouns and terms with the weights of the previous p names in all target verbs, and p is a positive integer;
updating the weight of the initial intention label based on the semantic relationship between the initial intention label and the historical application search query string corresponding to the initial intention label or based on the semantic relationship between the initial intention label and the expansion word corresponding to the historical application search query string;
and screening a preset number of intention labels from the initial intention labels according to a fold line function of the searching times of the historical application search query string so as to construct the application search intention dictionary based on the corresponding relation between the screened intention labels and the historical application search query string.
Optionally, the obtaining the original corpus required for constructing the application search intention dictionary includes:
obtaining a historical application search query string from a query session log and an application downloaded based on the historical application search query string;
respectively calculating Euclidean distances between each historical application search query string and the downloaded application and other historical application search query strings;
aiming at a current historical application search query string, determining a downloaded application or other historical application search query strings corresponding to the smallest top q Euclidean distances as an expansion word corresponding to the current historical application search query string, wherein q is a positive integer;
and taking the historical application search query string and the expansion words corresponding to the historical application search query string as the original training corpus.
Optionally, the preprocessing the original corpus, and acquiring a model corpus required for document theme generation model LDA training includes:
segmenting the historical application search query string in the original training corpus and the expanded words corresponding to the historical application search query string to obtain a first segmentation file only containing nouns and verbs and a second segmentation file containing all the segmentations;
calculating the compactness between two adjacent participles in the second participle file;
forming a phrase by the two word segmentations with the compactness larger than a preset threshold value, and adding the phrase into the first word segmentation file to obtain an updated first word segmentation file, wherein the phrase is a verb;
calculating the TF-IDF weight of each participle in the updated first participle file based on a TF-IDF algorithm;
determining nouns with the TF-IDF weights within a preset range as the target nouns, and determining verbs with the TF-IDF weights within a preset range as the target verbs.
Optionally, the obtaining an initial intention label corresponding to each historical application search query string and a weight of the initial intention label based on the document topic probability distribution, the topic term probability distribution and a preset probability algorithm includes:
selecting the first x topics with the maximum probability under each historical application search query string based on the document topic probability distribution and the topic term probability distribution, wherein the first y terms with the maximum probability under the first x topics are positive integers;
merging the same terms based on the probability of the first x topics and the probability of the first y terms under the first x topics to obtain merged terms and the weight of the merged terms, wherein the number of the merged terms is p;
and determining the merged term as the initial intention label, and determining the weight of the merged term as the weight of the initial intention label.
Optionally, the updating the weight of the initial intention label based on the semantic relationship between the initial intention label and the historical application search query string corresponding to the initial intention label includes:
calculating the sum of the cosine similarity of each term in the initial intention label and the corresponding historical application search query string;
and taking the product of the sum of the cosine similarity and the weight of the initial intention label as the updated weight of the initial intention label.
Optionally, updating the weight of the initial intention label based on the semantic relationship between the initial intention label and the expansion word corresponding to the historical application search query string includes:
calculating a document frequency DF value of an initial intention label corresponding to a historical application search query string based on an expansion word set formed by expansion words of the historical application search query string;
and taking the product of the DF value and the weight of the initial intention label as the updated weight of the initial intention label.
Optionally, the method further includes:
and initializing the same terms into a theme in the process of carrying out LDA model training based on the model training corpus.
In another aspect, the present invention provides an apparatus for identifying an application search intention, the apparatus comprising:
the application search intention dictionary is obtained through machine self-learning according to historical application search query strings and historical application download records corresponding to the historical application search query strings, and comprises the historical application search query strings and intention labels corresponding to the historical application search query strings, wherein each intention label has a weight;
a first determining unit, configured to determine, when the input application search query string is included in the application search intention dictionary acquired by the acquiring unit, an intention label corresponding to the input application search query string in the application search intention dictionary as an application search intention corresponding to the input application search query string;
a second determining unit, configured to, when the application search intention dictionary acquired by the acquiring unit does not have the input application search query string, calculate semantic similarities between the input application search query string and each historical application search query string in the application search intention dictionary, and according to a preset screening algorithm, screen a preset number of intention labels from intention labels corresponding to n previous historical application search query strings with the largest semantic similarity, and determine the screened intention labels as the application search intention corresponding to the input application search query string, where n is a positive integer.
Optionally, the second determining unit includes:
the first calculation module is used for respectively calculating Euclidean distances between the input application search query string and the historical application search query string;
the screening module is used for screening the first n Euclidean distances with the minimum distance from the Euclidean distances calculated by the first calculation module;
a first determining module, configured to determine the historical application search query strings corresponding to the first n euclidean distances screened by the screening module as the first n historical application search query strings with the largest similarity to the input application search query string;
the operation module is used for performing Gaussian kernel smoothing operation on Euclidean distances corresponding to the previous n historical application search query strings determined by the first determination module and taking an operation result as the weight of the corresponding historical application search query string;
a first merging module, configured to perform weight merging processing on the same intention label based on the weights of the first n historical application search query strings obtained by the operation module and the weight of each intention label in the first n historical application search query strings, and obtain a merged intention label and the weight of the merged intention label;
the screening module is further configured to screen the top m intention labels with the largest weight from the merged intention labels obtained by the first merging module, where m is the preset number.
Optionally, the obtaining unit is further configured to obtain an original corpus required for constructing the application search intention dictionary before obtaining the application search intention dictionary, where the original corpus includes a historical application search query string and an expansion word corresponding to the historical application search query string, and the expansion word is obtained based on other historical application search query strings and historical application download records;
the device further comprises:
the preprocessing unit is used for preprocessing the original corpus to obtain a model corpus required by LDA training of a document theme generation model, wherein the model corpus comprises a historical application search query string, a target noun and a target verb, the target noun is a categorical intention label required by construction of the application search intention dictionary, and the target verb is a functional intention label required by construction of the application search intention dictionary;
the training unit is used for carrying out LDA model training based on the model training corpus to obtain document theme probability distribution and theme term probability distribution, wherein the document is each historical application search query string;
the operation unit is used for acquiring an initial intention label corresponding to each historical application search query string and the weight of the initial intention label based on the document theme probability distribution, the theme word item probability distribution and a preset probability algorithm, wherein the initial intention label is all target nouns and all word items with the weights at the top p names in the target verbs, and p is a positive integer;
an updating unit, configured to update a weight of an initial intention tag based on a semantic relationship between the initial intention tag and a history application search query string corresponding to the initial intention tag, or based on a semantic relationship between the initial intention tag and an expanded word corresponding to the history application search query string;
the screening unit is used for screening a preset number of intention labels from the initial intention labels according to a broken line function of the search times of the historical application search query string;
and the construction unit is used for constructing the application search intention dictionary based on the corresponding relation between the screened intention labels and the historical application search query strings.
Optionally, the obtaining unit includes:
an acquisition module for acquiring a historical application search query string from a query session log and an application downloaded based on the historical application search query string;
the second calculation module is used for respectively calculating Euclidean distances between each historical application search query string acquired by the acquisition module and the downloaded application and other historical application search query strings;
a second determining module, configured to determine, for a current historical application search query string, a downloaded application or other historical application search query string corresponding to a minimum previous q-term euclidean distance as an expansion word corresponding to the current historical application search query string, where q is a positive integer;
and the setting module is used for taking the historical application search query string and the expansion word corresponding to the historical application search query string determined by the second determining module as the original corpus.
Optionally, the training unit includes:
the word segmentation module is used for segmenting the historical application search query string in the original training corpus and the expanded words corresponding to the historical application search query string to obtain a first word segmentation file only containing nouns and verbs and a second word segmentation file containing all the words;
the third calculation module is used for calculating the compactness between two adjacent participles in the second participle file obtained by the participle module;
the adding module is used for forming a phrase by the two participles of which the compactness is greater than a preset threshold value and calculated by the third calculating module, adding the phrase into the first participle file to obtain an updated first participle file, wherein the phrase is a verb;
the fourth calculation module is used for calculating the TF-IDF weight of each participle in the updated first participle file obtained by the adding module based on a TF-IDF algorithm;
a third determining module, configured to determine the noun with the TF-IDF weight within a preset range obtained by the fourth calculating module as the target noun, and determine the verb with the TF-IDF weight within the preset range as the target verb.
Optionally, the operation unit includes:
a selecting module, configured to select, based on the document topic probability distribution and the topic term probability distribution, the first x topics with the highest probability under each historical application search query string, where the first y terms with the highest probability under the first x topics are both positive integers;
the second merging module is used for merging the same terms based on the probability of the first x topics selected by the selection module and the probability of the first y terms under the first x topics to obtain the merged terms and the weight of the merged terms, wherein the number of the merged terms is p;
a fourth determining module, configured to determine the merged term obtained by the second merging module as the initial intention tag, and determine a weight of the merged term as a weight of the initial intention tag.
Optionally, the updating unit includes:
a fifth calculation module, configured to calculate a sum of cosine similarities of the initial intent tag and terms in the corresponding historical application search query string;
a fifth determining module, configured to take a product of the sum of the cosine similarities obtained by the fifth calculating module and the weight of the initial intention label as the updated weight of the initial intention label.
Optionally, the updating unit includes:
the sixth calculation module is used for calculating the document frequency DF value of the initial intention label corresponding to the historical application search query string based on an expansion word set formed by expansion words of the historical application search query string;
a sixth determining module, configured to take a product of the DF value obtained by the sixth calculating module and the weight of the initial intention tag as the updated weight of the initial intention tag.
Optionally, the training unit is configured to initialize the same term into a topic in the process of performing LDA model training based on the model training corpus.
By means of the technical scheme, the method and the device for identifying the application search intention provided by the invention can acquire the application search inquiry string input by the user, firstly acquire the application search intention dictionary trained according to the historical application search inquiry string and the historical application download record, and then determine the application search intention of the application search inquiry string input by the user through the historical application search inquiry string in the application search intention dictionary and the intention label corresponding to the historical application search inquiry string. Since the application search intention dictionary of the present invention is a user application search intention summarized from the history search records and the history download records, and the summarized application search intention includes a category aspect and a function aspect, predicting the application search intention from the application search intention dictionary can improve accuracy, compared to simply predicting from three broad categories.
The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:
FIG. 1 is a flow chart illustrating a method for identifying an application search intention according to an embodiment of the present invention;
FIG. 2 is a flow chart illustrating another method for identifying an application search intention according to an embodiment of the present invention;
FIG. 3 is a block diagram illustrating an apparatus for identifying search intention according to an embodiment of the present invention;
fig. 4 shows a block diagram of another apparatus for identifying an application search intention according to an embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
The embodiment of the invention provides an identification method of an application search intention, which mainly comprises the following steps as shown in figure 1:
101. an input application search query string is obtained along with an application search intent dictionary.
Wherein the application search intention dictionary is obtained by machine self-learning according to historical application search query strings and historical application download records corresponding to the historical application search query strings, and the application search intention dictionary comprises the historical application search query strings and intention labels corresponding to the historical application search query strings, wherein each intention label has a weight.
After a user inputs an Application search query string (hereinafter referred to as query) based on a client, the client reports the query to an Application search engine server, the Application search engine server determines an Application search intention of the user according to an Application search intention dictionary and the input query, and then feeds back a corresponding APP (Application) to the client according to the Application search intention.
102. When the input application search query string is contained in the application search intention dictionary, determining an intention label corresponding to the input application search query string in the application search intention dictionary as an application search intention corresponding to the input application search query string.
After obtaining the query input by the user and the application search intention dictionary constructed in advance, the application search engine server can firstly search whether the historical query same as the input query is contained in the application search intention dictionary; if the application search intention dictionary contains the history query which is the same as the input query, the intention label corresponding to the history query can be directly determined as the application search intention corresponding to the input query; if the application does not have a history query identical to the input query in the search intention dictionary, the following step 103 is executed.
103. When the application search intention dictionary does not have the input application search query string, calculating the semantic similarity between the input application search query string and each historical application search query string in the application search intention dictionary, screening preset number of intention labels from intention labels corresponding to the first n historical application search query strings with the maximum semantic similarity according to a preset screening algorithm, and determining the screened intention labels as the application search intention corresponding to the input application search query string.
Wherein n is a positive integer. When the application search intention dictionary does not have the history query which is the same as the input query, the history query which is similar to the input query can be selected from the application search intention dictionary, then the intention labels with the preset number which is closer to the input query are selected from all the intention labels corresponding to the history queries according to a preset screening algorithm, and the screened intention labels are determined as the application search intention corresponding to the input query.
According to the method for identifying the application search intention, after the application search query string input by the user is obtained, the application search intention dictionary obtained through training according to the historical application search query string and the historical application download record is obtained, and then the application search intention of the application search query string input by the user is determined through the historical application search query string in the application search intention dictionary and the intention label corresponding to the historical application search query string. Since the application search intention dictionary of the present invention is a user application search intention summarized from the history search records and the history download records, and the summarized application search intention includes a category aspect and a function aspect, predicting the application search intention from the application search intention dictionary can improve accuracy, compared to simply predicting from three broad categories.
Further, according to the method shown in fig. 1, another embodiment of the present invention further provides an identification method of an application search intention, as shown in fig. 2, the method mainly includes:
201. and acquiring original training corpora required for constructing the application search intention dictionary.
The original corpus comprises a historical application search query string and an expansion word corresponding to the historical application search query string, and the expansion word is obtained based on other historical application search query strings and historical application download records.
The specific implementation manner of this step may be:
(a1) historical application search query strings are obtained from a query session log and applications downloaded based on the historical application search query strings.
A session records a series of searching and downloading behaviors of a user in a certain period of time, so that information such as historical query input by the user and APP downloaded based on the historical query can be obtained from a session log. In practical application, a session log generated by a network-wide user within a preset time period (such as one year) can be obtained, each session is analyzed, and information about a historical query and a downloaded APP generated under each session is obtained.
In a specific implementation process, a file storing query and download information, and a file storing only query information can be constructed, and then the history query is expanded based on the two files.
For files storing query and download information: query and download sequences within a session may be constructed first and then all sequences stored in one file. Specifically, for each session, according to the actual search sequence of the user, obtaining historical queries input by the user, then obtaining an APP downloaded based on each historical query, and finally, adjacently and tiltably writing the obtained APP behind the corresponding historical query, thereby forming a query and download sequence of the session; after all session sequences are obtained, each session sequence is output to a file (e.g., a file name of session _ query-app _ list. For example, if a user inputs query1, query2, and query3 in sequence in a session, downloads APP1 after inputting query1, and downloads APP2 and APP3 after inputting query3, the sequence corresponding to the session may be: query1, APP1, query2, query3, APP2, APP 3.
For files that store only query information: each history query serves as a row, and all history queries are output to another file (such as a file name of query _ all.
(a2) Respectively calculating Euclidean distances between each historical application search query string and the downloaded application and other historical application search query strings; and aiming at the current historical application search query string, determining the downloaded application or other historical application search query strings corresponding to the smallest top q Euclidean distances as the expansion words corresponding to the current historical application search query string, wherein q is a positive integer.
When calculating the euclidean distance between two words, it is necessary to convert the two words into a computable vector, so after obtaining the file for storing the query and the download information in step (a1), it is necessary to convert each historical query and each APP in the file into a multidimensional vector. In practical application, each history query in the session _ query-APP _ list. txt file and the APP behind the history query can be taken as a whole by using the deep learning engineering package word2vec, a multidimensional vector of each history query and a multidimensional vector of each APP are calculated, and the multidimensional vectors are output to another file (for example, the file name is query _ w2 v.dit). The dimension of the vector may be 300 dimensions, or may be other dimensions.
After the query _ w2 v.ditt file is obtained, the query _ all.txt and the query _ w2 v.ditt can be used as input, top q items (for example, 300 items) which are most adjacent to each history query in the query _ all.txt are calculated by using a KNN (K-Nearest Neighbor ) algorithm, and the q items are used as expansion words of the corresponding history query. Specifically, for each history query, euclidean distances between the current history query and other history queries and between all APPs may be calculated, then the euclidean distances are sorted from small to large, a top q term is screened out from the euclidean distances, and a term (history query or APP) corresponding to the top q term euclidean distances is determined as an expansion term corresponding to the current history application search query string.
(a3) And taking the historical application search query string and the expansion words corresponding to the historical application search query string as the original training corpus.
After the expanded word of each history query is obtained, the history query and the expanded word corresponding to the history query may be output to a file (e.g., query _ ex.txt) as a line, and the file is used as an original corpus. The format of each line of storage may be: "query \ t-query query extension".
202. And preprocessing the original training corpus to obtain a model training corpus required by the LDA training of the document theme generation model.
The model corpus comprises a historical application search query string, a target noun and a target verb, wherein the target noun and the target verb correspond to the historical application search query string, the target noun is a categorical intention label required for constructing the application search intention dictionary, and the target verb is a functional intention label required for constructing the application search intention dictionary.
The specific implementation manner of this step may be:
(b1) and segmenting the historical application search query string in the original training corpus and the expanded words corresponding to the historical application search query string to obtain a first segmentation file only containing nouns and verbs and a second segmentation file containing all the segmentations.
Wherein the nouns and verbs in the first participle file do not include names of persons and place mechanisms. Since the intention labels representing the APP classification are nouns and the intention labels representing the APP functionality are verbs, after the original corpus is participled, required nouns and verbs can be extracted from the original corpus as part of data required for determining the intention labels.
(b2) And calculating the closeness between two adjacent participles in the second participle file, forming the two participles with the closeness larger than a preset threshold value into a phrase, and adding the phrase into the first participle file to obtain the updated first participle file, wherein the phrase is a verb.
After the segmentation processing is performed on the original corpus, some important functional phrases are likely to be cut apart, and the functional phrases can represent the application search intention of the user, so in order to fully acquire the application search intention, some important functional phrases can be continuously mined based on the segmentation, such as: province _ flow, make _ brick, bus _ transfer, cold _ joke, etc.
The specific implementation method for calculating the closeness between two participles may be cpmd algorithm, or other algorithms. Wherein, the calculation formula of the cPId algorithm is as follows:
Figure BDA0001190313350000141
where D (x, y) represents the co-occurrence frequency of the two terms x, y, D (x) represents the occurrence frequency of the term x, and D represents the total APP number.
After obtaining cpmd values, the cpmd values may be sorted in descending order, and then two segments higher than a preset threshold are combined into a phrase, and the phrase is added to the first segment file, so as to update the first segment file.
(b3) Calculating TF-IDF weight of each participle in the updated first participle file based on TF-IDF (Term Frequency-Inverse Document Frequency) algorithm, determining nouns with the TF-IDF weight within a preset range as the target nouns, and determining verbs with the TF-IDF weight within the preset range as the target verbs.
In practical application, terms with higher occurrence frequency or lower occurrence frequency are often terms with wide application, weak pertinence, and cannot directly reflect application search intention, so that the terms can be deleted from the updated first segmentation file, and the second updated first segmentation file is used as a model training corpus required for LDA (document topic generation model) training. For example, terms that need to be deleted may include "acknowledged, named, modified, deluxe, spectacular, existing, due, unmatched, fair, rare, bid, go in and go out".
In addition, the storage format of the second updated first segmentation file may be: query _ id \ t term 1 term 2 … term n.
203. And performing LDA model training based on the model training corpus to obtain document theme probability distribution and theme term probability distribution.
Wherein the document is each historical application search query string. In practical application, a GibbsLDA + + version can be selected, but each term can be initialized to one topic randomly by the GibbsLDA + + version, so that the same repeated term can be initialized to a plurality of topics, but in a vertical application field, the possibility of term ambiguity is low, and therefore, the situation that the same term is initialized to the same topic is more consistent with the situation in the vertical application field. Therefore, the embodiment of the invention can modify the source code of the GibbsLDA + + version, so that the source code can realize the function of initializing the same terms into one theme.
In practical application, 120 topics can be selected, 300 rounds of iteration are performed on LDA model training, and document-topic probability distribution and topic-term probability distribution are output.
204. And acquiring an initial intention label corresponding to each historical application search query string and the weight of the initial intention label based on the document theme probability distribution, the theme term probability distribution and a preset probability algorithm.
Wherein the initial intention labels are all target nouns and all lexical items with weights at the top p names in the target verbs, and p is a positive integer.
The specific implementation manner of this step may be: selecting the first x topics with the maximum probability under each historical application search query string based on the document topic probability distribution and the topic term probability distribution, wherein the first y terms with the maximum probability under the first x topics are positive integers; merging the same terms based on the probabilities of the first x topics and the probabilities of the first y terms under the first x topics to obtain the merged terms and the weights of the merged terms, determining the merged terms as the initial intention labels, and determining the weights of the merged terms as the weights of the initial intention labels. Wherein, the number of the combined terms is p.
For example, when x is 2 and y is 3, if the first 2 topics with the highest probability in a certain historical query are topic1 (probability of 0.5) and topic2 (probability of 0.3), and the first 3 terms of topic1 are word 1 (probability of 0.4), word2 (probability of 0.2), and word 3 (probability of 0.1), the first 3 terms of topic2 are word2 (probability of 0.5), word 4 (probability of 0.3), and word 1 (probability of 0.1), the probability of word 1 is 0.4 +0.5 +0.1 +0.3 is 0.23, and the probability of word2 is 0.2+ 0.5+ 0.3 is 0.25, and the probability of word2 is 0.5+ 0.3, and the probability of word2 is 0.5+ 0.3 is 0.05.
205. Updating the weight of the initial intention label based on a semantic relationship between the initial intention label and a historical application search query string corresponding to the initial intention label, or based on a semantic relationship between the initial intention label and an expansion word corresponding to the historical application search query string.
Based on the semantic relationship between the initial intention tag and the historical application search query string corresponding to the initial intention tag, the specific implementation manner of updating the weight of the initial intention tag may be: calculating the sum of the cosine similarity of each term in the initial intention label and the corresponding historical application search query string; and taking the product of the sum of the cosine similarity and the weight of the initial intention label as the updated weight of the initial intention label.
In step a1 of step 201, a file query _ all.txt in which only history queries are stored is obtained, so in this step, when the cosine similarity between the initial intent tag and each term in the history queries is calculated, a query _ all.txt file can be directly obtained, and word segmentation processing is performed on each history query in the file, then a vector of each word segmentation and a vector of each initial intent tag are calculated, the cosine similarity between the vector of the initial intent tag and the vector of each word in the corresponding history query is calculated, the cosine similarities are added, finally, the added value is multiplied by the original weight of the initial intent tag, and the product is used as a new weight of the initial intent tag.
Based on the semantic relationship between the initial intention tag and the expansion words corresponding to the historical application search query string, a specific implementation manner of updating the weight of the initial intention tag may be: calculating a document frequency DF value of an initial intention label corresponding to a historical application search query string based on an expansion word set formed by expansion words of the historical application search query string; and taking the product of the DF value and the weight of the initial intention label as the updated weight of the initial intention label.
Wherein the formula for calculating the DF value is
Figure BDA0001190313350000161
Where W represents the initial intent tag and D represents the number of expanded word sets of the historical query.
It should be added that after obtaining the initial intention labels of each historical query and the updated weights of the initial intention labels, the initial intention labels may be sorted in order of the weights from large to small to reflect the importance degree of each initial intention label relative to the historical query.
206. And screening a preset number of intention labels from the initial intention labels according to a fold line function of the searching times of the historical application search query string so as to construct the application search intention dictionary based on the corresponding relation between the screened intention labels and the historical application search query string.
In practical application, the searched times of each history query are different, and the more the searched times, the richer the search intention of the user, so the number of intention labels required to be reserved when constructing the application search intention dictionary can be determined according to the searched times of the history queries.
Suppose a polyline function is s1:c1;...si:ci;...;sj:cj;...},siIndicates a certain number of searches, ciIndicates the number of intent tags retained; when the number of searches is s(s)i≤s<sj) The formula for calculating the number of the reserved intention tags may be:
Figure BDA0001190313350000162
for example, one discounting function is "10: 1; 100: 2; 500: 4; 5000: 5; 10000: 7; 20000:10 ", when the number of searches is 10, 1 intention label needs to be retained, and when the number of searches is 100, 2 intention labels need to be retained; when the number of searches is 7000, the number of intention labels that need to be retained is between 5 and 7,
Figure BDA0001190313350000172
after the finally required intention labels are screened out, an application search intention dictionary can be constructed based on the corresponding relation between the historical query and the intention labels. The format of the application search intention dictionary can be query \ t tag1tag2tag3 … tagn.
For example, as shown in table 1, the historical query in the application search intention dictionary includes "i want to order a meal", "beidou gps satellite finder", and "lottery drawing query", wherein the intention label of "i want to order a meal" includes takeaway, meal order, gourmet, etc., the intention label of "beidou gps satellite finder" includes positioning, navigation, beidou, etc., and the intention label of "lottery drawing query" includes lottery, two-color ball, tool, etc.
TABLE 1
Figure BDA0001190313350000171
207. An input application search query string is obtained along with an application search intent dictionary.
The specific implementation manner of this step is the same as that of step 101 described above, and is not described herein again.
208. When the input application search query string is contained in the application search intention dictionary, determining an intention label corresponding to the input application search query string in the application search intention dictionary as an application search intention corresponding to the input application search query string.
The specific implementation manner of this step is the same as that of step 102 described above, and is not described herein again.
209. When the application search intention dictionary does not have the input application search query string, calculating the semantic similarity between the input application search query string and each historical application search query string in the application search intention dictionary, screening preset number of intention labels from intention labels corresponding to the first n historical application search query strings with the maximum semantic similarity according to a preset screening algorithm, and determining the screened intention labels as the application search intention corresponding to the input application search query string.
The specific implementation manner of this step may be: respectively calculating Euclidean distances between the input application search query string and the historical application search query string; screening the first n Euclidean distances with the minimum distance from the calculated Euclidean distances; determining the historical application search query strings corresponding to the first n Euclidean distances as the first n historical application search query strings with the maximum similarity to the input application search query string; performing Gaussian kernel smoothing operation on Euclidean distances corresponding to the previous n historical application search query strings, and taking an operation result as the weight of the corresponding historical application search query string; performing weight merging processing on the same intention label based on the weights of the first n historical application search query strings and the weight of each intention label in the first n historical application search query strings to obtain a merged intention label and the weight of the merged intention label; and screening the first m intention labels with the largest weight from the merged intention labels, wherein m is the preset number.
Wherein, when determining the top n history queries adjacent to the input query, a k-d tree (k-dimensional tree) can be used to reduce the computational complexity.
When there is only one intention tag in the current n history queries, the weight of the intention tag needs to be updated based on the weight of the history query corresponding to the intention tag.
The specific implementation method for merging the same intention labels can be seen in the following examples:
if the first 3 queries adjacent to the input query are query1 (weight 0.4), query2 (weight 0.3), and query3 (weight 0.3), and the intention labels of query1 are tag1 (weight 0.3) and tag2 (weight 0.2), the intention labels of query 62 are tag2 (weight 0.4) and tag3 (weight 0.3), the intention labels of query3 are tag1 (weight 0.5) and tag4 (weight 0.2), respectively, after the merging process, the weight of tag1 is 0.4 +0.3 + 0.3.5-0.27, and the weight of tag2 is 0.4 + 0.3-0.5-0.27, and the weight of tag2 is 0.3-0.850.06-0.06-0.3-0.850.06.
Further, according to the above method embodiment, another embodiment of the present invention further provides an apparatus for identifying an application search intention, as shown in fig. 3, the apparatus mainly includes: an acquisition unit 31, a first determination unit 32, and a second determination unit 33. Wherein the content of the first and second substances,
an obtaining unit 31, configured to obtain an input application search query string and an application search intention dictionary, where the application search intention dictionary is obtained by performing machine self-learning according to a historical application search query string and a historical application download record corresponding to the historical application search query string, and the application search intention dictionary includes the historical application search query string and intention labels corresponding to the historical application search query string, where each intention label has a weight;
a first determining unit 32, configured to determine, when the input application search query string is contained in the application search intention dictionary acquired by the acquiring unit 31, an intention tag corresponding to the input application search query string in the application search intention dictionary as an application search intention corresponding to the input application search query string;
a second determining unit 33, configured to, when there is no input application search query string in the application search intention dictionary acquired by the acquiring unit 31, calculate semantic similarities between the input application search query string and each historical application search query string in the application search intention dictionary, and according to a preset filtering algorithm, filter out a preset number of intention labels from intention labels corresponding to the top n historical application search query strings with the largest semantic similarity, and determine the filtered intention labels as the application search intention corresponding to the input application search query string, where n is a positive integer.
Optionally, as shown in fig. 4, the second determining unit 33 includes:
a first calculating module 331, configured to calculate euclidean distances between the input application search query string and the historical application search query strings, respectively;
a screening module 332, configured to screen the top n euclidean distances with the smallest distance from the euclidean distances calculated by the first calculation module 331;
a first determining module 333, configured to determine, as the top n historical application search query strings with the greatest similarity to the input application search query string, the historical application search query strings corresponding to the top n euclidean distances screened by the screening module 332;
an operation module 334, configured to perform gaussian kernel smoothing operation on euclidean distances corresponding to the previous n historical application search query strings determined by the first determination module 333, and use an operation result as a weight of the corresponding historical application search query string;
a first merging module 335, configured to perform weight merging processing on the same intention label based on the weights of the first n historical application search query strings obtained by the operation module 334 and the weight of each intention label in the first n historical application search query strings, to obtain a merged intention label and a weight of the merged intention label;
the filtering module 332 is further configured to filter the top m most weighted intention labels from the merged intention labels obtained by the first merging module 335, where m is the preset number.
Optionally, the obtaining unit 31 is further configured to obtain an original corpus required for constructing the application search intention dictionary before obtaining the application search intention dictionary, where the original corpus includes a historical application search query string and an expansion word corresponding to the historical application search query string, and the expansion word is obtained based on other historical application search query strings and historical application download records;
optionally, as shown in fig. 4, the apparatus further includes:
a preprocessing unit 34, configured to preprocess the original corpus to obtain a model corpus required for document theme generation model LDA training, where the model corpus includes a historical application search query string, a target noun and a target verb corresponding to the historical application search query string, the target noun is a categorical intent tag required for building the application search intent dictionary, and the target verb is a functional intent tag required for building the application search intent dictionary;
a training unit 35, configured to perform LDA model training based on the model training corpus to obtain a document topic probability distribution and a topic term probability distribution, where the document is each historical application search query string;
an operation unit 36, configured to obtain an initial intention tag and a weight of the initial intention tag corresponding to each history application search query string based on the document topic probability distribution, the topic term probability distribution and a preset probability algorithm, where the initial intention tag is all target nouns and terms with weights located at top p names in all target verbs, and p is a positive integer;
an updating unit 37, configured to update the weight of the initial intent tag based on a semantic relationship between the initial intent tag and a historical application search query string corresponding to the initial intent tag, or based on a semantic relationship between the initial intent tag and an expanded word corresponding to the historical application search query string;
a screening unit 38, configured to screen a preset number of intention labels from the initial intention labels according to a polyline function of the search times of the historical application search query string;
a constructing unit 39, configured to construct the application search intention dictionary based on the correspondence between the screened intention labels and the historical application search query strings.
Optionally, as shown in fig. 4, the obtaining unit 31 includes:
an obtaining module 311, configured to obtain a historical application search query string from a query session log and an application downloaded based on the historical application search query string;
a second calculating module 312, configured to calculate euclidean distances between each historical application search query string acquired by the acquiring module 311 and the downloaded application and the other historical application search query strings respectively;
a second determining module 313, configured to determine, for a current historical application search query string, a downloaded application or another historical application search query string corresponding to a minimum top q euclidean distances as an expansion word corresponding to the current historical application search query string, where q is a positive integer;
a setting module 314, configured to use a historical application search query string and an expanded word corresponding to the historical application search query string determined by the second determining module 313 as the original corpus.
Optionally, as shown in fig. 4, the training unit 35 includes:
a word segmentation module 351, configured to perform word segmentation on the historical application search query string in the original corpus and the expanded words corresponding to the historical application search query string, to obtain a first word segmentation file only containing nouns and verbs and a second word segmentation file containing all words;
a third calculating module 352, configured to calculate a closeness between two adjacent participles in the second participle file obtained by the participle module 351;
the adding module 353 is configured to combine the two participles with the closeness greater than the preset threshold value, which are calculated by the third calculating module 352, into a phrase, add the phrase to the first participle file, and obtain an updated first participle file, where the phrase is a verb;
a fourth calculating module 354, configured to calculate, based on a TF-IDF algorithm, a TF-IDF weight of each participle in the updated first participle file obtained by the adding module 353;
a third determining module 355, configured to determine a noun with the TF-IDF weight within a preset range obtained by the fourth calculating module 354 as the target noun, and determine a verb with the TF-IDF weight within a preset range as the target verb.
Optionally, as shown in fig. 4, the operation unit 36 includes:
a selecting module 361, configured to select, based on the document topic probability distribution and the topic term probability distribution, the first x topics with the highest probability under each historical application search query string, where x and y are positive integers, and the first y terms with the highest probability under the first x topics are selected;
a second merging module 362, configured to merge the same terms based on the probability of the first x topics selected by the selection module 361 and the probability of the first y terms under the first x topics, to obtain a merged term and a weight of the merged term, where the number of the merged terms is p;
a fourth determining module 363, configured to determine the merged term obtained by the second merging module 362 as the initial intention tag, and determine the weight of the merged term as the weight of the initial intention tag.
Optionally, as shown in fig. 4, the updating unit 37 includes:
a fifth calculating module 371, configured to calculate a sum of cosine similarities of the initial intent tag and terms in the corresponding historical application search query string;
a fifth determining module 372, configured to take the product of the sum of the cosine similarities obtained by the fifth calculating module 371 and the weight of the initial intention label as the updated weight of the initial intention label.
Optionally, as shown in fig. 4, the updating unit 37 includes:
a sixth calculating module 373, configured to calculate a document frequency DF value of the initial intent tag corresponding to the historical application search query string based on an expanded term set formed by expanded terms of the historical application search query string;
a sixth determining module 374, configured to take the product of the DF value obtained by the sixth calculating module 373 and the weight of the initial intention label as the updated weight of the initial intention label.
Optionally, the training unit 35 is configured to initialize the same term into a topic in the process of performing LDA model training based on the model training corpus.
The device for identifying the application search intention provided by the embodiment of the invention can acquire the application search intention dictionary trained according to the historical application search query string and the historical application download record after acquiring the application search query string input by the user, and then determine the application search intention of the application search query string input by the user through the historical application search query string in the application search intention dictionary and the intention label corresponding to the historical application search query string. Since the application search intention dictionary of the present invention is a user application search intention summarized from the history search records and the history download records, and the summarized application search intention includes a category aspect and a function aspect, predicting the application search intention from the application search intention dictionary can improve accuracy, compared to simply predicting from three broad categories.
It will be appreciated that the relevant features of the method and apparatus described above are referred to one another. In addition, "first", "second", and the like in the above embodiments are for distinguishing the embodiments, and do not represent merits of the embodiments.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
The algorithms and displays presented herein are not inherently related to any particular computer, virtual machine, or other apparatus. Various general purpose systems may also be used with the teachings herein. The required structure for constructing such a system will be apparent from the description above. Moreover, the present invention is not directed to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the teachings of the present invention as described herein, and any descriptions of specific languages are provided above to disclose the best mode of the invention.
In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.
Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.
Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.
Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.
The various component embodiments of the invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. It will be appreciated by those skilled in the art that a microprocessor or Digital Signal Processor (DSP) may be used in practice to implement some or all of the functions of some or all of the components of the method and apparatus for identification of application search intent according to embodiments of the present invention. The present invention may also be embodied as apparatus or device programs (e.g., computer programs and computer program products) for performing a portion or all of the methods described herein. Such programs implementing the present invention may be stored on computer-readable media or may be in the form of one or more signals. Such a signal may be downloaded from an internet website or provided on a carrier signal or in any other form.
It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.

Claims (16)

1. A method for identifying an application search intention, the method comprising:
the method comprises the steps of obtaining an input application search query string and an application search intention dictionary, wherein the application search intention dictionary is obtained by performing machine self-learning according to a historical application search query string and a historical application download record corresponding to the historical application search query string, and comprises the historical application search query string and intention labels corresponding to the historical application search query string, wherein each intention label has a weight;
when the input application search query string is contained in the application search intention dictionary, determining an intention label corresponding to the input application search query string in the application search intention dictionary as an application search intention corresponding to the input application search query string;
when the input application search query string does not exist in the application search intention dictionary, calculating the semantic similarity between the input application search query string and each historical application search query string in the application search intention dictionary, screening a preset number of intention labels from intention labels corresponding to the first n historical application search query strings with the maximum semantic similarity according to a preset screening algorithm, and determining the screened intention labels as the application search intention corresponding to the input application search query string, wherein n is a positive integer;
prior to obtaining the application search intent dictionary, the method further comprises:
acquiring an original corpus required for constructing the application search intention dictionary, wherein the original corpus comprises a historical application search query string and expansion words corresponding to the historical application search query string, and the expansion words are obtained based on other historical application search query strings and historical application download records;
preprocessing the original corpus to obtain a model corpus required by document theme generation model LDA training, wherein the model corpus comprises a historical application search query string, a target noun and a target verb, the target noun is a categorical intention label required by construction of the application search intention dictionary, and the target verb is a functional intention label required by construction of the application search intention dictionary;
performing LDA model training based on the model training corpus to obtain document theme probability distribution and theme term probability distribution, wherein the document is each historical application search query string;
based on the document theme probability distribution, the theme term probability distribution and a preset probability algorithm, obtaining an initial intention label corresponding to each historical application search query string and the weight of the initial intention label, wherein the initial intention label is all target nouns and terms with the weights of the previous p names in all target verbs, and p is a positive integer;
updating the weight of the initial intention label based on the semantic relationship between the initial intention label and the historical application search query string corresponding to the initial intention label or based on the semantic relationship between the initial intention label and the expansion word corresponding to the historical application search query string;
and screening a preset number of intention labels from the initial intention labels according to a fold line function of the searching times of the historical application search query string so as to construct the application search intention dictionary based on the corresponding relation between the screened intention labels and the historical application search query string.
2. The method of claim 1, wherein the calculating semantic similarities of the input application search query string and each historical application search query string in the application search intent dictionary, and the screening a preset number of intent labels from the intent labels corresponding to the top n historical application search query strings with the largest semantic similarity according to a preset screening algorithm comprises:
respectively calculating Euclidean distances between the input application search query string and the historical application search query string;
screening the first n Euclidean distances with the minimum distance from the calculated Euclidean distances;
determining the historical application search query strings corresponding to the first n Euclidean distances as the first n historical application search query strings with the maximum similarity to the input application search query string;
performing Gaussian kernel smoothing operation on Euclidean distances corresponding to the previous n historical application search query strings, and taking an operation result as the weight of the corresponding historical application search query string;
performing weight merging processing on the same intention label based on the weights of the first n historical application search query strings and the weight of each intention label in the first n historical application search query strings to obtain a merged intention label and the weight of the merged intention label;
and screening the first m intention labels with the largest weight from the merged intention labels, wherein m is the preset number.
3. The method of claim 1, wherein the obtaining original corpus required for constructing the application search intention dictionary comprises:
obtaining a historical application search query string from a query session log and an application downloaded based on the historical application search query string;
respectively calculating Euclidean distances between each historical application search query string and the downloaded application and other historical application search query strings;
aiming at a current historical application search query string, determining a downloaded application or other historical application search query strings corresponding to the smallest top q Euclidean distances as an expansion word corresponding to the current historical application search query string, wherein q is a positive integer;
and taking the historical application search query string and the expansion words corresponding to the historical application search query string as the original training corpus.
4. The method according to claim 1, wherein the preprocessing the original corpus to obtain a model corpus required for document topic generation model LDA training comprises:
segmenting the historical application search query string in the original training corpus and the expanded words corresponding to the historical application search query string to obtain a first segmentation file only containing nouns and verbs and a second segmentation file containing all the segmentations;
calculating the compactness between two adjacent participles in the second participle file;
forming a phrase by the two word segmentations with the compactness larger than a preset threshold value, and adding the phrase into the first word segmentation file to obtain an updated first word segmentation file, wherein the phrase is a verb;
calculating the TF-IDF weight of each participle in the updated first participle file based on a TF-IDF algorithm;
determining nouns with the TF-IDF weights within a preset range as the target nouns, and determining verbs with the TF-IDF weights within a preset range as the target verbs.
5. The method of claim 1, wherein obtaining an initial intent tag corresponding to each historical application search query string and a weight of the initial intent tag based on the document topic probability distribution, the topic term probability distribution, and a preset probability algorithm comprises:
selecting the first x topics with the maximum probability under each historical application search query string based on the document topic probability distribution and the topic term probability distribution, wherein the first y terms with the maximum probability under the first x topics are positive integers;
merging the same terms based on the probability of the first x topics and the probability of the first y terms under the first x topics to obtain merged terms and the weight of the merged terms, wherein the number of the merged terms is p;
and determining the merged term as the initial intention label, and determining the weight of the merged term as the weight of the initial intention label.
6. The method of claim 1, wherein updating the weight of the initial intent tag based on a semantic relationship between the initial intent tag and a historical application search query string to which the initial intent tag corresponds comprises:
calculating the sum of the cosine similarity of each term in the initial intention label and the corresponding historical application search query string;
and taking the product of the sum of the cosine similarity and the weight of the initial intention label as the updated weight of the initial intention label.
7. The method of claim 1, wherein updating the weight of the initial intent tag based on a semantic relationship between the initial intent tag and an expanded word corresponding to a historical application search query string comprises:
calculating a document frequency DF value of an initial intention label corresponding to a historical application search query string based on an expansion word set formed by expansion words of the historical application search query string;
and taking the product of the DF value and the weight of the initial intention label as the updated weight of the initial intention label.
8. The method according to any one of claims 1 to 7, further comprising:
and initializing the same terms into a theme in the process of carrying out LDA model training based on the model training corpus.
9. An apparatus for identifying an application search intention, the apparatus comprising:
the application search intention dictionary is obtained through machine self-learning according to historical application search query strings and historical application download records corresponding to the historical application search query strings, and comprises the historical application search query strings and intention labels corresponding to the historical application search query strings, wherein each intention label has a weight;
a first determining unit, configured to determine, when the input application search query string is included in the application search intention dictionary acquired by the acquiring unit, an intention label corresponding to the input application search query string in the application search intention dictionary as an application search intention corresponding to the input application search query string;
a second determining unit, configured to, when the application search intention dictionary acquired by the acquiring unit does not have the input application search query string, calculate semantic similarities between the input application search query string and each historical application search query string in the application search intention dictionary, screen a preset number of intention labels from intention labels corresponding to n previous historical application search query strings with the largest semantic similarity according to a preset screening algorithm, and determine the screened intention labels as application search intentions corresponding to the input application search query string, where n is a positive integer;
the obtaining unit is further configured to obtain an original corpus required for constructing the application search intention dictionary before obtaining the application search intention dictionary, where the original corpus includes a historical application search query string and an expansion word corresponding to the historical application search query string, and the expansion word is obtained based on other historical application search query strings and historical application download records;
the device further comprises:
the preprocessing unit is used for preprocessing the original corpus to obtain a model corpus required by LDA training of a document theme generation model, wherein the model corpus comprises a historical application search query string, a target noun and a target verb, the target noun is a categorical intention label required by construction of the application search intention dictionary, and the target verb is a functional intention label required by construction of the application search intention dictionary;
the training unit is used for carrying out LDA model training based on the model training corpus to obtain document theme probability distribution and theme term probability distribution, wherein the document is each historical application search query string;
the operation unit is used for acquiring an initial intention label corresponding to each historical application search query string and the weight of the initial intention label based on the document theme probability distribution, the theme word item probability distribution and a preset probability algorithm, wherein the initial intention label is all target nouns and all word items with the weights at the top p names in the target verbs, and p is a positive integer;
an updating unit, configured to update a weight of an initial intention tag based on a semantic relationship between the initial intention tag and a history application search query string corresponding to the initial intention tag, or based on a semantic relationship between the initial intention tag and an expanded word corresponding to the history application search query string;
the screening unit is used for screening a preset number of intention labels from the initial intention labels according to a broken line function of the search times of the historical application search query string;
and the construction unit is used for constructing the application search intention dictionary based on the corresponding relation between the screened intention labels and the historical application search query strings.
10. The apparatus according to claim 9, wherein the second determining unit comprises:
the first calculation module is used for respectively calculating Euclidean distances between the input application search query string and the historical application search query string;
the screening module is used for screening the first n Euclidean distances with the minimum distance from the Euclidean distances calculated by the first calculation module;
a first determining module, configured to determine the historical application search query strings corresponding to the first n euclidean distances screened by the screening module as the first n historical application search query strings with the largest similarity to the input application search query string;
the operation module is used for performing Gaussian kernel smoothing operation on Euclidean distances corresponding to the previous n historical application search query strings determined by the first determination module and taking an operation result as the weight of the corresponding historical application search query string;
a first merging module, configured to perform weight merging processing on the same intention label based on the weights of the first n historical application search query strings obtained by the operation module and the weight of each intention label in the first n historical application search query strings, and obtain a merged intention label and the weight of the merged intention label;
the screening module is further configured to screen the top m intention labels with the largest weight from the merged intention labels obtained by the first merging module, where m is the preset number.
11. The apparatus of claim 9, wherein the obtaining unit comprises:
an acquisition module for acquiring a historical application search query string from a query session log and an application downloaded based on the historical application search query string;
the second calculation module is used for respectively calculating Euclidean distances between each historical application search query string acquired by the acquisition module and the downloaded application and other historical application search query strings;
a second determining module, configured to determine, for a current historical application search query string, a downloaded application or other historical application search query string corresponding to a minimum previous q-term euclidean distance as an expansion word corresponding to the current historical application search query string, where q is a positive integer;
and the setting module is used for taking the historical application search query string and the expansion word corresponding to the historical application search query string determined by the second determining module as the original corpus.
12. The apparatus of claim 9, wherein the training unit comprises:
the word segmentation module is used for segmenting the historical application search query string in the original training corpus and the expanded words corresponding to the historical application search query string to obtain a first word segmentation file only containing nouns and verbs and a second word segmentation file containing all the words;
the third calculation module is used for calculating the compactness between two adjacent participles in the second participle file obtained by the participle module;
the adding module is used for forming a phrase by the two participles of which the compactness is greater than a preset threshold value and calculated by the third calculating module, adding the phrase into the first participle file to obtain an updated first participle file, wherein the phrase is a verb;
the fourth calculation module is used for calculating the TF-IDF weight of each participle in the updated first participle file obtained by the adding module based on a TF-IDF algorithm;
a third determining module, configured to determine the noun with the TF-IDF weight within a preset range obtained by the fourth calculating module as the target noun, and determine the verb with the TF-IDF weight within the preset range as the target verb.
13. The apparatus according to claim 9, wherein the arithmetic unit comprises:
a selecting module, configured to select, based on the document topic probability distribution and the topic term probability distribution, the first x topics with the highest probability under each historical application search query string, where the first y terms with the highest probability under the first x topics are both positive integers;
the second merging module is used for merging the same terms based on the probability of the first x topics selected by the selection module and the probability of the first y terms under the first x topics to obtain the merged terms and the weight of the merged terms, wherein the number of the merged terms is p;
a fourth determining module, configured to determine the merged term obtained by the second merging module as the initial intention tag, and determine a weight of the merged term as a weight of the initial intention tag.
14. The apparatus of claim 9, wherein the updating unit comprises:
a fifth calculation module, configured to calculate a sum of cosine similarities of the initial intent tag and terms in the corresponding historical application search query string;
a fifth determining module, configured to take a product of the sum of the cosine similarities obtained by the fifth calculating module and the weight of the initial intention label as the updated weight of the initial intention label.
15. The apparatus of claim 9, wherein the updating unit comprises:
the sixth calculation module is used for calculating the document frequency DF value of the initial intention label corresponding to the historical application search query string based on an expansion word set formed by expansion words of the historical application search query string;
a sixth determining module, configured to take a product of the DF value obtained by the sixth calculating module and the weight of the initial intention tag as the updated weight of the initial intention tag.
16. The apparatus according to any of the claims 9 to 15, wherein the training unit is configured to initialize the same term to a topic during LDA model training based on the model training corpus.
CN201611207524.3A 2016-12-23 2016-12-23 Application search intention identification method and device Expired - Fee Related CN106599278B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611207524.3A CN106599278B (en) 2016-12-23 2016-12-23 Application search intention identification method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611207524.3A CN106599278B (en) 2016-12-23 2016-12-23 Application search intention identification method and device

Publications (2)

Publication Number Publication Date
CN106599278A CN106599278A (en) 2017-04-26
CN106599278B true CN106599278B (en) 2020-06-12

Family

ID=58603494

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611207524.3A Expired - Fee Related CN106599278B (en) 2016-12-23 2016-12-23 Application search intention identification method and device

Country Status (1)

Country Link
CN (1) CN106599278B (en)

Families Citing this family (29)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018201280A1 (en) * 2017-05-02 2018-11-08 Alibaba Group Holding Limited Method and apparatus for query auto-completion
CN107133321B (en) * 2017-05-04 2020-06-12 广东神马搜索科技有限公司 Method and device for analyzing search characteristics of page
CN107256267B (en) 2017-06-19 2020-07-24 北京百度网讯科技有限公司 Query method and device
CN107622090B (en) * 2017-08-22 2020-10-16 上海艾融软件股份有限公司 Object acquisition method, device and system
CN110020128B (en) * 2017-10-26 2023-04-28 阿里巴巴集团控股有限公司 Search result ordering method and device
CN107832414B (en) * 2017-11-07 2021-10-22 百度在线网络技术(北京)有限公司 Method and device for pushing information
CN109948036B (en) * 2017-11-15 2022-10-04 腾讯科技(深圳)有限公司 Method and device for calculating weight of participle term
CN108597503B (en) * 2018-05-09 2021-04-30 科大讯飞股份有限公司 Test corpus generation method, device and equipment and readable and writable storage medium
CN110737823B (en) * 2018-07-03 2022-06-24 百度在线网络技术(北京)有限公司 Access intention mining method and device
CN109241261A (en) * 2018-08-30 2019-01-18 武汉斗鱼网络科技有限公司 User's intension recognizing method, device, mobile terminal and storage medium
CN111324805B (en) * 2018-12-13 2024-02-13 北京搜狗科技发展有限公司 Query intention determining method and device, searching method and searching engine
CN109918479B (en) * 2019-02-28 2021-07-20 百度在线网络技术(北京)有限公司 Method and device for processing information
CN110222147A (en) * 2019-05-15 2019-09-10 北京百度网讯科技有限公司 Label extending method, device, computer equipment and storage medium
CN110489742B (en) * 2019-07-15 2021-10-01 北京三快在线科技有限公司 Word segmentation method and device, electronic equipment and storage medium
CN110472027B (en) * 2019-07-18 2024-05-14 平安科技(深圳)有限公司 Intent recognition method, apparatus, and computer-readable storage medium
CN110569433B (en) * 2019-08-20 2024-03-22 腾讯科技(深圳)有限公司 Construction method and device of search result filter, electronic equipment and storage medium
CN110598214A (en) * 2019-09-10 2019-12-20 四川长虹电器股份有限公司 Intention recognition result error correction method
CN111062211A (en) * 2019-12-27 2020-04-24 中国联合网络通信集团有限公司 Information extraction method and device, electronic equipment and storage medium
CN111324626B (en) * 2020-01-21 2022-07-12 思必驰科技股份有限公司 Search method and device based on voice recognition, computer equipment and storage medium
CN111368045B (en) * 2020-02-21 2024-05-07 平安科技(深圳)有限公司 User intention recognition method, device, equipment and computer readable storage medium
CN111737425B (en) * 2020-02-28 2024-03-01 北京汇钧科技有限公司 Response method, device, server and storage medium
CN111488426B (en) * 2020-04-17 2024-02-02 支付宝(杭州)信息技术有限公司 Query intention determining method, device and processing equipment
CN111652001B (en) * 2020-06-04 2023-01-17 联想(北京)有限公司 Data processing method and device
CN111696558A (en) * 2020-06-24 2020-09-22 深圳壹账通智能科技有限公司 Intelligent outbound method, device, computer equipment and storage medium
CN111949864B (en) 2020-08-10 2022-02-25 北京字节跳动网络技术有限公司 Searching method, searching device, electronic equipment and storage medium
CN112115342A (en) * 2020-09-22 2020-12-22 深圳市欢太科技有限公司 Search method, search device, storage medium and terminal
CN112905893B (en) * 2021-03-22 2024-01-12 北京百度网讯科技有限公司 Training method of search intention recognition model, search intention recognition method and device
CN113590919A (en) * 2021-07-29 2021-11-02 小船出海教育科技(北京)有限公司 Search request processing method and device, electronic equipment and computer readable medium
CN114138755B (en) * 2022-02-07 2022-04-08 中国建筑第五工程局有限公司 Material archive retrieval method aiming at supplier cooperation based on artificial intelligence

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103810030A (en) * 2014-02-20 2014-05-21 北京奇虎科技有限公司 Application recommendation method, device and system based on mobile terminal application market
CN104239421A (en) * 2014-08-21 2014-12-24 北京奇虎科技有限公司 Method and system for pushing application to terminal
CN105095187A (en) * 2015-08-07 2015-11-25 广州神马移动信息科技有限公司 Search intention identification method and device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103810030A (en) * 2014-02-20 2014-05-21 北京奇虎科技有限公司 Application recommendation method, device and system based on mobile terminal application market
CN104239421A (en) * 2014-08-21 2014-12-24 北京奇虎科技有限公司 Method and system for pushing application to terminal
CN105095187A (en) * 2015-08-07 2015-11-25 广州神马移动信息科技有限公司 Search intention identification method and device

Also Published As

Publication number Publication date
CN106599278A (en) 2017-04-26

Similar Documents

Publication Publication Date Title
CN106599278B (en) Application search intention identification method and device
CN106649818B (en) Application search intention identification method and device, application search method and server
CN109145153B (en) Intention category identification method and device
CN109815308B (en) Method and device for determining intention recognition model and method and device for searching intention recognition
CN111046221B (en) Song recommendation method, device, terminal equipment and storage medium
US9104979B2 (en) Entity recognition using probabilities for out-of-collection data
US8538898B2 (en) Interactive framework for name disambiguation
US20040049499A1 (en) Document retrieval system and question answering system
US20120158703A1 (en) Search lexicon expansion
WO2016201511A1 (en) Methods and systems for object recognition
CN110147421B (en) Target entity linking method, device, equipment and storage medium
CN108846097B (en) User interest tag representation method, article recommendation device and equipment
CN101305368A (en) Semantic visual search engine
US9569525B2 (en) Techniques for entity-level technology recommendation
Satpal et al. Web information extraction using markov logic networks
KR101355945B1 (en) On line context aware advertising apparatus and method
CN110765761A (en) Contract sensitive word checking method and device based on artificial intelligence and storage medium
WO2021112984A1 (en) Feature and context based search result generation
Nesi et al. Ge (o) Lo (cator): Geographic information extraction from unstructured text data and Web documents
CN112685642A (en) Label recommendation method and device, electronic equipment and storage medium
Lin et al. Automatic tagging web services using machine learning techniques
Huynh et al. Vietnamese text classification with textrank and jaccard similarity coefficient
Toba et al. Enhanced unsupervised person name disambiguation to support alumni tracer study
Chang et al. Enhancing POI search on maps via online address extraction and associated information segmentation
JP5890413B2 (en) Method and search engine for searching a large number of data records

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20200612