CN106682170B - Application search method and device - Google Patents

Application search method and device Download PDF

Info

Publication number
CN106682170B
CN106682170B CN201611229802.5A CN201611229802A CN106682170B CN 106682170 B CN106682170 B CN 106682170B CN 201611229802 A CN201611229802 A CN 201611229802A CN 106682170 B CN106682170 B CN 106682170B
Authority
CN
China
Prior art keywords
application
keyword
corpus
stage
keywords
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201611229802.5A
Other languages
Chinese (zh)
Other versions
CN106682170A (en
Inventor
庞伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Qihoo Technology Co Ltd
Original Assignee
Beijing Qihoo Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Qihoo Technology Co Ltd filed Critical Beijing Qihoo Technology Co Ltd
Priority to CN201611229802.5A priority Critical patent/CN106682170B/en
Publication of CN106682170A publication Critical patent/CN106682170A/en
Application granted granted Critical
Publication of CN106682170B publication Critical patent/CN106682170B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]

Abstract

The invention discloses an application searching method and device, wherein the method comprises the following steps: constructing a label system of each application; receiving a search word uploaded by a client; matching in a label system of each application according to the search terms; and when the search word is matched with the keyword in the label system of one application, returning the relevant information of the application to the client for displaying. Therefore, by constructing the label system of each application, when receiving the search word uploaded by the client, the invention matches the search word in the label system of each application according to the search word, and when the search word matches the keyword in the label system of one application, the invention returns the relevant information of the application to the client for displaying, thereby realizing the intelligent search of the application.

Description

Application search method and device
Technical Field
The invention relates to the field of data mining and searching, in particular to an application searching method and device.
Background
The application search engine is a mobile terminal software application search engine service, and provides application search and download on a mobile phone, such as 360 mobile phone assistants, Tencent App, Quixey and the like. Taking 360 mobile phone assistants as an example, the number of applications is millions, and automatically mining and constructing a tag system of the applications is a key technology for improving the search quality of an application search engine and is also a core technology for realizing function search.
The traditional application label generation method is manual labeling, the workload is large, time and labor are wasted, and the coverage rate is low; or an application developer submits a label, which is often accompanied by a cheating problem, and the developer expects that the own application has higher showing opportunity and submits a large amount of label information irrelevant to the application.
Disclosure of Invention
In view of the above, the present invention has been made to provide an application search method that overcomes or at least partially solves the above problems.
According to an aspect of the present invention, there is provided an application search method, the method including:
constructing a label system of each application;
receiving a search word uploaded by a client;
matching in a label system of each application according to the search terms;
and when the search word is matched with the keyword in the label system of one application, returning the relevant information of the application to the client for displaying.
Optionally, the constructing of the tag system of each application includes:
obtaining the abstract of each application;
acquiring search terms related to each application from the application search log;
and excavating a label system of each application according to the abstract, the search terms and the preset strategy of each application.
Optionally, the mining a label system of each application according to the abstract, the search term, and the preset policy of each application includes:
obtaining a training corpus set according to the abstract and the search word of each application;
inputting the training corpus set into an LDA model for training to obtain an application-theme probability distribution result and a theme-keyword probability distribution result output by the LDA model;
and calculating to obtain a label system of each application according to the application-theme probability distribution result and the theme-keyword probability distribution result.
Optionally, the obtaining a corpus set according to the abstracts and the search terms of each application includes:
for each application, extracting a first segment of characters or characters of a preset number of sentences from the abstract of the application; the extracted characters and the applied search terms are jointly used as the original corpus of the application;
the original linguistic data of each application form an original linguistic data set; and preprocessing the original corpus set to obtain a training corpus set.
Optionally, the preprocessing the original corpus set includes:
in the original corpus collection, the corpus is divided into a plurality of parts,
for each original corpus, performing word segmentation processing on the original corpus to obtain word segmentation results containing a plurality of lexical items; searching phrases formed by adjacent terms in the word segmentation result; and reserving the phrases, the lexical items belonging to nouns and the lexical items belonging to verbs in the word segmentation result as the corresponding reserved keywords of the original corpus.
Optionally, the searching for phrases composed of adjacent terms in the word segmentation result includes:
calculating the cPId value of every two adjacent terms in the word segmentation result, and determining that the two adjacent terms form a phrase when the cPId values of the two adjacent terms are larger than a first preset threshold value.
Optionally, the preprocessing the original corpus set further includes:
using the keywords correspondingly reserved for the original material of each application as the first-stage training corpus of the application;
the first-stage training corpora of each application form a first-stage training corpus set; and carrying out data cleaning on the keywords in the first-stage corpus set.
Optionally, the performing data cleaning on the keywords in the first-stage corpus set includes:
in the first-stage corpus set,
for each first-stage training corpus, calculating a TF-IDF value of each keyword in the first-stage training corpus; and deleting the keywords with the TF-IDF values higher than the second preset threshold and/or lower than the third preset threshold.
Optionally, the preprocessing the original corpus set further includes:
taking the remaining keywords of the first-stage training corpus of each application after data cleaning as the second-stage training corpus of the application;
for each applied second-stage corpus, when a keyword in the applied second-stage corpus appears in the name of the application, repeating the keyword in the applied second-stage corpus for a fourth preset threshold number of times to obtain the applied corpus;
the corpus of each application constitutes a corpus set.
Optionally, the step of obtaining, by calculation, a tag system of each application according to the application-topic probability distribution result and the topic-keyword probability distribution result includes:
calculating to obtain an application-keyword probability distribution result according to the application-topic probability distribution result and the topic-keyword probability distribution result;
and according to the application-keyword probability distribution result, for each application, sorting the keywords according to the probability of the application from large to small, and selecting the keywords with the number of the top fifth preset threshold value.
Optionally, the calculating an application-keyword probability distribution result according to the application-topic probability distribution result and the topic-keyword probability distribution result includes:
for each application, obtaining the probability of each theme about the application according to the application-theme probability distribution result;
for each topic, obtaining the probability of each keyword about the topic according to the topic-keyword probability distribution result;
for each keyword, taking the product of the probability of the keyword about a topic and the probability of the topic about an application as the probability of the keyword based on the topic about the application; and taking the probability of the keyword about the application based on the sum of the probabilities of the topics about the application.
Optionally, the step of obtaining, by calculation, a tag system of each application according to the application-topic probability distribution result and the topic-keyword probability distribution result further includes:
taking the keywords with the number of the first fifth preset threshold value correspondingly selected by each application as a first-stage label system of the application;
for the first-stage label system of each application, calculating a semantic relation value between each keyword in the first-stage label system of the application and the abstract of the application; for each keyword, taking the product of the semantic relation value corresponding to the keyword and the probability of the keyword relative to the application as the correction probability of the keyword relative to the application; and sorting all the keywords in the first-stage label system of the application from large to small according to the correction probability of the application, and selecting the first K keywords to form the label system of the application.
Optionally, calculating a semantic relationship value between each keyword in the first stage tagging system of the application and the abstract of the application comprises:
calculating word vectors of the keywords, and calculating the word vectors of each lexical item in the sentences of the application abstract with the preset number;
calculating cosine similarity between the word vector of the keyword and the word vector of each term, and taking the product of each cosine similarity and the weight of the sentence in which the corresponding term is located as a semantic relation value of the keyword and the corresponding term;
and taking the sum of the semantic relation values of the keyword and each term as the semantic relation value between the keyword and the abstract of the application.
Optionally, the step of obtaining, by calculation, a tag system of each application according to the application-topic probability distribution result and the topic-keyword probability distribution result further includes:
taking the keywords correspondingly selected by each application as a second-stage label system of the application;
for the second-stage label system of each application, acquiring a search word set related to the downloading operation of the application from an application search log, and counting the DF value of each keyword in the second-stage label system of the application in the search word set; for each keyword, increasing the multiple of the DF value on the basis of the probability of the keyword about the application to obtain the secondary correction probability of the keyword about the application; and sorting all the keywords in the second-stage label system of the application from large to small according to the secondary correction probability of the application, and selecting the first K keywords to form the label system of the application.
Optionally, the selecting the first K keywords to form the tag system of the application includes:
acquiring the seasonal downloading times of the application from the application searching log;
selecting the first K keywords according to the quarterly downloading times of the application to form a label system of the application; where the value of K is a polyline function of the number of quarterly downloads for the application.
According to another aspect of the present invention, there is provided an application search apparatus including:
the label system building unit is suitable for building a label system of each application;
the interaction unit is suitable for receiving the search terms uploaded by the client;
the search processing unit is suitable for matching in a label system of each application according to the search terms;
and the interaction unit is also suitable for returning the relevant information of the application to the client side for displaying when the search word is matched with the keyword in the label system of the application.
Optionally, the tag architecture building unit includes:
the information acquisition unit is suitable for acquiring the abstract of each application; acquiring search terms related to each application from the application search log;
and the application label mining unit is suitable for mining a label system of each application according to the abstract, the search word and a preset strategy of each application.
Optionally, the application tag mining unit is adapted to obtain a corpus set according to the abstract and the search term of each application; inputting the training corpus set into an LDA model for training to obtain an application-theme probability distribution result and a theme-keyword probability distribution result output by the LDA model; and calculating to obtain a label system of each application according to the application-theme probability distribution result and the theme-keyword probability distribution result.
Optionally, the application tag mining unit is adapted to, for each application, extract a first segment of text or text of a preset number of sentences from the abstract of the application; the extracted characters and the applied search terms are jointly used as the original corpus of the application; the original linguistic data of each application form an original linguistic data set; and preprocessing the original corpus set to obtain a training corpus set.
Optionally, the application tag mining unit is adapted to perform word segmentation processing on each original corpus in the original corpus set to obtain a word segmentation result including a plurality of terms; searching phrases formed by adjacent terms in the word segmentation result; and reserving the phrases, the lexical items belonging to nouns and the lexical items belonging to verbs in the word segmentation result as the corresponding reserved keywords of the original corpus.
Optionally, the application tag mining unit is adapted to calculate cpmd values of every two adjacent terms in the word segmentation result, and determine that the two adjacent terms form a phrase when the cpmd values of the two adjacent terms are greater than a first preset threshold.
Optionally, the application tag mining unit is further adapted to use a keyword, which is reserved corresponding to the original material of each application, as a first-stage training corpus of the application; the first-stage training corpora of each application form a first-stage training corpus set; and carrying out data cleaning on the keywords in the first-stage corpus set.
Optionally, the application label mining unit is adapted to calculate, for each first-stage corpus, a TF-IDF value of each keyword in the first-stage corpus set; and deleting the keywords with the TF-IDF values higher than the second preset threshold and/or lower than the third preset threshold.
Optionally, the application tag mining unit is further adapted to use a remaining keyword after the first-stage corpus of each application is subjected to data cleaning as a second-stage corpus of the application; for each applied second-stage corpus, when a keyword in the applied second-stage corpus appears in the name of the application, repeating the keyword in the applied second-stage corpus for a fourth preset threshold number of times to obtain the applied corpus; the corpus of each application constitutes a corpus set.
Optionally, the application tag mining unit is adapted to calculate an application-keyword probability distribution result according to the application-topic probability distribution result and the topic-keyword probability distribution result; and according to the application-keyword probability distribution result, for each application, sorting the keywords according to the probability of the application from large to small, and selecting the keywords with the number of the top fifth preset threshold value.
Optionally, the application tag mining unit is adapted to, for each application, obtain, according to the application-topic probability distribution result, a probability of each topic about the application; for each topic, obtaining the probability of each keyword about the topic according to the topic-keyword probability distribution result; for each keyword, taking the product of the probability of the keyword about a topic and the probability of the topic about an application as the probability of the keyword based on the topic about the application; and taking the probability of the keyword about the application based on the sum of the probabilities of the topics about the application.
Optionally, the application tag mining unit is further adapted to use the first-fifth preset threshold number of keywords selected corresponding to each application as a first-stage tag system of the application; for the first-stage label system of each application, calculating a semantic relation value between each keyword in the first-stage label system of the application and the abstract of the application; for each keyword, taking the product of the semantic relation value corresponding to the keyword and the probability of the keyword relative to the application as the correction probability of the keyword relative to the application; and sorting all the keywords in the first-stage label system of the application from large to small according to the correction probability of the application, and selecting the first K keywords to form the label system of the application.
Optionally, the application tag mining unit is adapted to calculate a word vector of the keyword, and calculate a word vector of each term in a preset number of sentences before the abstract of the application; calculating cosine similarity between the word vector of the keyword and the word vector of each term, and taking the product of each cosine similarity and the weight of the sentence in which the corresponding term is located as a semantic relation value of the keyword and the corresponding term; and taking the sum of the semantic relation values of the keyword and each term as the semantic relation value between the keyword and the abstract of the application.
Optionally, the application tag mining unit is further adapted to use the selected keyword corresponding to each application as a second-stage tag system of the application; for the second-stage label system of each application, acquiring a search word set related to the downloading operation of the application from an application search log, and counting the DF value of each keyword in the second-stage label system of the application in the search word set; for each keyword, increasing the multiple of the DF value on the basis of the probability of the keyword about the application to obtain the secondary correction probability of the keyword about the application; and sorting all the keywords in the second-stage label system of the application from large to small according to the secondary correction probability of the application, and selecting the first K keywords to form the label system of the application.
Optionally, the application tag mining unit is adapted to obtain the number of times of downloading of the application in a quarter from an application search log; selecting the first K keywords according to the quarterly downloading times of the application to form a label system of the application; where the value of K is a polyline function of the number of quarterly downloads for the application.
According to the technical scheme provided by the invention, through constructing the label system of each application, when the search word uploaded by the client is received, the matching is carried out in the label system of each application according to the search word, and when the search word is matched with the keyword in the label system of one application, the relevant information of the application is returned to the client for displaying, so that the intelligent search of the application is realized.
The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:
FIG. 1 shows a flow diagram of a method of application searching in accordance with one embodiment of the present invention;
FIG. 2 illustrates a schematic interface diagram for searching based on an application search method according to one embodiment of the present invention;
FIG. 3 illustrates a flow diagram for building a label hierarchy for applications, according to one embodiment of the invention;
FIG. 4 shows a schematic diagram of an application search apparatus according to an embodiment of the invention;
FIG. 5 shows a schematic diagram of a label architecture building unit, according to one embodiment of the invention.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
Fig. 1 shows a flowchart of an application search method according to an embodiment of the present invention, and as shown in fig. 1, the application search method 100 includes:
and S110, constructing a label system of each application.
And S120, receiving the search terms uploaded by the client.
And S130, matching in the label system of each application according to the search terms.
S140, when the search word is matched with the keyword in the label system of one application, the relevant information of the application is returned to the client side for displaying.
As can be seen from the method shown in fig. 1, according to the scheme, by constructing the tag systems of the applications, when a search word uploaded by a client is received, matching is performed in the tag systems of the applications according to the search word, and when the search word is matched with a keyword in one tag system of the applications, relevant information of the applications is returned to the client for display, so that intelligent search of the applications is realized.
For example, the user searches for "drip", and the application engine also presents applications with similar functions, such as "quick taxi", "Uber excellent china", and the like, in addition to returning to the precise application of "drip outgoing".
In order to make the solution of the application search method clearer, the implementation process of the application search method is described by a specific example, and fig. 2 shows an interface schematic diagram of searching based on the application search method according to an embodiment of the present invention. The following description is given with reference to a specific example. In a specific example, the user searches the keyword "order" on the "360 mobile phone assistant", and the results displayed by the "360 mobile phone assistant" search engine are shown in fig. 2, and as can be seen from fig. 2, when the user searches for "order", "360 mobile phone assistant" search engine returns all applications with the function of ordering, such as "sweet take away", "hungry, hundredth sticky rice", "popular comment", "sweet take", and the like. Therefore, the application label system constructed by the invention plays a main role in retrieval and sequencing, the search quality is greatly improved, and the user search experience is improved.
In the application search method shown in fig. 1, the process of constructing the tag system of each application in step S110 determines whether the application search effect is good or bad, so the step S110 is described in detail: FIG. 3 illustrates a flow diagram of a method of building a label hierarchy for applications in accordance with one embodiment of the present invention; referring to fig. 3, the method includes:
in step S310, the abstract of each application is obtained.
In step S320, search terms for each application are acquired from the application search log.
And step S330, excavating a label system of each application according to the abstract, the search terms and the preset strategy of each application.
By the method shown in fig. 3, the application tags are dynamically updated by automatically obtaining the abstracts of each application and obtaining the search terms of each application from the historical application search logs of the user in real time; meanwhile, the accuracy and the recall rate of the application label are continuously improved through a preset strategy, and then the applied label system is excavated and created, so that the problems that the traditional application label system only can cause large manual workload, low coverage rate, serious cheating phenomenon and the like through manual marking are solved, the search quality of the application search engine is greatly improved, and the user search experience is improved.
In an embodiment of the present invention, the step S330 of mining the label system of each application according to the abstract, the search term and the preset policy of each application includes:
step S331, obtaining a training corpus set according to the abstracts and the search terms of each application.
Step S332, inputting the training corpus set into the LDA model for training, and obtaining an application-theme probability distribution result and a theme-keyword probability distribution result output by the LDA model.
And S333, calculating to obtain a label system of each application according to the application-theme probability distribution result and the theme-keyword probability distribution result.
It should be noted that lda (latent Dirichlet allocation) is a document topic generation model, which is an unsupervised machine learning technique and can be used to identify topic information hidden in a large-scale document collection (document collection) or corpus (corpus). It adopts bag of words (bag of words) method, which treats each document as a word frequency vector, thereby converting text information into digital information easy to model. Because the LDA model is better in performance in a long text and has poor effect when being used in a short text, but the application abstract is short and short, the LDA model is a typical short text, and in order to enable the application effect of the LDA model to be optimal, interactive history (namely the search words, hereinafter referred to as the search words) information of an application and a user is introduced to expand the application abstract, namely the short text of the application abstract is expanded into the long text suitable for the LDA model. The search terms not only comprise terms which can be retrieved by an engine and applied by the engine, but also comprise other terms, and the terms just overcome the problems that the frequency of synonym heteromorphism words is too low and the like caused by too short length of short texts of the application abstracts.
In this embodiment, the LDA model is GibbsLDA + + -version. In an application scene of a mobile terminal application, a GibbsLDA + + source code needs to be modified, and topics of the same lexical item in one application are initialized to be the same. In the original code, each term is randomly initialized into a theme, so that the same repeated term can be initialized into a plurality of themes.
In order to make the solution of the present invention clearer, the application-topic probability distribution result and the topic-keyword probability distribution result output by the LDA model mentioned in step S332 are illustrated in detail herein. For example, the LDA training selects 120 topics, iterates through 300 rounds, and generates two files, where the first file is a topic-keyword probability distribution result, and as shown in table 1, the corresponding probabilities between the fourth topic and 22 keywords respectively are shown:
TABLE 1
Figure BDA0001194306660000111
The second file is the application-topic probability distribution result, as shown in table 2, showing the corresponding probabilities between the application with application ID 5427 and 6 topics (topic IDs 134, 189, 139, 126, 14, 18, respectively).
TABLE 2
Figure BDA0001194306660000121
In order to make the solution of the present invention clearer, the following description is made with reference to a specific example. For example, the abstract of the WeChat includes that WeChat is a free application program which is released by Tencent corporation in 2011, 1, 21 and provides instant messaging service for the intelligent terminal. The WeChat supports cross-communication operators and cross-operating system platforms to quickly send free (small amount of network traffic is consumed) voice short messages, videos, pictures and characters through a network, and search terms of the WeChat comprise 'WeChat, free instant messaging, Tencent, friend circles, public platforms, message pushing, shaking, nearby people, friend adding in a two-dimensional code scanning mode and multi-person conversation'.
The corpus comprises all the abstract contents of the above-mentioned "WeChat" and all the contents of the search term of the "WeChat"; inputting the corpus set into the LDA model for training, and if the generated topic of the LDA model aiming at the corpus set of WeChat comprises social contact and the generated keywords comprise chat, voice, telephone, phone book, social contact, friend making, communication, address book and friends, obtaining the application-topic probability distribution result output by the LDA model and comprising P1.1 (WeChat-social contact); the distribution results of the topic-keyword output by the LDA model are obtained as P2.1 (WeChat-chat), P2.2 (WeChat-voice), P2.3 (WeChat-telephone), P2.4 (WeChat-telephone book), P2.5 (WeChat-social), P2.6 (WeChat-friend), P2.7 (WeChat-communication), P2.8 (WeChat-address book) and P2.9 (WeChat-friend); the label system of the WeChat calculated according to the P1.1 (WeChat-social contact) and the P2.1 (WeChat-chat), the P2.2 (WeChat-voice), the P2.3 (WeChat-telephone), the P2.4 (WeChat-telephone book), the P2.5 (WeChat-social contact), the P2.6 (WeChat-friend), the P2.7 (WeChat-communication), the P2.8 (WeChat-address book) and the P2.9 (WeChat-friend) is shown in Table 3.
TABLE 3
Figure BDA0001194306660000131
Therefore, a corpus set is obtained according to the abstract and the search word of each application, then the obtained corpus set is processed through an LDA model, corresponding application-theme probability distribution results and theme-keyword probability distribution results are generated, and then a label system of each application is obtained through calculation according to the application-theme probability distribution results and the theme-keyword probability distribution results, so that the application content or the function description text can be comprehensively and accurately represented.
In order to solve the problem, in an embodiment of the present invention, the step S331 of obtaining the corpus set according to the abstract and the search term of each application includes: for each application, extracting a first segment of characters or characters of a preset number of sentences from the abstract of the application; the extracted characters and the applied search terms are jointly used as the original corpus of the application; the original linguistic data of each application form an original linguistic data set; and preprocessing the original corpus set to obtain a training corpus set.
For example, for the application of "WeChat", obtaining the summary of "WeChat" includes:
"Wechat is a piece of social software.
The WeChat provides functions of a public platform, a friend circle, message pushing and the like, the user can add friends and pay attention to the public platform in a two-dimensional code scanning mode through shaking, searching numbers and nearby people, and meanwhile the WeChat shares content to the friends and wonderful content seen by the user to the WeChat friend circle. The WeChat supports the cross-communication operator and the cross-operating system platform to quickly send free (small network flow consumption) voice short messages, videos, pictures and characters through the network, and meanwhile, service plug-ins such as 'shake-shake', 'drift bottle', 'friend circle', 'public platform', 'voice notepad' and the like can also be used through sharing the data of streaming media content and the social plug-ins based on positions.
By the first quarter of 2015, WeChat has covered more than 90% of the smart phones in China, monthly active users reach 5.49 hundred million, and users cover more than 20 languages in more than 200 countries. In addition, the total number of WeChat public accounts of each brand is over 800 ten thousand, the number of mobile application butt joints is over 85000, and WeChat payment users reach about 4 hundred million. "
Extracting a previous sentence including that the WeChat is social software from the abstract of the WeChat, and simultaneously obtaining search terms of the WeChat, wherein the search terms of the WeChat include that the WeChat is social software, chat, voice, telephone, phone book, social contact, friend making, communication, address book and friend, and taking the WeChat as the original linguistic data of the WeChat; acquiring original predictions of other applications in a mode of acquiring 'WeChat' original corpora, wherein all the applied original corpora form an original corpus set; and preprocessing the original corpus set to obtain a training corpus set.
Specifically, the preprocessing the original corpus set includes: in the original corpus set, performing word segmentation processing on each original corpus to obtain a word segmentation result containing a plurality of terms; searching phrases formed by adjacent terms in the word segmentation result; and reserving the phrases, the lexical items belonging to nouns and the lexical items belonging to verbs in the word segmentation result as the corresponding reserved keywords of the original corpus.
For example, in the original corpus set, the original corpus set of the "WeChat" is "WeChat is a piece of social software, chat, voice, telephone call, telephone book, social contact, friend making, communication, address book, friend", the word segmentation processing is performed on the original corpus of the "WeChat" to obtain a word segmentation result containing a plurality of lexical items, including "WeChat, Yes, one, social contact, software, chat, voice, telephone call, telephone book, social contact, friend making, communication, address book, friend, the phrase formed by adjacent lexical items in the word segmentation result is searched to include" WeChat, one, social contact, software, chat, voice, telephone call, telephone book, social contact, friend making, communication, address book, friend ", the phrase, the lexical item belonging to noun and the lexical item belonging to verb in the word segmentation result are reserved as the corresponding reserved keyword of the original corpus, the keywords of "WeChat" include "WeChat, social, chat, voice, phone, phonebook, social, friend, contact list, friend".
In order to determine whether to form a phrase, the closeness of two preceding and following terms is calculated, and in an embodiment of the present invention, the searching for the phrase formed by adjacent terms in the word segmentation result includes: calculating the cPId value of every two adjacent terms in the word segmentation result, and determining that the two adjacent terms form a phrase when the cPId values of the two adjacent terms are larger than a first preset threshold value.
For example, setting the first preset threshold value to be 5, obtaining the word segmentation result of the "Baidu map" as "province, flow, public transportation and transfer", calculating the cpmd values of "province, flow", "flow, public transportation" and "public transportation and transfer" by using a cpmd calculation method, if the cpmd values of "province, flow", "public transportation and transfer" obtained by calculation are greater than 5, determining that "province, flow", "public transportation and transfer" constitute phrases "province flow" and "public transportation and transfer", and if the cpmd value of "flow, public transportation" obtained by calculation is less than 5, determining that "flow and public transportation" cannot constitute phrases.
In addition, the cPId is calculated according to the formula 1,
Figure BDA0001194306660000151
in formula 1, D (x, y) represents the co-occurrence frequency of two terms x and y, D (x) represents the occurrence frequency of term x, D (y) represents the occurrence frequency of term y, and D represents the total application number.
Further, in an embodiment of the present invention, the preprocessing the original corpus set further includes: using the keywords correspondingly reserved for the original material of each application as the first-stage training corpus of the application; the first-stage training corpora of each application form a first-stage training corpus set; and carrying out data cleaning on the keywords in the first-stage corpus set.
Specifically, in a million-level application, the probability that a term appearing at a very high frequency is a label is small, and the probability that a term appearing at a low frequency is a label is also small, so that the data cleansing process can filter out keywords appearing at a very high frequency and keywords appearing at a very low frequency.
For example, the keyword corresponding to the reserved keyword of the original material of the 'WeChat' comprises 'WeChat, social contact, chat, voice, telephone, phone book, social contact, friend making, communication, address book and friend', and then 'WeChat, social contact, chat, voice, telephone, phone book, social contact, friend making, communication, address book and friend' is used as the first-stage training corpus of the 'WeChat'; all the applied first-stage corpus forms a first-stage corpus set, data cleaning is carried out on keywords in the first-stage corpus set, and terms which appear frequently in the first-stage corpus set are filtered out, so that the quality of the applied search engine is improved.
In order to filter out keywords appearing in a very high frequency and keywords appearing in a very low frequency in the first-stage corpus set, in an embodiment of the present invention, the performing data cleaning on the keywords in the first-stage corpus set includes: in the first-stage corpus set, for each first-stage corpus, calculating a TF-IDF value of each keyword in the first-stage corpus; and deleting the keywords with the TF-IDF values higher than the second preset threshold and/or lower than the third preset threshold.
In the process, a TF-IDF calculation formula is adopted to calculate the TF-IDF value of each keyword in the first-stage training corpus, so that further cleaning of data is realized.
For example, the first-stage corpus of the "WeChat" includes "WeChat, social contact, chat, speech, telephone, phonebook, social contact, friend, communication, address book, friend", and the TF-IDF value of each term and phrase is calculated in the first-stage corpus of the "WeChat" by using the calculation formula of the TF-IDF to obtain TF-IDF (WeChat), TF-IDF (social contact), TF-IDF (chat), TF-IDF (speech), TF-IDF (telephone call), TF-IDF (phonebook), TF-IDF (social contact), TF-IDF (friend), TF-IDF (communication), TF-IDF (address book), TF-IDF (friend); if TF-IDF (address), TF-IDF (address list), TF-IDF (friend) are higher than the second preset threshold and/or lower than the third preset threshold, the 'address, address list and friend' are deleted. It should be noted that the second preset threshold and/or the threshold lower than the third preset threshold are related to specific linguistic data, and the specific thresholds are not listed here. Meanwhile, the TF-IDF is applied to cleaning the data because the TF-IDF can well evaluate the importance degree of a word to one of the files in a file set or a corpus, and the requirement of cleaning the data is completely met.
The calculation formula of TF-IDF is as follows:
Figure BDA0001194306660000161
in equation 2, count (w, app) is the term frequency of term w in app, count (w, Corpus) is the term frequency of w in Corpus, ncopus is the total number of apps, and app _ count (w) is the number of apps containing term w
Further, in an embodiment of the present invention, the preprocessing the original corpus set further includes: taking the remaining keywords of the first-stage training corpus of each application after data cleaning as the second-stage training corpus of the application; for each applied second-stage corpus, when a keyword in the applied second-stage corpus appears in the name of the application, repeating the keyword in the applied second-stage corpus for a fourth preset threshold number of times to obtain the applied corpus; the corpus of each application constitutes a corpus set.
For example, the first-stage corpus of "WeChat" includes "WeChat, social contact, chat, voice, telephone, phone book, social contact, friend-making, communication, address book, friend", remove "communication, address book, friend" through data cleaning processing, then the remaining keywords include "WeChat, social contact, chat, voice, telephone, phone book, social contact, friend-making" and are the second-stage corpus of "WeChat";
when analyzing the second-stage corpus, it is found that labels expressing the functions or categories of the application often appear in names, such as "get a car" in "get a car" and "take out" in "out of word," and "lease car" in "rough lease car" and "map" in "Baidu map", etc., in order to highlight this category of important labels, pmis appearing in the name of the application are repeated three times in each corpus of the application, phrases with a cd value higher than 10.0 are also repeated three times to increase the frequency of appearance of these potential important phrase labels, so far, the corpus construction of the LDA topic model is completed, and the corpus collection is stored in a file app _ corp _ seg _ nouns _ verbs _ verbe _ filtered _ repeat.
In an embodiment of the present invention, the step S133, according to the application-topic probability distribution result and the topic-keyword probability distribution result, calculating a label system of each application includes:
calculating to obtain an application-keyword probability distribution result according to the application-topic probability distribution result and the topic-keyword probability distribution result; and according to the application-keyword probability distribution result, for each application, sorting the keywords according to the probability of the application from large to small, and selecting the keywords with the number of the top fifth preset threshold value.
For example, setting the fifth preset threshold to 8, the LDA model outputs the probability distribution of the topic under each application, and the probability distribution of the term under each topic. In order to obtain each applied label, respectively sorting the probability distribution of the topics and the probability distribution of the keywords in a reverse order from large to small according to the probability, selecting the first 50 topics under each application, selecting the first 120 keywords under each topic, performing weighted sorting on the probabilities of the keywords by using the probabilities of the topics, wherein each applied keyword has a weight which represents the importance under the application, sorting in a reverse order according to the weight of the label, and selecting the first 8 keywords, so that a label list generated by the LDA is obtained, and the label list contains a lot of noises, and the sequence of the labels is inaccurate, as shown in Table 4.
TABLE 4
Figure BDA0001194306660000171
Figure BDA0001194306660000181
Wherein the calculating an application-keyword probability distribution result according to the application-topic probability distribution result and the topic-keyword probability distribution result comprises:
for each application, obtaining the probability of each theme about the application according to the application-theme probability distribution result; for each topic, obtaining the probability of each keyword about the topic according to the topic-keyword probability distribution result; for each keyword, taking the product of the probability of the keyword about a topic and the probability of the topic about an application as the probability of the keyword based on the topic about the application; and taking the probability of the keyword about the application based on the sum of the probabilities of the topics about the application.
For example, if the keyword of an application C is a, the topic corresponding to the keyword a includes B1, B2 and B3, the probability of the keyword a with respect to a topic B1 is P (a _ B1), the probability of the topic B1 with respect to an application C is P (B1_ C), and then P (a _ B1) P (B1_ C) is the probability of the keyword a with respect to the application C based on the topic B1; then P (a _ B2) × P (B2_ C) is the probability that keyword a is based on topic B2 with respect to application C; p (a _ B3) × P (B3_ C) is the probability that keyword a is based on topic B2 with respect to application C, then the probability that keyword a is with respect to said application C, P (a _ B1) × P (B1_ C) + P (a _ B2) × P (B2_ C) + P (a _ B3) × P (B3_ C).
On this basis, in a further embodiment of the present invention, the calculating a label system of each application according to the application-topic probability distribution result and the topic-keyword probability distribution result further includes:
taking the keywords with the number of the first fifth preset threshold value correspondingly selected by each application as a first-stage label system of the application; for the first-stage label system of each application, calculating a semantic relation value between each keyword in the first-stage label system of the application and the abstract of the application; for each keyword, taking the product of the semantic relation value corresponding to the keyword and the probability of the keyword relative to the application as the correction probability of the keyword relative to the application; and sorting all the keywords in the first-stage label system of the application from large to small according to the correction probability of the application, and selecting the first K keywords to form the label system of the application.
For example, assuming that the fifth preset threshold is 3, the keyword of the number of the first fifth preset thresholds, which is selected by the "Baidu map", includes "map, search and navigation", the "map, search and navigation" is taken as the first-stage label system of the "Baidu map";
for the first-stage label system of the 'Baidu map', calculating semantic relation values between each keyword in the 'map, search and navigation' of the first-stage label system in the 'Baidu map' and the abstract of the 'Baidu map' as R1, R2 and R3 respectively; calculating the probability of each keyword in the first stage label system map, search and navigation in the 'Baidu map' and the 'Baidu map' as P1, P2 and P3; then R1P 1, R2P 2 and R3P 3 are used as the correction probability of the 'Baidu map', if R1P 1> R3P 3> R2P 2, the sequence of each keyword in the first stage label system of the 'Baidu map' is 'map, navigation and search', if 2 keywords are selected to form the label system of the application, the label system of the 'Baidu map' includes 'map and navigation'
Specifically, calculating a semantic relationship value between each keyword in the first-stage labeling system of the application and the abstract of the application includes:
calculating word vectors of the keywords, and calculating the word vectors of each lexical item in the sentences of the application abstract with the preset number; calculating cosine similarity between the word vector of the keyword and the word vector of each term, and taking the product of each cosine similarity and the weight of the sentence in which the corresponding term is located as a semantic relation value of the keyword and the corresponding term; and taking the sum of the semantic relation values of the keyword and each term as the semantic relation value between the keyword and the abstract of the application.
For example, a search word set obtained from a search log of an application search engine is used as input data of a training word vector, and a 300-dimensional word vector dictionary file tag _ query _ w2v _300. ditt is obtained through training. If the keywords of the "Baidu map" include "map, search and navigation", the word vector of the "map" is calculated to be M1; calculating word vectors of each term in the first 3 sentences of the abstract of the 'Baidu map' to be N1, N2 and N3 respectively; calculating cosine similarity of the word vector of the ' map ' and the word vector of each term in the previous sentence of the abstract of the Baidu map ' to obtain ' cos M1N 1 ', ' cos M1N 2 ' and ' cos M1N 3 '; the weight of the sentence in which the corresponding term is positioned is Q1 and Q2; then the semantic relation values of the keyword and the corresponding term are "Q1 × cos M1 × N1" and "Q2 × cos M1 × N2", respectively; then "Q1 × cos M1 × N1+ Q2 × cos M1 × N2+ Q3 × cos M1 × N3" is taken as the semantic relationship value between the "map" and the "Baidu map" summary.
Further, in an embodiment of the present invention, the calculating a label system of each application according to the application-topic probability distribution result and the topic-keyword probability distribution result further includes:
taking the keywords correspondingly selected by each application as a second-stage label system of the application; for the second-stage label system of each application, acquiring a search word set related to the downloading operation of the application from an application search log, and counting the DF value of each keyword in the second-stage label system of the application in the search word set; for each keyword, increasing the multiple of the DF value on the basis of the probability of the keyword about the application to obtain the secondary correction probability of the keyword about the application; and sorting all the keywords in the second-stage label system of the application from large to small according to the secondary correction probability of the application, and selecting the first K keywords to form the label system of the application.
For example, the historical search word set mined to download the "Baidu map" includes "map, search and navigation", the DF value of the historical search word set of the keyword "map" on the "Baidu map" is calculated to be DF1, the DF value of the historical search word set of the keyword "search" on the "Baidu map" is calculated to be DF2, and the DF value of the historical search word set of the keyword "navigation" on the "Baidu map" is calculated to be DF 3; calculating initial probabilities of "map", "search", and "navigation" with respect to "Baidu map" as P1, P2, and P3; then the secondary correction probability of the keyword "map" with respect to "Baidu map" is P1 (1+ DF 1); the secondary correction probability of the keyword "search" with respect to "Baidu map" is P2 (1+ DF 2); the secondary correction probability of the keyword "navigation" with respect to the "Baidu map" is P3 (1+ DF 3).
If P3 (1+ DF3) > P1 (1+ DF1) > P2 (1+ DF2), the sequence of the keywords of the "Baidu map" is adjusted to "map, navigation and search", and if the first two keywords are selected to form the label system of the "Baidu map", the label system of the "Baidu map" includes "map, navigation". The accuracy of the label sequence of the hundred-degree map is greatly improved after the method is adopted for adjustment. If the results of one correction for "public praise takeout" and "Baidu map" are shown in table 5,
TABLE 5
Figure BDA0001194306660000201
Figure BDA0001194306660000211
The results of the secondary corrections to the "public praise takeaway" and "Baidu map" are shown in Table 6:
TABLE 6
Figure BDA0001194306660000212
By comparing table 5 with table 6, we can see that the accuracy of the applied label sequence is greatly improved after the secondary correction.
In a specific example, the selecting the top K keywords to form a tag system of the application includes:
acquiring the seasonal downloading times of the application from the application searching log;
selecting the first K keywords according to the quarterly downloading times of the application to form a label system of the application; where the value of K is a polyline function of the number of quarterly downloads for the application.
In practical application, a label list of the application is found, wherein the @ k accuracy of the label is related to whether the application is hot or not, the seasonal download times just reflect whether the application is hot or not, three to fifteen different labels are reserved for each application, the accuracy is 92%, the recall rate is 76%, and the number is in direct proportion to the seasonal download times. Typical examples are shown in table 7.
TABLE 7
Figure BDA0001194306660000213
Figure BDA0001194306660000221
Fig. 4 shows a schematic diagram of an application search apparatus according to an embodiment of the present invention, and as shown in fig. 4, the application search apparatus 400 includes:
a label system construction unit 410 adapted to construct a label system for each application.
And the interaction unit 420 is adapted to receive the search terms uploaded by the client.
And the search processing unit 430 is adapted to perform matching in the tag system of each application according to the search term.
The interaction unit 420 is further adapted to return the relevant information of an application to the client for presentation when the search term matches a keyword in a tag system of the application.
As can be seen from the apparatus shown in fig. 4, in the present scheme, by constructing the tag systems of the applications, when receiving a search word uploaded by a client, matching is performed in the tag systems of the applications according to the search word, and when the search word matches a keyword in one tag system of the applications, relevant information of the applications is returned to the client for display, so as to implement intelligent search of the applications.
FIG. 5 shows a schematic diagram of a label architecture building unit according to one embodiment of the invention, as shown in FIG. 5, the label architecture building unit 500 includes:
an information obtaining unit 510 adapted to obtain a summary of each application; and obtaining the search terms of each application from the application search log.
And the application label mining unit 520 is suitable for mining a label system of each application according to the abstract, the search term and the preset strategy of each application.
It should be noted that the label architecture building unit 500 has the same function as the label architecture building unit 410 in fig. 4.
In an embodiment of the present invention, the application label mining unit 520 is adapted to obtain a corpus set according to the abstract and the search term of each application; inputting the training corpus set into an LDA model for training to obtain an application-theme probability distribution result and a theme-keyword probability distribution result output by the LDA model; and calculating to obtain a label system of each application according to the application-theme probability distribution result and the theme-keyword probability distribution result.
The application tag mining unit 520 is adapted to extract, for each application, a first segment of characters or characters of a preset number of sentences from the abstract of the application; the extracted characters and the applied search terms are jointly used as the original corpus of the application; the original linguistic data of each application form an original linguistic data set; and preprocessing the original corpus set to obtain a training corpus set.
In an embodiment, the process of preprocessing the original corpus by the application tag mining unit 520 may be: in the original corpus set, performing word segmentation processing on each original corpus to obtain a word segmentation result containing a plurality of terms; searching phrases formed by adjacent terms in the word segmentation result; and reserving the phrases, the lexical items belonging to nouns and the lexical items belonging to verbs in the word segmentation result as the corresponding reserved keywords of the original corpus.
Specifically, the application tag mining unit 520 is adapted to calculate cpmd values of every two adjacent terms in the word segmentation result, and determine that the two adjacent terms form a phrase when the cpmd values of the two adjacent terms are greater than a first preset threshold.
Further, in another embodiment, the application tag mining unit 520 is further adapted to use a remaining keyword corresponding to the original material of each application as the first-stage corpus of the application; the first-stage training corpora of each application form a first-stage training corpus set; and carrying out data cleaning on the keywords in the first-stage corpus set.
Specifically, the application label mining unit 520 is adapted to calculate, for each first-stage corpus in the first-stage corpus set, a TF-IDF value of each keyword in the first-stage corpus; and deleting the keywords with the TF-IDF values higher than the second preset threshold and/or lower than the third preset threshold.
Further, in yet another embodiment, the application label mining unit 520 is further adapted to use the remaining keywords after the first-stage corpus of each application is subjected to data cleaning as the second-stage corpus of the application; for each applied second-stage corpus, when a keyword in the applied second-stage corpus appears in the name of the application, repeating the keyword in the applied second-stage corpus for a fourth preset threshold number of times to obtain the applied corpus; the corpus of each application constitutes a corpus set.
In an embodiment of the present invention, the specific way for the application tag mining unit 520 to calculate the tag system of each application according to the application-topic probability distribution result and the topic-keyword probability distribution result is as follows: calculating to obtain an application-keyword probability distribution result according to the application-topic probability distribution result and the topic-keyword probability distribution result; and according to the application-keyword probability distribution result, for each application, sorting the keywords according to the probability of the application from large to small, and selecting the keywords with the number of the top fifth preset threshold value.
The application label mining unit 520 is adapted to obtain, for each application, a probability of each topic about the application according to the application-topic probability distribution result; for each topic, obtaining the probability of each keyword about the topic according to the topic-keyword probability distribution result; for each keyword, taking the product of the probability of the keyword about a topic and the probability of the topic about an application as the probability of the keyword based on the topic about the application; and taking the probability of the keyword about the application based on the sum of the probabilities of the topics about the application.
Further, in an embodiment, the application tag mining unit 520 is further adapted to use the first fifth preset threshold number of keywords selected corresponding to each application as the first-stage tag system of the application; for the first-stage label system of each application, calculating a semantic relation value between each keyword in the first-stage label system of the application and the abstract of the application; for each keyword, taking the product of the semantic relation value corresponding to the keyword and the probability of the keyword relative to the application as the correction probability of the keyword relative to the application; and sorting all the keywords in the first-stage label system of the application from large to small according to the correction probability of the application, and selecting the first K keywords to form the label system of the application.
Specifically, the application tag mining unit 520 is adapted to calculate word vectors of the keywords, and calculate word vectors of each term in a preset number of sentences before the abstract of the application; calculating cosine similarity between the word vector of the keyword and the word vector of each term, and taking the product of each cosine similarity and the weight of the sentence in which the corresponding term is located as a semantic relation value of the keyword and the corresponding term; and taking the sum of the semantic relation values of the keyword and each term as the semantic relation value between the keyword and the abstract of the application.
Further, in another embodiment, the application tag mining unit 520 is further adapted to use the selected keyword corresponding to each application as the second-stage tag system of the application; for the second-stage label system of each application, acquiring a search word set related to the downloading operation of the application from an application search log, and counting the DF value of each keyword in the second-stage label system of the application in the search word set; for each keyword, increasing the multiple of the DF value on the basis of the probability of the keyword about the application to obtain the secondary correction probability of the keyword about the application; and sorting all the keywords in the second-stage label system of the application from large to small according to the secondary correction probability of the application, and selecting the first K keywords to form the label system of the application.
In an embodiment of the present invention, the application tag mining unit 520 is adapted to obtain the number of times of downloading of the application in the quarter from the application search log; selecting the first K keywords according to the quarterly downloading times of the application to form a label system of the application; where the value of K is a polyline function of the number of quarterly downloads for the application.
It should be noted that the working process 400 of the application search apparatus in this embodiment has the same corresponding functions as the implementation steps of the application search method described in the first embodiment, and the same parts are not described again.
In summary, the technology provided by the present invention automatically obtains the abstract of each application, and obtains the search word of each application in real time from the historical application search log of the user, so as to expand the application short text and realize dynamic update of the application tag; meanwhile, a preset strategy is formulated through an effective training unsupervised LDA learning model, so that the effect of continuously improving the accuracy and recall rate of the application label is achieved, an applied label system is further excavated and created, the method is also suitable for newly produced application, the problems that the traditional applied label system is large in manual workload, low in coverage rate, serious in cheating phenomenon and the like due to the fact that manual labeling can only be carried out are solved, the searching quality of an application search engine is greatly improved, and the user searching experience is improved.
It should be noted that:
the algorithms and displays presented herein are not inherently related to any particular computer, virtual machine, or other apparatus. Various general purpose devices may be used with the teachings herein. The required structure for constructing such a device will be apparent from the description above. Moreover, the present invention is not directed to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the teachings of the present invention as described herein, and any descriptions of specific languages are provided above to disclose the best mode of the invention.
In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.
Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.
Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.
Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.
The various component embodiments of the invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that a microprocessor or Digital Signal Processor (DSP) may be used in practice to implement some or all of the functions of some or all of the components of the application search method and apparatus according to embodiments of the present invention. The present invention may also be embodied as apparatus or device programs (e.g., computer programs and computer program products) for performing a portion or all of the methods described herein. Such programs implementing the present invention may be stored on computer-readable media or may be in the form of one or more signals. Such a signal may be downloaded from an internet website or provided on a carrier signal or in any other form.
It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.

Claims (24)

1. An application search method, comprising:
constructing a label system of each application;
receiving a search word uploaded by a client;
matching in a label system of each application according to the search terms;
when the search word is matched with a keyword in an application label system, returning the relevant information of the application to the client for displaying;
the label system for constructing each application comprises the following steps:
obtaining the abstract of each application;
acquiring search terms related to each application from the application search log;
excavating a label system of each application according to the abstract, the search terms and the preset strategy of each application;
the method for mining the label system of each application according to the abstract, the search term and the preset strategy of each application comprises the following steps:
obtaining a training corpus set according to the abstract and the search word of each application;
inputting the training corpus set into an LDA model for training to obtain an application-theme probability distribution result and a theme-keyword probability distribution result output by the LDA model;
calculating to obtain a label system of each application according to the application-theme probability distribution result and the theme-keyword probability distribution result;
the step of obtaining the label system of each application by calculation according to the application-topic probability distribution result and the topic-keyword probability distribution result further comprises:
taking the keywords with the number of the first fifth preset threshold value correspondingly selected by each application as a first-stage label system of the application;
for the first-stage label system of each application, calculating a semantic relation value between each keyword in the first-stage label system of the application and the abstract of the application; for each keyword, taking the product of the semantic relation value corresponding to the keyword and the probability of the keyword relative to the application as the correction probability of the keyword relative to the application; and sorting all the keywords in the first-stage label system of the application from large to small according to the correction probability of the application, and selecting the first K keywords to form the label system of the application.
2. The method of claim 1, wherein the obtaining a corpus set according to the abstracts and the search terms of each application comprises:
for each application, extracting a first segment of characters or characters of a preset number of sentences from the abstract of the application; the extracted characters and the applied search terms are jointly used as the original corpus of the application;
the original linguistic data of each application form an original linguistic data set; and preprocessing the original corpus set to obtain a training corpus set.
3. The method of claim 2, wherein the pre-processing the original corpus comprises:
in the original corpus collection, the corpus is divided into a plurality of parts,
for each original corpus, performing word segmentation processing on the original corpus to obtain word segmentation results containing a plurality of lexical items; searching phrases formed by adjacent terms in the word segmentation result; and reserving the phrases, the lexical items belonging to nouns and the lexical items belonging to verbs in the word segmentation result as the corresponding reserved keywords of the original corpus.
4. The method of claim 3, wherein said finding phrases comprised of adjacent terms in said word segmentation result comprises:
calculating the cPId value of every two adjacent terms in the word segmentation result, and determining that the two adjacent terms form a phrase when the cPId values of the two adjacent terms are larger than a first preset threshold value.
5. The method of claim 4, wherein the preprocessing the original corpus further comprises:
using the keywords correspondingly reserved for the original material of each application as the first-stage training corpus of the application;
the first-stage training corpora of each application form a first-stage training corpus set; and carrying out data cleaning on the keywords in the first-stage corpus set.
6. The method as claimed in claim 5, wherein the data cleansing of the keywords in the first-stage corpus comprises:
in the first-stage corpus set,
for each first-stage training corpus, calculating a TF-IDF value of each keyword in the first-stage training corpus; and deleting the keywords with the TF-IDF values higher than the second preset threshold and/or lower than the third preset threshold.
7. The method of claim 6, wherein the preprocessing the original corpus further comprises:
taking the remaining keywords of the first-stage training corpus of each application after data cleaning as the second-stage training corpus of the application;
for each applied second-stage corpus, when a keyword in the applied second-stage corpus appears in the name of the application, repeating the keyword in the applied second-stage corpus for a fourth preset threshold number of times to obtain the applied corpus;
the corpus of each application constitutes a corpus set.
8. The method of any one of claims 1-7, wherein the calculating a label system for each application based on the application-topic probability distribution result and the topic-keyword probability distribution result comprises:
calculating to obtain an application-keyword probability distribution result according to the application-topic probability distribution result and the topic-keyword probability distribution result;
and according to the application-keyword probability distribution result, for each application, sorting the keywords according to the probability of the application from large to small, and selecting the keywords with the number of the top fifth preset threshold value.
9. The method of any of claims 1-7, wherein said calculating an application-keyword probability distribution result from the application-topic probability distribution result and the topic-keyword probability distribution result comprises:
for each application, obtaining the probability of each theme about the application according to the application-theme probability distribution result;
for each topic, obtaining the probability of each keyword about the topic according to the topic-keyword probability distribution result;
for each keyword, taking the product of the probability of the keyword about a topic and the probability of the topic about an application as the probability of the keyword based on the topic about the application; and taking the probability of the keyword about the application based on the sum of the probabilities of the topics about the application.
10. The method of claim 9, wherein calculating the semantic relationship value between each keyword in the first stage tagging scheme for the application and the abstract of the application comprises:
calculating word vectors of the keywords, and calculating the word vectors of each lexical item in the sentences of the application abstract with the preset number;
calculating cosine similarity between the word vector of the keyword and the word vector of each term, and taking the product of each cosine similarity and the weight of the sentence in which the corresponding term is located as a semantic relation value of the keyword and the corresponding term;
and taking the sum of the semantic relation values of the keyword and each term as the semantic relation value between the keyword and the abstract of the application.
11. The method of any one of claims 1-7, wherein the calculating a label system for each application based on the application-topic probability distribution result and the topic-keyword probability distribution result further comprises:
taking the keywords correspondingly selected by each application as a second-stage label system of the application;
for the second-stage label system of each application, acquiring a search word set related to the downloading operation of the application from an application search log, and counting the DF value of each keyword in the second-stage label system of the application in the search word set; for each keyword, increasing the multiple of the DF value on the basis of the probability of the keyword about the application to obtain the secondary correction probability of the keyword about the application; and sorting all the keywords in the second-stage label system of the application from large to small according to the secondary correction probability of the application, and selecting the first K keywords to form the label system of the application.
12. The method of claim 11, wherein the selecting the top K keywords to form the tag system of the application comprises:
acquiring the seasonal downloading times of the application from the application searching log;
selecting the first K keywords according to the quarterly downloading times of the application to form a label system of the application; where the value of K is a polyline function of the number of quarterly downloads for the application.
13. An application search apparatus, comprising:
the label system building unit is suitable for building a label system of each application;
the interaction unit is suitable for receiving the search terms uploaded by the client;
the search processing unit is suitable for matching in a label system of each application according to the search terms;
the interaction unit is also suitable for returning the relevant information of the application to the client side for displaying when the search word is matched with the keyword in the label system of the application;
the label system building unit comprises:
the information acquisition unit is suitable for acquiring the abstract of each application; acquiring search terms related to each application from the application search log;
the application label mining unit is suitable for mining a label system of each application according to the abstract, the search terms and the preset strategy of each application;
the application label mining unit is suitable for obtaining a training corpus set according to the abstract and the search word of each application; inputting the training corpus set into an LDA model for training to obtain an application-theme probability distribution result and a theme-keyword probability distribution result output by the LDA model; calculating to obtain a label system of each application according to the application-theme probability distribution result and the theme-keyword probability distribution result;
the application label mining unit is also suitable for taking the keywords of the number of the first fifth preset threshold values correspondingly selected by each application as a first-stage label system of the application; for the first-stage label system of each application, calculating a semantic relation value between each keyword in the first-stage label system of the application and the abstract of the application; for each keyword, taking the product of the semantic relation value corresponding to the keyword and the probability of the keyword relative to the application as the correction probability of the keyword relative to the application; and sorting all the keywords in the first-stage label system of the application from large to small according to the correction probability of the application, and selecting the first K keywords to form the label system of the application.
14. The apparatus of claim 13, wherein,
the application tag mining unit is suitable for extracting a first segment of characters or characters of a preset number of sentences from the abstract of each application; the extracted characters and the applied search terms are jointly used as the original corpus of the application; the original linguistic data of each application form an original linguistic data set; and preprocessing the original corpus set to obtain a training corpus set.
15. The apparatus of claim 14, wherein,
the application label mining unit is suitable for performing word segmentation processing on each original corpus in the original corpus set to obtain a word segmentation result containing a plurality of terms; searching phrases formed by adjacent terms in the word segmentation result; and reserving the phrases, the lexical items belonging to nouns and the lexical items belonging to verbs in the word segmentation result as the corresponding reserved keywords of the original corpus.
16. The apparatus of claim 15, wherein,
and the application tag mining unit is suitable for calculating the cPId values of every two adjacent terms in the word segmentation result, and when the cPId values of the two adjacent terms are larger than a first preset threshold value, determining that the two adjacent terms form a phrase.
17. The apparatus of claim 16, wherein,
the application label mining unit is also suitable for taking the keywords correspondingly reserved by the original material of each application as the first-stage training corpus of the application; the first-stage training corpora of each application form a first-stage training corpus set; and carrying out data cleaning on the keywords in the first-stage corpus set.
18. The apparatus of claim 17, wherein,
the application label mining unit is suitable for calculating a TF-IDF value of each keyword in the first-stage corpus for each first-stage corpus in the first-stage corpus set; and deleting the keywords with the TF-IDF values higher than the second preset threshold and/or lower than the third preset threshold.
19. The apparatus of claim 18, wherein,
the application label mining unit is also suitable for taking the residual keywords of the first-stage training corpus of each application after data cleaning as the second-stage training corpus of the application; for each applied second-stage corpus, when a keyword in the applied second-stage corpus appears in the name of the application, repeating the keyword in the applied second-stage corpus for a fourth preset threshold number of times to obtain the applied corpus; the corpus of each application constitutes a corpus set.
20. The apparatus of any one of claims 13-19,
the application label mining unit is suitable for calculating an application-keyword probability distribution result according to the application-theme probability distribution result and the theme-keyword probability distribution result; and according to the application-keyword probability distribution result, for each application, sorting the keywords according to the probability of the application from large to small, and selecting the keywords with the number of the top fifth preset threshold value.
21. The apparatus of any one of claims 13-19,
the application label mining unit is suitable for obtaining the probability of each theme about the application according to the application-theme probability distribution result for each application; for each topic, obtaining the probability of each keyword about the topic according to the topic-keyword probability distribution result; for each keyword, taking the product of the probability of the keyword about a topic and the probability of the topic about an application as the probability of the keyword based on the topic about the application; and taking the probability of the keyword about the application based on the sum of the probabilities of the topics about the application.
22. The apparatus of claim 21, wherein,
the application label mining unit is suitable for calculating word vectors of the keywords and calculating the word vectors of each term in the sentences of which the number is preset in front of the abstract of the application; calculating cosine similarity between the word vector of the keyword and the word vector of each term, and taking the product of each cosine similarity and the weight of the sentence in which the corresponding term is located as a semantic relation value of the keyword and the corresponding term; and taking the sum of the semantic relation values of the keyword and each term as the semantic relation value between the keyword and the abstract of the application.
23. The apparatus of any one of claims 13-19,
the application label mining unit is also suitable for taking the keywords correspondingly selected by each application as a second-stage label system of the application; for the second-stage label system of each application, acquiring a search word set related to the downloading operation of the application from an application search log, and counting the DF value of each keyword in the second-stage label system of the application in the search word set; for each keyword, increasing the multiple of the DF value on the basis of the probability of the keyword about the application to obtain the secondary correction probability of the keyword about the application; and sorting all the keywords in the second-stage label system of the application from large to small according to the secondary correction probability of the application, and selecting the first K keywords to form the label system of the application.
24. The apparatus of claim 23, wherein,
the application label mining unit is suitable for acquiring the seasonal downloading times of the application from the application search log; selecting the first K keywords according to the quarterly downloading times of the application to form a label system of the application; where the value of K is a polyline function of the number of quarterly downloads for the application.
CN201611229802.5A 2016-12-27 2016-12-27 Application search method and device Active CN106682170B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611229802.5A CN106682170B (en) 2016-12-27 2016-12-27 Application search method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611229802.5A CN106682170B (en) 2016-12-27 2016-12-27 Application search method and device

Publications (2)

Publication Number Publication Date
CN106682170A CN106682170A (en) 2017-05-17
CN106682170B true CN106682170B (en) 2020-09-18

Family

ID=58871714

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611229802.5A Active CN106682170B (en) 2016-12-27 2016-12-27 Application search method and device

Country Status (1)

Country Link
CN (1) CN106682170B (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107291962B (en) * 2017-08-10 2020-06-26 Oppo广东移动通信有限公司 Searching method, searching device, storage medium and electronic equipment
CN107613520B (en) * 2017-08-29 2020-08-04 重庆邮电大学 Telecommunication user similarity discovery method based on L DA topic model
CN108038192A (en) * 2017-12-11 2018-05-15 广东欧珀移动通信有限公司 Application searches method and apparatus, electronic equipment, computer-readable recording medium
CN108762804B (en) * 2018-04-24 2021-11-19 创新先进技术有限公司 Method and device for gray-scale releasing new product
CN111221928B (en) * 2018-11-27 2024-02-23 上海擎感智能科技有限公司 Thematic map display method and vehicle-mounted terminal
CN109800348A (en) * 2018-12-12 2019-05-24 平安科技(深圳)有限公司 Search for information display method, device, storage medium and server
CN112052330B (en) * 2019-06-05 2021-11-26 上海游昆信息技术有限公司 Application keyword distribution method and device
CN113609380B (en) * 2021-07-12 2024-03-26 北京达佳互联信息技术有限公司 Label system updating method, searching device and electronic equipment
CN114168837A (en) * 2021-11-18 2022-03-11 深圳市梦网科技发展有限公司 Chatbot searching method, equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101350977A (en) * 2007-07-20 2009-01-21 宁波萨基姆波导研发有限公司 Rapid searching method for mobile communication terminal
CN102760149A (en) * 2012-04-05 2012-10-31 中国人民解放军国防科学技术大学 Automatic annotating method for subjects of open source software
CN104133877A (en) * 2014-07-25 2014-11-05 百度在线网络技术(北京)有限公司 Software label generation method and device
CN104281656A (en) * 2014-09-18 2015-01-14 广州三星通信技术研究有限公司 Method and device for adding label information into application program
CN105893609A (en) * 2016-04-26 2016-08-24 南通大学 Mobile APP recommendation method based on weighted mixing

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101350977A (en) * 2007-07-20 2009-01-21 宁波萨基姆波导研发有限公司 Rapid searching method for mobile communication terminal
CN102760149A (en) * 2012-04-05 2012-10-31 中国人民解放军国防科学技术大学 Automatic annotating method for subjects of open source software
CN104133877A (en) * 2014-07-25 2014-11-05 百度在线网络技术(北京)有限公司 Software label generation method and device
CN104281656A (en) * 2014-09-18 2015-01-14 广州三星通信技术研究有限公司 Method and device for adding label information into application program
CN105893609A (en) * 2016-04-26 2016-08-24 南通大学 Mobile APP recommendation method based on weighted mixing

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Appropriately Incorporating Statistical Significance in PMI;Om P. Damani等;《Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing》;20131021;第163-169页 *
一种基于加权LDA模型和多粒度的文本特征选择方法;李湘东等;《现代图书情报技术》;20150525(第5期);第42-49页 *

Also Published As

Publication number Publication date
CN106682170A (en) 2017-05-17

Similar Documents

Publication Publication Date Title
CN106682169B (en) Application label mining method and device, application searching method and server
CN106682170B (en) Application search method and device
CN106649818B (en) Application search intention identification method and device, application search method and server
CN110543574B (en) Knowledge graph construction method, device, equipment and medium
CN106970991B (en) Similar application identification method and device, application search recommendation method and server
CN108536852B (en) Question-answer interaction method and device, computer equipment and computer readable storage medium
CN106709040B (en) Application search method and server
CN109657054B (en) Abstract generation method, device, server and storage medium
CN110209897B (en) Intelligent dialogue method, device, storage medium and equipment
CN111767403B (en) Text classification method and device
CN105843962A (en) Information processing and displaying methods, information processing and displaying devices as well as information processing and displaying system
CN107657056B (en) Method and device for displaying comment information based on artificial intelligence
CN104394057A (en) Expression recommendation method and device
CN109284502B (en) Text similarity calculation method and device, electronic equipment and storage medium
CN108549723B (en) Text concept classification method and device and server
CN102314440B (en) Utilize the method and system in network operation language model storehouse
CN110209810A (en) Similar Text recognition methods and device
CN104077707B (en) A kind of optimization method and device for promoting presentation mode
KR20200087977A (en) Multimodal ducument summary system and method
CN113204953A (en) Text matching method and device based on semantic recognition and device readable storage medium
US9906588B2 (en) Server and method for extracting content for commodity
CN111813993A (en) Video content expanding method and device, terminal equipment and storage medium
CN106156262A (en) A kind of search information processing method and system
CN114881685A (en) Advertisement delivery method, device, electronic device and storage medium
CN110674388A (en) Mapping method and device for push item, storage medium and terminal equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant