CN106682169B - Application label mining method and device, application searching method and server - Google Patents

Application label mining method and device, application searching method and server Download PDF

Info

Publication number
CN106682169B
CN106682169B CN201611229785.5A CN201611229785A CN106682169B CN 106682169 B CN106682169 B CN 106682169B CN 201611229785 A CN201611229785 A CN 201611229785A CN 106682169 B CN106682169 B CN 106682169B
Authority
CN
China
Prior art keywords
application
keyword
corpus
stage
label system
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201611229785.5A
Other languages
Chinese (zh)
Other versions
CN106682169A (en
Inventor
庞伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Qihoo Technology Co Ltd
Original Assignee
Beijing Qihoo Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Qihoo Technology Co Ltd filed Critical Beijing Qihoo Technology Co Ltd
Priority to CN201611229785.5A priority Critical patent/CN106682169B/en
Publication of CN106682169A publication Critical patent/CN106682169A/en
Application granted granted Critical
Publication of CN106682169B publication Critical patent/CN106682169B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2216/00Indexing scheme relating to additional aspects of information retrieval not explicitly covered by G06F16/00 and subgroups
    • G06F2216/03Data mining

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses an application label mining method and device, an application searching method and a server. The method comprises the following steps: obtaining the abstract of each application; acquiring search terms related to each application from the application search log; and excavating a label system of each application according to the abstract, the search terms and the preset strategy of each application. Therefore, the invention dynamically updates the application label by automatically acquiring the abstract of each application and acquiring the search word of each application in real time from the historical application search log of the user; meanwhile, the accuracy and the recall rate of the application label are continuously improved through a preset strategy, and then the applied label system is excavated and created, so that the problems that the traditional application label system only can cause large manual workload, low coverage rate, serious cheating phenomenon and the like through manual marking are solved, the search quality of the application search engine is greatly improved, and the user search experience is improved.

Description

Application label mining method and device, application searching method and server
Technical Field
The invention relates to the field of data mining and searching, in particular to an application label mining method and device, an application searching method and a server.
Background
The application search engine is a mobile terminal software application search engine service, and provides application search and download on a mobile phone, such as 360 mobile phone assistants, Tencent App, Quixey and the like. Taking 360 mobile phone assistants as an example, the number of applications is millions, and automatically mining and constructing a tag system of the applications is a key technology for improving the search quality of an application search engine and is also a core technology for realizing function search.
The traditional application label generation method is manual labeling, the workload is large, time and labor are wasted, and the coverage rate is low; or an application developer submits a label, which is often accompanied by a cheating problem, and the developer expects that the own application has higher showing opportunity and submits a large amount of label information irrelevant to the application.
Disclosure of Invention
In view of the above, the present invention is proposed to provide an application tag mining method, apparatus and application search method, server that overcome or at least partially solve the above problems.
According to an aspect of the present invention, there is provided an application tag mining method, including:
obtaining the abstract of each application;
acquiring search terms related to each application from the application search log;
and excavating a label system of each application according to the abstract, the search terms and the preset strategy of each application.
Optionally, the mining a label system of each application according to the abstract, the search term, and the preset policy of each application includes:
obtaining a training corpus set according to the abstract and the search word of each application;
inputting the training corpus set into an LDA model for training to obtain an application-theme probability distribution result and a theme-keyword probability distribution result output by the LDA model;
and calculating to obtain a label system of each application according to the application-theme probability distribution result and the theme-keyword probability distribution result.
Optionally, the obtaining a corpus set according to the abstracts and the search terms of each application includes:
for each application, extracting a first segment of characters or characters of a preset number of sentences from the abstract of the application; the extracted characters and the applied search terms are jointly used as the original corpus of the application;
the original linguistic data of each application form an original linguistic data set; and preprocessing the original corpus set to obtain a training corpus set.
Optionally, the preprocessing the original corpus set includes:
in the original corpus collection, the corpus is divided into a plurality of parts,
for each original corpus, performing word segmentation processing on the original corpus to obtain word segmentation results containing a plurality of lexical items; searching phrases formed by adjacent terms in the word segmentation result; and reserving the phrases, the lexical items belonging to nouns and the lexical items belonging to verbs in the word segmentation result as the corresponding reserved keywords of the original corpus.
Optionally, the searching for phrases composed of adjacent terms in the word segmentation result includes:
calculating the cPId value of every two adjacent terms in the word segmentation result, and determining that the two adjacent terms form a phrase when the cPId values of the two adjacent terms are larger than a first preset threshold value.
Optionally, the preprocessing the original corpus set further includes:
using the keywords correspondingly reserved for the original material of each application as the first-stage training corpus of the application;
the first-stage training corpora of each application form a first-stage training corpus set; and carrying out data cleaning on the keywords in the first-stage corpus set.
Optionally, the performing data cleaning on the keywords in the first-stage corpus set includes:
in the first-stage corpus set,
for each first-stage training corpus, calculating a TF-IDF value of each keyword in the first-stage training corpus; and deleting the keywords with the TF-IDF values higher than the second preset threshold and/or lower than the third preset threshold.
Optionally, the preprocessing the original corpus set further includes:
taking the remaining keywords of the first-stage training corpus of each application after data cleaning as the second-stage training corpus of the application;
for each applied second-stage corpus, when a keyword in the applied second-stage corpus appears in the name of the application, repeating the keyword in the applied second-stage corpus for a fourth preset threshold number of times to obtain the applied corpus;
the corpus of each application constitutes a corpus set.
Optionally, the step of obtaining, by calculation, a tag system of each application according to the application-topic probability distribution result and the topic-keyword probability distribution result includes:
calculating to obtain an application-keyword probability distribution result according to the application-topic probability distribution result and the topic-keyword probability distribution result;
and according to the application-keyword probability distribution result, for each application, sorting the keywords according to the probability of the application from large to small, and selecting the keywords with the number of the top fifth preset threshold value.
Optionally, the calculating an application-keyword probability distribution result according to the application-topic probability distribution result and the topic-keyword probability distribution result includes:
for each application, obtaining the probability of each theme about the application according to the application-theme probability distribution result;
for each topic, obtaining the probability of each keyword about the topic according to the topic-keyword probability distribution result;
for each keyword, taking the product of the probability of the keyword about a topic and the probability of the topic about an application as the probability of the keyword based on the topic about the application; and taking the probability of the keyword about the application based on the sum of the probabilities of the topics about the application.
Optionally, the step of obtaining, by calculation, a tag system of each application according to the application-topic probability distribution result and the topic-keyword probability distribution result further includes:
taking the keywords with the number of the first fifth preset threshold value correspondingly selected by each application as a first-stage label system of the application;
for the first-stage label system of each application, calculating a semantic relation value between each keyword in the first-stage label system of the application and the abstract of the application; for each keyword, taking the product of the semantic relation value corresponding to the keyword and the probability of the keyword relative to the application as the correction probability of the keyword relative to the application; and sorting all the keywords in the first-stage label system of the application from large to small according to the correction probability of the application, and selecting the first K keywords to form the label system of the application.
Optionally, calculating a semantic relationship value between each keyword in the first stage tagging system of the application and the abstract of the application comprises:
calculating word vectors of the keywords, and calculating the word vectors of each lexical item in the sentences of the application abstract with the preset number;
calculating cosine similarity between the word vector of the keyword and the word vector of each term, and taking the product of each cosine similarity and the weight of the sentence in which the corresponding term is located as a semantic relation value of the keyword and the corresponding term;
and taking the sum of the semantic relation values of the keyword and each term as the semantic relation value between the keyword and the abstract of the application.
Optionally, the step of obtaining, by calculation, a tag system of each application according to the application-topic probability distribution result and the topic-keyword probability distribution result further includes:
taking the keywords correspondingly selected by each application as a second-stage label system of the application;
for the second-stage label system of each application, acquiring a search word set related to the downloading operation of the application from an application search log, and counting the DF value of each keyword in the second-stage label system of the application in the search word set; for each keyword, increasing the multiple of the DF value on the basis of the probability of the keyword about the application to obtain the secondary correction probability of the keyword about the application; and sorting all the keywords in the second-stage label system of the application from large to small according to the secondary correction probability of the application, and selecting the first K keywords to form the label system of the application.
Optionally, the selecting the first K keywords to form the tag system of the application includes:
acquiring the seasonal downloading times of the application from the application searching log;
selecting the first K keywords according to the quarterly downloading times of the application to form a label system of the application; where the value of K is a polyline function of the number of quarterly downloads for the application.
According to another aspect of the present invention, there is provided an application search method, the method including:
receiving a search word uploaded by a client;
matching in a label system of each application according to the search terms;
when the search word is matched with a keyword in an application label system, returning the relevant information of the application to the client for displaying;
the label system of each application is constructed by the application label mining method as described in any one of the above.
According to another aspect of the present invention, there is provided an application label digging device, including:
the information acquisition unit is suitable for acquiring the abstract of each application; acquiring search terms related to each application from the application search log;
and the label system construction unit is suitable for excavating the label system of each application according to the abstract, the search terms and the preset strategy of each application.
Optionally, the tag system constructing unit is adapted to obtain a corpus set according to the abstracts and the search terms of each application; inputting the training corpus set into an LDA model for training to obtain an application-theme probability distribution result and a theme-keyword probability distribution result output by the LDA model; and calculating to obtain a label system of each application according to the application-theme probability distribution result and the theme-keyword probability distribution result.
Optionally, the tag system constructing unit is adapted to, for each application, extract a first segment of text or text of a preset number of sentences from the abstract of the application; the extracted characters and the applied search terms are jointly used as the original corpus of the application; the original linguistic data of each application form an original linguistic data set; and preprocessing the original corpus set to obtain a training corpus set.
Optionally, the tag system constructing unit is adapted to perform word segmentation processing on each original corpus in the original corpus set to obtain a word segmentation result including a plurality of terms; searching phrases formed by adjacent terms in the word segmentation result; and reserving the phrases, the lexical items belonging to nouns and the lexical items belonging to verbs in the word segmentation result as the corresponding reserved keywords of the original corpus.
Optionally, the tag system constructing unit is adapted to calculate cpmd values of every two adjacent terms in the word segmentation result, and determine that the two adjacent terms form a phrase when the cpmd values of the two adjacent terms are greater than a first preset threshold.
Optionally, the tag system constructing unit is further adapted to use a keyword, which is reserved corresponding to the original material of each application, as a first-stage training corpus of the application; the first-stage training corpora of each application form a first-stage training corpus set; and carrying out data cleaning on the keywords in the first-stage corpus set.
Optionally, the tag system constructing unit is adapted to calculate, in the first-stage corpus, for each first-stage corpus, a TF-IDF value of each keyword in the first-stage corpus; and deleting the keywords with the TF-IDF values higher than the second preset threshold and/or lower than the third preset threshold.
Optionally, the tag system constructing unit is further adapted to use a remaining keyword after the data washing of the first-stage corpus of each application as the second-stage corpus of the application; for each applied second-stage corpus, when a keyword in the applied second-stage corpus appears in the name of the application, repeating the keyword in the applied second-stage corpus for a fourth preset threshold number of times to obtain the applied corpus; the corpus of each application constitutes a corpus set.
Optionally, the tag system constructing unit is adapted to calculate an application-keyword probability distribution result according to the application-topic probability distribution result and the topic-keyword probability distribution result; and according to the application-keyword probability distribution result, for each application, sorting the keywords according to the probability of the application from large to small, and selecting the keywords with the number of the top fifth preset threshold value.
Optionally, the tag system constructing unit is adapted to, for each application, obtain, according to the application-topic probability distribution result, a probability of each topic with respect to the application; for each topic, obtaining the probability of each keyword about the topic according to the topic-keyword probability distribution result; for each keyword, taking the product of the probability of the keyword about a topic and the probability of the topic about an application as the probability of the keyword based on the topic about the application; and taking the probability of the keyword about the application based on the sum of the probabilities of the topics about the application.
Optionally, the tag system constructing unit is further adapted to use the first-fifth preset threshold number of keywords selected corresponding to each application as the first-stage tag system of the application; for the first-stage label system of each application, calculating a semantic relation value between each keyword in the first-stage label system of the application and the abstract of the application; for each keyword, taking the product of the semantic relation value corresponding to the keyword and the probability of the keyword relative to the application as the correction probability of the keyword relative to the application; and sorting all the keywords in the first-stage label system of the application from large to small according to the correction probability of the application, and selecting the first K keywords to form the label system of the application.
Optionally, the tag system constructing unit is adapted to calculate word vectors of the keywords, and calculate word vectors of each term in a preset number of sentences before the abstract of the application; calculating cosine similarity between the word vector of the keyword and the word vector of each term, and taking the product of each cosine similarity and the weight of the sentence in which the corresponding term is located as a semantic relation value of the keyword and the corresponding term; and taking the sum of the semantic relation values of the keyword and each term as the semantic relation value between the keyword and the abstract of the application.
Optionally, the tag system constructing unit is further adapted to use the selected keyword corresponding to each application as a second-stage tag system of the application; for the second-stage label system of each application, acquiring a search word set related to the downloading operation of the application from an application search log, and counting the DF value of each keyword in the second-stage label system of the application in the search word set; for each keyword, increasing the multiple of the DF value on the basis of the probability of the keyword about the application to obtain the secondary correction probability of the keyword about the application; and sorting all the keywords in the second-stage label system of the application from large to small according to the secondary correction probability of the application, and selecting the first K keywords to form the label system of the application.
Optionally, the tag system constructing unit is adapted to obtain the number of times of downloading of the application in a quarter from the application search log; selecting the first K keywords according to the quarterly downloading times of the application to form a label system of the application; where the value of K is a polyline function of the number of quarterly downloads for the application.
According to another aspect of the present invention, there is provided an application search server, the server including:
the interaction unit is suitable for receiving the search terms uploaded by the client;
the search processing unit is suitable for matching in a label system of each application according to the search terms;
the interaction unit is also suitable for returning the relevant information of the application to the client side for displaying when the search word is matched with the keyword in the label system of the application;
the application search server further comprises the application tag mining device as described above, and the tag system of each application is constructed by the application tag mining device.
According to the technical scheme, the abstract of each application is automatically acquired, the search words of each application are acquired in real time from the historical application search log of the user, and the application label is dynamically updated; meanwhile, the accuracy and the recall rate of the application label are continuously improved through a preset strategy, and then the applied label system is excavated and created, so that the problems that the traditional application label system only can cause large manual workload, low coverage rate, serious cheating phenomenon and the like through manual marking are solved, the search quality of the application search engine is greatly improved, and the user search experience is improved.
The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:
FIG. 1 illustrates a flow diagram of a method of applying tag mining in accordance with one embodiment of the present invention;
FIG. 2 illustrates a flow diagram of a method of application searching in accordance with one embodiment of the present invention;
FIG. 3 illustrates a schematic interface diagram for searching based on an application search method according to one embodiment of the present invention;
FIG. 4 illustrates a schematic diagram of an application label mining apparatus in accordance with one embodiment of the present invention;
FIG. 5 shows a schematic diagram of an application search server, according to one embodiment of the invention.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
FIG. 1 illustrates a flow diagram of a method for application tag mining in accordance with one embodiment of the present invention; referring to fig. 1, the method includes:
step S110, obtain the abstract of each application.
In step S120, search terms for each application are acquired from the application search log.
And step S130, excavating a label system of each application according to the abstract, the search terms and the preset strategy of each application.
By the method shown in fig. 1, dynamically updating the application tags by automatically obtaining the abstracts of each application and obtaining the search terms of each application from the historical application search logs of the user in real time; meanwhile, the accuracy and the recall rate of the application label are continuously improved through a preset strategy, and then the applied label system is excavated and created, so that the problems that the traditional application label system only can cause large manual workload, low coverage rate, serious cheating phenomenon and the like through manual marking are solved, the search quality of the application search engine is greatly improved, and the user search experience is improved.
In an embodiment of the present invention, the step S130 of mining the label system of each application according to the abstract, the search term and the preset policy of each application includes:
step S131, obtaining a training corpus set according to the abstracts and the search terms of each application.
And S132, inputting the training corpus set into the LDA model for training to obtain an application-theme probability distribution result and a theme-keyword probability distribution result output by the LDA model.
And step S133, calculating to obtain a label system of each application according to the application-theme probability distribution result and the theme-keyword probability distribution result.
It should be noted that lda (latent Dirichlet allocation) is a document topic generation model, which is an unsupervised machine learning technique and can be used to identify topic information hidden in a large-scale document collection (document collection) or corpus (corpus). It adopts bag of words (bag of words) method, which treats each document as a word frequency vector, thereby converting text information into digital information easy to model. Because the LDA model is better in performance in a long text and has poor effect when being used in a short text, but the application abstract is short and short, the LDA model is a typical short text, and in order to enable the application effect of the LDA model to be optimal, interactive history (namely the search words, hereinafter referred to as the search words) information of an application and a user is introduced to expand the application abstract, namely the short text of the application abstract is expanded into the long text suitable for the LDA model. The search terms not only comprise terms which can be retrieved by an engine and applied by the engine, but also comprise other terms, and the terms just overcome the problems that the frequency of synonym heteromorphism words is too low and the like caused by too short length of short texts of the application abstracts.
In this embodiment, the LDA model is GibbsLDA + + -version. In an application scene of a mobile terminal application, a GibbsLDA + + source code needs to be modified, and topics of the same lexical item in one application are initialized to be the same. In the original code, each term is randomly initialized into a theme, so that the same repeated term can be initialized into a plurality of themes.
In order to make the solution of the present invention clearer, the application-topic probability distribution result and the topic-keyword probability distribution result output by the LDA model mentioned in step S132 are illustrated in detail herein. For example, the LDA training selects 120 topics, iterates through 300 rounds, and generates two files, where the first file is a topic-keyword probability distribution result, and as shown in table 1, the corresponding probabilities between the fourth topic and 22 keywords respectively are shown:
TABLE 1
Figure BDA0001194306520000081
Figure BDA0001194306520000091
The second file is the application-topic probability distribution result, as shown in table 2, showing the corresponding probabilities between the application with application ID 5427 and 6 topics (topic IDs 134, 189, 139, 126, 14, 18, respectively).
TABLE 2
Figure BDA0001194306520000092
In order to make the solution of the present invention clearer, the following description is made with reference to a specific example. For example, the abstract of the WeChat includes that WeChat is a free application program which is released by Tencent corporation in 2011, 1, 21 and provides instant messaging service for the intelligent terminal. The WeChat supports cross-communication operators and cross-operating system platforms to quickly send free (small amount of network traffic is consumed) voice short messages, videos, pictures and characters through a network, and search terms of the WeChat comprise 'WeChat, free instant messaging, Tencent, friend circles, public platforms, message pushing, shaking, nearby people, friend adding in a two-dimensional code scanning mode and multi-person conversation'.
The corpus comprises all the abstract contents of the above-mentioned "WeChat" and all the contents of the search term of the "WeChat"; inputting the corpus set into the LDA model for training, and if the generated topic of the LDA model aiming at the corpus set of WeChat comprises social contact and the generated keywords comprise chat, voice, telephone, phone book, social contact, friend making, communication, address book and friends, obtaining the application-topic probability distribution result output by the LDA model and comprising P1.1 (WeChat-social contact); the distribution results of the topic-keyword output by the LDA model are obtained as P2.1 (WeChat-chat), P2.2 (WeChat-voice), P2.3 (WeChat-telephone), P2.4 (WeChat-telephone book), P2.5 (WeChat-social), P2.6 (WeChat-friend), P2.7 (WeChat-communication), P2.8 (WeChat-address book) and P2.9 (WeChat-friend); the label system of the WeChat calculated according to the P1.1 (WeChat-social contact) and the P2.1 (WeChat-chat), the P2.2 (WeChat-voice), the P2.3 (WeChat-telephone), the P2.4 (WeChat-telephone book), the P2.5 (WeChat-social contact), the P2.6 (WeChat-friend), the P2.7 (WeChat-communication), the P2.8 (WeChat-address book) and the P2.9 (WeChat-friend) is shown in Table 3.
TABLE 3
Figure BDA0001194306520000101
Therefore, a corpus set is obtained according to the abstract and the search word of each application, then the obtained corpus set is processed through an LDA model, corresponding application-theme probability distribution results and theme-keyword probability distribution results are generated, and then a label system of each application is obtained through calculation according to the application-theme probability distribution results and the theme-keyword probability distribution results, so that the application content or the function description text can be comprehensively and accurately represented.
In order to solve the problem, in an embodiment of the present invention, the step S131 obtains a corpus set according to abstracts and search terms of each application, where the application tag is directly submitted by a developer, and in a process of submitting the application tag, the developer submits a large amount of content irrelevant to the application in a tag description of the application in order to enable the application to be installed and used by a large number of clients, which causes a long-term existence of a false information tag phenomenon, which seriously affects search quality of an application search engine and greatly reduces user search experience, and includes: for each application, extracting a first segment of characters or characters of a preset number of sentences from the abstract of the application; the extracted characters and the applied search terms are jointly used as the original corpus of the application; the original linguistic data of each application form an original linguistic data set; and preprocessing the original corpus set to obtain a training corpus set.
For example, for the application of "WeChat", obtaining the summary of "WeChat" includes:
"Wechat is a piece of social software.
The WeChat provides functions of a public platform, a friend circle, message pushing and the like, the user can add friends and pay attention to the public platform in a two-dimensional code scanning mode through shaking, searching numbers and nearby people, and meanwhile the WeChat shares content to the friends and wonderful content seen by the user to the WeChat friend circle. The WeChat supports the cross-communication operator and the cross-operating system platform to quickly send free (small network flow consumption) voice short messages, videos, pictures and characters through the network, and meanwhile, service plug-ins such as 'shake-shake', 'drift bottle', 'friend circle', 'public platform', 'voice notepad' and the like can also be used through sharing the data of streaming media content and the social plug-ins based on positions.
By the first quarter of 2015, WeChat has covered more than 90% of the smart phones in China, monthly active users reach 5.49 hundred million, and users cover more than 20 languages in more than 200 countries. In addition, the total number of WeChat public accounts of each brand is over 800 ten thousand, the number of mobile application butt joints is over 85000, and WeChat payment users reach about 4 hundred million. "
Extracting a previous sentence including that the WeChat is social software from the abstract of the WeChat, and simultaneously obtaining search terms of the WeChat, wherein the search terms of the WeChat include that the WeChat is social software, chat, voice, telephone, phone book, social contact, friend making, communication, address book and friend, and taking the WeChat as the original linguistic data of the WeChat; acquiring original predictions of other applications in a mode of acquiring 'WeChat' original corpora, wherein all the applied original corpora form an original corpus set; and preprocessing the original corpus set to obtain a training corpus set.
Specifically, the preprocessing the original corpus set includes: in the original corpus set, performing word segmentation processing on each original corpus to obtain a word segmentation result containing a plurality of terms; searching phrases formed by adjacent terms in the word segmentation result; and reserving the phrases, the lexical items belonging to nouns and the lexical items belonging to verbs in the word segmentation result as the corresponding reserved keywords of the original corpus.
For example, in the original corpus set, the original corpus set of the "WeChat" is "WeChat is a piece of social software, chat, voice, telephone call, telephone book, social contact, friend making, communication, address book, friend", the word segmentation processing is performed on the original corpus of the "WeChat" to obtain a word segmentation result containing a plurality of lexical items, including "WeChat, Yes, one, social contact, software, chat, voice, telephone call, telephone book, social contact, friend making, communication, address book, friend, the phrase formed by adjacent lexical items in the word segmentation result is searched to include" WeChat, one, social contact, software, chat, voice, telephone call, telephone book, social contact, friend making, communication, address book, friend ", the phrase, the lexical item belonging to noun and the lexical item belonging to verb in the word segmentation result are reserved as the corresponding reserved keyword of the original corpus, the keywords of "WeChat" include "WeChat, social, chat, voice, phone, phonebook, social, friend, contact list, friend".
In order to determine whether to form a phrase, the closeness of two preceding and following terms is calculated, and in an embodiment of the present invention, the searching for the phrase formed by adjacent terms in the word segmentation result includes: calculating the cPId value of every two adjacent terms in the word segmentation result, and determining that the two adjacent terms form a phrase when the cPId values of the two adjacent terms are larger than a first preset threshold value.
For example, setting the first preset threshold value to be 5, obtaining the word segmentation result of the "Baidu map" as "province, flow, public transportation and transfer", calculating the cpmd values of "province, flow", "flow, public transportation" and "public transportation and transfer" by using a cpmd calculation method, if the cpmd values of "province, flow", "public transportation and transfer" obtained by calculation are greater than 5, determining that "province, flow", "public transportation and transfer" constitute phrases "province flow" and "public transportation and transfer", and if the cpmd value of "flow, public transportation" obtained by calculation is less than 5, determining that "flow and public transportation" cannot constitute phrases.
The cpid calculation method is shown in equation 1:
Figure BDA0001194306520000121
in formula 1, D (x, y) represents the co-occurrence frequency of two terms x and y, D (x) represents the occurrence frequency of term x, D (y) represents the occurrence frequency of term y, and D represents the total application number.
Further, in an embodiment of the present invention, the preprocessing the original corpus set further includes: using the keywords correspondingly reserved for the original material of each application as the first-stage training corpus of the application; the first-stage training corpora of each application form a first-stage training corpus set; and carrying out data cleaning on the keywords in the first-stage corpus set.
Specifically, in a million-level application, the probability that a term appearing at a very high frequency is a label is small, and the probability that a term appearing at a low frequency is a label is also small, so that the data cleansing process can filter out keywords appearing at a very high frequency and keywords appearing at a very low frequency.
For example, the keyword corresponding to the reserved keyword of the original material of the 'WeChat' comprises 'WeChat, social contact, chat, voice, telephone, phone book, social contact, friend making, communication, address book and friend', and then 'WeChat, social contact, chat, voice, telephone, phone book, social contact, friend making, communication, address book and friend' is used as the first-stage training corpus of the 'WeChat'; all the applied first-stage corpus forms a first-stage corpus set, data cleaning is carried out on keywords in the first-stage corpus set, and terms which appear frequently in the first-stage corpus set are filtered out, so that the quality of the applied search engine is improved.
In order to filter out keywords appearing in a very high frequency and keywords appearing in a very low frequency in the first-stage corpus set, in an embodiment of the present invention, the performing data cleaning on the keywords in the first-stage corpus set includes: in the first-stage corpus set, for each first-stage corpus, calculating a TF-IDF value of each keyword in the first-stage corpus; and deleting the keywords with the TF-IDF values higher than the second preset threshold and/or lower than the third preset threshold.
In the process, a TF-IDF calculation formula is adopted to calculate the TF-IDF value of each keyword in the first-stage training corpus, so that further cleaning of data is realized.
For example, the first-stage corpus of the "WeChat" includes "WeChat, social contact, chat, speech, telephone, phonebook, social contact, friend, communication, address book, friend", and the TF-IDF value of each term and phrase is calculated in the first-stage corpus of the "WeChat" by using the calculation formula of the TF-IDF to obtain TF-IDF (WeChat), TF-IDF (social contact), TF-IDF (chat), TF-IDF (speech), TF-IDF (telephone call), TF-IDF (phonebook), TF-IDF (social contact), TF-IDF (friend), TF-IDF (communication), TF-IDF (address book), TF-IDF (friend); if TF-IDF (address), TF-IDF (address list), TF-IDF (friend) are higher than the second preset threshold and/or lower than the third preset threshold, the 'address, address list and friend' are deleted. It should be noted that the second preset threshold and/or the threshold lower than the third preset threshold are related to specific linguistic data, and the specific thresholds are not listed here. Meanwhile, the TF-IDF is applied to cleaning the data because the TF-IDF can well evaluate the importance degree of a word to one of the files in a file set or a corpus, and the requirement of cleaning the data is completely met.
The calculation formula of TF-IDF is as follows:
Figure BDA0001194306520000131
in equation 2, count (w, app) is the term frequency of term w in app, count (w, Corpus) is the term frequency of w in Corpus, ncopus is the total number of apps, and app _ count (w) is the number of apps containing term w
Further, in an embodiment of the present invention, the preprocessing the original corpus set further includes: taking the remaining keywords of the first-stage training corpus of each application after data cleaning as the second-stage training corpus of the application; for each applied second-stage corpus, when a keyword in the applied second-stage corpus appears in the name of the application, repeating the keyword in the applied second-stage corpus for a fourth preset threshold number of times to obtain the applied corpus; the corpus of each application constitutes a corpus set.
For example, the first-stage corpus of "WeChat" includes "WeChat, social contact, chat, voice, telephone, phone book, social contact, friend-making, communication, address book, friend", remove "communication, address book, friend" through data cleaning processing, then the remaining keywords include "WeChat, social contact, chat, voice, telephone, phone book, social contact, friend-making" and are the second-stage corpus of "WeChat";
when analyzing the second-stage corpus, it is found that labels expressing the functions or categories of the application often appear in names, such as "get a car" in "get a car" and "take out" in "out of word," and "lease car" in "rough lease car" and "map" in "Baidu map", etc., in order to highlight this category of important labels, pmis appearing in the name of the application are repeated three times in each corpus of the application, phrases with a cd value higher than 10.0 are also repeated three times to increase the frequency of appearance of these potential important phrase labels, so far, the corpus construction of the LDA topic model is completed, and the corpus collection is stored in a file app _ corp _ seg _ nouns _ verbs _ verbe _ filtered _ repeat.
In an embodiment of the present invention, the step S133, according to the application-topic probability distribution result and the topic-keyword probability distribution result, calculating a label system of each application includes:
calculating to obtain an application-keyword probability distribution result according to the application-topic probability distribution result and the topic-keyword probability distribution result; and according to the application-keyword probability distribution result, for each application, sorting the keywords according to the probability of the application from large to small, and selecting the keywords with the number of the top fifth preset threshold value.
For example, setting the fifth preset threshold to 8, the LDA model outputs the probability distribution of the topic under each application, and the probability distribution of the term under each topic. In order to obtain each applied label, respectively sorting the probability distribution of the topics and the probability distribution of the keywords in a reverse order from large to small according to the probability, selecting the first 50 topics under each application, selecting the first 120 keywords under each topic, performing weighted sorting on the probabilities of the keywords by using the probabilities of the topics, wherein each applied keyword has a weight which represents the importance under the application, sorting in a reverse order according to the weight of the label, and selecting the first 8 keywords, so that a label list generated by the LDA is obtained, and the label list contains a lot of noises, and the sequence of the labels is inaccurate, as shown in Table 4.
TABLE 4
Figure BDA0001194306520000141
Wherein the calculating an application-keyword probability distribution result according to the application-topic probability distribution result and the topic-keyword probability distribution result comprises:
for each application, obtaining the probability of each theme about the application according to the application-theme probability distribution result; for each topic, obtaining the probability of each keyword about the topic according to the topic-keyword probability distribution result; for each keyword, taking the product of the probability of the keyword about a topic and the probability of the topic about an application as the probability of the keyword based on the topic about the application; and taking the probability of the keyword about the application based on the sum of the probabilities of the topics about the application.
For example, if the keyword of an application C is a, the topic corresponding to the keyword a includes B1, B2 and B3, the probability of the keyword a with respect to a topic B1 is P (a _ B1), the probability of the topic B1 with respect to an application C is P (B1_ C), and then P (a _ B1) P (B1_ C) is the probability of the keyword a with respect to the application C based on the topic B1; then P (a _ B2) × P (B2_ C) is the probability that keyword a is based on topic B2 with respect to application C; p (a _ B3) × P (B3_ C) is the probability that keyword a is based on topic B2 with respect to application C, then the probability that keyword a is with respect to said application C, P (a _ B1) × P (B1_ C) + P (a _ B2) × P (B2_ C) + P (a _ B3) × P (B3_ C).
On this basis, in a further embodiment of the present invention, the calculating a label system of each application according to the application-topic probability distribution result and the topic-keyword probability distribution result further includes:
taking the keywords with the number of the first fifth preset threshold value correspondingly selected by each application as a first-stage label system of the application; for the first-stage label system of each application, calculating a semantic relation value between each keyword in the first-stage label system of the application and the abstract of the application; for each keyword, taking the product of the semantic relation value corresponding to the keyword and the probability of the keyword relative to the application as the correction probability of the keyword relative to the application; and sorting all the keywords in the first-stage label system of the application from large to small according to the correction probability of the application, and selecting the first K keywords to form the label system of the application.
For example, assuming that the fifth preset threshold is 3, the keyword of the number of the first fifth preset thresholds, which is selected by the "Baidu map", includes "map, search and navigation", the "map, search and navigation" is taken as the first-stage label system of the "Baidu map";
for the first-stage label system of the 'Baidu map', calculating semantic relation values between each keyword in the 'map, search and navigation' of the first-stage label system in the 'Baidu map' and the abstract of the 'Baidu map' as R1, R2 and R3 respectively; calculating the probability of each keyword in the first stage label system map, search and navigation in the 'Baidu map' and the 'Baidu map' as P1, P2 and P3; then R1P 1, R2P 2 and R3P 3 are used as the correction probability of the 'Baidu map', if R1P 1> R3P 3> R2P 2, the sequence of each keyword in the first stage label system of the 'Baidu map' is 'map, navigation and search', if 2 keywords are selected to form the label system of the application, the label system of the 'Baidu map' includes 'map and navigation'
Specifically, calculating a semantic relationship value between each keyword in the first-stage labeling system of the application and the abstract of the application includes:
calculating word vectors of the keywords, and calculating the word vectors of each lexical item in the sentences of the application abstract with the preset number; calculating cosine similarity between the word vector of the keyword and the word vector of each term, and taking the product of each cosine similarity and the weight of the sentence in which the corresponding term is located as a semantic relation value of the keyword and the corresponding term; and taking the sum of the semantic relation values of the keyword and each term as the semantic relation value between the keyword and the abstract of the application.
For example, a search word set obtained from a search log of an application search engine is used as input data of a training word vector, and a 300-dimensional word vector dictionary file tag _ query _ w2v _300. ditt is obtained through training. If the keywords of the "Baidu map" include "map, search and navigation", the word vector of the "map" is calculated to be M1; calculating word vectors of each term in the first 3 sentences of the abstract of the 'Baidu map' to be N1, N2 and N3 respectively; calculating cosine similarity of the word vector of the ' map ' and the word vector of each term in the previous sentence of the abstract of the Baidu map ' to obtain ' cos M1N 1 ', ' cos M1N 2 ' and ' cos M1N 3 '; the weight of the sentence in which the corresponding term is positioned is Q1 and Q2; then the semantic relation values of the keyword and the corresponding term are "Q1 × cos M1 × N1" and "Q2 × cos M1 × N2", respectively; then "Q1 × cos M1 × N1+ Q2 × cos M1 × N2+ Q3 × cos M1 × N3" is taken as the semantic relationship value between the "map" and the "Baidu map" summary.
Further, in an embodiment of the present invention, the calculating a label system of each application according to the application-topic probability distribution result and the topic-keyword probability distribution result further includes:
taking the keywords correspondingly selected by each application as a second-stage label system of the application; for the second-stage label system of each application, acquiring a search word set related to the downloading operation of the application from an application search log, and counting the DF value of each keyword in the second-stage label system of the application in the search word set; for each keyword, increasing the multiple of the DF value on the basis of the probability of the keyword about the application to obtain the secondary correction probability of the keyword about the application; and sorting all the keywords in the second-stage label system of the application from large to small according to the secondary correction probability of the application, and selecting the first K keywords to form the label system of the application.
For example, the historical search word set mined to download the "Baidu map" includes "map, search and navigation", the DF value of the historical search word set of the keyword "map" on the "Baidu map" is calculated to be DF1, the DF value of the historical search word set of the keyword "search" on the "Baidu map" is calculated to be DF2, and the DF value of the historical search word set of the keyword "navigation" on the "Baidu map" is calculated to be DF 3; calculating initial probabilities of "map", "search", and "navigation" with respect to "Baidu map" as P1, P2, and P3; then the secondary correction probability of the keyword "map" with respect to "Baidu map" is P1 (1+ DF 1); the secondary correction probability of the keyword "search" with respect to "Baidu map" is P2 (1+ DF 2); the secondary correction probability of the keyword "navigation" with respect to the "Baidu map" is P3 (1+ DF 3).
If P3 (1+ DF3) > P1 (1+ DF1) > P2 (1+ DF2), the sequence of the keywords of the 'Baidu map' is adjusted to 'map, navigation and search', and if the first two keywords are selected to form the label system of the 'Baidu map', the label system of the 'Baidu map' comprises 'map, navigation'
. The accuracy of the label sequence of the hundred-degree map is greatly improved after the method is adopted for adjustment.
If the results of one correction for "public praise takeout" and "Baidu map" are shown in table 5,
TABLE 5
Figure BDA0001194306520000171
The results of the secondary corrections to the "public praise takeaway" and the "Baidu map" are shown in table 6,
TABLE 6
Figure BDA0001194306520000172
By comparing table 5 with table 6, we can see that the accuracy of the applied label sequence is greatly improved after the secondary correction.
In a specific example, the selecting the top K keywords to form a tag system of the application includes:
acquiring the seasonal downloading times of the application from the application searching log;
selecting the first K keywords according to the quarterly downloading times of the application to form a label system of the application; where the value of K is a polyline function of the number of quarterly downloads for the application.
In practical application, a label list of the application is found, wherein the @ k accuracy of the label is related to whether the application is hot or not, the seasonal download times just reflect whether the application is hot or not, three to fifteen different labels are reserved for each application, the accuracy is 92%, the recall rate is 76%, and the number is in direct proportion to the seasonal download times. Typical examples are shown in table 7.
TABLE 7
Figure BDA0001194306520000181
Figure BDA0001194306520000191
Based on the application label mining scheme, the invention further provides an application searching method which comprises the following steps:
fig. 2 is a flowchart illustrating an application search method according to an embodiment of the present invention, and as shown in fig. 2, the application search method 200 includes:
s210, receiving search terms uploaded by a client;
s220, matching in the label system of each application according to the search terms;
s230, when the search word is matched with a keyword in an application label system, returning the relevant information of the application to the client for displaying;
s240, the label system of each application is constructed by any application label mining method.
Therefore, the application label mining method constructs the label system of each application, ensures the recall rate of the application engine search and improves the search quality of the application engine search; based on the application label mining method, the application searching method greatly improves the searching quality of the application engine search and enhances the user experience. For example, the user searches for "drip", and the application engine also presents applications with similar functions, such as "quick taxi", "Uber excellent china", and the like, in addition to returning to the precise application of "drip outgoing".
FIG. 3 is a diagram illustrating an interface for searching based on an application search method according to an embodiment of the present invention. In order to make the solution of applying the search method clearer, the following description is made with reference to a specific example. In a specific example, the user searches the keyword "order" on the "360 mobile phone assistant", and the results displayed by the "360 mobile phone assistant" search engine are shown in fig. 3, and as can be seen from fig. 3, when the user searches for "order", "360 mobile phone assistant" search engine returns all applications with the function of ordering, such as "sweet take away", "hungry, hundredth sticky rice", "popular comment", "sweet take", and the like. Therefore, the application label system constructed by the invention plays a main role in retrieval and sequencing, the search quality is greatly improved, and the user search experience is improved.
FIG. 4 illustrates a schematic diagram of an application label mining apparatus in accordance with one embodiment of the present invention; as shown in fig. 4, the application label mining apparatus 400 includes:
an information obtaining unit 410 adapted to obtain a summary of each application; and obtaining the search terms of each application from the application search log.
And the label system building unit 420 is adapted to mine the label system of each application according to the abstract, the search term and the preset strategy of each application.
Therefore, the invention automatically acquires the abstract of each application through the information acquisition unit 410, acquires the search word of each application from the historical application search log of the user, and dynamically updates the application label; meanwhile, through the tag system construction unit 420, according to the abstracts, the search terms and the preset strategies of the applications, the accuracy and the recall rate of the application tags are continuously improved, and then the applied tag systems are mined and created, so that the problems that the traditional application tag systems only can be manually marked to cause large manual workload, low coverage rate, serious cheating phenomena and the like are solved, the search quality of the application search engine is greatly improved, and the search experience of users is improved.
In an embodiment of the present invention, the label system constructing unit 420 is adapted to obtain a corpus set according to the abstract and the search term of each application; inputting the training corpus set into an LDA model for training to obtain an application-theme probability distribution result and a theme-keyword probability distribution result output by the LDA model; and calculating to obtain a label system of each application according to the application-theme probability distribution result and the theme-keyword probability distribution result.
The tag system constructing unit 420 is adapted to, for each application, extract a first segment of characters or characters of a preset number of sentences from the abstract of the application; the extracted characters and the applied search terms are jointly used as the original corpus of the application; the original linguistic data of each application form an original linguistic data set; and preprocessing the original corpus set to obtain a training corpus set.
In one embodiment, the process of preprocessing the original corpus by the tag architecture building unit 420 includes: the tag system constructing unit 420 is adapted to perform word segmentation processing on each original corpus in the original corpus set to obtain a word segmentation result including a plurality of terms; searching phrases formed by adjacent terms in the word segmentation result; and reserving the phrases, the lexical items belonging to nouns and the lexical items belonging to verbs in the word segmentation result as the corresponding reserved keywords of the original corpus.
Specifically, the tag system constructing unit 420 is adapted to calculate cpmd values of every two adjacent terms in the word segmentation result, and determine that the two adjacent terms form a phrase when the cpmd values of the two adjacent terms are greater than a first preset threshold.
Further, the label system constructing unit 420 is further adapted to use a keyword, which is reserved corresponding to the original material of each application, as the first-stage corpus of the application; the first-stage training corpora of each application form a first-stage training corpus set; and carrying out data cleaning on the keywords in the first-stage corpus set.
Specifically, the tag system constructing unit 420 is adapted to calculate, in the first-stage corpus set, for each first-stage corpus, a TF-IDF value of each keyword in the first-stage corpus; and deleting the keywords with the TF-IDF values higher than the second preset threshold and/or lower than the third preset threshold.
Further, the label system constructing unit 420 is further adapted to use the remaining keywords after the data cleaning of the first-stage corpus of each application as the second-stage corpus of the application; for each applied second-stage corpus, when a keyword in the applied second-stage corpus appears in the name of the application, repeating the keyword in the applied second-stage corpus for a fourth preset threshold number of times to obtain the applied corpus; the corpus of each application constitutes a corpus set.
In an embodiment of the present invention, the tag system constructing unit 420 is adapted to calculate an application-keyword probability distribution result according to the application-topic probability distribution result and the topic-keyword probability distribution result; and according to the application-keyword probability distribution result, for each application, sorting the keywords according to the probability of the application from large to small, and selecting the keywords with the number of the top fifth preset threshold value.
The label system constructing unit 420 is adapted to, for each application, obtain a probability of each topic about the application according to the application-topic probability distribution result; for each topic, obtaining the probability of each keyword about the topic according to the topic-keyword probability distribution result; for each keyword, taking the product of the probability of the keyword about a topic and the probability of the topic about an application as the probability of the keyword based on the topic about the application; and taking the probability of the keyword about the application based on the sum of the probabilities of the topics about the application.
Further, the tag system constructing unit 420 is further adapted to use the first-fifth preset threshold number of keywords selected corresponding to each application as the first-stage tag system of the application; for the first-stage label system of each application, calculating a semantic relation value between each keyword in the first-stage label system of the application and the abstract of the application; for each keyword, taking the product of the semantic relation value corresponding to the keyword and the probability of the keyword relative to the application as the correction probability of the keyword relative to the application; and sorting all the keywords in the first-stage label system of the application from large to small according to the correction probability of the application, and selecting the first K keywords to form the label system of the application.
Specifically, the tag system constructing unit 420 is adapted to calculate a word vector of the keyword, and calculate a word vector of each term in a preset number of sentences before the abstract of the application; calculating cosine similarity between the word vector of the keyword and the word vector of each term, and taking the product of each cosine similarity and the weight of the sentence in which the corresponding term is located as a semantic relation value of the keyword and the corresponding term; and taking the sum of the semantic relation values of the keyword and each term as the semantic relation value between the keyword and the abstract of the application.
Further, the tag system constructing unit 420 is further adapted to use the selected keyword corresponding to each application as the second-stage tag system of the application; for the second-stage label system of each application, acquiring a search word set related to the downloading operation of the application from an application search log, and counting the DF value of each keyword in the second-stage label system of the application in the search word set; for each keyword, increasing the multiple of the DF value on the basis of the probability of the keyword about the application to obtain the secondary correction probability of the keyword about the application; and sorting all the keywords in the second-stage label system of the application from large to small according to the secondary correction probability of the application, and selecting the first K keywords to form the label system of the application.
In an embodiment of the present invention, the tag architecture building unit 420 is adapted to obtain the number of times of downloading of the application in the quarter from the application search log; selecting the first K keywords according to the quarterly downloading times of the application to form a label system of the application; where the value of K is a polyline function of the number of quarterly downloads for the application.
It should be noted that the working process of the application tag mining apparatus in this embodiment corresponds to the implementation steps of the application tag mining method, and therefore, the specific working process of the application tag mining apparatus in this embodiment may refer to the relevant description of the application tag mining method, which is not described herein again.
FIG. 5 shows a schematic diagram of an application search server, according to one embodiment of the invention. As shown in fig. 5, the application search server 500 includes:
the interaction unit 510 is adapted to receive a search term uploaded by a client.
And the search processing unit 520 is adapted to perform matching in the tag systems of the applications according to the search terms.
The interaction unit 510 is further adapted to return the relevant information of an application to the client for displaying when the search term matches a keyword in a tag system of the application.
The application search server 500 further includes the application tag mining device 530, and the tag system of each application is constructed by the application tag mining device 530.
It should be noted that the application tag mining apparatus 530 in this embodiment has the same corresponding functions as the application tag mining apparatus 400 in fig. 4 in the third embodiment, and the working process of the apparatus shown in fig. 4 is the same as the implementation steps of each embodiment of the method shown in fig. 1, and the same parts are not described again.
Therefore, in the present embodiment, the search terms uploaded by the client are received through the interaction unit 510; the search terms are matched in the label system of each application through a search processing unit 520; meanwhile, the interaction unit 510 returns the relevant information of the application to the client for display, so that the application search server 500 greatly improves the search quality of the application engine search and enhances the user experience. For example, the user searches for "drip", and the engine, in addition to returning to the precise application of "drip outgoing", also simultaneously presents applications with similar functions, such as "quick taxi", "Uber excellent Chinese", and the like.
In conclusion, the abstracts of the applications are automatically obtained, and the search words of the applications are obtained in real time from the historical application search logs of the users, so that the short application texts are expanded, and the application tags are dynamically updated; meanwhile, a preset strategy is formulated through an effective training unsupervised LDA learning model, so that the effect of continuously improving the accuracy and recall rate of the application label is achieved, an applied label system is further excavated and created, the method is also suitable for newly produced application, the problems that the traditional applied label system is large in manual workload, low in coverage rate, serious in cheating phenomenon and the like due to the fact that manual labeling can only be carried out are solved, the searching quality of an application search engine is greatly improved, and the user searching experience is improved.
It should be noted that:
the algorithms and displays presented herein are not inherently related to any particular computer, virtual machine, or other apparatus. Various general purpose devices may be used with the teachings herein. The required structure for constructing such a device will be apparent from the description above. Moreover, the present invention is not directed to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the teachings of the present invention as described herein, and any descriptions of specific languages are provided above to disclose the best mode of the invention.
In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.
Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.
Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.
Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.
The various component embodiments of the invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. It will be appreciated by those skilled in the art that a microprocessor or Digital Signal Processor (DSP) may be used in practice to implement some or all of the functions of some or all of the components of the application tag mining method, apparatus and application search method, server, according to embodiments of the present invention. The present invention may also be embodied as apparatus or device programs (e.g., computer programs and computer program products) for performing a portion or all of the methods described herein. Such programs implementing the present invention may be stored on computer-readable media or may be in the form of one or more signals. Such a signal may be downloaded from an internet website or provided on a carrier signal or in any other form.
It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.

Claims (26)

1. An application label mining method, comprising:
obtaining the abstract of each application;
acquiring search terms related to each application from the application search log;
excavating a label system of each application according to the abstract, the search terms and the preset strategy of each application;
the method for mining the label system of each application according to the abstract, the search term and the preset strategy of each application comprises the following steps:
obtaining a training corpus set according to the abstract and the search word of each application;
inputting the training corpus set into an LDA model for training to obtain an application-theme probability distribution result and a theme-keyword probability distribution result output by the LDA model;
calculating to obtain a label system of each application according to the application-theme probability distribution result and the theme-keyword probability distribution result;
the step of obtaining the label system of each application by calculation according to the application-topic probability distribution result and the topic-keyword probability distribution result further comprises:
taking the keywords with the number of the first fifth preset threshold value correspondingly selected by each application as a first-stage label system of the application;
for the first-stage label system of each application, calculating a semantic relation value between each keyword in the first-stage label system of the application and the abstract of the application; for each keyword, taking the product of the semantic relation value corresponding to the keyword and the probability of the keyword relative to the application as the correction probability of the keyword relative to the application; and sorting all the keywords in the first-stage label system of the application from large to small according to the correction probability of the application, and selecting the first K keywords to form the label system of the application.
2. The method of claim 1, wherein the obtaining a corpus set according to the abstracts and the search terms of each application comprises:
for each application, extracting a first segment of characters or characters of a preset number of sentences from the abstract of the application; the extracted characters and the applied search terms are jointly used as the original corpus of the application;
the original linguistic data of each application form an original linguistic data set; and preprocessing the original corpus set to obtain a training corpus set.
3. The method of claim 2, wherein the pre-processing the original corpus comprises:
in the original corpus collection, the corpus is divided into a plurality of parts,
for each original corpus, performing word segmentation processing on the original corpus to obtain word segmentation results containing a plurality of lexical items; searching phrases formed by adjacent terms in the word segmentation result; and reserving the phrases, the lexical items belonging to nouns and the lexical items belonging to verbs in the word segmentation result as the corresponding reserved keywords of the original corpus.
4. The method of claim 3, wherein said finding phrases comprised of adjacent terms in said word segmentation result comprises:
calculating the cPId value of every two adjacent terms in the word segmentation result, and determining that the two adjacent terms form a phrase when the cPId values of the two adjacent terms are larger than a first preset threshold value.
5. The method of claim 4, wherein the preprocessing the original corpus further comprises:
using the keywords correspondingly reserved for the original material of each application as the first-stage training corpus of the application;
the first-stage training corpora of each application form a first-stage training corpus set; and carrying out data cleaning on the keywords in the first-stage corpus set.
6. The method as claimed in claim 5, wherein the data cleansing of the keywords in the first-stage corpus comprises:
in the first-stage corpus set,
for each first-stage training corpus, calculating a TF-IDF value of each keyword in the first-stage training corpus; and deleting the keywords with the TF-IDF values higher than the second preset threshold and/or lower than the third preset threshold.
7. The method of claim 6, wherein the preprocessing the original corpus further comprises:
taking the remaining keywords of the first-stage training corpus of each application after data cleaning as the second-stage training corpus of the application;
for each applied second-stage corpus, when a keyword in the applied second-stage corpus appears in the name of the application, repeating the keyword in the applied second-stage corpus for a fourth preset threshold number of times to obtain the applied corpus;
the corpus of each application constitutes a corpus set.
8. The method of any one of claims 1-7, wherein the calculating a label system for each application based on the application-topic probability distribution result and the topic-keyword probability distribution result comprises:
calculating to obtain an application-keyword probability distribution result according to the application-topic probability distribution result and the topic-keyword probability distribution result;
and according to the application-keyword probability distribution result, for each application, sorting the keywords according to the probability of the application from large to small, and selecting the keywords with the number of the top fifth preset threshold value.
9. The method of any of claims 1-7, wherein said calculating an application-keyword probability distribution result from the application-topic probability distribution result and the topic-keyword probability distribution result comprises:
for each application, obtaining the probability of each theme about the application according to the application-theme probability distribution result;
for each topic, obtaining the probability of each keyword about the topic according to the topic-keyword probability distribution result;
for each keyword, taking the product of the probability of the keyword about a topic and the probability of the topic about an application as the probability of the keyword based on the topic about the application; and taking the probability of the keyword about the application based on the sum of the probabilities of the topics about the application.
10. The method of claim 1, wherein calculating the semantic relationship value between each keyword in the first stage tagging scheme for the application and the abstract of the application comprises:
calculating word vectors of the keywords, and calculating the word vectors of each lexical item in the sentences of the application abstract with the preset number;
calculating cosine similarity between the word vector of the keyword and the word vector of each term, and taking the product of each cosine similarity and the weight of the sentence in which the corresponding term is located as a semantic relation value of the keyword and the corresponding term;
and taking the sum of the semantic relation values of the keyword and each term as the semantic relation value between the keyword and the abstract of the application.
11. The method of any one of claims 1-7, wherein the calculating a label system for each application based on the application-topic probability distribution result and the topic-keyword probability distribution result further comprises:
taking the keywords correspondingly selected by each application as a second-stage label system of the application;
for the second-stage label system of each application, acquiring a search word set related to the downloading operation of the application from an application search log, and counting the DF value of each keyword in the second-stage label system of the application in the search word set; for each keyword, increasing the multiple of the DF value on the basis of the probability of the keyword about the application to obtain the secondary correction probability of the keyword about the application; and sorting all the keywords in the second-stage label system of the application from large to small according to the secondary correction probability of the application, and selecting the first K keywords to form the label system of the application.
12. The method of claim 11, wherein the selecting the top K keywords to form the tag system of the application comprises:
acquiring the seasonal downloading times of the application from the application searching log;
selecting the first K keywords according to the quarterly downloading times of the application to form a label system of the application; where the value of K is a polyline function of the number of quarterly downloads for the application.
13. An application search method, comprising:
receiving a search word uploaded by a client;
matching in a label system of each application according to the search terms;
when the search word is matched with a keyword in an application label system, returning the relevant information of the application to the client for displaying;
the label system for each application is constructed by the method of any one of claims 1-12.
14. An application label mining apparatus, comprising:
the information acquisition unit is suitable for acquiring the abstract of each application; acquiring search terms related to each application from the application search log;
the label system construction unit is suitable for excavating a label system of each application according to the abstract, the search terms and the preset strategy of each application;
the label system construction unit is suitable for obtaining a training corpus set according to the abstracts and the search terms of all applications; inputting the training corpus set into an LDA model for training to obtain an application-theme probability distribution result and a theme-keyword probability distribution result output by the LDA model; calculating to obtain a label system of each application according to the application-theme probability distribution result and the theme-keyword probability distribution result;
the tag system construction unit is also suitable for taking the keywords of the first fifth preset threshold number correspondingly selected by each application as a first-stage tag system of the application; for the first-stage label system of each application, calculating a semantic relation value between each keyword in the first-stage label system of the application and the abstract of the application; for each keyword, taking the product of the semantic relation value corresponding to the keyword and the probability of the keyword relative to the application as the correction probability of the keyword relative to the application; and sorting all the keywords in the first-stage label system of the application from large to small according to the correction probability of the application, and selecting the first K keywords to form the label system of the application.
15. The apparatus of claim 14, wherein,
the tag system construction unit is suitable for extracting a first segment of characters or characters of a preset number of sentences from the abstract of each application; the extracted characters and the applied search terms are jointly used as the original corpus of the application; the original linguistic data of each application form an original linguistic data set; and preprocessing the original corpus set to obtain a training corpus set.
16. The apparatus of claim 15, wherein,
the tag system construction unit is suitable for performing word segmentation processing on each original corpus in the original corpus set to obtain a word segmentation result containing a plurality of terms; searching phrases formed by adjacent terms in the word segmentation result; and reserving the phrases, the lexical items belonging to nouns and the lexical items belonging to verbs in the word segmentation result as the corresponding reserved keywords of the original corpus.
17. The apparatus of claim 16, wherein,
and the tag system construction unit is suitable for calculating the cPId values of every two adjacent terms in the word segmentation result, and when the cPId values of the two adjacent terms are larger than a first preset threshold value, determining that the two adjacent terms form a phrase.
18. The apparatus of claim 17, wherein,
the label system construction unit is also suitable for taking the keywords correspondingly reserved by the original material of each application as the first-stage training corpus of the application; the first-stage training corpora of each application form a first-stage training corpus set; and carrying out data cleaning on the keywords in the first-stage corpus set.
19. The apparatus of claim 18, wherein,
the label system construction unit is suitable for calculating a TF-IDF value of each keyword in the first-stage corpus for each first-stage corpus in the first-stage corpus set; and deleting the keywords with the TF-IDF values higher than the second preset threshold and/or lower than the third preset threshold.
20. The apparatus of claim 19, wherein,
the label system construction unit is also suitable for taking the residual keywords of the first-stage training corpus of each application after data cleaning as the second-stage training corpus of the application; for each applied second-stage corpus, when a keyword in the applied second-stage corpus appears in the name of the application, repeating the keyword in the applied second-stage corpus for a fourth preset threshold number of times to obtain the applied corpus; the corpus of each application constitutes a corpus set.
21. The apparatus of any one of claims 14-20,
the label system construction unit is suitable for calculating an application-keyword probability distribution result according to the application-theme probability distribution result and the theme-keyword probability distribution result; and according to the application-keyword probability distribution result, for each application, sorting the keywords according to the probability of the application from large to small, and selecting the keywords with the number of the top fifth preset threshold value.
22. The apparatus of any one of claims 14-20,
the label system construction unit is suitable for obtaining the probability of each theme about the application according to the application-theme probability distribution result for each application; for each topic, obtaining the probability of each keyword about the topic according to the topic-keyword probability distribution result; for each keyword, taking the product of the probability of the keyword about a topic and the probability of the topic about an application as the probability of the keyword based on the topic about the application; and taking the probability of the keyword about the application based on the sum of the probabilities of the topics about the application.
23. The apparatus of claim 14, wherein,
the tag system construction unit is suitable for calculating word vectors of the keywords and calculating the word vectors of each term in the sentences of the application abstract with the preset number; calculating cosine similarity between the word vector of the keyword and the word vector of each term, and taking the product of each cosine similarity and the weight of the sentence in which the corresponding term is located as a semantic relation value of the keyword and the corresponding term; and taking the sum of the semantic relation values of the keyword and each term as the semantic relation value between the keyword and the abstract of the application.
24. The apparatus of any one of claims 14-20,
the label system construction unit is also suitable for taking the keywords correspondingly selected by each application as a second-stage label system of the application; for the second-stage label system of each application, acquiring a search word set related to the downloading operation of the application from an application search log, and counting the DF value of each keyword in the second-stage label system of the application in the search word set; for each keyword, increasing the multiple of the DF value on the basis of the probability of the keyword about the application to obtain the secondary correction probability of the keyword about the application; and sorting all the keywords in the second-stage label system of the application from large to small according to the secondary correction probability of the application, and selecting the first K keywords to form the label system of the application.
25. The apparatus of claim 24, wherein,
the label system construction unit is suitable for acquiring the quarterly downloading times of the application from the application search log; selecting the first K keywords according to the quarterly downloading times of the application to form a label system of the application; where the value of K is a polyline function of the number of quarterly downloads for the application.
26. An application search server, comprising:
the interaction unit is suitable for receiving the search terms uploaded by the client;
the search processing unit is suitable for matching in a label system of each application according to the search terms;
the interaction unit is also suitable for returning the relevant information of the application to the client side for displaying when the search word is matched with the keyword in the label system of the application;
the application search server further comprises an application label mining device according to any one of claims 14 to 25, wherein the label system of each application is constructed by the application label mining device.
CN201611229785.5A 2016-12-27 2016-12-27 Application label mining method and device, application searching method and server Active CN106682169B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611229785.5A CN106682169B (en) 2016-12-27 2016-12-27 Application label mining method and device, application searching method and server

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611229785.5A CN106682169B (en) 2016-12-27 2016-12-27 Application label mining method and device, application searching method and server

Publications (2)

Publication Number Publication Date
CN106682169A CN106682169A (en) 2017-05-17
CN106682169B true CN106682169B (en) 2020-09-18

Family

ID=58871712

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611229785.5A Active CN106682169B (en) 2016-12-27 2016-12-27 Application label mining method and device, application searching method and server

Country Status (1)

Country Link
CN (1) CN106682169B (en)

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107704572B (en) * 2017-09-30 2021-07-13 北京奇虎科技有限公司 Method and device for mining creation angle of character entity
CN110019068B (en) * 2017-10-19 2023-04-28 阿里巴巴集团控股有限公司 Log text processing method and device
CN107944946B (en) * 2017-11-03 2020-10-16 清华大学 Commodity label generation method and device
CN110147426B (en) * 2017-12-01 2021-08-13 北京搜狗科技发展有限公司 Method for determining classification label of query text and related device
CN108304457A (en) * 2017-12-22 2018-07-20 努比亚技术有限公司 A kind of application mask method, server and computer readable storage medium
CN110209763A (en) * 2018-02-12 2019-09-06 北京京东尚科信息技术有限公司 Data processing method, device and computer readable storage medium
CN108763194B (en) * 2018-04-27 2022-09-27 阿里巴巴(中国)有限公司 Method and device for applying label labeling, storage medium and computer equipment
CN109961091B (en) * 2019-03-01 2021-04-20 杭州叙简科技股份有限公司 Self-learning accident text label and abstract generation system and method thereof
CN110263153B (en) * 2019-05-15 2021-04-30 北京邮电大学 Multi-source information-oriented mixed text topic discovery method
CN112052330B (en) * 2019-06-05 2021-11-26 上海游昆信息技术有限公司 Application keyword distribution method and device
CN110347977A (en) * 2019-06-28 2019-10-18 太原理工大学 A kind of news automated tag method based on LDA model
CN110598070B (en) * 2019-09-09 2022-01-25 腾讯科技(深圳)有限公司 Application type identification method and device, server and storage medium
CN113625918A (en) * 2020-05-08 2021-11-09 百度在线网络技术(北京)有限公司 Screen display method, device, terminal and storage medium
CN112527769B (en) * 2020-12-09 2023-05-16 重庆大学 Automatic quality assurance framework for software change log generation method
CN113609380B (en) * 2021-07-12 2024-03-26 北京达佳互联信息技术有限公司 Label system updating method, searching device and electronic equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102760149A (en) * 2012-04-05 2012-10-31 中国人民解放军国防科学技术大学 Automatic annotating method for subjects of open source software
CN104133877A (en) * 2014-07-25 2014-11-05 百度在线网络技术(北京)有限公司 Software label generation method and device
CN105787053A (en) * 2016-02-26 2016-07-20 维沃移动通信有限公司 Application pushing method and terminal

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102760149A (en) * 2012-04-05 2012-10-31 中国人民解放军国防科学技术大学 Automatic annotating method for subjects of open source software
CN104133877A (en) * 2014-07-25 2014-11-05 百度在线网络技术(北京)有限公司 Software label generation method and device
CN105787053A (en) * 2016-02-26 2016-07-20 维沃移动通信有限公司 Application pushing method and terminal

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Appropriately Incorporating Statistical Significance in PMI;Om P. Damani等;《Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing》;20131021;全文 *
一种基于加权LDA模型和多粒度的文本特征选择方法;李湘东等;《现代图书情报技术》;20150525(第5期);第42-49页 *

Also Published As

Publication number Publication date
CN106682169A (en) 2017-05-17

Similar Documents

Publication Publication Date Title
CN106682169B (en) Application label mining method and device, application searching method and server
CN106682170B (en) Application search method and device
CN106709040B (en) Application search method and server
CN110543574B (en) Knowledge graph construction method, device, equipment and medium
CN106970991B (en) Similar application identification method and device, application search recommendation method and server
CN109657054B (en) Abstract generation method, device, server and storage medium
US8095547B2 (en) Method and apparatus for detecting spam user created content
CN106960030B (en) Information pushing method and device based on artificial intelligence
CN105843962A (en) Information processing and displaying methods, information processing and displaying devices as well as information processing and displaying system
CN107657056B (en) Method and device for displaying comment information based on artificial intelligence
US20160125028A1 (en) Systems and methods for query rewriting
CN107544988B (en) Method and device for acquiring public opinion data
CN102314440B (en) Utilize the method and system in network operation language model storehouse
CN107798622B (en) Method and device for identifying user intention
CN104951435A (en) Method and device for displaying keywords intelligently during chatting process
CN105550217B (en) Scene music searching method and scene music searching device
CN113204953A (en) Text matching method and device based on semantic recognition and device readable storage medium
CN112989824A (en) Information pushing method and device, electronic equipment and storage medium
CN107766398B (en) Method, apparatus and data processing system for matching an image with a content item
CN111813993A (en) Video content expanding method and device, terminal equipment and storage medium
CN106156262A (en) A kind of search information processing method and system
CN107665442B (en) Method and device for acquiring target user
CN110674388A (en) Mapping method and device for push item, storage medium and terminal equipment
CN111882224A (en) Method and device for classifying consumption scenes
CN111400464B (en) Text generation method, device, server and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant