CN105677695A - Method for calculating similarity of mobile applications based on content - Google Patents

Method for calculating similarity of mobile applications based on content Download PDF

Info

Publication number
CN105677695A
CN105677695A CN201510776878.9A CN201510776878A CN105677695A CN 105677695 A CN105677695 A CN 105677695A CN 201510776878 A CN201510776878 A CN 201510776878A CN 105677695 A CN105677695 A CN 105677695A
Authority
CN
China
Prior art keywords
app
similarity
weight
key word
content
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510776878.9A
Other languages
Chinese (zh)
Other versions
CN105677695B (en
Inventor
吴明晖
刘泽民
金苍宏
应晶
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Yuancheng Technology Co Ltd
Original Assignee
Hangzhou Yuancheng Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Yuancheng Technology Co Ltd filed Critical Hangzhou Yuancheng Technology Co Ltd
Publication of CN105677695A publication Critical patent/CN105677695A/en
Application granted granted Critical
Publication of CN105677695B publication Critical patent/CN105677695B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • G06F16/243Natural language query formulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a method for calculating similarity of mobile applications based on content.The method comprises following steps: extracting information of mobile applications after acquiring a large amount of information of mobile applications including application names, application types, application descriptions and application sizes; carrying out word segmentation on application description information; dividing content into two parts after word segmentation is finished with one part as a training corpus for a word 2vec model and the other part being stored in the form of file sets and subjected to TF-IDF calculations and then stored into an HBase data warehouse; and inquiring similarity of applications and calculating. The method for calculating similarity of mobile applications based on content has following beneficial effects: the method is capable of rapidly responding to similarity query of apps; by app features based on content and description information, apps can be well referred; high accuracy is obtained; and accuracy of app query and recommendation can be increased.

Description

A kind of content-based method calculating Mobile solution similarity
Technical field
The present invention relates to data information retrieval and commending system field, particularly to a kind of method calculating Mobile solution similarity of feature based content completed by information retrieval.
Background technology
Along with the proposition of the day by day prosperous of mobile Internet and " the Internet+", the convenience of mobile Internet is more and more well known with high efficiency. The proposition of O2O (OnlineToOffline, on line under line) concept and various on-line off-line application, not only rapidly promoted the dealing of commodity, also enriched the life of people greatly.
In the life of popular " the Internet+", the Mobile solution (MobileApplications is called for short app) of magnanimity is in occupation of lifting foot consequence. The app demand that domestic each big Mobile solution market is popular provides strong support. In Mobile solution market, user often searches for the app that oneself needs. But when such magnanimity, for the public users as layman, it may appear that the result of a lot of search is not the situation that oneself needs. Therefore, it is badly in need of a kind of method, it is possible to inquire about relevant app's meanwhile, it is capable to provide the user some similar app, the rough inquiry etc. possible in order to meet user user. Simultaneously in commending system, it is possible to be actively that user recommends some and similar for the app Mobile solution of installation on user terminal, recommend Mobile solution can improve the accuracy rate of recommendation according to the hobby of user.
The existing Similarity measures for application, has the Similarity measures based on bottom code and interface. These are based on the Similarity measures of code layer, it is impossible to directly reflect the semantic requirement of domestic consumer, and the Mobile solution app developed is all complete .apk file, it is impossible to obtain the code details of its bottom, therefore be not suitable with the current demand of user.
For the Similarity measures of application, also has the similarity calculation method based on app content simultaneously. Most content-based similarity calculation method is based on the description information of app, because description information is able to describe the relatively more authoritative data of an app itself. But, the computational methods of existing description information are generally based on what word bag model did. Word bag model does not account for the order between word and word, thus have ignored the context relation of a lot of word, and when the similarity calculated between vector, such as two near synonym, owing to not being same word, it is more likely that make similarity diminish and very big error occur.
Meanwhile, when calculating similarity application, other such as information such as title, classification and size of app are not taken into account by existing most methods. And the review information of such as app is also added wherein by some methods. According to us it has been observed that the quality of the review information of app is excessively poor, the true content of app generally cannot be reflected.
Therefore, for the drawbacks described above existed in current existing technology, it is necessary to study, it is provided that a kind of scheme, the defect existed in prior art is solved so that similarity calculation method deeper can depend on app characteristic information.
Summary of the invention
It is an object of the invention to provide the similarity calculation method of a kind of Mobile solution app, for better finding the most like app of certain app from magnanimity app storehouse, in order to improve the accuracy rate of the search of app and the success rate of recommendation.
For achieving the above object, the technical scheme is that
A kind of content-based method calculating Mobile solution similarity, comprises the steps:
S10. crawl a large amount of app data and carry out the feature arrangement of data, the feature put in order being saved in data base, sets up a feature database for inquiry;
S20. the characteristic information according to app to be checked, carries out inquiring about and calculating in described feature database, finds out the similar app of app to be checked; The characteristic information of described app to be checked is provided by user or inquires about from described feature database and obtains.
Further, step S10 comprises the following steps:
S101. crawling by a large amount of app data, structuring is deposited in data base after arranging;
S102. the description information of app each in described data base is individually integrated into file, then distinguishes participle;
S103. the data that participle obtains after completing, copy merges as complete corpus, then uses word2vec to carry out the training of corpus; Another copy, then according to original file structure, carries out the calculating of TF-IDF between each document, draws the weight of all key words in each document;
S104. the key word calculated and weight thereof are write in HBase, wherein go corresponding each app bag name, arrange corresponding all key words, be worth for keyword weight, set up feature database for inquiry;
S105. calculate the similarity of the title of app, type, description and four aspect features of application size and integrate with respective weight, as the similarity that algorithm is last.
Further, step S20 comprises the following steps:
S201. the bag name of the app to inquire about is obtained as its unique name;
S202. in the feature Kuku in HBase, carry out horizontal inquiry according to the bag name of app, find out all of key word of this app;
S203. for each key word, before using word2vec to find out this key word respectively, K near synonym are extended;
S204. the key word after extension is carried out the integration of weight, and picks out its top n key word absolute key word as this app;
S205. according to absolute key word, by the feature database in row inquiry HBase, all of app corresponding for described absolute key word is checked out, and the weight of app is integrated;
S206. the similar value of the title between these app and app to be checked, classification and size is calculated respectively, then the similar value of the description information between these app and app to be checked, title, classification and size is integrated according to respective weight, as the similarity between these app and app to be checked;
S207. by the app after integration according to weight descending, setting up the similarity sequence of app, what weight was more big is more similar app.
The invention has the beneficial effects as follows: provide the similarity calculation method of a kind of Mobile solution app, for better finding the most like app of certain app from magnanimity app storehouse, in order to improve the accuracy rate of the search of app and the success rate of recommendation. It is in particular in following aspect:
1) use the description information of app, use word2vec to carry out the calculating of near synonym simultaneously, the concrete semantic content of app can not only be reflected well, in conjunction with the context relation in description information, can better excavate near synonym feature therein simultaneously;
2) combining the title of app, type, size and description information, fully use the feature of app, information poor for the content of the comment etc. of app foreclosed simultaneously, result of calculation is more accurately comprehensively;
3) using HBase to carry out the inquiry of data as data warehouse, the app data for magnanimity can process faster.
Accompanying drawing explanation
Fig. 1 is the schematic flow sheet of the embodiment of the content-based method calculating Mobile solution similarity of the present invention.
Detailed description of the invention
In order to be further appreciated by the present invention, below in conjunction with embodiment, the preferred embodiment of the invention is described, but it is to be understood that these describe simply as further illustrating the features and advantages of the present invention, rather than limiting to the claimed invention.
The invention provides a kind of content-based method calculating Mobile solution similarity, depend on the features such as the title of app, description information, type and size, find the app the most similar to this app, specifically include following steps:
S1. crawl a large amount of app data and carry out the feature arrangement of data, the feature put in order being saved in data base, sets up a feature database for inquiry;
S2. the characteristic information according to app to be checked, carries out inquiring about and calculating in described feature database, finds out the similar app of app to be checked; The characteristic information of described app to be checked is provided by user or inquires about from described feature database and obtains.
Below in conjunction with detailed description of the invention, foregoing is described in further detail.
Step S10, crawls the relevant information of a large amount of app from network, including the title of app, classification, size and the information of description, and these information is saved in relevant database.
Step S20, extracts the description information of all app, and this information is divided into two parts, be respectively calculated, comprise the following steps:
S201 and S202, obtains all of app data from data base, and its title, type, size and description information is read out;
S203, the information that described by app is divided into the form of independent document, first by the content of each document remove stop-word and add retain word premise under carry out participle, then whole document sets is calculated its TF-IDF (TermFrequency InverseDocumentFrequency, term frequency-inverse document frequency) value, obtain key word and the weight thereof of each document;
S204, describes the app after all of point of good word information one big document of composition, then it can be used as the training corpus of word2vec, be trained;
S205, is deposited in HBase data warehouse by the resultant content of step S203, in order to carry out describing the data retrieval of information based on app. Using the app bag name corresponding for each document rowkey as HBase, using all key words row content as HBase. When the description information of the app stored after a calculating, its bag name is as rowkey, and its all of key word is as corresponding row, and the weighted value of key word is as the value of row correspondence simultaneously. So can not only quickly search information corresponding to certain app, the key word of its correspondence, convenient search can be extended simultaneously dynamically;
Step S30, look for a kind of method that can will carry out weight adjustment according to correlation result, the title of app, type, size and description information are integrated, when keeping obtaining optimum similar app, uses many group cases to calculate the optimal weight of these four combinations of attributes.
Step S40, after ready for data, just can proceed by the searching step of similar app, and its content farther includes following steps:
S401, obtains the bag name of app to be retrieved;
S402, according to app bag name to be retrieved, retrieves it in HBase as the row corresponding to rowkey, and therefrom finds all key words and the weight of its correspondence;
S403, uses all of key word the training result of word2vec to carry out synonym extension, and the word extended out is calculated its weighting weight, then merged by identical word, simultaneously that weight is superimposed;
S404, according to the key word expanded and weight thereof, after the equal normalization of weight in being arranged by each word, longitudinally searches the app of its correspondence in HBase data warehouse. The corresponding multiple app of each word, then calculate the weight of each app, and integrate, and descending also filters out the multiple apps most like according to description information;
S405, according to the S404 app drawn, extracts the information such as their title, type, size;
S406, uses the method for editing distance to calculate the similarity of app title and the title of retrieval app;
S407, uses the method for Taxonomic discussion to calculate the type of app and the similarity of retrieval app type:
S408, calculates the size of app and the similarity of retrieval app. Use formula:
S i m i l a r i t y = 1 - | size x - size a | size max - size m i n
Wherein, a is app, x to be retrieved is each app, the size similar based on description information calculated in S404maxFor the size of the app taking maximum space of these similar app, sizeminSize for the app taking minimum space of these similar app.
S409, the similarity title of app, type, size and description information calculated is weighted integrating according to the weight of each attribute, obtains a final Similarity value namely below equation:
Similarity=λ1Simname2Simcategory3Simsize4Simdescription
Wherein, name refers to app title, and category refers to app type, and size refers to the size of app, and description refers to that app describes information, refers to the weight of app title during the integration each side weight calculated, type, size and the information of description respectively, and has λ1234=1.
Then according to result is ranked up and filters by Similarity value, obtain last most like one or more app.
The explanation of above example is only intended to help to understand method and the core concept thereof of the present invention. It should be pointed out that, for those skilled in the art, under the premise without departing from the principles of the invention, it is also possible to the present invention carries out some improvement and modification, these improve and modify in the protection domain also falling into the claims in the present invention.

Claims (3)

1. the method calculating Mobile solution similarity one kind content-based, it is characterised in that comprise the steps:
S10. crawl a large amount of app data and carry out the feature arrangement of data, the feature put in order being saved in data base, sets up a feature database for inquiry;
S20. the characteristic information according to app to be checked, carries out inquiring about and calculating in described feature database, finds out the similar app of app to be checked; The characteristic information of described app to be checked is provided by user or inquires about from described feature database and obtains.
2. the content-based method calculating Mobile solution similarity according to claim 1, it is characterised in that step S10 comprises the following steps:
S101. crawling by a large amount of app data, structuring is deposited in data base after arranging;
S102. the description information of app each in described data base is individually integrated into file, then distinguishes participle;
S103. the data that participle obtains after completing, copy merges as complete corpus, then uses word2vec to carry out the training of corpus; Another copy, then according to original file structure, carries out the calculating of TF-IDF between each document, draws the weight of all key words in each document;
S104. the key word calculated and weight thereof are write in HBase, wherein go corresponding each app bag name, arrange corresponding all key words, be worth for keyword weight, set up feature database for inquiry;
S105. calculate the similarity of the title of app, type, description and four aspect features of application size and integrate with respective weight, as the similarity that algorithm is last.
3. the content-based method calculating Mobile solution similarity according to claim 1, it is characterised in that step S20 comprises the following steps:
S201. the bag name of the app to inquire about is obtained as its unique name;
S202. in the feature Kuku in HBase, carry out horizontal inquiry according to the bag name of app, find out all of key word of this app;
S203. for each key word, before using word2vec to find out this key word respectively, K near synonym are extended;
S204. the key word after extension is carried out the integration of weight, and picks out its top n key word absolute key word as this app;
S205. according to absolute key word, by the feature database in row inquiry HBase, all of app corresponding for described absolute key word is checked out, and the weight of app is integrated;
S206. the similar value of the title between these app and app to be checked, classification and size is calculated respectively, then the similar value of the description information between these app and app to be checked, title, classification and size is integrated according to respective weight, as the similarity between these app and app to be checked;
S207. by the app after integration according to weight descending, setting up the similarity sequence of app, what weight was more big is more similar app.
CN201510776878.9A 2015-09-28 2015-11-13 A method of the calculating mobile application similitude based on content Active CN105677695B (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510626874 2015-09-28
CN2015106268742 2015-09-28

Publications (2)

Publication Number Publication Date
CN105677695A true CN105677695A (en) 2016-06-15
CN105677695B CN105677695B (en) 2019-03-08

Family

ID=56946915

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510776878.9A Active CN105677695B (en) 2015-09-28 2015-11-13 A method of the calculating mobile application similitude based on content

Country Status (1)

Country Link
CN (1) CN105677695B (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108170665A (en) * 2017-11-29 2018-06-15 有米科技股份有限公司 Keyword expanding method and device based on comprehensive similarity
CN108170664A (en) * 2017-11-29 2018-06-15 有米科技股份有限公司 Keyword expanding method and device based on emphasis keyword
CN108182201A (en) * 2017-11-29 2018-06-19 有米科技股份有限公司 Application extension method and apparatus based on emphasis keyword
CN108255522A (en) * 2016-12-27 2018-07-06 北京金山云网络技术有限公司 A kind of application program sorting technique and device
CN108319449A (en) * 2017-01-16 2018-07-24 北京金山云网络技术有限公司 A kind of application architecture determines method and device
CN108804492A (en) * 2018-03-27 2018-11-13 优视科技新加坡有限公司 The method and device recommended for multimedia object
CN109002441A (en) * 2017-06-06 2018-12-14 阿里巴巴集团控股有限公司 Determination method, the exception of Apply Names similarity apply detection method and system
CN113868533A (en) * 2021-09-30 2021-12-31 北京达佳互联信息技术有限公司 Application search method and device, electronic equipment and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080086511A1 (en) * 2006-10-10 2008-04-10 Canon Kabushiki Kaisha Image display controlling apparatus, method of controlling image display, and storage medium
CN103530339A (en) * 2013-10-08 2014-01-22 北京百度网讯科技有限公司 Mobile application information push method and device
CN103955536A (en) * 2014-05-15 2014-07-30 深圳市中兴移动通信有限公司 Classification method and device of applications
CN104424307A (en) * 2013-09-04 2015-03-18 腾讯科技(深圳)有限公司 Intelligent terminal application classifying method, system and intelligent terminal,
CN104750798A (en) * 2015-03-19 2015-07-01 腾讯科技(深圳)有限公司 Application program recommendation method and device
CN104778178A (en) * 2014-01-13 2015-07-15 腾讯科技(深圳)有限公司 Application classification method, application classification device and service server
CN104866526A (en) * 2015-04-21 2015-08-26 惠州Tcl移动通信有限公司 Intelligent terminal and method for recommending applications thereof

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080086511A1 (en) * 2006-10-10 2008-04-10 Canon Kabushiki Kaisha Image display controlling apparatus, method of controlling image display, and storage medium
CN104424307A (en) * 2013-09-04 2015-03-18 腾讯科技(深圳)有限公司 Intelligent terminal application classifying method, system and intelligent terminal,
CN103530339A (en) * 2013-10-08 2014-01-22 北京百度网讯科技有限公司 Mobile application information push method and device
CN104778178A (en) * 2014-01-13 2015-07-15 腾讯科技(深圳)有限公司 Application classification method, application classification device and service server
CN103955536A (en) * 2014-05-15 2014-07-30 深圳市中兴移动通信有限公司 Classification method and device of applications
CN104750798A (en) * 2015-03-19 2015-07-01 腾讯科技(深圳)有限公司 Application program recommendation method and device
CN104866526A (en) * 2015-04-21 2015-08-26 惠州Tcl移动通信有限公司 Intelligent terminal and method for recommending applications thereof

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108255522A (en) * 2016-12-27 2018-07-06 北京金山云网络技术有限公司 A kind of application program sorting technique and device
CN108319449A (en) * 2017-01-16 2018-07-24 北京金山云网络技术有限公司 A kind of application architecture determines method and device
CN108319449B (en) * 2017-01-16 2021-07-20 北京金山云网络技术有限公司 Application program architecture determining method and device
CN109002441A (en) * 2017-06-06 2018-12-14 阿里巴巴集团控股有限公司 Determination method, the exception of Apply Names similarity apply detection method and system
CN108170665A (en) * 2017-11-29 2018-06-15 有米科技股份有限公司 Keyword expanding method and device based on comprehensive similarity
CN108170664A (en) * 2017-11-29 2018-06-15 有米科技股份有限公司 Keyword expanding method and device based on emphasis keyword
CN108182201A (en) * 2017-11-29 2018-06-19 有米科技股份有限公司 Application extension method and apparatus based on emphasis keyword
CN108182201B (en) * 2017-11-29 2020-06-30 有米科技股份有限公司 Application expansion method and device based on key keywords
CN108170665B (en) * 2017-11-29 2021-06-04 有米科技股份有限公司 Keyword expansion method and device based on comprehensive similarity
CN108804492A (en) * 2018-03-27 2018-11-13 优视科技新加坡有限公司 The method and device recommended for multimedia object
CN113868533A (en) * 2021-09-30 2021-12-31 北京达佳互联信息技术有限公司 Application search method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN105677695B (en) 2019-03-08

Similar Documents

Publication Publication Date Title
CN105677695A (en) Method for calculating similarity of mobile applications based on content
US8898155B2 (en) Personalized meta-search method and application terminal thereof
CN103593425B (en) Preference-based intelligent retrieval method and system
US8015172B1 (en) Method of conducting searches on the internet to obtain selected information on local entities and provide for searching the data in a way that lists local businesses at the top of the results
CN107180093B (en) Information searching method and device and timeliness query word identification method and device
US10540365B2 (en) Federated search
CN108154425B (en) Offline merchant recommendation method combining social network and location
CN103455487B (en) The extracting method and device of a kind of search term
CN107690637B (en) Connecting semantically related data using large-table corpus
CN107180045A (en) A kind of internet text contains the abstracting method of geographical entity relation
US11061893B2 (en) Multi-domain query completion
CN103049440A (en) Recommendation processing method and processing system for related articles
CN103823893A (en) User comment-based product search method and system
CN105023178B (en) A kind of electronic commerce recommending method based on ontology
WO2016101812A1 (en) Method and equipment for processing search data
CN107992563B (en) Recommendation method and system for user browsing content
CN103020074A (en) Object-level search technique based on main body
CN106407362A (en) Keyword information retrieval method and device
WO2020215437A1 (en) Approximate search method for spatial keyword query in electronic map
CN103064907A (en) System and method for topic meta search based on unsupervised entity relation extraction
CN106227762A (en) A kind of method for vertical search assisted based on user and system
WO2018010569A1 (en) Product chain object database establishment, and query methods, devices and systems therefor
CN105677664A (en) Compactness determination method and device based on web search
CN102214209A (en) Method and equipment for identifying homonymous information entities
CN102542022A (en) Theme search algorithm based on body

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant