CN108520012A

CN108520012A - Mobile Internet user comment method for digging based on machine learning

Info

Publication number: CN108520012A
Application number: CN201810233877.3A
Authority: CN
Inventors: 张莉; 黄新越; 蒋竞
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2018-03-21
Filing date: 2018-03-21
Publication date: 2018-09-11
Anticipated expiration: 2038-03-21
Also published as: CN108520012B

Abstract

The present invention proposes a kind of mobile Internet user comment method for digging based on machine learning, belongs to requirement engineering and Data Mining.The present invention includes：The selecting of step 1 Focus Area and labeled data, the formulation of step 2 Questions types, step 3 application program is analyzed compared with thinking with data, step 4 pre-process the data in step 2 and three, step 5 sets a property for Application Type, step 6 establishes a binary classifier to each Questions types.The method of the present invention enriches the feature that grader uses by the addition of data attribute, solves the problems, such as existing data nonbalance to a certain extent by cost-sensitive meta classifier, pass through the Rational Parameters configuration optimization of the support vector machines effect of grader, improve the accuracy of comment classification, the personal needs of user can be flexibly met, data mining effect is better than current best user comment sorting technique.

Description

Mobile Internet user comment method for digging based on machine learning

Technical field

The present invention relates to requirement engineerings and Data Mining, and in particular, to it is a kind of it is based on machine learning, to movement The method that interconnection user on the network excavates the comment of software.

Background technology

Software requirement engineering is all indispensable part in the exploitation of software and evolutionary process, and the demand of being generally divided into obtains It takes, Requirements Modeling, form five requirement specification, requirements verification and demand management aspects.Wherein, software is used by collecting user The feedback information generated afterwards excavates all kinds of demands therein, there is important value for software developer.

With the development of Internet era, the acquisition modes of user feedback also become more diversified.Especially web2.0 After epoch, user-generated content (User Generated Content, abbreviation UGC) becomes novel user feedback money Source.Wherein, online comment of the user for software, be a kind of enormous amount, informative feedback data source, be UGC Typical Representative.User's online comment is usually direct demand that user independently sends out, to product, and content is more true, can Letter, there is stronger timeliness.

Before mobile applications prevalence, have many researchs for the excavation of the user comment on internet.Such as To the opinion mining of automobile evaluation, the system etc. that product advantage and disadvantage in the online customer evaluation of network are counted.Mobile Internet is emerging After rising, a large amount of online comment information is also produced in mobile terminal.Also, mobile applications (being commonly known as APP) are usually Have the characteristics that the development cycle is short and iteration is rapid；Meanwhile user group is more extensive, user demand is multifarious and changes Soon, field feedback is more abundant also more random.Recent decades have had a large amount of research works to attempt from text data Useful information is excavated, but mobile application user comment is short text for traditional text excavation, it is thus possible to need The short text different from traditional text digging technology is used to understand technology.It is excavated based on the demand that user comment is realized, for There is important value for software engineer.

Mobile applications distribution platform (Apple store, the shops Google Play etc.) allow user can easily search, Software application is purchased and installed, download is also very huge, monthly there are about 1,000,000,000 downloads only in Apple store.These platforms Allow user be download application submit feedback, marking and comment on all be disclose it is visible.If these feedbacks can be made good use of Information, they can become the channel that is exchanged with developer of user, help developer faster, more fully understand the demand of user, And it is taken in software iteration.

Many case studies all show in user feedback to include of great value information, such as error reporting, functional requirement With user experience etc.；For developer, the user comment in application market can help them to more fully understand user demand, Improve software quality.Also have and research and analyse and how comment to be carried out to automatically useful and useless classification, how from feedback Extract user demand etc..Research is it is also shown that since the quantity of user comment is very huge, and the organizational form of language is also very Freely, therefore, it is difficult to fully excavate the effective information in comment by the way of checking manually, needs with automation landform Formula excavates user comment.

Existing part research (reference can be made to existing file 1~3) discusses how to be several by the comment division of teaching contents of user Different types.User comment classification can be disclosed into user view, developer is helped to understand user demand.In this field, The classification type granularity that existing research has is thicker；Some sorting techniques do not make full use of comment attribute, Evaluated effect there is also The space that can be promoted.Therefore, currently there is more room for improvement in the processing of comment and sorting technique.

Bibliography：

[1]Maalej W,Nabil H.Bug report,feature request,or simply praiseon automatically classifying app reviews[C]//2015IEEE 23rd international requirements engineering conference(RE).IEEE,2015:116-125.

[2]Panichella S,Di Sorbo A,Guzman E,et al.How can i improve my app classifying user reviews for software maintenance and evolution[C]//Software Maintenance and Evolution(ICSME),2015IEEE International Conference on.IEEE, 2015:281-290.

[3]McIlroy S,Ali N,Khalid H,et al.Analyzing and automatically labelling the types of user issues that are raised in mobile app reviews[J] .Empirical Software Engineering,2016,21(3):1067-1106.

Invention content

The features such as abundant, a large amount of, diversified for user comment in current mobile software distribution platform and exploit person Demand of the member to excavation user feedback, there are granularity of classification for existing method slightly, comment attribute utilizes the problems such as insufficient, the present invention It is proposed a kind of mobile Internet user comment method for digging based on machine learning.The method of the present invention preferably solves above ask Topic, improve comment classification accuracy so that developer faster, more fully understand user demand.

The present invention is based on the mobile Internet user comment method for digging of machine learning, include the following steps：

Step 1：The user comment for treating the application program of research field is sampled.

This method needs certain artificial labeled data collection as training set, and user can be according to the software oneself paid close attention to Field is sampled the user comment in application program.

Step 2：It determines the problem of including in user comment type, manually sampling comment is labeled, and mark is tied Fruit carries out inspection verification.

Sampling comment is labeled, namely marks the problem type belonging to each sample, it can be by examining one by one Comment obtains, and also can specify the problem of paying close attention to by user.

Step 3：Prepare the user comment data set of the application program to be analyzed, the comment of concern can be selected by user Data.

Step 4：The user comment data set that comment data collection and step 3 after being verified to step 2 obtain is into line number Data preprocess.Pretreatment includes participle, and word frequency vector is established using vector space model (VSM) and TF-IDF algorithms.

Step 5：The attribute of one identification application type is set, which represents two class application programs, Yi Leiying Service and content only are provided by developer with program, there is contacted with other people or enterprise by user in another kind of application program With exchange；The word of every comment in the user comment data set obtained for the comment data collection and step 4 of the mark after verification The attribute of the identification application type is added in frequency vector.

Step 6：One binary classifier is established to each Questions types, the comment data collection of the label after verification is made For training set, the user comment data set that step 3 is obtained utilizes the binary classifier of each Questions types as forecast set Classify.

In the step six, the binary classifier established uses linear SVM, and is added with cost-sensitive Meta classifier is classified by the way that different cost matrix values is arranged for cost-sensitive meta classifier, and selects effect most Excellent cost matrix.

The advantages and positive effects of the present invention are：(1) the method for the present invention is intuitive, simple, effective, for comment data Feature enriches the feature that grader uses, to a certain extent by cost-sensitive meta classifier by the addition of data attribute It solves the problems, such as existing data nonbalance, passes through the Rational Parameters configuration optimization of the support vector machines effect of grader, knot The above means are closed, effect of the present invention is finally made to increase than the effect of current best user comment sorting technique；This is sent out Bright method and the result of current the best way (bibliography [3]) are compared, the results showed that this method improves 14% Accuracy rate with 30% recall rate.(2) personal needs of user, either step 1 can be flexibly met in the method for the present invention Thinking and number of the middle Focus Area compared with being analyzed in the formulation of Questions types, step 3 in the selection of labeled data, step 2 According to all can freely being changed by the demand that user faces so that method can be generalized to diversified usage scenario.

Description of the drawings

Fig. 1 is the overall flow schematic diagram of the user comment method for digging the present invention is based on machine learning；

Fig. 2 is the comment sorting technique schematic diagram of the present invention；

Fig. 3 is that the present invention is applied to communicate the analysis result of social class application in 360 mobile phone assistant.

Specific implementation mode

The present invention is understood and implemented for ease of persons skilled in the art, in conjunction with the specific reality of the attached drawing description present invention Apply mode.

360 mobile phone assistant is one of most popular application shop in China, it is the Android applications run by Qihoo 360 Program distribution platform.By the total number of users in 2 months 2015, the application program shop be more than 800,000,000.It is reached using accumulation download To 64,000,000,000 times, average day abundance reaches 1.8 hundred million times.360 mobile phone assistant popularization degree is higher, wherein number of users considerably beyond The shops Google Play of China.In addition, user can capture the evaluation of APP in its website.360 mobile phone assistant is distributed Application program be free.In this application shop, the evaluation of user includes the date, grading (favorable comment, in comment, difference is commented), With comment content.And the content excavated is commented in being and is commented with difference.Because favorable comment is the praise to application program mostly, and comment can for middle difference To reflect that user wants the software issue complained.

The present invention comments the mobile interchange network users the present invention is based on machine learning using 360 mobile phone assistant as example platforms It is described in detail by specific steps P01~P06 of method for digging.

Step P01：User comment in application market is sampled.In 360 mobile phone assistant, application program is divided into 13 A classification, including：System and safety, communication and society, music and video, news and reading, life style, theme and wallpaper, Office and business, photography and video, shopping, map and tourism, education, finance, health and medical treatment.According to following standard 360 Application program is selected to analyze in mobile phone assistant：1. according to the whole ranking that market provides, select by July 3rd, 2016 The application program of each classification top ranked；2. deleting middle difference therein comments sum<100 application program.Finally, 11 are obtained Application program, as shown in table 1.

Then, the user comment under these application programs is crawled using reptile, calculates the quantity that middle difference therein is commented, and press Confidence level 95% and confidence interval 5% carry out random sampling to these comments.

Table 1：Application program sampling instances list

Apply Names	Classification	General comment number	Middle difference comment number	Sampling comment number
					Mobile phone Taobao	It does shopping preferential	100236	12594	373
Meitu Xiu Xiu	Photography and vedio recording	89561	4313	353
					Today's tops	News is read	23522	1191	291
Alipay	Finance and money management	20809	8808	369
					Wechat	Communication is social	111043	36227	381
Youku.com	Audio-visual audiovisual	77830	17759	377
					360 bodyguards	System safety	81847	29373	380
360 desktops	Theme wallpaper	22518	6881	364
					360 cloud disks	Office business	10246	3950	351
Ooze row	Map is traveled	16620	1367	301
					Operation is helped	Education and study	82004	6234	362

Step P02：Determine in user comment comprising the problem of type, target be find comment in include have to developer The problem of meaning type.First, the problem of setting one is initially empty Category List, then artificially to each comment sample Check its content.It is the type by sample labeling if a certain type in sample content compliance problem list；If be not inconsistent It closes, then adds a kind of New raxa into problem list, restart to mark referring next to new list of requirements.Finally it can be obtained All problem types and the sample after label, as shown in table 2.Can comment rechecking is multiple during this, can also it subtract Few artificial error.

Table 2：Questions types list

Hand inspection verification again is carried out to the result of mark, reduces artificial mistake.

Step P03：Prepare the user comment data set that will be analyzed, such as specific type application, longitudinal comparison is not Complain the distribution situation of type and various complaint types with application (comparison for including different scoring sections and same scoring section) Accounting difference, summarize the type application Requirement Commonness and emphasis.

The present invention can select the application to be analyzed for communicating the application of social class by following procedure：

1. it is respectively a from high to low that certain class, which is applied according to download, under note " 360 mobile phone assistant "₁, a₂, a₃... a_n；It is right The download answered is denoted as d₁, d₂... d_n；

2. a is applied in selection₁, a₂, a₃... a_k, meet condition

3. crawling a_iSeveral comments, including favorable comment, in comment and difference is commented, be 10 according to every favorable comment weights, in comment weights It is 5, it is 1 that difference, which comments weights, obtains each comment weights read group total using a_iScoring s_i；I=1,2 ..., k；

4. 9 points of selection is above and 7 points of following two scoring sections, rank forefront 5 sections of 5 of selection download are answered in respective section It is used as research object.

After the application being analysed to chooses, that is, crawl first 2000 analyses commented on after being used for of these applications.

In addition to this it is possible to be compared analysis to different types of application, comment reflection user demand is complained in observation The universal existence of this phenomenon；Feedback effects of the user to the iteration situation of a certain application are analyzed by node of version updating, Evaluation is made to pervious iteration, is made prediction to iteration from now on；Etc..

Step P04：The user comment data in comment data collection and step P03 to the mark after step P02 verifications Collection carries out data prediction.Pretreatment include participle, using vector space model (VSM) and TF-IDF algorithms come establish word frequency to Amount.

Since compared with English text, Chinese text segmentation is the basis of Chinese information processing, and meeting after text segmentation is easy In computer disposal and understand information, therefore needs to be segmented first.The present invention is using stammerer participle to the content of text of comment It is segmented, it is an efficient Python participles component.Single number and non-Chinese character need to be deleted, because Lack useful information；But stop words needs to retain, because some of which is significant for determining problem types, such as " Should not ".

Vector space model (VSM) is to be suitble to the text representation model of Large Scale Corpus.In the model, text space It is considered as the vector space being made of one group of orthogonal eigenvectors.Each dimension of vector corresponds to the feature in text, and And each dimension itself indicates the weight of character pair in text.TF-IDF algorithms are the common method for calculating weight, TF tables Show that word frequency, IDF indicate reverse document-frequency.The main thought of TF-IDF algorithms is, if word or phrase frequently occur on one In document/one kind document sets, the frequency TF high of appearance, and it is rarely found in other documents, then the word or phrase are considered having There is good classification capacity.The present invention is come using the String To Word Vector classes that Data Mining Tools packet WEKA is provided Word frequency vector is built, such realizes TF-IDF algorithms.In addition, to occurring carrying out less than words three times in data set Filter, this filtering can eliminate rare misspelling.String To Word Vector classes

By the pretreatment of this step, every comment is indicated with a word frequency vector.

Step P05：It adds and belongs to for the user comment data set in the comment data collection and step P03 of the mark after verification Property, the attribute distinguished only provide the mobile application of service by developer and user will be contacted with other people, enterprise etc. with The mobile application of exchange.Specific to the application of step P03 selections, it can be divided into two classes as shown in table 3.

Table 3：Application program categorical attribute

Classification 1	Classification 2
		Alipay	Ooze row
360 desktops	Mobile phone Taobao
		Meitu Xiu Xiu	Wechat
Today's tops	Youku.com
		Operation is helped
360 bodyguards
		360 cloud disks

This attribute will be added to the WEKA word frequency of every comment in the form of Category Attributes (nominal attribute) In vector.

The research of previous user comment classification only considers basis of the text message of comment as classification, and the present invention is also Consider that application is service and content to be provided separately by developer or user will be with other people, enterprise interacts, this will cause Comment on some differences of content.For example, the application that row is an online booking taxi and private car is oozed, its some use Family comment lays particular emphasis on the service of driver's offer, rather than software function.But for as 360 cloud disks (providing cloud storage service) Application program, the comment of user does not include and other people or the relevant content of other business.Application program is divided by the present invention Two classes：A kind of application program only provides service and content by developer, another kind of, and there is contacted with other people or enterprise With exchange.This attribute is added to the comment data collection after verification and the user comment data set in step 5.Specifically, will The attribute of mark application program is added in the word frequency vector of every comment.Enrich what grader used by the attribute of addition Feature, and make the Result for being directed to the comment data of different application more preferable.

Step P06：Using the comment data collection of the label after verification as training set, the user during step P03 is obtained comments By data set as forecast set, binary classification is carried out to each Questions types.

One binary classifier is established to each Questions types.One user comment may include multiple problem types, because This need to build multiple binary classifiers.Binary classifier uses support vector machines (SVM), and feature quantity few in sample size non- In the case of often big, select Non-linear Kernel usually inaccurate, may mistakenly divide feature space.For Optimum Classification effect, Using Linear SVM as grader.Simultaneously as the quantity of some Questions types negative samples is much larger than positive sample, this data Unbalanced situation may cause grader to be more likely to new samples being predicted as negative sample, and therefore, the present invention is quick using cost Learning method is felt to handle this problem, that is, adding cost-sensitive meta classifier for binary classifier, and rational generation is set Valence matrix parameter.Weight is assigned again to data according to different mistake point costs, when the generation that positive sample is predicted as to negative sample When valence is higher, it is increased by the weight of positive sample.

When realizing, the embodiment of the present invention is to each Questions types, the support provided using Data Mining Tools packet WEKA Vector machine and CostSensitiveClassifier classes establish binary classifier.It is set to be linearly to support for SMO class arrange parameters Vector machine is arranged different cost matrixes and finds optimal classification effect for the CostSensitiveClassifier classes of WEKA. The present invention passes through the Rational Parameters configuration optimization of the support vector machines effect of grader.

As shown in Fig. 2, after step P05 increases attribute：1. authenticated user comment is divided into training data and test Data are used as meta classifier, and the support provided using WEKA by the WEKA CostSensitiveClassifier classes provided Vector machine realizes the default parameters of SMO classes, obtains disaggregated model and its effect.Herein, by for Different values is arranged in the cost matrix provided needed for CostSensitiveClassifier, will obtain different Evaluated effects. After multiple study, cost matrix value when selecting effect optimal is as parameter when predicting below.2. will be authenticated before User comment be integrally used as training set, the user comment data set in step 5 is as forecast set, most with effect noted earlier Cost matrix value when excellent is used as member classification as parameter by the WEKA CostSensitiveClassifier classes provided Device, and using the default parameters of the WEKA support vector machines realization SMO classes provided, disaggregated model is established, and obtain result.It is each A Questions types are performed both by the above operation.

By taking the analysis target described in step P03 as an example, the result obtained by this method is as shown in Fig. 3.It can be seen that 1. needle To the social class application of communication, highest customer problem accounting is replacement problem, i.e., the demand that user applies the type focuses mostly on Experience in the updated is bad, in addition, collapse complains to be also the problem of needing concern with functional；2. more different scoring sections Each demand be averaged accounting numerical value it can be found that cast aside among the above apply common problem outside, one group of relatively low scoring is answered Complain used in functionality, need to increase characteristic, network problem and response time etc. performance are not so good as the application of high scoring, The developer of these applications should make improvement in these areas.The method of the present invention (is referred to using current the best way Document [3]) result compared, the results showed that this method improves 14% accuracy rate and 30% recall rate.

The method of the present invention is in a manner of intuitive, simple, effective, it is proposed that a kind of user comment excavation based on machine learning Method keeps method effect excellent by the reasonable disposition of addition, cost-sensitive meta classifier and support vector machines to data attribute In existing user comment sort research；And the personal needs of user can be flexibly met, neck is paid close attention to either in step 1 Domain in the formulation of Questions types, step 3 in the selection of labeled data, step 2 to application program analyze compared with thinking with Data all can freely be changed by the demand that user faces so that method can be generalized to it is diversified use field Scape.

Claims

1. a kind of mobile Internet user comment method for digging based on machine learning, includes the following steps：

Step 1：The user comment for treating the application program of research field is sampled；

Step 2：Determine in user comment comprising the problem of type, manually to sampling comment being labeled, and to annotation results into Row checks verification；

Step 3：Obtain the comment data collection of application program to be analyzed；

Step 4：The comment data collection that mark comment data collection and step 3 after being verified to step 2 obtain pre-processes, Pretreatment includes：Participle establishes word frequency vector using vector space model and TF-IDF algorithms；TF indicates that word frequency, IDF indicate Reverse document-frequency；

It is characterized in that,

Step 5：The attribute of one identification application type is set, which represents two class application programs, and one kind applies journey Sequence only provides service and content by developer, and there is contacting and handing over other people or enterprise by user in another kind of application program Stream；For the mark after verification comment data collection and step 4 obtain user comment data set in every comment word frequency to The attribute of the identification application type is added in amount；

Step 6：One binary classifier is established to each Questions types, the comment data collection after step 2 is verified is as instruction Practice collection, the user comment data set that step 3 is obtained is carried out as forecast set using the binary classifier of each Questions types Classification；

In the step six, the binary classifier established uses linear SVM, and added with cost-sensitive member point Class device is classified by the way that different cost matrix values is arranged for cost-sensitive meta classifier, and selects effect optimal Cost matrix.

2. according to the method described in claim 1, it is characterized in that, in the step four, is segmented using stammerer, deleted simultaneously Single number and non-Chinese character, and retain stop words.

3. according to the method described in claim 1, it is characterized in that, in the step four, when building word frequency vector, filter Fall to concentrate in comment data and occur less than words three times.

4. according to the method described in claim 1, it is characterized in that, in the step six, to each Questions types, number is used Binary classifier is established according to the digging tool packet WEKA support vector machines provided and CostSensitiveClassifier classes.

5. method according to claim 1 or 4, which is characterized in that in the step six, to each Questions types, into The following operation of row：1. the comment data collection after verification is divided into training data and test data, provided by WEKA CostSensitiveClassifier classes realize the silent of SMO classes as meta classifier, and using the support vector machines that WEKA is provided Recognize parameter, obtains disaggregated model and its Evaluated effect；Wherein, by for cost needed for CostSensitiveClassifier classes The different value of arranged in matrix will obtain different Evaluated effects；After multiple study, cost matrix when selecting effect optimal Value is as parameter when predicting below；2. integrally regarding the comment data collection after verification as training set, the comment number of step 4 It is used as forecast set according to collection, parameter is used as using the cost matrix value of the effect that is obtained in 1. when optimal, WEKA offers are provided CostSensitiveClassifier classes realize the silent of SMO classes as meta classifier, and using the support vector machines that WEKA is provided Recognize parameter, establish disaggregated model, obtains the classification results of forecast set.