CN103793444B

CN103793444B - Method for acquiring user requirements

Info

Publication number: CN103793444B
Application number: CN201210436032.7A
Authority: CN
Inventors: 朱利民
Original assignee: JIANGSU SUDADA DATA TECHNOLOGY Co Ltd
Current assignee: JIANGSU SUDADA DATA TECHNOLOGY Co Ltd
Priority date: 2012-11-05
Filing date: 2012-11-05
Publication date: 2017-02-08
Anticipated expiration: 2032-11-05
Also published as: CN103793444A

Abstract

The invention relates to a method for acquiring user requirements. The method sequentially includes steps of acquiring seed words provided by users; expanding keywords; searching web pages; selecting the web pages; labeling the web pages; evaluating the web pages; learning the user requirements. A user requirement model can be acquired via the steps. The method for acquiring the user requirements has the advantages that the requirement model is built according to the user requirements and is continuously improved, the user requirements can be accurately acquired according to the user requirement model, and accordingly high-correlation information can be provided for the users.

Description

Customer requirement retrieval method

Technical field

The present invention relates to networking technology area, more particularly to a kind of customer requirement retrieval method.

Background technology

Since the Internet is born, the Internet has been developed as having the huge whole world of nearly hundred million users and several hundred million page Information warehouse, and its quantity of information is still increasing so that exponential form is by leaps and bounds.The information that obtains from the Internet has become as individual Obtain main method and the important means of knowledge, also become the important channel that current enterprise obtains information, but, in the face of great as cigarette The network information in sea, traditional artificial collection and processing method are all difficult to be competent at, and Search Results generally include a lot of and user The little information of demand dependency, the demand therefore how accurately obtaining user is a crucial problem.

At present, do numerous studies in information search field both at home and abroad, and develop multiple search engines, such as hundred Degree, Google and Yahoo etc..These search engines improve efficiency and the speed of search to a certain extent, but obtain user and need The method asked remains significant limitation, and outstanding behaviours is in the following aspects：Firstly, since using in full Retrieval or keyword retrieval mode, cause the deviation between actual retrieval result and user's request based on literal search mechanism, I.e. retrieval returns " useful " information very little, and " rubbish " information is too many；Secondly, network search engines need in the face of extensive knowledge neck Domain, and it is directed to a certain special dimension because not having enough background knowledges, lead to search webpage unrelated in a large number, there is larger phase The webpage of closing property is little.

Content of the invention

Based on this, the search for the network information is it is necessary to provide a kind of accurate method obtaining user's request.

A kind of customer requirement retrieval method, comprises the following steps successively：

Obtain the seed words that user provides, described seed words include positive seed words and negative seed words；

Described seed words using TongYiCi CiLin and hyponymy, are extended by keyword expansion step, obtain with just The related positive correlation key word of seed words and the inverse correlation key word related to negative seed words；

Search step, according to described positive correlation key word and inverse correlation key word, carries out coupling search based on the Internet, obtains To webpage to be marked, described webpage to be marked includes candidate's positive example and candidate's counter-example, and described candidate's positive example and candidate's counter-example are respectively Obtained by described positive correlation key word and inverse correlation keyword search；

Webpage selecting step, analyzes described webpage to be marked, according to its content by Web page classifying to be marked, then from every class Select a sample web page in webpage respectively and supply user annotation；

Annotation step, if sample web page meets the demand of user, this sample web page is labeled as positive example, then this sample Webpage be located apoplexy due to endogenous wind other webpages to be marked all regarding positive example, if sample web page does not meet the demand of user, by this sample Webpage label is counter-example, then other webpages to be marked of the apoplexy due to endogenous wind that this sample web page is located, all regarding counter-example, gather described positive example And counter-example, obtain initial user labeled data collection；

Evaluation procedure, using SVM classifier coaching method, all samples that will select from described candidate's positive example and candidate's counter-example This webpage, as test set, using all non-sample webpages as training set, is tested to the accuracy of Web page classifying to be marked, Obtain the accuracy rate classified, preset threshold value, when the rate of accuracy reached of described classification is to described threshold value, described evaluation procedure is complete Become, when the accuracy rate of described classification is not up to described threshold value, return described webpage selecting step, adjustment needs the positive example of mark With the quantity of counter-example, repeat mark step and evaluation procedure, finally give positive example and the user annotation data of counter-example equal number Collection；

Learning procedure, based on the user annotation data set of described positive example and counter-example equal number, learns user's request, obtains The demand model of user.

Wherein in an embodiment, in described keyword expansion step, described TongYiCi CiLin and up and down justice are closed System is provided by wordnet.

Wherein in an embodiment, also include after described annotation step extracting from the described positive example obtaining and counter-example Feature Words, generate positive correlation key word and inverse correlation key word, the step further expanding described seed words.

Wherein in an embodiment, in described annotation step, the mark of described sample web page is passed through man-machine by user Interactive interface completes.

Wherein in an embodiment, in described annotation step, need the described positive example of mark and the initial number of counter-example Amount is equal.

Wherein in an embodiment, in described evaluation procedure, the quantity of the described positive example needing mark and counter-example is such as Under：

Total * (the ratio of counter-example in the ratio of current counter-example+current class mistake of the quantity of positive example=user annotation webpage Example)/2；

Total * (the ratio of positive example in the ratio of current positive example+current class mistake of the quantity of counter-example=user annotation webpage Example)/2；

In above-mentioned computational methods, the ratio of counter-example refers to counter-example and accounts for positive example and the ratio of counter-example total amount, and the ratio of positive example is criticized Example accounts for positive example and the ratio of counter-example total amount, after in classification error, counter-example ratio refers to train through SVM classifier, quilt in described training set It is mistakenly considered the ratio of counter-example, after positive example ratio refers to train through SVM classifier in classification error, be just mistaken as in described training set The ratio of example.

Wherein in an embodiment, described learning procedure includes：

Theme line learning procedure, pre-sets theme feature search tree, first from described positive example and counter-example equal number Extract theme in user annotation data set and obtain subject dataset, secondly concentrate from described subject data and extract theme line feature, Finally carry out theme demand estimation, if current theme feature search tree does not include the theme feature of described extraction, by institute The theme feature stating extraction adds described theme feature search tree, obtains the theme monitoring model of user；

Content learning procedure, extracts content first from the user annotation data set of described positive example and counter-example equal number and obtains To content data set, secondly concentrate from described content-data and extract content characteristic, finally carry out binary classifier training, to content Demand is differentiated, obtains contents supervision's model of user.

Wherein in an embodiment, in described theme line learning procedure, extract master concentrating from described subject data During topic feature, build theme feature by the way of based on the reordering of word.

Wherein in an embodiment, described binary classifier is trained for Bayes classifier.

Above-mentioned customer requirement retrieval method, the seed words extend that obtaining user first provides obtain positive correlation key Word and inverse correlation key word, obtain webpage to be marked secondly based on positive correlation key word and inverse correlation keyword search, by net Page selecting step and annotation step, obtain initial user labeled data collection, then initial user labeled data collection are evaluated, obtain To the user annotation data set of positive example and counter-example equal number, the user annotation data set of positive example and counter-example equal number is carried out Analysis, the demand of study user, and obtain the demand model of user.The demand model is to set up according to user's request and continuous Perfect, according to the demand model of this user, can accurately obtain the demand of this user, thus providing dependency higher information To this user.

Brief description

Fig. 1 is the flow chart of the customer requirement retrieval method of an embodiment；

Fig. 2 is the flow chart of the learning procedure of an embodiment.

Specific embodiment

In order to solve the problems, such as to be difficult to accurately obtain user's request, present embodiments provide for a kind of accurately obtain user's request Method.With reference to specific embodiment, customer requirement retrieval method is specifically described.

Refer to Fig. 1 and Fig. 2, the customer requirement retrieval method that present embodiment provides, comprise the steps：

Step S110, obtains the seed words that user provides；

Step S120, keyword expansion step；

Step S130, search step；

Step S140, webpage selecting step；

Step S150, annotation step；

Step S160, evaluation procedure；

Step S170, learning procedure.

In step s 110, the seed words that user provides are obtained, seed words include positive seed words and negative seed words.

Step S120 is keyword expansion step, and keyword expansion increases the synonym of seed words or near synonym to extend Current seed words.Keyword expansion has two kinds of approach, and the first approach is (a kind of based on cognitive linguistics using wordnet English dictionary, it not only alphabetically arranges word, and forms one " network of word " according to the meaning of word) carry For TongYiCi CiLin and hyponymy, keyword expansion is carried out to seed words, obtains the positive related to positive seed words Close key word and the inverse correlation key word related to negative seed words, set positive correlation key word and inverse correlation key word are closed Keyword storehouse.Another kind of approach is, according to the evaluation procedure of step S160, extraction feature word from the positive example obtaining and counter-example, and raw Become positive correlation key word and inverse correlation key word, further expand seed words, thus improving keywords database, more accurately obtaining and using The demand at family.

Step S130 is search step, according to positive correlation key word in keywords database and inverse correlation key word, based on interconnection Net carries out coupling search, obtains webpage to be marked, webpage to be marked includes candidate's positive example and candidate's counter-example, candidate's positive example and candidate Counter-example is obtained by positive correlation key word and inverse correlation keyword search respectively.Candidate's positive example is the webpage that user is concerned about, meets The demand of user；Candidate's counter-example as so-called " error message ", does not meet user's request.

It is in webpage selecting step in step S140, analyze webpage to be marked, according to its content, webpage to be marked is divided into If Ganlei, from every class webpage, then select a sample web page respectively supply user annotation, the webpage quantity that sample web page comprises Specified by user.If sample web page is positive example by user annotation, then other of the apoplexy due to endogenous wind that this sample web page is located are to be marked Webpage is all regarding positive example, if sample web page is counter-example by user annotation, then other of the apoplexy due to endogenous wind that this sample web page is located are waited to mark Note webpage is all regarding counter-example.Obviously, step S140 can greatly reduce the workload of user annotation.

Step S150 is annotation step, present embodiments provide for human-computer interaction interface, user can pass through man-machine interaction Interface is readily achieved the mark work to candidate web pages.If sample web page meets the demand of user, by this sample web page mark Note as positive example, then this sample web page be located apoplexy due to endogenous wind other webpages to be marked all regarding positive example, if sample web page does not meet The demand of user, this sample web page is labeled as counter-example, then other webpages to be marked of the apoplexy due to endogenous wind that this sample web page is located are equal Depending on counter-example, gather positive example and counter-example, obtain initial user labeled data collection.

First during man-machine interaction, due to also not carrying out the evaluation procedure of step S160, user needs sample web page is entered Row 1:1 mark, that is, user from all sample web page, mark out positive example and the counter-example of equal number.Certainly, this positive example and Counter-example 1:1 ratio is initial setting ratio, and in subsequent step, the ratio regular meeting of positive example and counter-example adjusts accordingly.

Step S160 is evaluation procedure, using SVM (support vector machine, support vector machine) grader, Using all sample web page selected from candidate's positive example and candidate's counter-example as test set, using all non-sample webpages as training Collection, tests to the accuracy of Web page classifying to be marked, obtains the accuracy rate classified, preset threshold value, when the standard of classification When really rate reaches threshold value, evaluation procedure completes.When the accuracy rate of classification is not up to threshold value, return webpage selecting step, adjustment Need the positive example of mark and quantity, repeat mark step and the evaluation procedure of counter-example, until the rate of accuracy reached of classification is to threshold value, from And obtain the user annotation data set of positive example and counter-example equal number.

If the accuracy rate of described classification is not up to described threshold value, when entering next round evaluation procedure, described positive example and anti- Quantity in next round evaluation procedure for the example adjusts according to the method that is calculated as below：

In above-mentioned computational methods, the ratio of counter-example refers to counter-example and accounts for positive example and the ratio of counter-example total amount, and the ratio of positive example is criticized Example accounts for positive example and the ratio of counter-example total amount.In classification error, counter-example ratio refers to after SVM classifier training step, quilt in training set It is mistakenly considered the ratio of counter-example, in classification error, positive example ratio refers to, after SVM classifier training step, just be mistaken as in training set The ratio of example.This computational methods is the foundation adjusting positive example and counter-example ratio.

Step S170 is learning procedure, including theme line learning procedure and content learning procedure two parts.

In step S170, step 172 is first carried out：Subject extraction, content extraction.In the positive example having obtained and counter-example In the user annotation data set of equal number, extract theme and content, and respectively obtain subject dataset and content-data Collection.

Step S174a is the theme a feature extraction, and step S176a is the theme demand estimation.Pre-set theme feature to search Suo Shu, execution step S174a, concentrate from subject data and extract theme line feature.Extract theme spy concentrating from subject data When levying, traditional way is usually to adopt word-based feature construction theme transition model, and descriptor version is more, therefore The method of word-based feature construction theme transition model can not cover the information of user's needs comprehensively.In order to solve this problem, Present embodiment builds theme feature by the way of based on the reordering of word, and descriptor is decomposed into keyword and by keyword Restructuring obtains various forms of descriptor, so just solves the problems, such as that descriptor causes because version is more.Then hold again Row step S176a, carries out theme demand estimation, if current theme feature search tree does not include the theme feature extracting, The theme feature extracting is added theme feature search tree, obtains the theme monitoring model of user.

Step S174b extracts for content characteristic, and step S176b differentiates for content requirements.Execution step S174a, from theme Content characteristic is extracted in data set, then execution step S176a, carry out theme demand differentiation.When theme demand differentiates, adopt The method of training binary classifier, obtains contents supervision's model of user.In order to ensure the classification speed of grader, this embodiment party The binary classifier that formula uses is Bayes classifier.

In the present embodiment, the seed words extend obtaining user first and providing obtain positive correlation key word and instead Associative key, is obtained webpage to be marked secondly based on positive correlation key word and inverse correlation keyword search, is selected by webpage Step and annotation step, obtain initial user labeled data collection, then initial user labeled data collection is done with multiple evaluation just obtain Example and the user annotation data set of counter-example equal number, are carried out to the user annotation data set of positive example and counter-example equal number point Analysis, the demand of study user, and obtain the demand model of user.The demand model is to set up and constantly complete according to user's request Kind, according to the demand model of this user, can accurately obtain the demand of this user, thus provide the higher information of dependency to This user.

Using SVM classifier, can quantitatively evaluate positive example and counter-example, have quantitative evaluation to training set, adjust in time Positive example and the ratio of counter-example.Selectively carry out man-machine interaction on this basis, more effectively can obtain the need of user comprehensively Ask.

Traditional method word-based feature construction theme transition model, and descriptor variation is more, this kind of variation Word be likely to be taken as " error message " filter out, make user cannot Overall Acquisition demand information.Using based on word The theme feature filtering model reordered, the more problem of effectively solving theme variation is it is ensured that user obtains demand letter Cease is comprehensive.

Embodiment described above only have expressed the several embodiments of the present invention, and its description is more concrete and detailed, but simultaneously Therefore the restriction to the scope of the claims of the present invention can not be interpreted as.It should be pointed out that for those of ordinary skill in the art For, without departing from the inventive concept of the premise, some deformation can also be made and improve, these broadly fall into the guarantor of the present invention Shield scope.Therefore, the protection domain of patent of the present invention should be defined by claims.

Claims

1. a kind of customer requirement retrieval method is it is characterised in that comprise the following steps successively：

Described seed words, using TongYiCi CiLin and hyponymy, are extended, obtain and positive seed by keyword expansion step The related positive correlation key word of word and the inverse correlation key word related to negative seed words；

Search step, according to described positive correlation key word and inverse correlation key word, carries out coupling search based on the Internet, is treated Mark webpage, described webpage to be marked includes candidate's positive example and candidate's counter-example, and described candidate's positive example and candidate's counter-example are respectively by institute State positive correlation key word and inverse correlation keyword search obtains；

Webpage selecting step, analyzes described webpage to be marked, according to its content by Web page classifying to be marked, then from every class webpage Middle select a sample web page respectively and supply user annotation；

Annotation step, if sample web page meets the demand of user, this sample web page is labeled as positive example, then this sample web page Be located apoplexy due to endogenous wind other webpages to be marked all regarding positive example, if sample web page does not meet the demand of user, by this sample web page It is labeled as counter-example, then other webpages to be marked of the apoplexy due to endogenous wind that this sample web page is located all regard counter-example, gather described positive example and instead Example, obtains initial user labeled data collection；

Evaluation procedure, using SVM classifier coaching method, all sample nets that will select from described candidate's positive example and candidate's counter-example Page, as test set, using all non-sample webpages as training set, is tested to the accuracy of Web page classifying to be marked, is obtained The accuracy rate of classification, preset threshold value, when the rate of accuracy reached of described classification is to described threshold value, described evaluation procedure completes, When described classification accuracy rate be not up to described threshold value when, return described webpage selecting step, adjustment need mark positive example and The quantity of counter-example, repeat mark step and evaluation procedure, finally give the user annotation data set of positive example and counter-example equal number；

Learning procedure, including：

Theme line learning procedure, pre-sets theme feature search tree, first from the user of described positive example and counter-example equal number Labeled data concentrates extraction theme to obtain subject dataset, secondly concentrates from described subject data and extracts theme line feature, finally Carry out theme demand estimation, if current theme feature search tree does not include the theme feature of described extraction, take out described The theme feature taking adds described theme feature search tree, obtains the theme monitoring model of user；

Content learning procedure, extracts in content obtains first from the user annotation data set of described positive example and counter-example equal number Hold data set, secondly concentrate from described content-data and extract content characteristic, finally carry out binary classifier training, to content requirements Differentiated, obtained contents supervision's model of user.

2. customer requirement retrieval method according to claim 1 is it is characterised in that in described keyword expansion step, Described TongYiCi CiLin and hyponymy are provided by wordnet.

3. customer requirement retrieval method according to claim 1 it is characterised in that also include after described annotation step from Extraction feature word in the described positive example obtaining and counter-example, generates positive correlation key word and inverse correlation key word, further expands institute The step stating seed words.

4. customer requirement retrieval method according to claim 1 is it is characterised in that in described annotation step, described sample The mark of this webpage is completed by human-computer interaction interface by user.

5. customer requirement retrieval method according to claim 1 is it is characterised in that in described annotation step, need to mark The initial number of the described positive example of note and counter-example is equal.

6. customer requirement retrieval method according to claim 1 is it is characterised in that in described evaluation procedure, described need The quantity of positive example to be marked and counter-example is as follows：

The total * (ratio of counter-example in the ratio of current counter-example+current class mistake) of the quantity of positive example=user annotation webpage/ 2；

The total * (ratio of positive example in the ratio of current positive example+current class mistake) of the quantity of counter-example=user annotation webpage/ 2；

In above-mentioned computational methods, the ratio of counter-example refers to counter-example and accounts for positive example and the ratio of counter-example total amount, and the ratio of positive example is criticized example and accounted for Positive example and the ratio of counter-example total amount, after in classification error, counter-example ratio refers to train through SVM classifier, are easily mistaken in described training set For the ratio of counter-example, after positive example ratio refers to train through SVM classifier in classification error, in described training set, it is mistaken as positive example Ratio.

7. customer requirement retrieval method according to claim 1 is it is characterised in that in described theme line learning procedure, When concentrating and extracting theme feature from described subject data, build theme feature by the way of based on the reordering of word.

8. customer requirement retrieval method according to claim 1 is it is characterised in that described binary classifier is trained for pattra leaves This grader.