CN102541913A - Web-oriented VSM (vector space model) classifier training method, web-oriented OSSP (open resource software page) identifying method and Web-oriented OSS (open resource software) resource extracting method - Google Patents

Web-oriented VSM (vector space model) classifier training method, web-oriented OSSP (open resource software page) identifying method and Web-oriented OSS (open resource software) resource extracting method Download PDF

Info

Publication number
CN102541913A
CN102541913A CN2010106097430A CN201010609743A CN102541913A CN 102541913 A CN102541913 A CN 102541913A CN 2010106097430 A CN2010106097430 A CN 2010106097430A CN 201010609743 A CN201010609743 A CN 201010609743A CN 102541913 A CN102541913 A CN 102541913A
Authority
CN
China
Prior art keywords
keyword
ossp
vsm
page
web
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2010106097430A
Other languages
Chinese (zh)
Other versions
CN102541913B (en
Inventor
王怀民
朱沿旭
尹刚
袁霖
史殿习
米海波
滕猛
刘惠
刘波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN201010609743.0A priority Critical patent/CN102541913B/en
Publication of CN102541913A publication Critical patent/CN102541913A/en
Application granted granted Critical
Publication of CN102541913B publication Critical patent/CN102541913B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Transfer Between Computers (AREA)

Abstract

The invention provides a Web-oriented VSM (vector space model) classifier training method, a Web-oriented OSSP (open resource software page) identifying method and a Web-oriented OSS (open resource software) resource extracting method. The Web-oriented VSM classifier training method comprises the step of training a VSM classifier by using an initial sample set on the basis of an OSSP identifying feature vector, wherein the OSSP identifying feature vector is formed by seven or all of the eight of the software version control and management keywords, the mailing lists keywords, the Bug tracking keywords, the developer list keywords, the certificate keywords, the log modification keywords, the task list keywords and the software control and management command. The Web-oriented OSSP identifying method comprises the step of identifying whether a Web page is an OSSP according to the trained VSM classifier. The Web-oriented OSS resource extracting method comprises the steps of finding out OSS resources in the identified OSSP and downloading the OSS resources locally. The Web-oriented OSSP identifying precision can be significantly improved, the OSS resource searching and downloading intactness can be improved, and the OSS resources can be obtained more accurately.

Description

The training of VSM sorter, the identification of the OSSP page and OSS resource method for distilling towards Web
Technical field
The present invention relates to Web document classification and Web page info extractive technique field, specifically, the present invention relates to VSM sorter training method, OSSP (open source software homepage) page recognition methods and OSS (open source software) resource method for distilling.
Background technology
One, OSS brief introduction
Open source software English is Open Source Software, is abbreviated as OSS.The free software motion is initiated from nineteen eighty-three; The advocate of open source software is advocating the spirit of " freedom, participation, devotion and cooperation " always; This spirit is attracting large quantities of elite to bound oneself to it, and has formed free software alliance (FSF), open source code promotion association a plurality of tissues such as (OSI) gradually.
Open source software has experienced nearly 30 years development; Formed huge scale in the whole world; With the leading open source software rapid growth of Linux; Formed simultaneously the open source software co-development alliance (OSSF) of similar Sourceforge gradually, the software project number in each alliance does not wait to hundreds of thousands is individual from tens hundreds ofs are individual yet, and such alliance's number is also in constantly increasing.
Two, the searching method of OSS resource
At present, each big open source software alliance generally all is embedded with specialized search engine, thereby has made things convenient for user or OSS developer to search own required OSS resource (mainly being source code).Yet this specialized search engine often can only be used for the inner search of an open source software alliance, and quantity of information is very limited, and institute's Search Results that returns is complete inadequately.
In addition, can also use universal search engine (such as Google) in the Web of the magnanimity page, to search for the OSS resource in the prior art.With Google is example, the keyword of input open source software, and Google can return search result list, and the user can tabulate through navigate search results and obtain the OSS resource.Search Results that this way is returned is comparatively complete, yet the result who uses universal search engine to return often is mingled with a large amount of Web pages that does not contain the OSS resource, and therefore, the user must browse a large amount of pages and seek desired software, uses very inconvenience.Therefore the current solution that presses for the OSS resource searching that can improve search completeness and accuracy simultaneously.
Three, existing text classification technology based on machine learning
In the prior art, have a kind of text classification technology based on machine learning, this technology can be applied to Web page classifying.Yet, be used for the identification of the OSSP page based on the text classification technology of machine learning, can there be following defective:
1, the OSSP page is different from the common Web page, can not choose keyword according to word frequency simply.Such as maybe be not high, sometimes even possibly only occur once to the word frequency of speech in the OSSP page such as the identification OSSP page quite valuable SVN, Git, CVS, License.Like this, in the classical file classification method, some and OSS have nothing to do but the bigger speech of word frequency may be used as the sorter of principal character input based on machine learning, and then causes the recognition result degree of accuracy on the low side.
2, in the Web of the magnanimity page; There is a large amount of pages relevant with OSS; The page such as a certain OSS of brief introduction; A lot of characteristics with OSSP page of this type related pages, but lack the inlet that code release is controlled the storehouse, that is to say that the user can not obtain source code from this type OSS related pages.Understand easily, in the classical file classification method, maybe a large amount of OSS related pages erroneous judgements be the OSSP page, this also makes the recognition result degree of accuracy reduce greatly.
In sum, current recognition methods of the OSSP page and the OSS resource method for distilling that presses for a kind of pinpoint accuracy towards Web.
Summary of the invention
The VSM sorter training method towards Web, the recognition methods of the OSSP page and the OSS resource method for distilling that the purpose of this invention is to provide a kind of pinpoint accuracy.
For realizing above-mentioned purpose, the invention provides a kind of VSM sorter training method, this method is based on OSSP page recognition feature vector, with initial sample set training VSM sorter; Said OSSP page recognition feature vector is: software version control and management keyword, mail tabulation keyword, bug follow the tracks of keyword, developer tabulate keyword, certificate keyword, revise daily record keyword, task list keyword, and choose wherein 7 or select whole 8 VSM sorter proper vectors of forming as component in the software control administration order.
The present invention also provides a kind of OSSP page recognition methods towards Web, comprises the following steps (as shown in Figure 1):
1) based on OSSP page recognition feature vector, with initial sample set training VSM sorter; Said OSSP page recognition feature vector is: software version control and management keyword, mail tabulation keyword, bug follow the tracks of keyword, developer tabulate keyword, certificate keyword, revise daily record keyword, task list keyword, and choose wherein 7 or select whole 8 VSM sorter proper vectors of forming as component in the software control administration order;
2) to each Web page to be identified, extract the OSSP page recognition feature vector of each Web page respectively, use the VSM sorter of training to identify this Web page then and whether be the OSSP page.
The present invention also provides a kind of OSS resource method for distilling towards Web, comprises the following steps:
1) identifies the OSSP page in the Web page according to above-mentioned OSSP page recognition methods towards Web;
2) in the OSSP page that is identified, search OSS resource and it is downloaded to this locality.
Compared with prior art, the present invention has following technique effect:
1, the present invention can significantly improve towards the degree of accuracy of the OSSP page identification of Web;
2, the present invention can improve the completeness of OSS resource searching and download;
3, the present invention can obtain the OSS resource more accurately.
Description of drawings
Fig. 1 shows the process flow diagram towards the OSSP page recognition methods of Web of the embodiment of the invention 1;
Fig. 2 shows the process flow diagram towards the OSS resource method for distilling of Web of the embodiment of the invention 2.
Embodiment
For setting forth the present invention better, at first introduce the definition of the OSS and the OSSP page, and existing text classification technology based on machine learning.
1., the definition of the OSS and the OSSP page
Open source code promotion association (OSI) (comprises 10 aspects) as follows to the definition of OSS:
1, freely re-issues
Licence can not limit any group and sell or give software, and software can be one of them original paper in the software publishing version of the program of several separate sources after integrated.Licence can not require licence expense or other fees are collected in such sale.
2, program source code
Program must comprise source code.Must allow release when comprising the compiling form, also to comprise program source code.When product does not comprise source code during with the distribution of certain form, must be very eye-catching inform the user, how through the free loading source code of Internet.Source code must be to provide with the form of preferentially selecting for use when programmer's update routine.It is unallowed intentionally upsetting source code.Also is unallowed with preprocessor or the such intermediate form of translater as source code.
3, derivation program
Licence must allow change or derivation program.Must allow these programs by the licence distribution identical with initial software.
4, author's integrity of source code
Have only the licence of working as to allow in the program development stage, when for the purpose of adjusting program the release of " patch file " being issued with source code, licence could limit source code with the form distribution after changing.Licence must allow by the program distribution set up of source code after the change clearly.The program that licence can require to derive from is used title or the version number different with initial software.
5, do not discriminate against individual or group
Licence must must not be discriminated against any individual or group.
6, do not discriminate against trial in the field
Licence can not limit anyone and attempt program in certain specific field.For example can not be applied to commercial field by limiting program, perhaps be applied to genetic research.
7, licence distribution
The power that is attached to program must be applicable to all program retail traders, and does not need additional again other extra licences between these groups.
8, licence can not special certain product
If program is the part in a certain release of certain software, the power that is attached to this program is not to depend on this release.If program is taken passages from a certain release, what use when use or distribution all is the licence of that program, and all entities of distribution program all should have all power that is allowed with the initial software version.
9, licence can not repel other softwares
Licence can not limit other softwares of issuing with this licence software.For example, licence other softwares that can not require all to issue therewith all are open source softwares.
10, licence must be that technology is neutral
Licence can not be based upon on the basis of any individual skill or interface style.
More than definition is very complicated, in fact, those skilled in the art can be simply from whether having License and these two aspects of source code judge whether a software resource is the OSS resource.Among the present invention,, promptly determine that it is the OSS resource as long as a software resource has License and source code simultaneously.Each OSS resource generally all has corresponding open source software and develops community's homepage, describes for convenient, among this paper open source software is developed community's homepage and is called OSSP.Understanding easily, is the OSSP page if can utilize learning machine that which from the Web page of magnanimity, identifies, and the search of OSS resource just can be simultaneously obtains to significantly improve aspect two of efficient and completenesses so.In the present invention, the OSSP page is meant the inlet that code release control storehouse is provided, and can download and upload the page of OSS source code.Based on this definition, when the initial sample set of structure, whether those of ordinary skill in the art can be intuitively and judged a web page uniquely is the OSSP page.
2., existing text classification technology based on machine learning
In the prior art, have a kind of text classification technology based on machine learning, this technology can be applied to Web page classifying.Briefly introduce a kind of file classification method of classics below, text classification mainly comprises following step:
Foundation → the training classifier of text representation → training sample set → classification prediction.
The topmost method of text representation be exactly the vector space representation model (Vector Space Model, VSM), in the prior art; Mainly with speech (or phrase) as; Frequency with item is the basic calculation weight, and each text d can be expressed as by speech and the vector of word frequency to forming, d={ (t 1, w 1d), (t 2, w 2d) ..., (t n, w Nd).
Training sample set is exactly a limited set of being made up of text vector and text generic, its form of expression such as table 1
Table 1
Term1 Term2 ...... Classification
Text
1 The word frequency of Term1 in document 1 The word frequency of Term2 in document 1 ...... Physical culture
Text
2 The word frequency of Term1 in document 2 The word frequency of Term2 in document 2 ...... Music
...... ...... ...... ...... ......
At present, mainly contain based on the sorter of machine learning: methods such as SVM, Bayes, linear classification, decision tree and k-NN, SVM has sturdy theoretical foundation, and is more accurate than other algorithm of great majority in many applications, especially when handling high dimensional data; In addition, a lot of researchists think that SVM solves text classification problem algorithm the most accurately, are main sorter so generally select SVM.
The groundwork principle of svm classifier device (the simplest situation): SVM is the learning system of a linearity, is mainly used in the two-value classification problem.Training sample set is { (X 1, y 1), (X 2, y 2) ..., (X n, y n), X wherein i=(x I1, x I2..., x Ir) be the input vector of a r dimension, y iBe X iThe generic mark.Such as, for table, the input vector of text 1 is X 1=(w 11, w 21..., w R1), generic is labeled as y i∈ { physical culture, music }.
SVM seeks a linear function (1) exactly
f(X)=<W·X>+b (1)
If f is (X i)>0 is X so iBe endowed positive type, otherwise be endowed negative type, i.e. (2)
Figure BSA00000401651200061
For table, if f is (X 1)>0 explanatory text 1 is categorized as y 1=physical culture; F (X 2)<0 explanatory text 2 is categorized as y 2=music.
In conjunction with above-mentioned analysis, can find out, the file classification method of above-mentioned classics is used for the identification of the OSSP page, can there be following defective:
1, the OSSP page is different from the common Web page, can not choose keyword according to word frequency simply.Such as maybe be not high, sometimes even possibly only occur once to the word frequency of speech in the OSSP page such as the identification OSSP page quite valuable SVN, Git, CVS, License.Like this, in the classical file classification method, some and OSS have nothing to do but the bigger speech of word frequency may be used as the sorter of principal character input based on machine learning, and then causes the recognition result degree of accuracy on the low side.
2, in the Web of the magnanimity page; There is a large amount of pages relevant with OSS; The page such as a certain OSS of brief introduction; A lot of characteristics with OSSP page of this type related pages, but lack the inlet that code release is controlled the storehouse, that is to say that the user can not obtain source code from this type OSS related pages.Understand easily, in the classical file classification method, maybe a large amount of OSS related pages erroneous judgements be the OSSP page, this also makes the recognition result degree of accuracy reduce greatly.
Below in conjunction with specific embodiment the present invention is done to describe further.
Embodiment 1
According to one embodiment of present invention, a kind of OSSP page recognition methods towards Web based on VSM (vector space representation model) sorter is provided, this method comprises the following steps:
1) chooses the proper vector of one group of keyword as the VSM sorter;
2) based on the proper vector of step 1), train the VSM sorter with initial sample set;
3) carry out the identification of the web page with the VSM sorter of training.
Present embodiment also provides corresponding OSS resource method for distilling, and this method is according to above-mentioned steps 1) 2) 3) identify the OSSP page, search OSS resource (like source code) at this OSSP page then, and it is downloaded to local memory device.
Introduce above-mentioned each step below respectively.
One, keyword is chosen
In step 1), the proper vector of VSM sorter is made up of one group of dissimilar keyword.In the present embodiment, keyword is divided into by type: software version control and management keyword, mail tabulation keyword, Bug follow the tracks of keyword, developer tabulate keyword, certificate keyword, revise daily record keyword and task list keyword.
Wherein, software version control and management keyword comprises SVN, Git or CVS.As long as contain any speech among SVN, Git, the CVS in a Web page, can judge that this web page has software version control and management keyword; Otherwise judge that this Web page does not have software version control and management keyword.
The mail tabulation keyword comprises Mailing Lists, Mail_List or Email_List.As long as contain any speech among Mailing Lists, Mail_List or the Email_List in a Web page, can judge that this Web page has the mail tabulation keyword; Otherwise judge that this Web page does not have the mail tabulation keyword.
Bug follows the tracks of keyword and comprises Bug Trackers, Issue Tracker or Bug Report.As long as contain any speech among Bug Trackers, Issue Tracker, the Bug Report in a Web page, can judge that this Web page has Bug and follows the tracks of keyword; Otherwise judge that this Web page does not have Bug and follows the tracks of keyword.
Developer's keyword of tabulating comprises Developer List, Member List, Project Memberlist, Blogger List, View Members or Author.As long as contain any speech among DeveloperList, Member List, Project Memberlist, Blogger List, View Members, the Author in a Web page, can judge that this Web page has developer's keyword of tabulating; Otherwise judge that this Web page does not have developer's keyword of tabulating.
The certificate keyword comprises GPL, Apache License, BSD License, MIT license, Mozilla Public License, Common Development and Distribution License or Eclipse Public License.As long as contain any speech among GPL, Apache License, BSDLicense, MIT License, Mozilla Public License, Common Development andDistribution License, the Eclipse Public License in a Web page, can judge that this Web page has the certificate keyword; Otherwise judge that this Web page does not have the certificate keyword.
Revise the daily record keyword and comprise Change Log, Commit Log, Update Log.As long as contain any speech among Change Log, Commit Log, the Update Log in a Web page, can judge that this Web page has the daily record of modification keyword; Otherwise judge that this Web page does not have the daily record of modification keyword.
The task list keyword comprises task lists.
Two, the training of VSM sorter
In step 2) in, be fundamental construction initial training sample set with the OSSP page of known open source software co-development alliance (OSSF).In the initial training sample set, for an OSSP page, the VSM proper vector corresponding with it is: (software version control and management keyword; The mail tabulation keyword, Bug follows the tracks of keyword, developer's keyword of tabulating; The certificate keyword is revised the daily record keyword, the task list keyword).The value of each keyword is " 0 " or " 1 ", representes the keyword that this OSSP page does not have or have corresponding types respectively.And the output valve of VSM sorter also is " 0 " or " 1 ", respectively expression " denying " or " being " OSSP page.
For increasing the degree of accuracy of VSM sorter, can in the initial training sample set, further increase by the Web page of artificial cognition.According to the described definition of preamble, the inlet in code release control storehouse is provided, can download and upload the page of OSS source code, can think the OSSP page.Based on this definition, whether those of ordinary skill in the art can be intuitively and judged a Web page uniquely is the OSSP page.
Particularly, according to the OSSP page definition 100 typical Web pages relevant with OSS are judged whether draw it is the OSSP page by one of ordinary skill in the art; Final formation whether comprising software version control and management keyword, the mail tabulation keyword, Bug follows the tracks of keyword; Developer's keyword of tabulating, the daily record keyword revised in the certificate keyword; Whether the task list keyword is an attribute, being that the OSSP page is the training sample set of generic.
VSM proper vector input VSM sorter with the Web page of each the OSSP page in the initial training sample set or non-OSSP; Also give the VSM sorter with the pairing VSM output valve of the Web page of the OSSP page or non-OSSP simultaneously, thereby obtain VSM sorter through initial training.
Three, the identification of the web page
In step 3); For each Web page to be identified; Computing machine retrieves whether have software version control and management keyword in this Web page respectively, keyword followed the tracks of in mail tabulation keyword, Bug, the developer tabulates keyword, the certificate keyword, revise daily record keyword and task list keyword, thereby draws the pairing VSM proper vector of this Web page.VSM sorter with the input of this VSM proper vector was trained draws the VSM output valve, if when the VSM output valve is " 1 ", this Web page is the OSSP page, if when the VSM output valve is " 0 ", this Web page is not the OSSP page.
When the VSM output valve is " 1 ", can be further the VSM proper vector and the VSM output valve thereof of the current Web page be added training sample set, so that it does further training to the VSM classification, with the degree of accuracy of further raising identification.
The foregoing description with the keyword in the OSSP page text as the features training sorter.Yet, only use the keyword in the text, false-positive problem may appear.The page such as a certain OSS of brief introduction; This type related pages has a lot of characteristics of the OSSP page; But lack the inlet in code release control storehouse, that is to say that the user can not obtain source code from this type OSS related pages, so this type Web page is not the OSSP page.And when only using keyword in the text to discern, might the page erroneous judgement of a large amount of brief introduction character be the OSSP page as proper vector.Therefore, the present invention also provides preferred embodiment, the embodiment basically identical of the preferred embodiment and front, and difference is to have adopted different VSM proper vectors.In a preferred embodiment, the VSM proper vector also comprises OSSP page structure characteristic except keyword.OSSP page structure characteristic comprises the software control administration order.In a preferred embodiment, increase an element--software control administration order in the VSM proper vector.According to the web page whether have the software control administration order decide in the VSM proper vector corresponding to the value of software control administration order be " 1 " or " 0 ".The remainder of the preferred embodiment and aforesaid first embodiment are in full accord, repeat no more here.
In the preferred embodiment, the software control administration order comprises: the order of up-to-date renewal on the order of downloading for the first time, the Download Server, the order that detects certain revision version, interpolation are perhaps submitted to the order of change by the order of trace file by the order of trace file, deletion.Promptly if contain in Web page renewal up-to-date on the order of downloading for the first time, the Download Server order, detect certain revision version order, add by the order of trace file, deletion by the order of trace file, submit any order in the order of change to, can judge that this Web page has the software control administration order; Otherwise judge that this Web page does not have the software control administration order.
Further, the software control administration order of the OSSP page comprises: (comprising three kinds of control and management software-svn, the cvs of present main flow, the commonly used command of git)
(1) download for the first time, comprise source code and version repository:
For SVN control and management software, this order is:
svn?checkout?http://path/to/repo?repo_name
For CVS control and management software, this order is:
cvs?checkout?project_Lname
For Git control and management software, this order is:
git-clone\git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git\linux-2.6
(2) up-to-date renewal on the Download Server
For SVN control and management software, this order is:
svn?update[-r?rev]PATH
For CVS control and management software, this order is:
cvs?update
For Git control and management software, this order is:
git-pull?git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git
(3) detect certain revision version
For SVN control and management software, this order is:
svn?checkout-r<rev>
For CVS control and management software, this order is:
cvs?checkout-r?rel-1-0?tc
For Git control and management software, this order is:
git?reset-hard-r<rev>
(4) add by trace file
For SVN control and management software, this order is:
svn?add?PATH...
For CVS control and management software, this order is:
cvs?add?new_file
For Git control and management software, this order is:
git-add?Documentation/Sandwiches
(5) deletion is by trace file
For SVN control and management software, this order is:
svn?delete?PATH
For CVS control and management software, this order is:
cvs?rm?file_name
For Git control and management software, this order is:
git?rm/path/to/file?Svn8.Com
(6) submit change to
For SVN control and management software, this order is:
svn?status-v?PATH
For CVS control and management software, this order is:
cvs?commit-m″write?some?comments?here″file_name
For Git control and management software, this order is:
git?commit
Except the foregoing description, VSM proper vector of the present invention can also adopt other array mode.Such as, can, software version control and management keyword, mail tabulation keyword, Bug tracking keyword, developer select any 7 to form the VSM proper vectors from tabulating keyword, certificate keyword, modification daily record keyword and task list keyword and software control administration order.
The VSM sorter can be selected sorters such as SVM, Bayes, linear classification, decision tree or k-NN; Wherein, The method that is fit to the two-value classification has SVM and decision tree, and this foregoing description belongs to the category of two-value classification, so optional sorter is SVM and decision tree.
Provide actual test datas more of the present invention below.
The explanation of test sample book collection: test sample book collection of the present invention is the same with the training sample set construction method of sorter, all is according to the OSSP page definition 100 typical Web pages relevant with OSS to be judged whether draw it is the OSSP page by one of ordinary skill in the art; Final formation whether to comprise software version control and management keyword; The mail tabulation keyword, Bug follows the tracks of keyword, developer's keyword of tabulating; The certificate keyword; Whether revise the daily record keyword, the task list keyword is an attribute, being that the OSSP page is the sample set of generic.
The experiment condition explanation:
Hardware configuration: SONY NW series (CPU: double-core 2.1G, internal memory: 4G)
Software arrangements: operating system is WIN7, and the compilation run environment is Eclipse Java EE IDE for WebDevelopers, and database is MySQL 5.0.89.
The definition of degree of accuracy is as shown in table 2:
Table 2
Figure BSA00000401651200111
Annotate: TP: be positive example originally, by the correct number that is categorized as positive example (true positive)
FN: be positive example originally, by the wrong number that is categorized as counter-example (false negative)
FP: be counter-example originally, by the wrong number that is categorized as positive example (false positive)
TN: be counter-example originally, by the correct number that is categorized as counter-example (true negative)
The classifying quality that different sorters use the inventive method to produce is as shown in table 3:
Table 3
Figure BSA00000401651200112
Can find out from table 3 no matter being based on the svm classifier device still is the decision tree classification device, the present invention with respect to the method for traditional text classification technology based on machine learning by significantly improving.
Embodiment 2
Present embodiment provides a kind of OSS resource method for distilling of automatic intelligence; This method is from the webpage of open source software alliance; Through the various links of migration on the page, the characteristic of the study page and link is discerned the OSSP page automatically, efficiently; The OSS resource extracts the most at last, stores local data base into.In the present embodiment, described OSS resource can be an OSS information, and OSS information comprises dbase, exploitation community entry address, development teams entry address, mail tabulation entry address, Bug tabulation entry address, code release control system entry address.
As shown in Figure 2, the OSS information extracting method of present embodiment may further comprise the steps:
Each big open source software alliance network address of step 1, typing deposits them in link buffer queue (seed lists of links).
Step 2, read in the formation one automatically and do not read link; Analyze the webpage that link is pointed to; According to link type migration different in the webpage, judge whether the web page that occurs in the migration path is the OSSP page, and catch the OSSP page that is identified; OSSP collections of web pages of final formation, the while is the new url buffer queue more.Wherein the recognition methods of the OSSP page is consistent with embodiment 1, repeats no more here.When the OSSP page that identification makes new advances, can deposit this OSSP page in OSSP learning sample collection, with continuous training classifier.
Step 3, analyze each OSSP page in the OSSP collections of web pages automatically, the association attributes of identification OSS extracts the corresponding OSS information of each OSSP webpage.OSS information comprises dbase, exploitation community entry address, development teams entry address, mail tabulation entry address, Bug tabulation entry address, code release control system entry address.
Step 4, deposit the OSS information of extracting in database table, the field of database table comprises < dbase, exploitation community entry address, development teams entry address, mail tabulation entry address, the Bug entry address of tabulating, code release control system entry address >.
It should be noted last that; Above embodiment is only unrestricted in order to technical scheme of the present invention to be described; Although the present invention is specified with reference to preferred embodiment; Those of ordinary skill in the art should be appreciated that and can make amendment or be equal to replacement technical scheme of the present invention, and do not break away from the spirit and the scope of technical scheme of the present invention.

Claims (12)

1. VSM sorter training method towards Web comprises:
Based on OSSP page recognition feature vector, with initial sample set training VSM sorter; Said OSSP page recognition feature vector is: software version control and management keyword, mail tabulation keyword, Bug follow the tracks of keyword, developer tabulate keyword, certificate keyword, revise daily record keyword, task list keyword, and choose wherein 7 or select whole 8 VSM sorter proper vectors of forming as component in the software control administration order.
2. the VSM sorter training method towards Web according to claim 1 is characterized in that said software version control and management keyword comprises SVN, Git or CVS.
3. the VSM sorter training method towards Web according to claim 1 is characterized in that said mail tabulation keyword comprises Mailing Lists, Mail_List or Email_List.
4. the VSM sorter training method towards Web according to claim 1 is characterized in that, said Bug follows the tracks of keyword and comprises Bug Trackers, Issue Tracker or Bug Report.
5. the VSM sorter training method towards Web according to claim 1; It is characterized in that the said developer keyword of tabulating comprises Developer, Developer List, Member List, Project Memberlist, Blogger List, View Members or Author.
6. the VSM sorter training method towards Web according to claim 1; It is characterized in that said certificate keyword comprises License, GPL, Apache License, BSD License, MIT License, Mozilla Public License, Common Development and DistributionLicense or Eclipse Public License.
7. the VSM sorter training method towards Web according to claim 1 is characterized in that said modification daily record keyword comprises Change Log, Commit Log or Update Log.
8. the VSM sorter training method towards Web according to claim 1 is characterized in that said task list keyword comprises Task Lists.
9. the VSM sorter training method towards Web according to claim 1; It is characterized in that, said software control administration order comprise renewal up-to-date on the order of downloading for the first time, the Download Server order, detect certain revision version order, add by the order of trace file, deletion by the order of trace file or submit the order of change to.
10. the OSSP page recognition methods towards Web is characterized in that, comprises the following steps:
1) to each Web page to be identified; Extract the OSSP page recognition feature vector of each Web page respectively; Said OSSP page recognition feature vector is: software version control and management keyword, mail tabulation keyword, Bug follow the tracks of keyword, developer tabulate keyword, certificate keyword, revise daily record keyword, task list keyword, and choose wherein 7 or select whole 8 VSM sorter proper vectors of forming as component in the software control administration order;
Whether 2) utilize the VSM sorter that is trained according to the described VSM sorter of one of claim 1~9 training method to identify this Web page then is the OSSP page.
11. the OSS resource acquiring method towards Web is characterized in that, comprises the following steps:
1) the OSSP page recognition methods towards Web according to claim 11 identifies the OSSP page in the Web page;
2) in the OSSP page that is identified, search the OSS resource.
12. the OSS resource acquiring method towards Web according to claim 11 is characterized in that, also comprises the following steps:
3) with step 2) the OSS resource downloading that found is to local.
CN201010609743.0A 2010-12-15 2010-12-15 VSM classifier trainings, the identification of the OSSP pages and the OSS Resource Access methods of web oriented Expired - Fee Related CN102541913B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201010609743.0A CN102541913B (en) 2010-12-15 2010-12-15 VSM classifier trainings, the identification of the OSSP pages and the OSS Resource Access methods of web oriented

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201010609743.0A CN102541913B (en) 2010-12-15 2010-12-15 VSM classifier trainings, the identification of the OSSP pages and the OSS Resource Access methods of web oriented

Publications (2)

Publication Number Publication Date
CN102541913A true CN102541913A (en) 2012-07-04
CN102541913B CN102541913B (en) 2017-10-03

Family

ID=46348830

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201010609743.0A Expired - Fee Related CN102541913B (en) 2010-12-15 2010-12-15 VSM classifier trainings, the identification of the OSSP pages and the OSS Resource Access methods of web oriented

Country Status (1)

Country Link
CN (1) CN102541913B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103078897A (en) * 2012-11-29 2013-05-01 中山大学 System for implementing fine grit classification and management of Web services
CN103226509A (en) * 2013-04-08 2013-07-31 上海华力微电子有限公司 Method for automatically analyzing system log
CN110188536A (en) * 2019-05-22 2019-08-30 北京邮电大学 Application program detection method and device
CN110990035A (en) * 2019-11-01 2020-04-10 中国人民解放军63811部队 Chain type software upgrading method based on Git

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1719436A (en) * 2004-07-09 2006-01-11 中国科学院自动化研究所 A kind of method and device of new proper vector weight towards text classification
CN101055621A (en) * 2006-04-10 2007-10-17 中国科学院自动化研究所 Content based sensitive web page identification method
CN101281521A (en) * 2007-04-05 2008-10-08 中国科学院自动化研究所 Method and system for filtering sensitive web page based on multiple classifier amalgamation

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1719436A (en) * 2004-07-09 2006-01-11 中国科学院自动化研究所 A kind of method and device of new proper vector weight towards text classification
CN101055621A (en) * 2006-04-10 2007-10-17 中国科学院自动化研究所 Content based sensitive web page identification method
CN101281521A (en) * 2007-04-05 2008-10-08 中国科学院自动化研究所 Method and system for filtering sensitive web page based on multiple classifier amalgamation

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
白洁,李春平: "面向软件开发信息库的数据挖掘综述", 《计算机应用研究》, vol. 25, no. 1, 31 January 2008 (2008-01-31) *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103078897A (en) * 2012-11-29 2013-05-01 中山大学 System for implementing fine grit classification and management of Web services
CN103078897B (en) * 2012-11-29 2015-11-18 中山大学 A kind of system realizing Web service fine grit classification and management
CN103226509A (en) * 2013-04-08 2013-07-31 上海华力微电子有限公司 Method for automatically analyzing system log
CN103226509B (en) * 2013-04-08 2016-03-30 上海华力微电子有限公司 A kind of method of system journal automatic analysis
CN110188536A (en) * 2019-05-22 2019-08-30 北京邮电大学 Application program detection method and device
CN110990035A (en) * 2019-11-01 2020-04-10 中国人民解放军63811部队 Chain type software upgrading method based on Git

Also Published As

Publication number Publication date
CN102541913B (en) 2017-10-03

Similar Documents

Publication Publication Date Title
US11615246B2 (en) Data-driven structure extraction from text documents
Zhu et al. Unsupervised entity resolution on multi-type graphs
US8503769B2 (en) Matching text to images
US8538898B2 (en) Interactive framework for name disambiguation
Saleem et al. Porsche: Performance oriented schema mediation
CN102508859B (en) Advertisement classification method and device based on webpage characteristic
López et al. Modelset: a dataset for machine learning in model-driven engineering
CN108717470A (en) A kind of code snippet recommendation method with high accuracy
CN108762808B (en) Interface document generation method and system
JP2010501096A (en) Cooperative optimization of wrapper generation and template detection
Babur et al. Hierarchical clustering of metamodels for comparative analysis and visualization
Babur et al. Metamodel clone detection with SAMOS
Bin et al. Web mining research
Chen et al. Recommending software features for mobile applications based on user interface comparison
Meusel et al. Exploiting microdata annotations to consistently categorize product offers at web scale
Babur Statistical analysis of large sets of models
CN102541913A (en) Web-oriented VSM (vector space model) classifier training method, web-oriented OSSP (open resource software page) identifying method and Web-oriented OSS (open resource software) resource extracting method
Jalal Text Mining: Design of Interactive Search Engine Based Regular Expressions of Online Automobile Advertisements.
Revindasari et al. Traceability between business process and software component using Probabilistic Latent Semantic Analysis
de Viana et al. Integrating deep-web information sources
De Bonis et al. Graph-based methods for Author Name Disambiguation: a survey
Souza et al. ARCTIC: metadata extraction from scientific papers in pdf using two-layer CRF
Velloso et al. Web page structured content detection using supervised machine learning
CN107341169B (en) Large-scale software information station label recommendation method based on information retrieval
Patel et al. Author homepage discovery in citeseerx

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20171003

Termination date: 20201215

CF01 Termination of patent right due to non-payment of annual fee