CN105389330A

CN105389330A - Cross-community matched correlation method for open source resources

Info

Publication number: CN105389330A
Application number: CN201510617004.9A
Authority: CN
Inventors: 王怀民; 尹刚; 王涛; 宋晨希; 范强; 史殿习; 刘惠; 丁博; 史佩昌; 杨程; 侯翔; 湛云
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2015-09-21
Filing date: 2015-09-21
Publication date: 2016-03-09
Anticipated expiration: 2035-09-21
Also published as: CN105389330B

Abstract

The invention relates to a cross-community matched correlation method for open source software. The open source software and online document information data are obtained from a main open source project hosting community and a knowledge sharing community of an internet by adopting a universal Web crawler technology; correlation matching is carried out by utilizing the project names and the project tags of the open source software and the document titles and the document tags of the online document information; different weights are set for different matching types; cross-community correlation between an online document and the open source software is realized; and the efficiency of searching related information by using the open source software by a developer is increased.

Description

A kind of across community's open source resources coupling correlating method

Technical field

The present invention relates to a kind of open source software and mate correlating method across community, particularly relate to a kind of method of the open source software in open source projects trustship community and the online document in knowledge sharing community being carried out associate across community.

Background technology

In the soft project of cybertimes, along with increasing people joins in the exploitation of software, software repeated usage becomes an important content, and software development greatly is all carried out based on existing software or platform.Open source software is that software repeated usage provides a large amount of resources.The appearance of open source software is on the one hand for software developer provides a large amount of available resourcess relatively reliably, reduce repetitive operation, in the user's request allowing developer that more energy is put into software and core technology, on the other hand, numerous developer starts to put in the development and maintenance of open source software with tissue, and making to increase income becomes a kind of trend more.

In order to the open source software resource that provides to software developer and consulting, current internet can be divided into two classes about the website of open source software: a class is for open source software provides the project trustship community of online trustship and resource downloading, as github ( https: //github.com/), sourceforge ( https: //sourceforge .com/), openhub ( https: //www.openhub.net/) etc.; One class is the knowledge sharing community of platform for developer provides discussion, exchanges, as external stackoverflow ( http:// stackoverflow.com/), domestic CSDN ( http:// www.csdn.net/), blog garden http:// www.cnblogs.com/) etc.

Project trustship community is that user provides abundant open source software resource, and user can search or download required open source software.Project trustship community as github is also for open source software provides hosted platform, and the open source software trustship that oneself creates in platform, is carried out the maintenance of code and lasting exploitation by user.Meanwhile, user also can exchange for the problem in open source software use the exploitation even adding open source software with developer.Project hosted platform is provide good environment based on aspects such as the exploitation of open source software and the co-developments of maintenance and open source software itself.

Knowledge sharing community is then mainly the platform that user provides interchange, user sends out blog and shares oneself use gains in depth of comprehension to software in knowledge sharing community, posts and put question to the developer seeking to answer, be engaged in other similar work to carry out communication and discussion when encountering problems in forum.The blog that user issues or the discussion in forum are all very important information and resource concerning other users of this software.The content comprised in knowledge sharing community is very abundant: have the news about open source software and information, and the technology having user to write shares blog, has the discussion note of user in forum, also comprises the recruitment information of recruitment website.Conveniently, the information in these knowledge sharing communities is referred to as online document herein.Online document uses open source software to provide a great help for user.In addition, due to open source resources substantial amounts, quality is very different, and user also needs the first evaluation of decorrelation open source software by knowledge sharing community when the open source resources needed for selecting, in online document about the discussion of open source resources also for the selection of user provides reference.

But this Liang Lei community normally separately and separate, user needs to search from project trustship community, download required open source resources, and the online document of going knowledge sharing community search to be correlated with in use searches required information.Resource quantity in this Liang Lei community is all very huge, especially the online document in knowledge sharing community, information (as " water note ", advertisement etc.) containing bulk redundancy and rubbish, makes user will expend long time when searching to screen relevant useful information.This greatly reduces the development efficiency of user.

Summary of the invention

The technical problem to be solved in the present invention is: under the condition that there is a large amount of open source software project and relevant discussion in internet, there is provided a kind of can by the open source software in project trustship community and the online document in knowledge sharing community (conveniently, in the present note, discussion note, blog, news, recruitment information etc. in knowledge sharing community are referred to as " online document ") method of efficient association, make user can obtain the online document relevant to this open source software when searching open source resources simultaneously.Therefore, user, when searching open source resources, can obtain the information such as information and the blog article relevant to open source software, forum's discussion, news of software itself simultaneously.User namely can many-sided understanding open source software simultaneously can not in junk information elapsed time, saved the time of user, improved development efficiency.

Technical scheme of the present invention comprises the following steps:

Step 101, general Web crawler technology is adopted to obtain open source software information from the major open source project trustship community of internet, gather the project data comprising the base attribute of open source software, described base attribute comprises project name, item description, development language, creation-time, crawls time, item label, project source address.

Step 102, adopt general Web crawler technology from main Knowledge Sharing community, internet, obtain open source software to be correlated with online document information data, described online document information data comprises Document Title, document content and document base attribute, and described document base attribute comprises document label, document issuing time, document source address.

The project name of step 103, the Document Title adopting the full-text search instrument Lucene increased income to be the online document data collected, document content and open source software sets up file index.

Step 104, to mate with the document label of online document with the project name of open source software, wherein with described project name for keyword searches online document in the label list of database purchase, if described online document has the label identical with described project name, then be associated with described open source software for described online document, and give weight w ₁, when to represent with project name and document label for tolerance, described online document is for the correlation degree of described open source software.

Step 105, to retrieve in online document title with the project name of open source software, wherein, take project name as keyword, online document title is searched in the file index set up from step 3, if containing project name in Document Title, then be associated with project for the document, and imparting weight w will be associated specifically ₂, when to represent with project name and Document Title for tolerance, described online document is for the correlation degree of described open source software.

Step 106, in all associations of having set up, the item label number x be included in online document title is added up to the item label of open source software, calculate weight w ₃=0.5*log ₂(x ²+ 1), it represents the tolerance using the occurrence number of item label in Document Title as this project and online document correlation degree, is judged the confidence level of association results by the calculating of weights.

Step 107, in all associations of having set up, to association the item label of open source software mate with the document label of online document, add up the label number y all occurred in both, calculate weights

W ₄=0.6*log ₂(y ²+ 1), it represents that label number identical in item label and document label is as tolerance, judges the confidence level of association results by the calculating of weights.

Step 108, calculate final weights

W=w ₁+ w ₂+ (w ₁+ w ₂) * (w ₃+ w ₄), when the weight w associated is greater than threshold value q, think that this online document associates with open source software, association results stores in a database with the result of [open source software, online document, weight], completes and associates across community.

Further, the described item label in step 101 is stored in separately in label list with the form of [project id, label], and other property store of open source software are in open source software table; Described document label in step 102 adds with the form of [document id, label] and is stored in label list, and described Document Title, document content and other document base attributes are stored in document table.

Further, the process of retrieval described in step 105 uses the Lucene full-text search instrument of increasing income to realize.

Further, in step 108, the basis of w is, the w only matched in step 104 or step 105 ₁or w ₂be not 0 weight of just going calculation procedure 6 and step 7, and with (w ₁+ w ₂) * (w ₃+ w ₄) represent that step 106 and 107 is based upon on step 104 and step 105 impact of final weights.

Adopt the present invention can reach following effect: the present invention is not only helpful when the movable correlated activation such as utilize open source software to develop to software developer, is also very helpful in understanding each open source software information and function etc. to domestic consumer.The present invention to increase income the feature of community and user's Problems existing in use according to current two classes, first gathers open source software and online document information from the community that increases income of internet, then online document and open source software is carried out associating across community by association algorithm.Make user can obtain relevant blog when searching open source software simultaneously, the information such as note are discussed.The data of Liang Lei community associate by this method first, greatly can improve the efficiency of user when obtaining related data.

Therefore user is when searching open source software, the online document of software information and the relevant community that increases income such as discussion note, blog thereof can be obtained simultaneously, make user can obtain required information more fast and efficiently, contribute to user to the understanding of open source software and use.

Accompanying drawing explanation

Fig. 1 is correlating method is mated in the present invention across community process flow diagram towards open source software;

Fig. 2 mutual schematic diagram of network node that to be open source software of the present invention relate to across the embodiment of community's matching process;

Embodiment

A data acquisition station point list is safeguarded separately for open source projects trustship website and community website of increasing income.From two site lists, open source software data and online document data are regularly crawled, respectively stored in open source software data server SDS and online document data server DDS by general Web reptile.

Step 101, obtain open source software information time, adopt general Web crawler technology from the major open source project trustship community of internet, gather the project data of open source software.Gather the project data comprising the base attribute of open source software, described base attribute comprises project name, item description, development language, creation-time, crawls time, item label, project source address.Because an open source software may contain multiple label, therefore software label is stored in separately in label list with the form of [project id, label 1], [project id, label 2].Other property store of open source software are in open source software table.

Step 102, obtain online document information from each Knowledge Sharing community.Similar with step 1, when obtaining online document information, same adopt general Web reptile to gather from main Knowledge Sharing community, internet open source software is correlated with note, news etc. is discussed in interior online document information data containing blog, forum.Described online document information data comprises Document Title, document content and document base attribute, and described document base attribute comprises document label, document issuing time, document source address.Described document label adds with the form of [document id, label] and is stored in label list, and described Document Title, document content and other document base attributes are stored in document table.Wherein server regularly gathers up-to-date document data from crawling list according to the time interval of setting.

The title of step 103, the title adopting the full-text search instrument Lucene increased income to be the online document data collected, content and open source software sets up file index, with the speed up processing when carrying out association coupling and retrieval.

For open source software Hadoop, the process of coupling is: from the label list of database purchase, search community's document that all labels are " Hadoop " (case-insensitive), if document D has " Hadoop " label, then be associated with open source software Hadoop for the document D, and imparting weight w will be associated specifically ₁(in the calculating of reality, get w ₁=1).

Step 105, to retrieve in online document title with the project name of open source software, wherein, take project name as keyword, online document title is searched in the file index set up from step 3, if containing project name in Document Title, then be associated with project for the document, and imparting weight w will be associated specifically ₂, when to represent with project name and Document Title for tolerance, described online document is for the correlation degree of described open source software.In the calculating of reality, get w ₂=0.8.

Because the confidence level of the label number that matches and association results is not linear, logarithm operation is therefore adopted to show the relation of label number x and confidence level.Coefficient 0.5 is in order to control w ₃span.According to data analysis, generally more than 2, but can not may there is low volume data (situation that Document Title is long especially or software label is a lot) in the value of x, the value of its x may be very large, in order to control w ₃span, avoid when calculating final weight w, because of x, comparatively ambassador w result is very large, can reduce the accuracy of result like this.

Such as, the label of open source software Hadoop has " Apache ", " java " and " large data ", the title of some discussion note D is: " the large data processing tools based on Java ", the Hadoop number of labels then contained in model D is 2 (" java " and " large data "), the w calculated ₃=0.5*log ₂5.

Similar with the formula principle in step 106, w ₄computing formula be nonlinear relationship based on the label number y matched and association results confidence level.Coefficient 0.6 is in order to control w equally ₄span, and relative w ₃; open source software label is contributed higher than x to the confidence level of association results with the number y of online document tag match, if that is, the label of open source software and online document has common factor; compared to the label containing open source software in online document title, the former may be more inter-related.Therefore w ₄coefficient value be 0.6.

Same for Hadoop and model D, the label of model D has " Java ", " distributed ", " Mapreduce ", the label of open source software Hadoop has " Apache ", " java " and " large data ", then the label number y that Hadoop and D is common is 1, and the weights calculated are w ₄=0.6*log ₂3.

Step 108, calculate final weights

Step 104 is mated to 4 kinds of different pieces of informations of step 107 pair open source software and online document, and for each step association results impart corresponding weights, this 4 step coupling terminate after, calculate final weights.

The basis of w is, because whether the matching process of step 104 and step 105 associates with online document at decision open source software plays a very important role, (the w only matched in step 104 or step 105 ₁or w ₂be not 0) just go the weight of calculation procedure 6 and step 7, and with (w ₁+ w ₂) * (w ₃+ w ₄) represent that step 106 and 107 is based upon on step 104 and step 105 impact of final weights.

In the calculating of reality, q value gets 1.3.According to the analysis to data, think when open source software associates with online document, in the title of online document and label, have one at least containing dbase (i.e. w ₁or w ₂be not 0), and software label occurs in Document Title or the label of software and document has common factor (i.e. w ₃or w ₄be not 0).Therefore the value of threshold value q should be greater than 1, and the value of q is larger, and the accuracy of association results is higher, but recall rate is lower simultaneously, and the fruiting quantities namely mated is fewer.Compare by experiment, when q gets 1.3, the result of association is while guarantee is compared with high-accuracy, and energy recall rate is also higher.

About weight w ₁~ w ₄with the value of threshold value q, the main mode of experiment that adopts adjusts and determines.Weight is relative with the value of threshold value, and each weights representative, to the degree of belief of each step coupling, such as in step 4, when containing software name in online document label, thinks that the possibility that online document associates with this open source software is very high.

Before determining weights, analyze the confidence level of each step association first by experiment.For step 4, the incidence relation set up in step 4 all (i.e. the incidence relation containing software project title in the label of online document), analyzes its accuracy rate, finds that its accuracy rate is about 90%, when threshold value is determined, by weight w ₁be set to 1.0.Other several weights deterministic processes are similar, after determining the weights of each several part and the account form of final weights, according to the analysis related result of institute and weights definite threshold.

By the correlating method of step 101-108 of the present invention, open source software data in SDS and the online document data in DDS are associated, incidence relation is by [open source software, online document, associated weights] form be stored in separately in the contingency table of database, what wherein open source software and online document stored is its id in SDS and DDS.Such as open source software Hadoop id is in a database 1234, and the id of the online document D associated with Hadoop is 5678, and associated weight value is 1.58, then what store in a database is recorded as [1234,5678,1.58].The incidence relation of all software is stored in a contingency table, is convenient to inquiry and safeguards.

When the Data Update collected or when increasing, only data acquisition the method for change is associated, upgrade the incidence relation of two groups of data, and the contingency table more in new database.

When user asks certain open source software P data, first from SDS, read project information, comprise project base attribute, described base attribute comprises project name, item description, development language, creation-time, crawls time, item label, project source address etc.And development teams attribute (developer's list, developer's mail tabulation etc.), then from contingency table, inquire about the online document with this item association, the online document that Query Result is namely relevant to this open source software.To each document D with item association, from DDS, read content and the attribute (issuing time, update time, source address, author information etc.) of document, project information is returned to user together with the document properties of associated.

Below in conjunction with embodiment, effect of the present invention is described intuitively.Fig. 2 is the mutual schematic diagram implementing this example.Comprise an open source software data server SDS in example, online document data server DDS, incidence relation storage server, open source projects website SF1 and SF2, knowledge sharing community website SP1 and SP2, the user for illustration of interaction flow ask.The present invention will obtain open source software data and stored in SDS from SF1 and SF2 of website, online document data are obtained stored in DDS from SP1 and SP2, adopt and across community's association algorithm, the data in SDS and DDS are associated, namely to an open source software, from online document database, which online document is excavated about this software.When user asks open source software information, server can return to the information of user's open source software and relative online document information simultaneously.

The present invention presents and in the form of a web page in user interactions.User can see the open source software list crawled on webpage, when user clicks or search for an open source software, by the present invention, can be presented on the page by the open source software information needed for user with together with its online document information associated.This implementation comprises the following steps:

Step 201, employing Web reptile crawl open source software data and data are stored in data server from open source software website.

Step 202, from the relevant Knowledge Sharing community of open source software, be captured in on-line documentation (blog, note, news etc. are discussed) data by reptile, and data are stored in data server.

Open source software data associate with online document data across community's association algorithm by the open source software described in step 203, employing Fig. 1, and incidence relation stores separately.

Step 204, SDS receive user send search the request of certain open source software P after, from SDS, search the attribute (title, description, creation-time etc.) of this open source software

Step 205, SDS search the online document list associated with open source software P from contingency table

Step 206, to each online document D in the linked list of P, from DDS, search its title, content and attribute (label, issuing time, source address etc.)

Step 207, the details of P (title, description, development language, creation-time, crawl time, label, source address etc.) and the information (title, content, label, issuing time, source address etc.) of all online document Ds relevant to P to be presented on webpage together.

Above embodiment can reflect that the present invention can in internet scope for software developer provides the online document information relevant to the open source software needed for it.Due to the online document collection in system from internet multiple Knowledge Sharing community, acquisition range is extensive, abundant in content (containing blog, note, news, recruitment information etc. being discussed) that comprise, when user searches open source software, the discussion be associated with open source software or technology such as to share at the information can be supplied to user simultaneously, make user can pass through the information once asking to obtain from each Knowledge Sharing community, extensively understand the functional characteristic etc. of this software all sidedly, improve the efficiency that user uses open source software.In addition, due to the present invention mainly towards be use the software developer of open source software, therefore in the process crawling data and implementation algorithm, there is very strong specific aim.The list that crawls of open source software and online document is through conscientious consideration and strict screening, the information making every effort to make to crawl is tried one's best comprehensively and is ensured to crawl the quality of data, the more community of such as some advertisements or " water note " is not just crawling in list, and community relatively more active in programmer such as stackoverflow etc. is exactly the object that crawls of emphasis herein.The online document of therefore presenting to user is all the discussion of the specialty about open source software, user is made can once to obtain comprehensive, professional information when using open source software to encounter problems on platform of the present invention, and do not need to go to search for one by one with traditional search engine, substantially increase the efficiency of developer when using open source software to search relevant information.

It should be noted last that, above embodiment is only in order to illustrate technical scheme of the present invention and unrestricted, although with reference to preferred embodiment to invention has been detailed description, those of ordinary skill in the art is to be understood that, can modify to technical scheme of the present invention or equivalent replacement, and not depart from the spirit and scope of technical solution of the present invention.

Claims

1. open source software mates a correlating method across community, comprises the following steps:

Step 101, general Web crawler technology is adopted to obtain open source software information from the major open source project trustship community of internet, gather the project data comprising the base attribute of open source software, described base attribute comprises project name, item description, development language, creation-time, crawls time, item label, project source address;

Step 102, adopt general Web crawler technology from main Knowledge Sharing community, internet, obtain open source software to be correlated with online document information data, described online document information data comprises Document Title, document content and document base attribute, and described document base attribute comprises document label, document issuing time, document source address;

The project name of step 103, the Document Title adopting the full-text search instrument Lucene increased income to be the online document data collected, document content and open source software sets up file index;

Step 104, to mate with the document label of online document with the project name of open source software, wherein with described project name for keyword searches online document in the label list of database purchase, if described online document has the label identical with described project name, then be associated with described open source software for described online document, and give weight w ₁, when to represent with project name and document label for tolerance, described online document is for the correlation degree of described open source software;

Step 105, to retrieve in online document title with the project name of open source software, wherein, take project name as keyword, online document title is searched in the file index set up from step 3, if containing project name in Document Title, then be associated with project for the document, and imparting weight w will be associated specifically ₂, when to represent with project name and Document Title for tolerance, described online document is for the correlation degree of described open source software;

Step 108, calculate final weight w=w ₁+ w ₂+ (w ₁+ w ₂) * (w ₃+ w ₄), when the weight w associated is greater than threshold value q, think that this online document associates with open source software, association results stores in a database with [open source software, online document, weight] result, completes and associates across community.

2. method as claimed in claim, the described item label wherein in step 101 is stored in separately in label list with the form of [project id, label], and other property store of open source software are in open source software table; Described document label in step 102 adds with the form of [document id, label] and is stored in label list, and described Document Title, document content and other document base attributes are stored in document table.

3. the method for claim 1, wherein the process of retrieval described in step 5 uses the Lucene full-text search instrument of increasing income to realize.

4. the method for claim 1, wherein in step 108, the basis of w is, the w only matched in step 104 or step 105 ₁or w ₂be not 0 weight of just going calculation procedure 6 and step 7, and with (w ₁+ w ₂) * (w ₃+ w ₄) represent that step 106 and 107 is based upon on step 104 and step 105 impact of final weights.