CN105389327B - A kind of extensive open source software label level method for auto constructing - Google Patents

A kind of extensive open source software label level method for auto constructing Download PDF

Info

Publication number
CN105389327B
CN105389327B CN201510617001.5A CN201510617001A CN105389327B CN 105389327 B CN105389327 B CN 105389327B CN 201510617001 A CN201510617001 A CN 201510617001A CN 105389327 B CN105389327 B CN 105389327B
Authority
CN
China
Prior art keywords
label
open source
preliminary
page
frequency
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510617001.5A
Other languages
Chinese (zh)
Other versions
CN105389327A (en
Inventor
王怀民
王涛
尹刚
谷崇明
杨程
史殿习
刘惠
丁博
史佩昌
刘步权
湛云
侯翔
李翔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN201510617001.5A priority Critical patent/CN105389327B/en
Publication of CN105389327A publication Critical patent/CN105389327A/en
Application granted granted Critical
Publication of CN105389327B publication Critical patent/CN105389327B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

A kind of open source software label level automated construction method, item label information extraction is come out including the use of existing extraction tool, obtained item label is extracted arbitrarily to match between any two, form several undirected labels pair, then it counts and records calculated label to the calculated label of information to frequency of occurrence according to all, and according to label frequency relationship be label to add direction, formation<label pair, the frequency>side;Directed edge, which is connected with each other, can form several connected graphs and obtain the label level of Primary Construction;It obtains website and has taxonomical hierarchy, carry out the comparison on side, existing taxonomical hierarchy is optimized.With stratification tissue is carried out to open source resources extensive in open source community, the effect of open source software positioning accuracy and efficiency is improved.

Description

A kind of extensive open source software label level method for auto constructing
Technical field
The present invention relates to a kind of label level automated construction methods, more particularly to for a large amount of due to existing on internet Open source software and bring retrieval difficult problem and the label level automated construction method of open source software that provides.
Background technique
In recent years, flourishing with open source movement, a large amount of open source software is continuously dissolved into each open source In community (Github, Sourceforge, Openhub etc.), reusable software resource abundant is brought, but is also simultaneously money Source retrieval brings challenge.In order to preferably manage the open source software of such vast number, some communities introduce tag system, User is encouraged to be labeled open source software, some of communities are even more to have used free tag system, allow user to soft Part labels, some projects are likely to be breached dozens of label.On the one hand this label scheme may result under label quality Drop, but label data has also been greatly enriched, software can be described comprehensively from different angles.
These communities realize the classification to software by label, alleviate asking for extensive resource retrieval to a certain extent Topic.But since the limitation of label itself, the structure of formation tend to flattening, the association between label is not fully considered Property, cause the Resource orientation effect based on label undesirable.If relationship between label can be accounted for, label layer is established It is secondary, the accuracy and efficiency of Resource orientation will be greatly improved.
Currently, the method for building software label level is broadly divided into two kinds: manual method and automatic method.Manual type (Sourceforge) generally by domain expert is engaged, their domain knowledge, analysis, research, General Office's software mark are utilized The hierarchical structure of label, but this mode needs to consume a large amount of human and material resources.The mode of automation usually utilizes open source software Label constructed, general method be firstly the need of measurement two tags between relationship;Then according between label Relationship is built into software classification level.In general, between measurement labels there are two types of relational manners: Generalized Degree and similarity.Broad sense What degree indicated is the general degree of each label on class hierarchy, class of the bigger tag representation of broad sense angle value representated by it Not broader, the content for including is more, and the position in level is also higher.And similarity then illustrates the similar of two labels Degree.The general Generalized Degree or similarity that measurement labels are removed using set theory either topic model.Just we grasp at present Data from the point of view of, existing automated construction method is only found under experiment condition, but is difficult to be used in real scene.
Therefore, the domain knowledges such as existing label information and existing taxonomical hierarchy, automation how efficiently to be utilized Ground constructs the hierarchical structure of software label, is the weight that those skilled in the art extremely pay close attention to improve software retrieval efficiency Want problem.
Summary of the invention
In view of the above-mentioned deficiencies in the prior art, it is an object of the present invention to existing taxonomical hierarchy is made full use of, combination tag language Adopted measure proposes a kind of software label level construction method of automation, improves the matter of open source software taxonomical hierarchy structure Amount and software retrieval efficiency.
Technical solution of the present invention the following steps are included:
The project information page of major open source community is grabbed using general crawler technology, and utilized by step 101 Existing extraction tool comes out item label information extraction, and it is < entry name that each item page, which can form a field, Tag set>record, wherein tag set can use<label 1><label 2>...<label n>format storage, wherein n is greater than Equal to 1, a series of item label records are formed, are stored into local data base A.
Step 102 is done as follows each record in local data base A: the item label in tag set is appointed Meaning matches between any two, forms several undirected labels pair, then counts and record calculated label to information according to all Calculated label is to frequency of occurrence, formation<label pair, the frequency>relationship;The frequency of occurrence of all labels is counted simultaneously, is made For label Generalized Degree measurement.
Step 103, general<label pair, the frequency>relationship carry out descending arrangement by the frequency, and taking preceding N, (N is more than or equal to 1, determines The size of hierarchical structure) label in a relationship to as the side in hierarchical structure, be then these choosings according to label Generalized Degree In side add direction, the small label of Generalized Degree is directed toward by the big label of Generalized Degree.
Step 104, several directed edges formed are the side representations of several connected graphs, these sides are connected with each other To be formed several connected graphs, the figure that connection is formed so all retains, and accordingly, constructs preliminary software labeling Hierarchical structure.Optimization method later is possible to connect in this several figure.
Step 105 optimizes fringeware labeling level according to existing labeling level.
Step 106 periodically checks more new information to the project information of major open source community, if item label has update, Then the page is crawled, is extracted, is updated into local data base A, step 102 is executed and subsequent step is slept if do not updated It sleeps, waiting checks next time.
Further, wherein step 105 optimizes according to existing labeling level and includes:
The page comprising taxonomical hierarchy in same open source community is crawled by step 105.1 using general crawler, right It is that the page for including top mode with some starts that the page comprising taxonomical hierarchy, which crawl, under successively crawling and including Then taxonomical hierarchy information therein is extracted using general extraction tool, has been formatted by the page of node layer structure The directed edge of point-> terminal form is stored into local data base B.
Step 105.2, to Primary Construction come out taxonomical hierarchy in each directed edge check, if this Corresponding two vertex in side occurred in database B, and two points by several sides connect path direction and just Path direction in the taxonomical hierarchy of step is identical, then, continue checking other points on the above-mentioned path connected whether Occurred in preliminary classification level, if there is mistake, then formed other points described in appearance with described two vertex all Path is all added in preliminary label hierarchical structure;If two points connect the direction in path and preliminary by several sides Taxonomical hierarchy in path direction it is different, then this side is deleted from preliminary label level.If this side is two corresponding At least one vertex in vertex does not occur in database B, then without any processing and be retained in current preliminary label layer In secondary.
Following technical effect can achieve using the present invention:
The problems such as this method is big for building open source software level difficulty, at high cost is excavated according to the label information of software Software label level out, and existing taxonomical hierarchy is made full use of to optimize result, further promote the matter of taxonomical hierarchy Amount, to improve software retrieval efficiency, and the automation for realizing taxonomical hierarchy updates, and has saved a large amount of resource, Neng Gouman Sufficient user is to software retrieval quality, the demand of efficiency.
Detailed description of the invention
Fig. 1 is the building software label level flow chart that the present invention automates;
Fig. 2 is the preliminary classification hierarchical chart of building;
Fig. 3 is the flow chart optimized in the present invention to preliminary taxonomical hierarchy.
Specific embodiment
As shown in Figure 1, specifically executing following steps for the building software label level flow chart that the present invention automates:
Step 101, will be major using crawler (such as open source crawler Webmagic, be also possible to general other and crawl technology) The project information page of open source community grabs, and utilizes existing extraction tool (such as XPath) by item label information Extract, it is<entry name that each item page, which can form a field, tag set>record, wherein tag set <label 1><label 2>can be used ...<label n>format storage, wherein n is more than or equal to 1, forms a series of item label records, It stores in local data base A.
Step 102 is done as follows each record in local data base A: the label any two in tag set It is matched between two, for example the tag set of some project is<tag1><tag2><tag3>, then the label that tag match is formed To for<tag1, tag2>,<tag1, tag3>,<tag2, tag3>, be consequently formed several labels pair, label is to being undirected. Then it counts and records calculated label to the calculated label of information to frequency of occurrence, formation < label pair, frequency according to all Secondary > relationship;The frequency of occurrence for counting all labels simultaneously is measured as label Generalized Degree.
Step 103, general<label pair, the frequency>relationship carry out descending arrangement by the frequency, and taking preceding N, (N is more than or equal to 1, determines The size of hierarchical structure) label in a relationship to as the side in hierarchical structure, be then these choosings according to label Generalized Degree In side add direction, the direction is that the small label of Generalized Degree is directed toward by the big label of Generalized Degree.
Step 104, above-mentioned formation several directed edges may be considered the side representation of several connected graphs, by these While several connected graphs can be formed by being connected with each other, such as shown in Fig. 2,5 side A- > B, A- > C, B- > C are shared, B- > D, E- > F can so construct structure chart as shown in Figure 2.The figure that connection is formed so all retains, later excellent Change method is possible to connect in this several figure, accordingly, constructs preliminary software labeling hierarchical structure.
Step 105 optimizes fringeware labeling level according to existing labeling level, such as Fig. 3 institute Show.
The page comprising taxonomical hierarchy in same open source community (such as Sourceforge) is crawled using crawler (usually being started with some page for including top mode, successively crawl the page comprising lower level node structure).Then sharp Taxonomical hierarchy information therein is extracted with general extraction tool such as XPath, these classification information unprocessed forms exist In link, it is similar to http://sourceforge.net/directory/system-administration/ distributed-computing/.System Administration is exactly father's section of Distributed Computing Point, we are climbed down the similar link iteration in the page from top layer using crawler then to extract taxonomical hierarchy.
It is formatted into the directed edge of starting point-> terminal form, such as the form of Communications- > Email, is stored Into local data base B.It just can determine that upper layer node and lower level node when extracting, using first volume node as starting point, lower layer's section Point is used as terminal, forms directed edge.
Step 105.2, to Primary Construction come out taxonomical hierarchy in each directed edge check, if this Corresponding two vertex in side occurred in database B, and two points by several sides connect path direction and just Path direction in the taxonomical hierarchy of step is identical, then, continue checking other points on the above-mentioned path connected whether Occurred in preliminary classification level, if there is mistake, then formed other points described in appearance with described two vertex all Path is all added in preliminary label hierarchical structure.Such as: assuming that this side is A- > D, there are A, D to connect a point in database B, And there is A- > B- > C- > D to connect A, D, wherein B occurred in preliminary taxonomical hierarchy, and C does not have, then will A- > B, B- > D is added in preliminary taxonomical hierarchy.
If two points connect the path direction in the direction and preliminary taxonomical hierarchy in path not by several sides Together, then this side is deleted from preliminary label level.If at least one vertex in corresponding two vertex in this side is in number It is according to not occurring in the B of library, then without any processing and be retained in current preliminary label level.
Step 106 periodically checks more new information to the project information of major open source community, if item label has update, Then the page is crawled, is extracted, is updated into local data base A, step 102 is executed and subsequent step is slept if do not updated It sleeps, waiting checks next time.At the end of manually controlling, executing step 107 construction terminates.
The acquisition that this method is automated for open source software information on internet, simplifies cumbersome duplicate craft and obtains Process is taken, to automate update taxonomical hierarchy, and makes full use of existing taxonomical hierarchy optimum results.
As shown in figure 3, the flow chart optimized in the present invention to preliminary taxonomical hierarchy, specifically executes following steps:
Step 201 crawls the open source community comprising taxonomical hierarchy, obtains taxonomical hierarchy, is formatted into<starting point, and terminal> Format after storage into local data base B;
Each side in step 202, the taxonomical hierarchy of searching loop Primary Construction of the present invention;
Step 203 checks whether this side occurs in local data base B, in the event of execution step 204;Otherwise, it protects This side is stayed, subsequently into recycling next time;
Step 204 checks whether the direction on this side is consistent with the direction in local data base B, if unanimously, by path Upper appearance and the corresponding path of label that occurs in local data base A be added in preliminary software label level, otherwise, This side is deleted from preliminary software label level;
After step 205 recycles, terminate optimization process.
The processing of scheme through the invention, can be automatic to construct label level, and benefit according to the tag attributes of open source software Stratification tissue is carried out to open source resources extensive in open source community with the level, thus improve open source software positioning accuracy and Efficiency.
It should be noted last that the above examples are only used to illustrate the technical scheme of the present invention and are not limiting, although ginseng It is described the invention in detail according to preferred embodiment, those skilled in the art should understand that, it can be to the present invention Technical solution be modified or replaced equivalently, without departing from the spirit and scope of the technical solution of the present invention.

Claims (2)

1. a kind of open source software label level automated construction method, including the following steps:
Step 101, the project information page that open source community is grabbed using general crawler technology, and utilize existing extraction tool Item label information extraction is come out, each item page can be formed a field be<entry name, tag set>note Record, wherein tag set can use<label 1><label 2>..., and<label n>format storage, wherein n is more than or equal to 1, forms a system List of items label record is stored into local data base A;
Step 102 is done as follows each record in local data base A: the item label any two in tag set It is matched between two, forms several undirected labels pair, then counted and record calculated label to information calculating according to all Label out counts the frequency of occurrence of all labels to frequency of occurrence, formation<label pair, the frequency>relationship, as mark Sign Generalized Degree measurement;
Step 103, general<label pair, the frequency>relationship carry out descending arrangement by the frequency, take the label in top n relationship to as layer Then side in secondary structure is that the side that these are chosen adds direction according to label Generalized Degree, is directed toward by the big label of Generalized Degree wide The small label of adopted degree, the N are more than or equal to 1, determine the size of hierarchical structure;
Step 104, several directed edges formed are the side representations of several connected graphs, these sides are connected with each other just Several connected graphs can be formed, the figure that connection is formed so all retains, and accordingly, constructs preliminary software labeling level Structure;
Step 105 optimizes fringeware labeling level according to existing labeling level;
Step 106 periodically checks more new information to the project information of major open source community, will if item label has update The page is crawled, is extracted, and is updated into local data base A, executes step 102 and subsequent step is slept if do not updated, Waiting checks next time.
2. the method as described in claim 1, wherein step 105 further comprises:
The page comprising taxonomical hierarchy in same open source community is crawled by step 105.1 using general crawler, to described It is that the page for including top mode with some starts that the page comprising taxonomical hierarchy, which crawl, successively crawls and saves comprising lower layer Then the page of point structure is extracted taxonomical hierarchy information therein using general extraction tool, be formatted into starting point -> The directed edge of terminal form is stored into local data base B;
Each directed edge in step 105.2, the taxonomical hierarchy come out to Primary Construction checks, if this side pair Two vertex answered occurred in database B, and two points connect the direction in path and preliminary by several sides Path direction in taxonomical hierarchy is identical, then, whether other points on the above-mentioned path connected are continued checking preliminary Occurred in taxonomical hierarchy, and if there is mistake, then other points described in appearance was formed into all paths with described two vertex All it is added in preliminary label hierarchical structure;If two points are connected the direction and preliminary minute in path by several sides Path direction in class hierarchy is different, then deletes on this side from preliminary label level;If corresponding two vertex in this side In at least one vertex do not occur in database B, then it is without any processing and be retained in current preliminary label level.
CN201510617001.5A 2015-09-21 A kind of extensive open source software label level method for auto constructing Active CN105389327B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510617001.5A CN105389327B (en) 2015-09-21 A kind of extensive open source software label level method for auto constructing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510617001.5A CN105389327B (en) 2015-09-21 A kind of extensive open source software label level method for auto constructing

Publications (2)

Publication Number Publication Date
CN105389327A CN105389327A (en) 2016-03-09
CN105389327B true CN105389327B (en) 2019-07-16

Family

ID=

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101799814A (en) * 2009-12-31 2010-08-11 茂名学院 Method for gathering free classification label into reticular classification structure
CN102760149A (en) * 2012-04-05 2012-10-31 中国人民解放军国防科学技术大学 Automatic annotating method for subjects of open source software
CN104199857A (en) * 2014-08-14 2014-12-10 西安交通大学 Tax document hierarchical classification method based on multi-tag classification

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101799814A (en) * 2009-12-31 2010-08-11 茂名学院 Method for gathering free classification label into reticular classification structure
CN102760149A (en) * 2012-04-05 2012-10-31 中国人民解放军国防科学技术大学 Automatic annotating method for subjects of open source software
CN104199857A (en) * 2014-08-14 2014-12-10 西安交通大学 Tax document hierarchical classification method based on multi-tag classification

Similar Documents

Publication Publication Date Title
WO2021103492A1 (en) Risk prediction method and system for business operations
CN109597855A (en) Domain knowledge map construction method and system based on big data driving
CN106126648B (en) It is a kind of based on the distributed merchandise news crawler method redo log
US20140067457A1 (en) Workflow execution framework
CN112580831B (en) Intelligent auxiliary operation and maintenance method and system for power communication network based on knowledge graph
CN102982076A (en) Multi-dimensionality content labeling method based on semanteme label database
CN109947949A (en) Knowledge information intelligent management, device and server
US20120078969A1 (en) System and method to extract models from semi-structured documents
CN111026671A (en) Test case set construction method and test method based on test case set
CN102521374B (en) Intelligent data aggregation method and intelligent data aggregation system based on relational online analytical processing
CN104615734B (en) A kind of community management service big data processing system and its processing method
CN110795932B (en) Geological report text information extraction method based on geological ontology
CN106021551A (en) Consumption auxiliary decision making method based on screenshot information recognition
CN111522950B (en) Rapid identification system for unstructured massive text sensitive data
CN104933104A (en) Method and system for collecting metadata
CN116049379A (en) Knowledge recommendation method, knowledge recommendation device, electronic equipment and storage medium
CN109636303A (en) A kind of storage method and system of semi-automatic extraction and structured document information
CN109213793A (en) A kind of stream data processing method and system
CN105389327B (en) A kind of extensive open source software label level method for auto constructing
CN112363996A (en) Method, system, and medium for building a physical model of a power grid knowledge graph
Nethra et al. WEB CONTENT EXTRACTION USING HYBRID APPROACH.
CN111666263A (en) Method for realizing heterogeneous data management in data lake environment
CN103699568A (en) Method for extracting hyponymy relation of field terms from wikipedia
CN104573098B (en) Extensive object identifying method based on Spark systems
CN110362828A (en) Network information Risk Identification Method and system

Legal Events

Date Code Title Description
PB01 Publication
SE01 Entry into force of request for substantive examination
GR01 Patent grant