CN105389327B - A kind of extensive open source software label level method for auto constructing - Google Patents
A kind of extensive open source software label level method for auto constructing Download PDFInfo
- Publication number
- CN105389327B CN105389327B CN201510617001.5A CN201510617001A CN105389327B CN 105389327 B CN105389327 B CN 105389327B CN 201510617001 A CN201510617001 A CN 201510617001A CN 105389327 B CN105389327 B CN 105389327B
- Authority
- CN
- China
- Prior art keywords
- label
- open source
- preliminary
- page
- frequency
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000010276 construction Methods 0.000 claims abstract description 12
- 238000000605 extraction Methods 0.000 claims abstract description 10
- 230000015572 biosynthetic process Effects 0.000 claims abstract description 6
- 238000005755 formation reaction Methods 0.000 claims abstract description 6
- 238000002372 labelling Methods 0.000 claims description 7
- 230000000875 corresponding Effects 0.000 claims description 6
- 238000005259 measurement Methods 0.000 claims description 5
- 238000005516 engineering process Methods 0.000 claims description 3
- 230000000717 retained Effects 0.000 claims description 3
- 230000000694 effects Effects 0.000 abstract description 3
- 210000001519 tissues Anatomy 0.000 abstract description 2
- 238000000034 method Methods 0.000 description 3
- 101700006259 tagF Proteins 0.000 description 3
- 238000005457 optimization Methods 0.000 description 2
- 230000007958 sleep Effects 0.000 description 2
- 241000208340 Araliaceae Species 0.000 description 1
- 235000003140 Panax quinquefolius Nutrition 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000001276 controlling effect Effects 0.000 description 1
- 238000007429 general method Methods 0.000 description 1
- 235000005035 ginseng Nutrition 0.000 description 1
- 235000008434 ginseng Nutrition 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 238000004064 recycling Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
Abstract
A kind of open source software label level automated construction method, item label information extraction is come out including the use of existing extraction tool, obtained item label is extracted arbitrarily to match between any two, form several undirected labels pair, then it counts and records calculated label to the calculated label of information to frequency of occurrence according to all, and according to label frequency relationship be label to add direction, formation<label pair, the frequency>side;Directed edge, which is connected with each other, can form several connected graphs and obtain the label level of Primary Construction;It obtains website and has taxonomical hierarchy, carry out the comparison on side, existing taxonomical hierarchy is optimized.With stratification tissue is carried out to open source resources extensive in open source community, the effect of open source software positioning accuracy and efficiency is improved.
Description
Technical field
The present invention relates to a kind of label level automated construction methods, more particularly to for a large amount of due to existing on internet
Open source software and bring retrieval difficult problem and the label level automated construction method of open source software that provides.
Background technique
In recent years, flourishing with open source movement, a large amount of open source software is continuously dissolved into each open source
In community (Github, Sourceforge, Openhub etc.), reusable software resource abundant is brought, but is also simultaneously money
Source retrieval brings challenge.In order to preferably manage the open source software of such vast number, some communities introduce tag system,
User is encouraged to be labeled open source software, some of communities are even more to have used free tag system, allow user to soft
Part labels, some projects are likely to be breached dozens of label.On the one hand this label scheme may result under label quality
Drop, but label data has also been greatly enriched, software can be described comprehensively from different angles.
These communities realize the classification to software by label, alleviate asking for extensive resource retrieval to a certain extent
Topic.But since the limitation of label itself, the structure of formation tend to flattening, the association between label is not fully considered
Property, cause the Resource orientation effect based on label undesirable.If relationship between label can be accounted for, label layer is established
It is secondary, the accuracy and efficiency of Resource orientation will be greatly improved.
Currently, the method for building software label level is broadly divided into two kinds: manual method and automatic method.Manual type
(Sourceforge) generally by domain expert is engaged, their domain knowledge, analysis, research, General Office's software mark are utilized
The hierarchical structure of label, but this mode needs to consume a large amount of human and material resources.The mode of automation usually utilizes open source software
Label constructed, general method be firstly the need of measurement two tags between relationship;Then according between label
Relationship is built into software classification level.In general, between measurement labels there are two types of relational manners: Generalized Degree and similarity.Broad sense
What degree indicated is the general degree of each label on class hierarchy, class of the bigger tag representation of broad sense angle value representated by it
Not broader, the content for including is more, and the position in level is also higher.And similarity then illustrates the similar of two labels
Degree.The general Generalized Degree or similarity that measurement labels are removed using set theory either topic model.Just we grasp at present
Data from the point of view of, existing automated construction method is only found under experiment condition, but is difficult to be used in real scene.
Therefore, the domain knowledges such as existing label information and existing taxonomical hierarchy, automation how efficiently to be utilized
Ground constructs the hierarchical structure of software label, is the weight that those skilled in the art extremely pay close attention to improve software retrieval efficiency
Want problem.
Summary of the invention
In view of the above-mentioned deficiencies in the prior art, it is an object of the present invention to existing taxonomical hierarchy is made full use of, combination tag language
Adopted measure proposes a kind of software label level construction method of automation, improves the matter of open source software taxonomical hierarchy structure
Amount and software retrieval efficiency.
Technical solution of the present invention the following steps are included:
The project information page of major open source community is grabbed using general crawler technology, and utilized by step 101
Existing extraction tool comes out item label information extraction, and it is < entry name that each item page, which can form a field,
Tag set>record, wherein tag set can use<label 1><label 2>...<label n>format storage, wherein n is greater than
Equal to 1, a series of item label records are formed, are stored into local data base A.
Step 102 is done as follows each record in local data base A: the item label in tag set is appointed
Meaning matches between any two, forms several undirected labels pair, then counts and record calculated label to information according to all
Calculated label is to frequency of occurrence, formation<label pair, the frequency>relationship;The frequency of occurrence of all labels is counted simultaneously, is made
For label Generalized Degree measurement.
Step 103, general<label pair, the frequency>relationship carry out descending arrangement by the frequency, and taking preceding N, (N is more than or equal to 1, determines
The size of hierarchical structure) label in a relationship to as the side in hierarchical structure, be then these choosings according to label Generalized Degree
In side add direction, the small label of Generalized Degree is directed toward by the big label of Generalized Degree.
Step 104, several directed edges formed are the side representations of several connected graphs, these sides are connected with each other
To be formed several connected graphs, the figure that connection is formed so all retains, and accordingly, constructs preliminary software labeling
Hierarchical structure.Optimization method later is possible to connect in this several figure.
Step 105 optimizes fringeware labeling level according to existing labeling level.
Step 106 periodically checks more new information to the project information of major open source community, if item label has update,
Then the page is crawled, is extracted, is updated into local data base A, step 102 is executed and subsequent step is slept if do not updated
It sleeps, waiting checks next time.
Further, wherein step 105 optimizes according to existing labeling level and includes:
The page comprising taxonomical hierarchy in same open source community is crawled by step 105.1 using general crawler, right
It is that the page for including top mode with some starts that the page comprising taxonomical hierarchy, which crawl, under successively crawling and including
Then taxonomical hierarchy information therein is extracted using general extraction tool, has been formatted by the page of node layer structure
The directed edge of point-> terminal form is stored into local data base B.
Step 105.2, to Primary Construction come out taxonomical hierarchy in each directed edge check, if this
Corresponding two vertex in side occurred in database B, and two points by several sides connect path direction and just
Path direction in the taxonomical hierarchy of step is identical, then, continue checking other points on the above-mentioned path connected whether
Occurred in preliminary classification level, if there is mistake, then formed other points described in appearance with described two vertex all
Path is all added in preliminary label hierarchical structure;If two points connect the direction in path and preliminary by several sides
Taxonomical hierarchy in path direction it is different, then this side is deleted from preliminary label level.If this side is two corresponding
At least one vertex in vertex does not occur in database B, then without any processing and be retained in current preliminary label layer
In secondary.
Following technical effect can achieve using the present invention:
The problems such as this method is big for building open source software level difficulty, at high cost is excavated according to the label information of software
Software label level out, and existing taxonomical hierarchy is made full use of to optimize result, further promote the matter of taxonomical hierarchy
Amount, to improve software retrieval efficiency, and the automation for realizing taxonomical hierarchy updates, and has saved a large amount of resource, Neng Gouman
Sufficient user is to software retrieval quality, the demand of efficiency.
Detailed description of the invention
Fig. 1 is the building software label level flow chart that the present invention automates;
Fig. 2 is the preliminary classification hierarchical chart of building;
Fig. 3 is the flow chart optimized in the present invention to preliminary taxonomical hierarchy.
Specific embodiment
As shown in Figure 1, specifically executing following steps for the building software label level flow chart that the present invention automates:
Step 101, will be major using crawler (such as open source crawler Webmagic, be also possible to general other and crawl technology)
The project information page of open source community grabs, and utilizes existing extraction tool (such as XPath) by item label information
Extract, it is<entry name that each item page, which can form a field, tag set>record, wherein tag set
<label 1><label 2>can be used ...<label n>format storage, wherein n is more than or equal to 1, forms a series of item label records,
It stores in local data base A.
Step 102 is done as follows each record in local data base A: the label any two in tag set
It is matched between two, for example the tag set of some project is<tag1><tag2><tag3>, then the label that tag match is formed
To for<tag1, tag2>,<tag1, tag3>,<tag2, tag3>, be consequently formed several labels pair, label is to being undirected.
Then it counts and records calculated label to the calculated label of information to frequency of occurrence, formation < label pair, frequency according to all
Secondary > relationship;The frequency of occurrence for counting all labels simultaneously is measured as label Generalized Degree.
Step 103, general<label pair, the frequency>relationship carry out descending arrangement by the frequency, and taking preceding N, (N is more than or equal to 1, determines
The size of hierarchical structure) label in a relationship to as the side in hierarchical structure, be then these choosings according to label Generalized Degree
In side add direction, the direction is that the small label of Generalized Degree is directed toward by the big label of Generalized Degree.
Step 104, above-mentioned formation several directed edges may be considered the side representation of several connected graphs, by these
While several connected graphs can be formed by being connected with each other, such as shown in Fig. 2,5 side A- > B, A- > C, B- > C are shared,
B- > D, E- > F can so construct structure chart as shown in Figure 2.The figure that connection is formed so all retains, later excellent
Change method is possible to connect in this several figure, accordingly, constructs preliminary software labeling hierarchical structure.
Step 105 optimizes fringeware labeling level according to existing labeling level, such as Fig. 3 institute
Show.
The page comprising taxonomical hierarchy in same open source community (such as Sourceforge) is crawled using crawler
(usually being started with some page for including top mode, successively crawl the page comprising lower level node structure).Then sharp
Taxonomical hierarchy information therein is extracted with general extraction tool such as XPath, these classification information unprocessed forms exist
In link, it is similar to http://sourceforge.net/directory/system-administration/
distributed-computing/.System Administration is exactly father's section of Distributed Computing
Point, we are climbed down the similar link iteration in the page from top layer using crawler then to extract taxonomical hierarchy.
It is formatted into the directed edge of starting point-> terminal form, such as the form of Communications- > Email, is stored
Into local data base B.It just can determine that upper layer node and lower level node when extracting, using first volume node as starting point, lower layer's section
Point is used as terminal, forms directed edge.
Step 105.2, to Primary Construction come out taxonomical hierarchy in each directed edge check, if this
Corresponding two vertex in side occurred in database B, and two points by several sides connect path direction and just
Path direction in the taxonomical hierarchy of step is identical, then, continue checking other points on the above-mentioned path connected whether
Occurred in preliminary classification level, if there is mistake, then formed other points described in appearance with described two vertex all
Path is all added in preliminary label hierarchical structure.Such as: assuming that this side is A- > D, there are A, D to connect a point in database B,
And there is A- > B- > C- > D to connect A, D, wherein B occurred in preliminary taxonomical hierarchy, and C does not have, then will
A- > B, B- > D is added in preliminary taxonomical hierarchy.
If two points connect the path direction in the direction and preliminary taxonomical hierarchy in path not by several sides
Together, then this side is deleted from preliminary label level.If at least one vertex in corresponding two vertex in this side is in number
It is according to not occurring in the B of library, then without any processing and be retained in current preliminary label level.
Step 106 periodically checks more new information to the project information of major open source community, if item label has update,
Then the page is crawled, is extracted, is updated into local data base A, step 102 is executed and subsequent step is slept if do not updated
It sleeps, waiting checks next time.At the end of manually controlling, executing step 107 construction terminates.
The acquisition that this method is automated for open source software information on internet, simplifies cumbersome duplicate craft and obtains
Process is taken, to automate update taxonomical hierarchy, and makes full use of existing taxonomical hierarchy optimum results.
As shown in figure 3, the flow chart optimized in the present invention to preliminary taxonomical hierarchy, specifically executes following steps:
Step 201 crawls the open source community comprising taxonomical hierarchy, obtains taxonomical hierarchy, is formatted into<starting point, and terminal>
Format after storage into local data base B;
Each side in step 202, the taxonomical hierarchy of searching loop Primary Construction of the present invention;
Step 203 checks whether this side occurs in local data base B, in the event of execution step 204;Otherwise, it protects
This side is stayed, subsequently into recycling next time;
Step 204 checks whether the direction on this side is consistent with the direction in local data base B, if unanimously, by path
Upper appearance and the corresponding path of label that occurs in local data base A be added in preliminary software label level, otherwise,
This side is deleted from preliminary software label level;
After step 205 recycles, terminate optimization process.
The processing of scheme through the invention, can be automatic to construct label level, and benefit according to the tag attributes of open source software
Stratification tissue is carried out to open source resources extensive in open source community with the level, thus improve open source software positioning accuracy and
Efficiency.
It should be noted last that the above examples are only used to illustrate the technical scheme of the present invention and are not limiting, although ginseng
It is described the invention in detail according to preferred embodiment, those skilled in the art should understand that, it can be to the present invention
Technical solution be modified or replaced equivalently, without departing from the spirit and scope of the technical solution of the present invention.
Claims (2)
1. a kind of open source software label level automated construction method, including the following steps:
Step 101, the project information page that open source community is grabbed using general crawler technology, and utilize existing extraction tool
Item label information extraction is come out, each item page can be formed a field be<entry name, tag set>note
Record, wherein tag set can use<label 1><label 2>..., and<label n>format storage, wherein n is more than or equal to 1, forms a system
List of items label record is stored into local data base A;
Step 102 is done as follows each record in local data base A: the item label any two in tag set
It is matched between two, forms several undirected labels pair, then counted and record calculated label to information calculating according to all
Label out counts the frequency of occurrence of all labels to frequency of occurrence, formation<label pair, the frequency>relationship, as mark
Sign Generalized Degree measurement;
Step 103, general<label pair, the frequency>relationship carry out descending arrangement by the frequency, take the label in top n relationship to as layer
Then side in secondary structure is that the side that these are chosen adds direction according to label Generalized Degree, is directed toward by the big label of Generalized Degree wide
The small label of adopted degree, the N are more than or equal to 1, determine the size of hierarchical structure;
Step 104, several directed edges formed are the side representations of several connected graphs, these sides are connected with each other just
Several connected graphs can be formed, the figure that connection is formed so all retains, and accordingly, constructs preliminary software labeling level
Structure;
Step 105 optimizes fringeware labeling level according to existing labeling level;
Step 106 periodically checks more new information to the project information of major open source community, will if item label has update
The page is crawled, is extracted, and is updated into local data base A, executes step 102 and subsequent step is slept if do not updated,
Waiting checks next time.
2. the method as described in claim 1, wherein step 105 further comprises:
The page comprising taxonomical hierarchy in same open source community is crawled by step 105.1 using general crawler, to described
It is that the page for including top mode with some starts that the page comprising taxonomical hierarchy, which crawl, successively crawls and saves comprising lower layer
Then the page of point structure is extracted taxonomical hierarchy information therein using general extraction tool, be formatted into starting point ->
The directed edge of terminal form is stored into local data base B;
Each directed edge in step 105.2, the taxonomical hierarchy come out to Primary Construction checks, if this side pair
Two vertex answered occurred in database B, and two points connect the direction in path and preliminary by several sides
Path direction in taxonomical hierarchy is identical, then, whether other points on the above-mentioned path connected are continued checking preliminary
Occurred in taxonomical hierarchy, and if there is mistake, then other points described in appearance was formed into all paths with described two vertex
All it is added in preliminary label hierarchical structure;If two points are connected the direction and preliminary minute in path by several sides
Path direction in class hierarchy is different, then deletes on this side from preliminary label level;If corresponding two vertex in this side
In at least one vertex do not occur in database B, then it is without any processing and be retained in current preliminary label level.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510617001.5A CN105389327B (en) | 2015-09-21 | A kind of extensive open source software label level method for auto constructing |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510617001.5A CN105389327B (en) | 2015-09-21 | A kind of extensive open source software label level method for auto constructing |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105389327A CN105389327A (en) | 2016-03-09 |
CN105389327B true CN105389327B (en) | 2019-07-16 |
Family
ID=
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101799814A (en) * | 2009-12-31 | 2010-08-11 | 茂名学院 | Method for gathering free classification label into reticular classification structure |
CN102760149A (en) * | 2012-04-05 | 2012-10-31 | 中国人民解放军国防科学技术大学 | Automatic annotating method for subjects of open source software |
CN104199857A (en) * | 2014-08-14 | 2014-12-10 | 西安交通大学 | Tax document hierarchical classification method based on multi-tag classification |
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101799814A (en) * | 2009-12-31 | 2010-08-11 | 茂名学院 | Method for gathering free classification label into reticular classification structure |
CN102760149A (en) * | 2012-04-05 | 2012-10-31 | 中国人民解放军国防科学技术大学 | Automatic annotating method for subjects of open source software |
CN104199857A (en) * | 2014-08-14 | 2014-12-10 | 西安交通大学 | Tax document hierarchical classification method based on multi-tag classification |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2021103492A1 (en) | Risk prediction method and system for business operations | |
CN109597855A (en) | Domain knowledge map construction method and system based on big data driving | |
CN106126648B (en) | It is a kind of based on the distributed merchandise news crawler method redo log | |
US20140067457A1 (en) | Workflow execution framework | |
CN112580831B (en) | Intelligent auxiliary operation and maintenance method and system for power communication network based on knowledge graph | |
CN102982076A (en) | Multi-dimensionality content labeling method based on semanteme label database | |
CN109947949A (en) | Knowledge information intelligent management, device and server | |
US20120078969A1 (en) | System and method to extract models from semi-structured documents | |
CN111026671A (en) | Test case set construction method and test method based on test case set | |
CN102521374B (en) | Intelligent data aggregation method and intelligent data aggregation system based on relational online analytical processing | |
CN104615734B (en) | A kind of community management service big data processing system and its processing method | |
CN110795932B (en) | Geological report text information extraction method based on geological ontology | |
CN106021551A (en) | Consumption auxiliary decision making method based on screenshot information recognition | |
CN111522950B (en) | Rapid identification system for unstructured massive text sensitive data | |
CN104933104A (en) | Method and system for collecting metadata | |
CN116049379A (en) | Knowledge recommendation method, knowledge recommendation device, electronic equipment and storage medium | |
CN109636303A (en) | A kind of storage method and system of semi-automatic extraction and structured document information | |
CN109213793A (en) | A kind of stream data processing method and system | |
CN105389327B (en) | A kind of extensive open source software label level method for auto constructing | |
CN112363996A (en) | Method, system, and medium for building a physical model of a power grid knowledge graph | |
Nethra et al. | WEB CONTENT EXTRACTION USING HYBRID APPROACH. | |
CN111666263A (en) | Method for realizing heterogeneous data management in data lake environment | |
CN103699568A (en) | Method for extracting hyponymy relation of field terms from wikipedia | |
CN104573098B (en) | Extensive object identifying method based on Spark systems | |
CN110362828A (en) | Network information Risk Identification Method and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant |