CN110569061A

CN110569061A - Automatic construction system of software engineering knowledge base based on big data

Info

Publication number: CN110569061A
Application number: CN201910904299.6A
Authority: CN
Inventors: 贾凌杉
Original assignee: Hebei Institute Of Environmental Engineering
Current assignee: Hebei Institute Of Environmental Engineering
Priority date: 2019-09-24
Filing date: 2019-09-24
Publication date: 2019-12-13

Abstract

the invention discloses an automatic construction system of a software engineering knowledge base based on big data, which is used for collecting target data through multiple ports based on a target data collection rule, adopting a redundancy function to clear excessive contents in the target data, realizing the standardized processing of the data based on MapReduce, completing the construction of data association relation based on the data collection rule, marking a hyperlink of related data in a hyperlink mode, and realizing the viewing of the associated data after a user clicks the hyperlink mark, wherein the association relation with a source file is marked on each piece of associated data. The invention realizes the automatic construction of the software engineering knowledge base and the automatic construction of the data association relation, greatly improves the quality of the obtained knowledge base and facilitates the use of the data at the later stage.

Description

Automatic construction system of software engineering knowledge base based on big data

Technical Field

The invention relates to the field of software engineering, in particular to an automatic construction system of a software engineering knowledge base based on big data.

Background

In the era of the modern society, which is the Semantic world wide Web (Semantic Web) as the main direction of future development, it is very important to construct Web information that can be understood and processed by a computer at the present stage. The Knowledge Base (Knowledge Base) is used as a Knowledge set composed of concepts, entities and relations, so that the Knowledge Base has more and more important application value and industrial value in the environment of vigorous development such as information retrieval, Knowledge question answering and the like. The software engineering domain knowledge base is taken as an important branch in the knowledge base, and the effect of difficult replacement is also highlighted. Therefore, the quality of the knowledge base in the software engineering field largely determines and influences the quality and effect of research. Therefore, the construction of a high-quality and large-scale knowledge base in the field of software engineering is of great significance. The existing software engineering knowledge base has the problems of complex automatic construction process, low working efficiency, sparse relation of the constructed knowledge base and low construction quality.

disclosure of Invention

the invention aims to provide an automatic construction system of a software engineering knowledge base based on big data, which realizes the automatic construction of the software engineering knowledge base and the automatic establishment of the data association relation, greatly improves the quality of the obtained knowledge base and facilitates the use of later data.

In order to achieve the purpose, the invention adopts the technical scheme that:

The automatic construction system of the software engineering knowledge base based on big data comprises:

The target data acquisition rule generating module is used for generating a corresponding data acquisition rule according to a data acquisition standard input by the man-machine operation module;

The target data acquisition module acquires target data through multiple ports based on the data acquisition rule and sends the acquired data to the data standardization module;

The data standardization module is used for clearing excessive content in target data, realizing data standardization processing based on MapReduce, completing construction of data association relation based on the data acquisition rule, marking hyperlinks of related data in a hyperlink mode, and checking the associated data by clicking the hyperlink mark by a user, wherein the association relation with a source file is marked on each piece of associated data;

and the data positioning module is used for finding a proper position in the database for the data subjected to the data standardization processing, finding similar data points for the data, and establishing a relationship between the data points and the similar data points.

further, the data acquisition rule at least comprises word stem co-occurrence degree, asymmetric common string similarity degree, anchor link co-occurrence degree based on a wiki structure, structural body information similarity degree based on the wiki structure and topic distribution similarity degree based on KL divergence.

Further, the data collection criteria are input in a questionnaire check.

Further, the redundant content is cleared by using a redundancy function, specifically, in the redundancy function, knowledge elements in k1 and k2 are taken out of e1 and e2 respectively, then X, Y and relation R in e1 and e2 are taken out and compared with xe1, xe2, ye1 and ye2 respectively, element items with the same content are deleted, the original relation R value is retained, and the relation is merged with the undeleted items.

Further, the data positioning module realizes data positioning based on a facet technology, and accurately positions data by calculating a facet distance between different data terms; when the data is positioned, corresponding terms are selected under the constraint of the known facets, so that the description of the required data is completed, and if the selection is successful, the corresponding data is returned; if the selection is unsuccessful, the system will calculate the similarity of terms from the synonym dictionary and the conceptual distance map, forming new positioning information.

Further, the data standardization module is further used for marking a source identifier for each target data, and a user can realize the access of the link where the source data is located by clicking the identifier.

Further, the hyperlink label and the source identifier are labeled with different labels.

And further, a block chain is used for realizing caching and safety audit of data, and all data need to be subjected to safety audit before entering the system.

Further, data features are extracted based on a deep convolution model, and then the obtained data features are input into a BP neural network model to realize data security audit

The invention has the following beneficial effects:

the automatic construction of the software engineering knowledge base is realized, the automatic construction of the data association relation is realized, the quality of the obtained knowledge base is greatly improved, and the use of later data is facilitated.

Drawings

FIG. 1 is a system block diagram of an automated building system for a big data-based software engineering knowledge base according to an embodiment of the present invention.

Detailed Description

In order that the objects and advantages of the invention will be more clearly understood, the invention is further described in detail below with reference to examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

As shown in fig. 1, an automated building system of a big data-based software engineering knowledge base according to an embodiment of the present invention includes:

The target data acquisition module is used for acquiring target data through multiple ports based on the data acquisition rule and sending the acquired data to the data standardization module after the acquired data is cached and audited by a block chain;

The block chain is used for realizing caching and safety audit of data, safety audit needs to be completed before all data enter the system, and when the data are audited, data features are extracted based on a deep convolution model at first, and then the obtained data features are input into a BP neural network model to realize data safety audit;

the data standardization module is used for clearing excessive content in the target data, realizing data standardization processing based on MapReduce, marking a source identifier for each target data, enabling a user to realize the access of a link where the source data is located by clicking the identifier, completing the construction of a data association relation based on the data acquisition rule, marking a hyperlink of related data in a hyperlink mode, enabling the user to check the associated data after clicking the hyperlink label, and marking the association relation with the source file on each piece of associated data; the hyperlink label and the source label adopt different label symbols;

In this embodiment, the data collection rule at least includes word stem co-occurrence degree, asymmetric common string similarity degree, anchor link co-occurrence degree based on wiki structure, structural information similarity degree based on wiki structure, and topic distribution similarity degree based on KL divergence. The word stem co-occurrence degree is as follows: performing relevance calculation on the word senses of the concepts, extracting word stems, namely central words, of the concepts by adopting a StandfordParser tool, and calculating the co-occurrence degree of the word stems; the asymmetric public string similarity refers to: because the upper and lower parts are in an asymmetric relationship (when the concept A is the hypernym of the concept B, the concept B is not necessarily the hypernym of the concept A), the characteristic can effectively avoid the interference that the concepts have close relationship but are not in the hypernym relationship, namely the similarity value of the common word string among the concepts is calculated; the anchor link co-occurrence degree based on the wiki structure is as follows: each concept corresponds to a certain page of the Wikipedia, so the structure in the Wikipedia page and the text information in the structure can well reflect the information and meaning referred by the concept; respectively analyzing the co-occurrence similarity of each structure of the Wikipedia pages of the concept by adopting NGD (Normalized Google Distance); due to the structures in wikipedia pages as: the concept meanings of the anchor link sets in the quotation (Abstract), the Text (Text) and the Category (Category) can be well reflected, so that the NGD is calculated based on the three structures respectively to obtain three different characteristic values; in addition, since the generic (Category) structure can clearly represent the relationship between the top and bottom, if the concept a is included in the anchor link set of the generic (Category) of the concept B or the concept B is included in the anchor link set of the generic (Category) of the concept a, the additional coefficient V is set to 0.05 according to the value range of NGD in the present embodiment in order that the calculated generic structure NGD value is equal to the current calculation result plus the additional coefficient V; the structural body information similarity based on the wiki structure is as follows: the Wikipedia provides two wiki structures, namely an outline (guideline) and an information box (infobox), for each concept, wherein the two structures embody main information of the concept through keywords, the outline (guideline) mainly describes the aspects of a current concept wiki page, and the information box (infobox) mainly describes the characteristics and attributes of the current concept; the closely related software engineering field concepts often have relatively similar outline (guideline) and information box (infobox) structures, so that the similarity of the information described by the structures is calculated through Jaccard; in the embodiment, the similarity of the structural body information is calculated twice according to two structures of the outline and the information frame; the topic distribution similarity based on the KL divergence refers to that: for concepts in the field of software engineering with a context relationship, part of the concepts do not have a complete wiki structure; therefore, in order to mine the superior-inferior relation of the concepts in the field of software engineering with incomplete structures, the embodiment calculates the association degree between the concepts through the KL divergence; firstly, modeling theme distribution of a software engineering field concept by adopting LDA (Latent Dirichlet Allocation); when the relation between any two concepts is judged, firstly, the probability distribution of the concept distribution in different topics is calculated according to the topic distribution; the KL divergence is then used to calculate topic distribution similarity between the two concepts. The propagation relationship comprises: synonymy relation, upper and lower relation and incidence relation, the propagated label can be obtained by satisfying any relation, wherein: the method for judging the synonymous relationship specifically comprises the following steps: and when the undetermined concept appears in the Redirect structure in the current concept or the current concept appears in the Redirect structure of the undetermined concept, judging as the synonymous concept. The upper and lower relation determination method comprises the following steps: when the concept to be determined appears in the Category structure of the current concept or the current concept appears in the Category of the concept to be determined, determining the concept to be positioned up and down; the judgment of the incidence relation is specifically judged by normalizing the Google distance NGD, and when the NGD value of the incidence relation reaches a limited standard, the relation is judged as the incidence relation.

In this embodiment, the data acquisition standard is input in a questionnaire check mode, when a user needs to construct a database, the user may click a "data construction" button, the system may implement the data acquisition standard questionnaire in a pop-up dialog mode, and the user may implement the input of the data acquisition standard by checking each option, where the data acquisition standard at least includes a data type, a data keyword, and a data source.

in this embodiment, the redundant content is cleared by using a redundancy function, specifically, in the redundancy function, the knowledge elements in k1 and k2 are taken out of e1 and e2, X, Y and relationship R in e1 and e2 are taken out and compared with xe1, xe2, ye1 and ye2, respectively, the element items with the same content are deleted, the original relationship R value is retained, and the relationship is merged with the undeleted items. The data positioning module realizes data positioning based on a facet technology, and accurately positions data by calculating a facet distance between different data terms; when the data is positioned, corresponding terms are selected under the constraint of the known facets, so that the description of the required data is completed, and if the selection is successful, the corresponding data is returned; if the selection is unsuccessful, the system will calculate the similarity of terms from the synonym dictionary and the conceptual distance map, forming new positioning information.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that those skilled in the art can make various improvements and modifications without departing from the principle of the present invention, and these improvements and modifications should also be construed as the protection scope of the present invention.

Claims

1. The automatic construction system of the software engineering knowledge base based on big data is characterized in that: the method comprises the following steps:

2. The automated big-data-based software engineering knowledge base building system according to claim 1, wherein: the data acquisition rule at least comprises word stem co-occurrence degree, asymmetric public word string similarity degree, anchor link co-occurrence degree based on a wiki structure, structural body information similarity based on the wiki structure and theme distribution similarity based on KL divergence.

3. the automated big-data-based software engineering knowledge base building system according to claim 1, wherein: the data acquisition standard is input in a questionnaire check mode.

4. The automated big-data-based software engineering knowledge base building system according to claim 1, wherein: specifically, in the redundancy function, knowledge elements in k1 and k2 are taken out of e1 and e2 respectively, then X, Y and a relation R in e1 and e2 are taken out and compared with xe1, xe2, ye1 and ye2 respectively, element items with the same content are deleted, the original relation R value is reserved, and the relation and the undeleted items are merged.

5. The automated big-data-based software engineering knowledge base building system according to claim 1, wherein: the data positioning module realizes data positioning based on a facet technology, and accurately positions data by calculating a facet distance between different data terms; when the data is positioned, corresponding terms are selected under the constraint of the known facets, so that the description of the required data is completed, and if the selection is successful, the corresponding data is returned; if the selection is unsuccessful, the system will calculate the similarity of terms from the synonym dictionary and the conceptual distance map, forming new positioning information.

6. The automated big-data-based software engineering knowledge base building system according to claim 1, wherein: the data standardization module is also used for marking a source identifier for each target data, and the user can realize the access of the link where the source data is located by clicking the identifier.

7. the automated big-data-based software engineering knowledge base building system according to claim 1, wherein: the hyperlink label and the source mark are different labels.

8. the automated big-data-based software engineering knowledge base building system according to claim 1, wherein: also comprises

and the block chain is used for realizing caching and safety audit of the data, and all the data need to be subjected to the safety audit before entering the system.

9. The automated big-data-based software engineering knowledge base building system of claim 8, wherein: firstly, data features are extracted based on a deep convolution model, and then the obtained data features are input into a BP neural network model to realize data security audit.