CN110569061A - Automatic construction system of software engineering knowledge base based on big data - Google Patents

Automatic construction system of software engineering knowledge base based on big data Download PDF

Info

Publication number
CN110569061A
CN110569061A CN201910904299.6A CN201910904299A CN110569061A CN 110569061 A CN110569061 A CN 110569061A CN 201910904299 A CN201910904299 A CN 201910904299A CN 110569061 A CN110569061 A CN 110569061A
Authority
CN
China
Prior art keywords
data
knowledge base
software engineering
engineering knowledge
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN201910904299.6A
Other languages
Chinese (zh)
Inventor
贾凌杉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hebei Institute Of Environmental Engineering
Original Assignee
Hebei Institute Of Environmental Engineering
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hebei Institute Of Environmental Engineering filed Critical Hebei Institute Of Environmental Engineering
Priority to CN201910904299.6A priority Critical patent/CN110569061A/en
Publication of CN110569061A publication Critical patent/CN110569061A/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/70Software maintenance or management

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

the invention discloses an automatic construction system of a software engineering knowledge base based on big data, which is used for collecting target data through multiple ports based on a target data collection rule, adopting a redundancy function to clear excessive contents in the target data, realizing the standardized processing of the data based on MapReduce, completing the construction of data association relation based on the data collection rule, marking a hyperlink of related data in a hyperlink mode, and realizing the viewing of the associated data after a user clicks the hyperlink mark, wherein the association relation with a source file is marked on each piece of associated data. The invention realizes the automatic construction of the software engineering knowledge base and the automatic construction of the data association relation, greatly improves the quality of the obtained knowledge base and facilitates the use of the data at the later stage.

Description

Automatic construction system of software engineering knowledge base based on big data
Technical Field
The invention relates to the field of software engineering, in particular to an automatic construction system of a software engineering knowledge base based on big data.
Background
In the era of the modern society, which is the Semantic world wide Web (Semantic Web) as the main direction of future development, it is very important to construct Web information that can be understood and processed by a computer at the present stage. The Knowledge Base (Knowledge Base) is used as a Knowledge set composed of concepts, entities and relations, so that the Knowledge Base has more and more important application value and industrial value in the environment of vigorous development such as information retrieval, Knowledge question answering and the like. The software engineering domain knowledge base is taken as an important branch in the knowledge base, and the effect of difficult replacement is also highlighted. Therefore, the quality of the knowledge base in the software engineering field largely determines and influences the quality and effect of research. Therefore, the construction of a high-quality and large-scale knowledge base in the field of software engineering is of great significance. The existing software engineering knowledge base has the problems of complex automatic construction process, low working efficiency, sparse relation of the constructed knowledge base and low construction quality.
disclosure of Invention
the invention aims to provide an automatic construction system of a software engineering knowledge base based on big data, which realizes the automatic construction of the software engineering knowledge base and the automatic establishment of the data association relation, greatly improves the quality of the obtained knowledge base and facilitates the use of later data.
In order to achieve the purpose, the invention adopts the technical scheme that:
The automatic construction system of the software engineering knowledge base based on big data comprises:
The target data acquisition rule generating module is used for generating a corresponding data acquisition rule according to a data acquisition standard input by the man-machine operation module;
The target data acquisition module acquires target data through multiple ports based on the data acquisition rule and sends the acquired data to the data standardization module;
The data standardization module is used for clearing excessive content in target data, realizing data standardization processing based on MapReduce, completing construction of data association relation based on the data acquisition rule, marking hyperlinks of related data in a hyperlink mode, and checking the associated data by clicking the hyperlink mark by a user, wherein the association relation with a source file is marked on each piece of associated data;
and the data positioning module is used for finding a proper position in the database for the data subjected to the data standardization processing, finding similar data points for the data, and establishing a relationship between the data points and the similar data points.
further, the data acquisition rule at least comprises word stem co-occurrence degree, asymmetric common string similarity degree, anchor link co-occurrence degree based on a wiki structure, structural body information similarity degree based on the wiki structure and topic distribution similarity degree based on KL divergence.
Further, the data collection criteria are input in a questionnaire check.
Further, the redundant content is cleared by using a redundancy function, specifically, in the redundancy function, knowledge elements in k1 and k2 are taken out of e1 and e2 respectively, then X, Y and relation R in e1 and e2 are taken out and compared with xe1, xe2, ye1 and ye2 respectively, element items with the same content are deleted, the original relation R value is retained, and the relation is merged with the undeleted items.
Further, the data positioning module realizes data positioning based on a facet technology, and accurately positions data by calculating a facet distance between different data terms; when the data is positioned, corresponding terms are selected under the constraint of the known facets, so that the description of the required data is completed, and if the selection is successful, the corresponding data is returned; if the selection is unsuccessful, the system will calculate the similarity of terms from the synonym dictionary and the conceptual distance map, forming new positioning information.
Further, the data standardization module is further used for marking a source identifier for each target data, and a user can realize the access of the link where the source data is located by clicking the identifier.
Further, the hyperlink label and the source identifier are labeled with different labels.
And further, a block chain is used for realizing caching and safety audit of data, and all data need to be subjected to safety audit before entering the system.
Further, data features are extracted based on a deep convolution model, and then the obtained data features are input into a BP neural network model to realize data security audit
The invention has the following beneficial effects:
the automatic construction of the software engineering knowledge base is realized, the automatic construction of the data association relation is realized, the quality of the obtained knowledge base is greatly improved, and the use of later data is facilitated.
Drawings
FIG. 1 is a system block diagram of an automated building system for a big data-based software engineering knowledge base according to an embodiment of the present invention.
Detailed Description
In order that the objects and advantages of the invention will be more clearly understood, the invention is further described in detail below with reference to examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
As shown in fig. 1, an automated building system of a big data-based software engineering knowledge base according to an embodiment of the present invention includes:
The target data acquisition rule generating module is used for generating a corresponding data acquisition rule according to a data acquisition standard input by the man-machine operation module;
The target data acquisition module is used for acquiring target data through multiple ports based on the data acquisition rule and sending the acquired data to the data standardization module after the acquired data is cached and audited by a block chain;
The block chain is used for realizing caching and safety audit of data, safety audit needs to be completed before all data enter the system, and when the data are audited, data features are extracted based on a deep convolution model at first, and then the obtained data features are input into a BP neural network model to realize data safety audit;
the data standardization module is used for clearing excessive content in the target data, realizing data standardization processing based on MapReduce, marking a source identifier for each target data, enabling a user to realize the access of a link where the source data is located by clicking the identifier, completing the construction of a data association relation based on the data acquisition rule, marking a hyperlink of related data in a hyperlink mode, enabling the user to check the associated data after clicking the hyperlink label, and marking the association relation with the source file on each piece of associated data; the hyperlink label and the source label adopt different label symbols;
And the data positioning module is used for finding a proper position in the database for the data subjected to the data standardization processing, finding similar data points for the data, and establishing a relationship between the data points and the similar data points.
In this embodiment, the data collection rule at least includes word stem co-occurrence degree, asymmetric common string similarity degree, anchor link co-occurrence degree based on wiki structure, structural information similarity degree based on wiki structure, and topic distribution similarity degree based on KL divergence. The word stem co-occurrence degree is as follows: performing relevance calculation on the word senses of the concepts, extracting word stems, namely central words, of the concepts by adopting a StandfordParser tool, and calculating the co-occurrence degree of the word stems; the asymmetric public string similarity refers to: because the upper and lower parts are in an asymmetric relationship (when the concept A is the hypernym of the concept B, the concept B is not necessarily the hypernym of the concept A), the characteristic can effectively avoid the interference that the concepts have close relationship but are not in the hypernym relationship, namely the similarity value of the common word string among the concepts is calculated; the anchor link co-occurrence degree based on the wiki structure is as follows: each concept corresponds to a certain page of the Wikipedia, so the structure in the Wikipedia page and the text information in the structure can well reflect the information and meaning referred by the concept; respectively analyzing the co-occurrence similarity of each structure of the Wikipedia pages of the concept by adopting NGD (Normalized Google Distance); due to the structures in wikipedia pages as: the concept meanings of the anchor link sets in the quotation (Abstract), the Text (Text) and the Category (Category) can be well reflected, so that the NGD is calculated based on the three structures respectively to obtain three different characteristic values; in addition, since the generic (Category) structure can clearly represent the relationship between the top and bottom, if the concept a is included in the anchor link set of the generic (Category) of the concept B or the concept B is included in the anchor link set of the generic (Category) of the concept a, the additional coefficient V is set to 0.05 according to the value range of NGD in the present embodiment in order that the calculated generic structure NGD value is equal to the current calculation result plus the additional coefficient V; the structural body information similarity based on the wiki structure is as follows: the Wikipedia provides two wiki structures, namely an outline (guideline) and an information box (infobox), for each concept, wherein the two structures embody main information of the concept through keywords, the outline (guideline) mainly describes the aspects of a current concept wiki page, and the information box (infobox) mainly describes the characteristics and attributes of the current concept; the closely related software engineering field concepts often have relatively similar outline (guideline) and information box (infobox) structures, so that the similarity of the information described by the structures is calculated through Jaccard; in the embodiment, the similarity of the structural body information is calculated twice according to two structures of the outline and the information frame; the topic distribution similarity based on the KL divergence refers to that: for concepts in the field of software engineering with a context relationship, part of the concepts do not have a complete wiki structure; therefore, in order to mine the superior-inferior relation of the concepts in the field of software engineering with incomplete structures, the embodiment calculates the association degree between the concepts through the KL divergence; firstly, modeling theme distribution of a software engineering field concept by adopting LDA (Latent Dirichlet Allocation); when the relation between any two concepts is judged, firstly, the probability distribution of the concept distribution in different topics is calculated according to the topic distribution; the KL divergence is then used to calculate topic distribution similarity between the two concepts. The propagation relationship comprises: synonymy relation, upper and lower relation and incidence relation, the propagated label can be obtained by satisfying any relation, wherein: the method for judging the synonymous relationship specifically comprises the following steps: and when the undetermined concept appears in the Redirect structure in the current concept or the current concept appears in the Redirect structure of the undetermined concept, judging as the synonymous concept. The upper and lower relation determination method comprises the following steps: when the concept to be determined appears in the Category structure of the current concept or the current concept appears in the Category of the concept to be determined, determining the concept to be positioned up and down; the judgment of the incidence relation is specifically judged by normalizing the Google distance NGD, and when the NGD value of the incidence relation reaches a limited standard, the relation is judged as the incidence relation.
In this embodiment, the data acquisition standard is input in a questionnaire check mode, when a user needs to construct a database, the user may click a "data construction" button, the system may implement the data acquisition standard questionnaire in a pop-up dialog mode, and the user may implement the input of the data acquisition standard by checking each option, where the data acquisition standard at least includes a data type, a data keyword, and a data source.
in this embodiment, the redundant content is cleared by using a redundancy function, specifically, in the redundancy function, the knowledge elements in k1 and k2 are taken out of e1 and e2, X, Y and relationship R in e1 and e2 are taken out and compared with xe1, xe2, ye1 and ye2, respectively, the element items with the same content are deleted, the original relationship R value is retained, and the relationship is merged with the undeleted items. The data positioning module realizes data positioning based on a facet technology, and accurately positions data by calculating a facet distance between different data terms; when the data is positioned, corresponding terms are selected under the constraint of the known facets, so that the description of the required data is completed, and if the selection is successful, the corresponding data is returned; if the selection is unsuccessful, the system will calculate the similarity of terms from the synonym dictionary and the conceptual distance map, forming new positioning information.
The foregoing is only a preferred embodiment of the present invention, and it should be noted that those skilled in the art can make various improvements and modifications without departing from the principle of the present invention, and these improvements and modifications should also be construed as the protection scope of the present invention.

Claims (9)

1. The automatic construction system of the software engineering knowledge base based on big data is characterized in that: the method comprises the following steps:
the target data acquisition rule generating module is used for generating a corresponding data acquisition rule according to a data acquisition standard input by the man-machine operation module;
The target data acquisition module acquires target data through multiple ports based on the data acquisition rule and sends the acquired data to the data standardization module;
The data standardization module is used for clearing excessive content in target data, realizing data standardization processing based on MapReduce, completing construction of data association relation based on the data acquisition rule, marking hyperlinks of related data in a hyperlink mode, and checking the associated data by clicking the hyperlink mark by a user, wherein the association relation with a source file is marked on each piece of associated data;
And the data positioning module is used for finding a proper position in the database for the data subjected to the data standardization processing, finding similar data points for the data, and establishing a relationship between the data points and the similar data points.
2. The automated big-data-based software engineering knowledge base building system according to claim 1, wherein: the data acquisition rule at least comprises word stem co-occurrence degree, asymmetric public word string similarity degree, anchor link co-occurrence degree based on a wiki structure, structural body information similarity based on the wiki structure and theme distribution similarity based on KL divergence.
3. the automated big-data-based software engineering knowledge base building system according to claim 1, wherein: the data acquisition standard is input in a questionnaire check mode.
4. The automated big-data-based software engineering knowledge base building system according to claim 1, wherein: specifically, in the redundancy function, knowledge elements in k1 and k2 are taken out of e1 and e2 respectively, then X, Y and a relation R in e1 and e2 are taken out and compared with xe1, xe2, ye1 and ye2 respectively, element items with the same content are deleted, the original relation R value is reserved, and the relation and the undeleted items are merged.
5. The automated big-data-based software engineering knowledge base building system according to claim 1, wherein: the data positioning module realizes data positioning based on a facet technology, and accurately positions data by calculating a facet distance between different data terms; when the data is positioned, corresponding terms are selected under the constraint of the known facets, so that the description of the required data is completed, and if the selection is successful, the corresponding data is returned; if the selection is unsuccessful, the system will calculate the similarity of terms from the synonym dictionary and the conceptual distance map, forming new positioning information.
6. The automated big-data-based software engineering knowledge base building system according to claim 1, wherein: the data standardization module is also used for marking a source identifier for each target data, and the user can realize the access of the link where the source data is located by clicking the identifier.
7. the automated big-data-based software engineering knowledge base building system according to claim 1, wherein: the hyperlink label and the source mark are different labels.
8. the automated big-data-based software engineering knowledge base building system according to claim 1, wherein: also comprises
and the block chain is used for realizing caching and safety audit of the data, and all the data need to be subjected to the safety audit before entering the system.
9. The automated big-data-based software engineering knowledge base building system of claim 8, wherein: firstly, data features are extracted based on a deep convolution model, and then the obtained data features are input into a BP neural network model to realize data security audit.
CN201910904299.6A 2019-09-24 2019-09-24 Automatic construction system of software engineering knowledge base based on big data Withdrawn CN110569061A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910904299.6A CN110569061A (en) 2019-09-24 2019-09-24 Automatic construction system of software engineering knowledge base based on big data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910904299.6A CN110569061A (en) 2019-09-24 2019-09-24 Automatic construction system of software engineering knowledge base based on big data

Publications (1)

Publication Number Publication Date
CN110569061A true CN110569061A (en) 2019-12-13

Family

ID=68782310

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910904299.6A Withdrawn CN110569061A (en) 2019-09-24 2019-09-24 Automatic construction system of software engineering knowledge base based on big data

Country Status (1)

Country Link
CN (1) CN110569061A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113326310A (en) * 2021-06-18 2021-08-31 立信(重庆)数据科技股份有限公司 NLP-based research data standardization method and system

Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1882939A (en) * 2003-07-02 2006-12-20 维布兰特媒体有限公司 Method and system for augmenting web content
US20080028286A1 (en) * 2006-07-27 2008-01-31 Chick Walter F Generation of hyperlinks to collaborative knowledge bases from terms in text
CN101493820A (en) * 2008-01-25 2009-07-29 北京华深慧正系统工程技术有限公司 Medicine Regulatory industry knowledge base platform and construct method thereof
US20130066903A1 (en) * 2011-09-12 2013-03-14 Siemens Corporatoin System for Linking Medical Terms for a Medical Knowledge Base
CN103699568A (en) * 2013-11-16 2014-04-02 西安交通大学城市学院 Method for extracting hyponymy relation of field terms from wikipedia
CN104915717A (en) * 2015-06-02 2015-09-16 百度在线网络技术(北京)有限公司 Data processing method, knowledge base reasoning method and related device
CN105095969A (en) * 2015-09-25 2015-11-25 沈阳农业大学 Self-learning model facing knowledge sharing
CN106294608A (en) * 2016-08-02 2017-01-04 郑州工业应用技术学院 A kind of framework method of clinical medicine commonsense knowledge base
CN106407208A (en) * 2015-07-29 2017-02-15 清华大学 Establishment method and system for city management ontology knowledge base
CN106875014A (en) * 2017-03-02 2017-06-20 上海交通大学 The automation of the soft project knowledge base based on semi-supervised learning builds implementation method
CN107631754A (en) * 2017-09-26 2018-01-26 中电科新型智慧城市研究院有限公司 Slope monitoring method and system based on big data platform
CN110209723A (en) * 2019-06-06 2019-09-06 广州商学院 A kind of equipment information collection system based on Internet of Things big data
CN110245186A (en) * 2019-05-21 2019-09-17 深圳壹账通智能科技有限公司 A kind of method for processing business and relevant device based on block chain
CN110263085A (en) * 2019-04-23 2019-09-20 阿里巴巴集团控股有限公司 Data processing system, method, calculating equipment and storage medium based on block chain

Patent Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1882939A (en) * 2003-07-02 2006-12-20 维布兰特媒体有限公司 Method and system for augmenting web content
US20080028286A1 (en) * 2006-07-27 2008-01-31 Chick Walter F Generation of hyperlinks to collaborative knowledge bases from terms in text
CN101493820A (en) * 2008-01-25 2009-07-29 北京华深慧正系统工程技术有限公司 Medicine Regulatory industry knowledge base platform and construct method thereof
US20130066903A1 (en) * 2011-09-12 2013-03-14 Siemens Corporatoin System for Linking Medical Terms for a Medical Knowledge Base
CN103699568A (en) * 2013-11-16 2014-04-02 西安交通大学城市学院 Method for extracting hyponymy relation of field terms from wikipedia
CN104915717A (en) * 2015-06-02 2015-09-16 百度在线网络技术(北京)有限公司 Data processing method, knowledge base reasoning method and related device
CN106407208A (en) * 2015-07-29 2017-02-15 清华大学 Establishment method and system for city management ontology knowledge base
CN105095969A (en) * 2015-09-25 2015-11-25 沈阳农业大学 Self-learning model facing knowledge sharing
CN106294608A (en) * 2016-08-02 2017-01-04 郑州工业应用技术学院 A kind of framework method of clinical medicine commonsense knowledge base
CN106875014A (en) * 2017-03-02 2017-06-20 上海交通大学 The automation of the soft project knowledge base based on semi-supervised learning builds implementation method
CN107631754A (en) * 2017-09-26 2018-01-26 中电科新型智慧城市研究院有限公司 Slope monitoring method and system based on big data platform
CN110263085A (en) * 2019-04-23 2019-09-20 阿里巴巴集团控股有限公司 Data processing system, method, calculating equipment and storage medium based on block chain
CN110245186A (en) * 2019-05-21 2019-09-17 深圳壹账通智能科技有限公司 A kind of method for processing business and relevant device based on block chain
CN110209723A (en) * 2019-06-06 2019-09-06 广州商学院 A kind of equipment information collection system based on Internet of Things big data

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
曾建勋: "知识链接的构建方式研究", 《图书情报工作》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113326310A (en) * 2021-06-18 2021-08-31 立信(重庆)数据科技股份有限公司 NLP-based research data standardization method and system
CN113326310B (en) * 2021-06-18 2023-04-18 立信(重庆)数据科技股份有限公司 NLP-based research data standardization method and system

Similar Documents

Publication Publication Date Title
JP6309644B2 (en) Method, system, and storage medium for realizing smart question answer
JP6901816B2 (en) Entity-related data generation methods, devices, devices, and storage media
CN108874878A (en) A kind of building system and method for knowledge mapping
CN103294781B (en) A kind of method and apparatus for processing page data
CN102722498B (en) Search engine and implementation method thereof
CN102253930B (en) A kind of method of text translation and device
CN113822067A (en) Key information extraction method and device, computer equipment and storage medium
CN103544255A (en) Text semantic relativity based network public opinion information analysis method
CN112650848A (en) Urban railway public opinion information analysis method based on text semantic related passenger evaluation
CN107885793A (en) A kind of hot microblog topic analyzing and predicting method and system
CN111831794A (en) Knowledge map-based construction method for knowledge question-answering system in comprehensive pipe gallery industry
CN102722499A (en) Search engine and implementation method thereof
CN107102993A (en) A kind of user's demand analysis method and device
CN103246644A (en) Method and device for processing Internet public opinion information
LU503512B1 (en) Operating method for construction of knowledge graph based on naming rule and caching mechanism
CN102737021A (en) Search engine and realization method thereof
CN114090861A (en) Education field search engine construction method based on knowledge graph
CN112650858A (en) Method and device for acquiring emergency assistance information, computer equipment and medium
CN116450834A (en) Archive knowledge graph construction method based on multi-mode semantic features
EP4364044A1 (en) Automated troubleshooter
CN106649557A (en) Semantic association mining method for defect report and mail list
CN114386422A (en) Intelligent aid decision-making method and device based on enterprise pollution public opinion extraction
WO2023057988A1 (en) Generation and use of content briefs for network content authoring
KR101532252B1 (en) The system for collecting and analyzing of information of social network
CN110569061A (en) Automatic construction system of software engineering knowledge base based on big data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication
WW01 Invention patent application withdrawn after publication

Application publication date: 20191213