CN110569061A - Automatic construction system of software engineering knowledge base based on big data - Google Patents
Automatic construction system of software engineering knowledge base based on big data Download PDFInfo
- Publication number
- CN110569061A CN110569061A CN201910904299.6A CN201910904299A CN110569061A CN 110569061 A CN110569061 A CN 110569061A CN 201910904299 A CN201910904299 A CN 201910904299A CN 110569061 A CN110569061 A CN 110569061A
- Authority
- CN
- China
- Prior art keywords
- data
- knowledge base
- software engineering
- engineering knowledge
- module
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/70—Software maintenance or management
Landscapes
- Engineering & Computer Science (AREA)
- Software Systems (AREA)
- General Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
the invention discloses an automatic construction system of a software engineering knowledge base based on big data, which is used for collecting target data through multiple ports based on a target data collection rule, adopting a redundancy function to clear excessive contents in the target data, realizing the standardized processing of the data based on MapReduce, completing the construction of data association relation based on the data collection rule, marking a hyperlink of related data in a hyperlink mode, and realizing the viewing of the associated data after a user clicks the hyperlink mark, wherein the association relation with a source file is marked on each piece of associated data. The invention realizes the automatic construction of the software engineering knowledge base and the automatic construction of the data association relation, greatly improves the quality of the obtained knowledge base and facilitates the use of the data at the later stage.
Description
Technical Field
The invention relates to the field of software engineering, in particular to an automatic construction system of a software engineering knowledge base based on big data.
Background
In the era of the modern society, which is the Semantic world wide Web (Semantic Web) as the main direction of future development, it is very important to construct Web information that can be understood and processed by a computer at the present stage. The Knowledge Base (Knowledge Base) is used as a Knowledge set composed of concepts, entities and relations, so that the Knowledge Base has more and more important application value and industrial value in the environment of vigorous development such as information retrieval, Knowledge question answering and the like. The software engineering domain knowledge base is taken as an important branch in the knowledge base, and the effect of difficult replacement is also highlighted. Therefore, the quality of the knowledge base in the software engineering field largely determines and influences the quality and effect of research. Therefore, the construction of a high-quality and large-scale knowledge base in the field of software engineering is of great significance. The existing software engineering knowledge base has the problems of complex automatic construction process, low working efficiency, sparse relation of the constructed knowledge base and low construction quality.
disclosure of Invention
the invention aims to provide an automatic construction system of a software engineering knowledge base based on big data, which realizes the automatic construction of the software engineering knowledge base and the automatic establishment of the data association relation, greatly improves the quality of the obtained knowledge base and facilitates the use of later data.
In order to achieve the purpose, the invention adopts the technical scheme that:
The automatic construction system of the software engineering knowledge base based on big data comprises:
The target data acquisition rule generating module is used for generating a corresponding data acquisition rule according to a data acquisition standard input by the man-machine operation module;
The target data acquisition module acquires target data through multiple ports based on the data acquisition rule and sends the acquired data to the data standardization module;
The data standardization module is used for clearing excessive content in target data, realizing data standardization processing based on MapReduce, completing construction of data association relation based on the data acquisition rule, marking hyperlinks of related data in a hyperlink mode, and checking the associated data by clicking the hyperlink mark by a user, wherein the association relation with a source file is marked on each piece of associated data;
and the data positioning module is used for finding a proper position in the database for the data subjected to the data standardization processing, finding similar data points for the data, and establishing a relationship between the data points and the similar data points.
further, the data acquisition rule at least comprises word stem co-occurrence degree, asymmetric common string similarity degree, anchor link co-occurrence degree based on a wiki structure, structural body information similarity degree based on the wiki structure and topic distribution similarity degree based on KL divergence.
Further, the data collection criteria are input in a questionnaire check.
Further, the redundant content is cleared by using a redundancy function, specifically, in the redundancy function, knowledge elements in k1 and k2 are taken out of e1 and e2 respectively, then X, Y and relation R in e1 and e2 are taken out and compared with xe1, xe2, ye1 and ye2 respectively, element items with the same content are deleted, the original relation R value is retained, and the relation is merged with the undeleted items.
Further, the data positioning module realizes data positioning based on a facet technology, and accurately positions data by calculating a facet distance between different data terms; when the data is positioned, corresponding terms are selected under the constraint of the known facets, so that the description of the required data is completed, and if the selection is successful, the corresponding data is returned; if the selection is unsuccessful, the system will calculate the similarity of terms from the synonym dictionary and the conceptual distance map, forming new positioning information.
Further, the data standardization module is further used for marking a source identifier for each target data, and a user can realize the access of the link where the source data is located by clicking the identifier.
Further, the hyperlink label and the source identifier are labeled with different labels.
And further, a block chain is used for realizing caching and safety audit of data, and all data need to be subjected to safety audit before entering the system.
Further, data features are extracted based on a deep convolution model, and then the obtained data features are input into a BP neural network model to realize data security audit
The invention has the following beneficial effects:
the automatic construction of the software engineering knowledge base is realized, the automatic construction of the data association relation is realized, the quality of the obtained knowledge base is greatly improved, and the use of later data is facilitated.
Drawings
FIG. 1 is a system block diagram of an automated building system for a big data-based software engineering knowledge base according to an embodiment of the present invention.
Detailed Description
In order that the objects and advantages of the invention will be more clearly understood, the invention is further described in detail below with reference to examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
As shown in fig. 1, an automated building system of a big data-based software engineering knowledge base according to an embodiment of the present invention includes:
The target data acquisition rule generating module is used for generating a corresponding data acquisition rule according to a data acquisition standard input by the man-machine operation module;
The target data acquisition module is used for acquiring target data through multiple ports based on the data acquisition rule and sending the acquired data to the data standardization module after the acquired data is cached and audited by a block chain;
The block chain is used for realizing caching and safety audit of data, safety audit needs to be completed before all data enter the system, and when the data are audited, data features are extracted based on a deep convolution model at first, and then the obtained data features are input into a BP neural network model to realize data safety audit;
the data standardization module is used for clearing excessive content in the target data, realizing data standardization processing based on MapReduce, marking a source identifier for each target data, enabling a user to realize the access of a link where the source data is located by clicking the identifier, completing the construction of a data association relation based on the data acquisition rule, marking a hyperlink of related data in a hyperlink mode, enabling the user to check the associated data after clicking the hyperlink label, and marking the association relation with the source file on each piece of associated data; the hyperlink label and the source label adopt different label symbols;
And the data positioning module is used for finding a proper position in the database for the data subjected to the data standardization processing, finding similar data points for the data, and establishing a relationship between the data points and the similar data points.
In this embodiment, the data collection rule at least includes word stem co-occurrence degree, asymmetric common string similarity degree, anchor link co-occurrence degree based on wiki structure, structural information similarity degree based on wiki structure, and topic distribution similarity degree based on KL divergence. The word stem co-occurrence degree is as follows: performing relevance calculation on the word senses of the concepts, extracting word stems, namely central words, of the concepts by adopting a StandfordParser tool, and calculating the co-occurrence degree of the word stems; the asymmetric public string similarity refers to: because the upper and lower parts are in an asymmetric relationship (when the concept A is the hypernym of the concept B, the concept B is not necessarily the hypernym of the concept A), the characteristic can effectively avoid the interference that the concepts have close relationship but are not in the hypernym relationship, namely the similarity value of the common word string among the concepts is calculated; the anchor link co-occurrence degree based on the wiki structure is as follows: each concept corresponds to a certain page of the Wikipedia, so the structure in the Wikipedia page and the text information in the structure can well reflect the information and meaning referred by the concept; respectively analyzing the co-occurrence similarity of each structure of the Wikipedia pages of the concept by adopting NGD (Normalized Google Distance); due to the structures in wikipedia pages as: the concept meanings of the anchor link sets in the quotation (Abstract), the Text (Text) and the Category (Category) can be well reflected, so that the NGD is calculated based on the three structures respectively to obtain three different characteristic values; in addition, since the generic (Category) structure can clearly represent the relationship between the top and bottom, if the concept a is included in the anchor link set of the generic (Category) of the concept B or the concept B is included in the anchor link set of the generic (Category) of the concept a, the additional coefficient V is set to 0.05 according to the value range of NGD in the present embodiment in order that the calculated generic structure NGD value is equal to the current calculation result plus the additional coefficient V; the structural body information similarity based on the wiki structure is as follows: the Wikipedia provides two wiki structures, namely an outline (guideline) and an information box (infobox), for each concept, wherein the two structures embody main information of the concept through keywords, the outline (guideline) mainly describes the aspects of a current concept wiki page, and the information box (infobox) mainly describes the characteristics and attributes of the current concept; the closely related software engineering field concepts often have relatively similar outline (guideline) and information box (infobox) structures, so that the similarity of the information described by the structures is calculated through Jaccard; in the embodiment, the similarity of the structural body information is calculated twice according to two structures of the outline and the information frame; the topic distribution similarity based on the KL divergence refers to that: for concepts in the field of software engineering with a context relationship, part of the concepts do not have a complete wiki structure; therefore, in order to mine the superior-inferior relation of the concepts in the field of software engineering with incomplete structures, the embodiment calculates the association degree between the concepts through the KL divergence; firstly, modeling theme distribution of a software engineering field concept by adopting LDA (Latent Dirichlet Allocation); when the relation between any two concepts is judged, firstly, the probability distribution of the concept distribution in different topics is calculated according to the topic distribution; the KL divergence is then used to calculate topic distribution similarity between the two concepts. The propagation relationship comprises: synonymy relation, upper and lower relation and incidence relation, the propagated label can be obtained by satisfying any relation, wherein: the method for judging the synonymous relationship specifically comprises the following steps: and when the undetermined concept appears in the Redirect structure in the current concept or the current concept appears in the Redirect structure of the undetermined concept, judging as the synonymous concept. The upper and lower relation determination method comprises the following steps: when the concept to be determined appears in the Category structure of the current concept or the current concept appears in the Category of the concept to be determined, determining the concept to be positioned up and down; the judgment of the incidence relation is specifically judged by normalizing the Google distance NGD, and when the NGD value of the incidence relation reaches a limited standard, the relation is judged as the incidence relation.
In this embodiment, the data acquisition standard is input in a questionnaire check mode, when a user needs to construct a database, the user may click a "data construction" button, the system may implement the data acquisition standard questionnaire in a pop-up dialog mode, and the user may implement the input of the data acquisition standard by checking each option, where the data acquisition standard at least includes a data type, a data keyword, and a data source.
in this embodiment, the redundant content is cleared by using a redundancy function, specifically, in the redundancy function, the knowledge elements in k1 and k2 are taken out of e1 and e2, X, Y and relationship R in e1 and e2 are taken out and compared with xe1, xe2, ye1 and ye2, respectively, the element items with the same content are deleted, the original relationship R value is retained, and the relationship is merged with the undeleted items. The data positioning module realizes data positioning based on a facet technology, and accurately positions data by calculating a facet distance between different data terms; when the data is positioned, corresponding terms are selected under the constraint of the known facets, so that the description of the required data is completed, and if the selection is successful, the corresponding data is returned; if the selection is unsuccessful, the system will calculate the similarity of terms from the synonym dictionary and the conceptual distance map, forming new positioning information.
The foregoing is only a preferred embodiment of the present invention, and it should be noted that those skilled in the art can make various improvements and modifications without departing from the principle of the present invention, and these improvements and modifications should also be construed as the protection scope of the present invention.
Claims (9)
1. The automatic construction system of the software engineering knowledge base based on big data is characterized in that: the method comprises the following steps:
the target data acquisition rule generating module is used for generating a corresponding data acquisition rule according to a data acquisition standard input by the man-machine operation module;
The target data acquisition module acquires target data through multiple ports based on the data acquisition rule and sends the acquired data to the data standardization module;
The data standardization module is used for clearing excessive content in target data, realizing data standardization processing based on MapReduce, completing construction of data association relation based on the data acquisition rule, marking hyperlinks of related data in a hyperlink mode, and checking the associated data by clicking the hyperlink mark by a user, wherein the association relation with a source file is marked on each piece of associated data;
And the data positioning module is used for finding a proper position in the database for the data subjected to the data standardization processing, finding similar data points for the data, and establishing a relationship between the data points and the similar data points.
2. The automated big-data-based software engineering knowledge base building system according to claim 1, wherein: the data acquisition rule at least comprises word stem co-occurrence degree, asymmetric public word string similarity degree, anchor link co-occurrence degree based on a wiki structure, structural body information similarity based on the wiki structure and theme distribution similarity based on KL divergence.
3. the automated big-data-based software engineering knowledge base building system according to claim 1, wherein: the data acquisition standard is input in a questionnaire check mode.
4. The automated big-data-based software engineering knowledge base building system according to claim 1, wherein: specifically, in the redundancy function, knowledge elements in k1 and k2 are taken out of e1 and e2 respectively, then X, Y and a relation R in e1 and e2 are taken out and compared with xe1, xe2, ye1 and ye2 respectively, element items with the same content are deleted, the original relation R value is reserved, and the relation and the undeleted items are merged.
5. The automated big-data-based software engineering knowledge base building system according to claim 1, wherein: the data positioning module realizes data positioning based on a facet technology, and accurately positions data by calculating a facet distance between different data terms; when the data is positioned, corresponding terms are selected under the constraint of the known facets, so that the description of the required data is completed, and if the selection is successful, the corresponding data is returned; if the selection is unsuccessful, the system will calculate the similarity of terms from the synonym dictionary and the conceptual distance map, forming new positioning information.
6. The automated big-data-based software engineering knowledge base building system according to claim 1, wherein: the data standardization module is also used for marking a source identifier for each target data, and the user can realize the access of the link where the source data is located by clicking the identifier.
7. the automated big-data-based software engineering knowledge base building system according to claim 1, wherein: the hyperlink label and the source mark are different labels.
8. the automated big-data-based software engineering knowledge base building system according to claim 1, wherein: also comprises
and the block chain is used for realizing caching and safety audit of the data, and all the data need to be subjected to the safety audit before entering the system.
9. The automated big-data-based software engineering knowledge base building system of claim 8, wherein: firstly, data features are extracted based on a deep convolution model, and then the obtained data features are input into a BP neural network model to realize data security audit.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910904299.6A CN110569061A (en) | 2019-09-24 | 2019-09-24 | Automatic construction system of software engineering knowledge base based on big data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910904299.6A CN110569061A (en) | 2019-09-24 | 2019-09-24 | Automatic construction system of software engineering knowledge base based on big data |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110569061A true CN110569061A (en) | 2019-12-13 |
Family
ID=68782310
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910904299.6A Withdrawn CN110569061A (en) | 2019-09-24 | 2019-09-24 | Automatic construction system of software engineering knowledge base based on big data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110569061A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113326310A (en) * | 2021-06-18 | 2021-08-31 | 立信(重庆)数据科技股份有限公司 | NLP-based research data standardization method and system |
Citations (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1882939A (en) * | 2003-07-02 | 2006-12-20 | 维布兰特媒体有限公司 | Method and system for augmenting web content |
US20080028286A1 (en) * | 2006-07-27 | 2008-01-31 | Chick Walter F | Generation of hyperlinks to collaborative knowledge bases from terms in text |
CN101493820A (en) * | 2008-01-25 | 2009-07-29 | 北京华深慧正系统工程技术有限公司 | Medicine Regulatory industry knowledge base platform and construct method thereof |
US20130066903A1 (en) * | 2011-09-12 | 2013-03-14 | Siemens Corporatoin | System for Linking Medical Terms for a Medical Knowledge Base |
CN103699568A (en) * | 2013-11-16 | 2014-04-02 | 西安交通大学城市学院 | Method for extracting hyponymy relation of field terms from wikipedia |
CN104915717A (en) * | 2015-06-02 | 2015-09-16 | 百度在线网络技术(北京)有限公司 | Data processing method, knowledge base reasoning method and related device |
CN105095969A (en) * | 2015-09-25 | 2015-11-25 | 沈阳农业大学 | Self-learning model facing knowledge sharing |
CN106294608A (en) * | 2016-08-02 | 2017-01-04 | 郑州工业应用技术学院 | A kind of framework method of clinical medicine commonsense knowledge base |
CN106407208A (en) * | 2015-07-29 | 2017-02-15 | 清华大学 | Establishment method and system for city management ontology knowledge base |
CN106875014A (en) * | 2017-03-02 | 2017-06-20 | 上海交通大学 | The automation of the soft project knowledge base based on semi-supervised learning builds implementation method |
CN107631754A (en) * | 2017-09-26 | 2018-01-26 | 中电科新型智慧城市研究院有限公司 | Slope monitoring method and system based on big data platform |
CN110209723A (en) * | 2019-06-06 | 2019-09-06 | 广州商学院 | A kind of equipment information collection system based on Internet of Things big data |
CN110245186A (en) * | 2019-05-21 | 2019-09-17 | 深圳壹账通智能科技有限公司 | A kind of method for processing business and relevant device based on block chain |
CN110263085A (en) * | 2019-04-23 | 2019-09-20 | 阿里巴巴集团控股有限公司 | Data processing system, method, calculating equipment and storage medium based on block chain |
-
2019
- 2019-09-24 CN CN201910904299.6A patent/CN110569061A/en not_active Withdrawn
Patent Citations (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1882939A (en) * | 2003-07-02 | 2006-12-20 | 维布兰特媒体有限公司 | Method and system for augmenting web content |
US20080028286A1 (en) * | 2006-07-27 | 2008-01-31 | Chick Walter F | Generation of hyperlinks to collaborative knowledge bases from terms in text |
CN101493820A (en) * | 2008-01-25 | 2009-07-29 | 北京华深慧正系统工程技术有限公司 | Medicine Regulatory industry knowledge base platform and construct method thereof |
US20130066903A1 (en) * | 2011-09-12 | 2013-03-14 | Siemens Corporatoin | System for Linking Medical Terms for a Medical Knowledge Base |
CN103699568A (en) * | 2013-11-16 | 2014-04-02 | 西安交通大学城市学院 | Method for extracting hyponymy relation of field terms from wikipedia |
CN104915717A (en) * | 2015-06-02 | 2015-09-16 | 百度在线网络技术(北京)有限公司 | Data processing method, knowledge base reasoning method and related device |
CN106407208A (en) * | 2015-07-29 | 2017-02-15 | 清华大学 | Establishment method and system for city management ontology knowledge base |
CN105095969A (en) * | 2015-09-25 | 2015-11-25 | 沈阳农业大学 | Self-learning model facing knowledge sharing |
CN106294608A (en) * | 2016-08-02 | 2017-01-04 | 郑州工业应用技术学院 | A kind of framework method of clinical medicine commonsense knowledge base |
CN106875014A (en) * | 2017-03-02 | 2017-06-20 | 上海交通大学 | The automation of the soft project knowledge base based on semi-supervised learning builds implementation method |
CN107631754A (en) * | 2017-09-26 | 2018-01-26 | 中电科新型智慧城市研究院有限公司 | Slope monitoring method and system based on big data platform |
CN110263085A (en) * | 2019-04-23 | 2019-09-20 | 阿里巴巴集团控股有限公司 | Data processing system, method, calculating equipment and storage medium based on block chain |
CN110245186A (en) * | 2019-05-21 | 2019-09-17 | 深圳壹账通智能科技有限公司 | A kind of method for processing business and relevant device based on block chain |
CN110209723A (en) * | 2019-06-06 | 2019-09-06 | 广州商学院 | A kind of equipment information collection system based on Internet of Things big data |
Non-Patent Citations (1)
Title |
---|
曾建勋: "知识链接的构建方式研究", 《图书情报工作》 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113326310A (en) * | 2021-06-18 | 2021-08-31 | 立信(重庆)数据科技股份有限公司 | NLP-based research data standardization method and system |
CN113326310B (en) * | 2021-06-18 | 2023-04-18 | 立信(重庆)数据科技股份有限公司 | NLP-based research data standardization method and system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP6309644B2 (en) | Method, system, and storage medium for realizing smart question answer | |
JP6901816B2 (en) | Entity-related data generation methods, devices, devices, and storage media | |
CN108874878A (en) | A kind of building system and method for knowledge mapping | |
CN103294781B (en) | A kind of method and apparatus for processing page data | |
CN102722498B (en) | Search engine and implementation method thereof | |
CN102253930B (en) | A kind of method of text translation and device | |
CN113822067A (en) | Key information extraction method and device, computer equipment and storage medium | |
CN103544255A (en) | Text semantic relativity based network public opinion information analysis method | |
CN112650848A (en) | Urban railway public opinion information analysis method based on text semantic related passenger evaluation | |
CN107885793A (en) | A kind of hot microblog topic analyzing and predicting method and system | |
CN111831794A (en) | Knowledge map-based construction method for knowledge question-answering system in comprehensive pipe gallery industry | |
CN102722499A (en) | Search engine and implementation method thereof | |
CN107102993A (en) | A kind of user's demand analysis method and device | |
CN103246644A (en) | Method and device for processing Internet public opinion information | |
LU503512B1 (en) | Operating method for construction of knowledge graph based on naming rule and caching mechanism | |
CN102737021A (en) | Search engine and realization method thereof | |
CN114090861A (en) | Education field search engine construction method based on knowledge graph | |
CN112650858A (en) | Method and device for acquiring emergency assistance information, computer equipment and medium | |
CN116450834A (en) | Archive knowledge graph construction method based on multi-mode semantic features | |
EP4364044A1 (en) | Automated troubleshooter | |
CN106649557A (en) | Semantic association mining method for defect report and mail list | |
CN114386422A (en) | Intelligent aid decision-making method and device based on enterprise pollution public opinion extraction | |
WO2023057988A1 (en) | Generation and use of content briefs for network content authoring | |
KR101532252B1 (en) | The system for collecting and analyzing of information of social network | |
CN110569061A (en) | Automatic construction system of software engineering knowledge base based on big data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WW01 | Invention patent application withdrawn after publication | ||
WW01 | Invention patent application withdrawn after publication |
Application publication date: 20191213 |