WO2023130774A1 - Data acquisition system for scientific research capability assessment based on subject development - Google Patents

Data acquisition system for scientific research capability assessment based on subject development Download PDF

Info

Publication number
WO2023130774A1
WO2023130774A1 PCT/CN2022/121792 CN2022121792W WO2023130774A1 WO 2023130774 A1 WO2023130774 A1 WO 2023130774A1 CN 2022121792 W CN2022121792 W CN 2022121792W WO 2023130774 A1 WO2023130774 A1 WO 2023130774A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
module
scientific research
input
electrically connected
Prior art date
Application number
PCT/CN2022/121792
Other languages
French (fr)
Chinese (zh)
Inventor
张颖聪
武青松
马鸣
向璨
陈实
吴建才
金阳
王征
罗飞
王智慧
Original Assignee
华中科技大学同济医学院附属协和医院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华中科技大学同济医学院附属协和医院 filed Critical 华中科技大学同济医学院附属协和医院
Publication of WO2023130774A1 publication Critical patent/WO2023130774A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/248Presentation of query results
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling

Definitions

  • the disclosure relates to the technical field of scientific research capability assessment, and in particular to a data collection system for scientific research capability assessment based on subject development.
  • Scientific research capability evaluation is a scientific and technological consulting activity aimed at improving the level of scientific and technological management and decision-making. According to specific purposes, follow certain principles, procedures and indicators, and use scientific, fair and feasible methods to comprehensively analyze scientific and technological activities and their related environments. and judgments, and make qualitative and quantitative evaluations and estimates.
  • the evaluation of scientific research capabilities should not only be limited to scientific and technological activities themselves, but also consider the environmental conditions and effects of scientific and technological activities. It is necessary to comprehensively describe scientific research strengths, accurately To grasp scientific research strength accurately, several indicators should be used to describe the internal structure of scientific research strength, and a complete index system should be used to comprehensively describe scientific research strength, so as to give full play to the scientific evaluation and assessment function of the index system and its guiding role in the development of scientific research.
  • the scientific research level of a hospital depends on the scientific research ability of doctors, among which the scientific research ability evaluation is one of the effective ways to test the scientific research level, and it also provides a reference for strengthening the hospital's scientific research management and formulating scientific research development planning strategies.
  • the process of evaluation it is often necessary to collect evaluation data first.
  • the characteristics of quantitative evaluation in order to better describe the qualitative characteristics of scientific research results, it is often necessary to collect data from the scientific research results of the personnel being evaluated.
  • the ranking of scientific research personnel in the scientific research results is not considered, which makes the accuracy of scientific research evaluation low, and due to the diversity of information sources and the complexity of data structures, etc.
  • the present disclosure provides a data acquisition system for evaluating scientific research capabilities based on subject development, and the technical problem to be solved by utilizing one or more embodiments of the present disclosure is: prior art
  • prior art In general, according to the characteristics of quantitative evaluation, in order to better describe the qualitative characteristics of scientific research results, it is often necessary to collect data on the scientific research results of the evaluated personnel.
  • the ranking in scientific research results makes the accuracy of scientific research evaluation low, and due to the diversity of information sources and the complexity of data structures, it is difficult to extract accurate and effective information, and it will not be correct for the collected data.
  • the correlation analysis of the data makes the data more scattered, which is not convenient for subsequent evaluation and use.
  • a data collection system for evaluating scientific research capabilities based on subject development including a data mining module, a data reporting module and the Internet, the output of the Internet and the input of the data mining module Terminals are electrically connected, the output terminal of the data mining module is electrically connected to the input terminal of the data preprocessing module, the output terminal of the data reporting module is electrically connected to the input terminal of the data preprocessing module, and the output terminal of the data preprocessing module end is electrically connected to the input end of the feature extraction module, and the output end of the feature extraction module is electrically connected to the input end of the investigation and correction module; the output end of the investigation and correction module is electrically connected to the input end of the cluster analysis module, and the The output end of the clustering analysis module is electrically connected to the input end of the association module, the output end of the association module is electrically connected to the input end of the quantization calculation module, and the output end of the quantization calculation module is electrically connected to the input end of
  • the data reporting module includes a text entry module, a voice entry module and an image entry module, the output end of the text entry module, the output end of the speech entry module and the output end of the image entry module are connected with the data
  • the input end of the preprocessing module is electrically connected.
  • the data mining module is used to mine the following various related data: data related to scientific research capabilities, the scientific research capabilities include technological innovation capabilities, technological transformation capabilities, technological competitiveness and technological support capabilities,
  • the scientific and technological innovation capability includes three elements of theoretical innovation, technological innovation and collaborative innovation, the scientific and technological transformation ability includes two elements of military benefit and economic benefit, and the scientific and technological competitiveness includes three elements of academic competition, talent competition and development potential , the technological support capability includes two elements of platform support and management support.
  • the data mining module is used for: crawling data with the help of a web crawler tool, grabbing data on the network across screens with the help of the Scrapy grabbing framework, and grabbing structured data from the page at the same time, using the Python's Scrapy technical framework implements data crawling in the website; and then according to the needs of evaluating data, the captured data is analyzed by association algorithm through data mining.
  • the data preprocessing module is configured to: convert the data captured by the data mining module into a data set for computer identification and calculation; perform on the data set: remove abnormal data , and check data spelling errors and remove duplicate data records, calculate missing data by derivation and fill incomplete record data, remove interference and noise in data through filtering technology and data cleaning, and strengthen useful information.
  • the mining problem of association rules is to find the association rules in the transaction database D that satisfy the minimum support S min and the minimum confidence C min given by the user.
  • the calculation formula of the quantization calculation module is:
  • A is the quantitative score of scientific research personnel performance evaluation
  • t h and S h are the number and ranking of scientific research personnel in scientific research activities h that meet the quantitative indicators K ij...
  • K ij...x is the quantitative index value of a scientific researcher satisfying the quantitative index K ij...x .
  • the data dimension reduction module is used to: reduce the dimension of multi-dimensional data through OLAP, convert it into a report form and store it in the database; query in the database, and finally use the data as evaluation basis for scientific research Capability Assessment.
  • the text input module is used to input data information by means of text input;
  • the voice input module is used to input data information by voice input;
  • the image input module The module is used to input data by means of image input, and can also perform text data;
  • the Internet is used to search, obtain and share data information;
  • the database is used to collect various data information in the system Management and classification and sorting, and at the same time realize the storage of data information;
  • the feature extraction module used to further screen and extract important features and characteristics of the mined data information.
  • the clustering threshold is randomly selected, and the basic important features of the cluster-like data are determined through a random algorithm, and then the features are clustered through the clustering algorithm to obtain the feature data that has been classified, and then the in-depth calculation of the category features is performed.
  • the average value is distinguished, and the data characteristics are obtained through dimensionality reduction.
  • the line chart can be clearly expressed to obtain accurate evaluation data, and the minimum support and minimum credibility of the given satisfaction can be found through the establishment of confidence and support.
  • carry out the mining analysis of association rules and then obtain the quantitative result index value through quantitative calculation, then through the above clustering processing, dimension reduction arrangement and data association analysis processing to ensure the extraction of accurate information, and then lay a solid foundation for subsequent evaluation, Improve the credibility and authenticity of the evaluation of scientific research results;
  • This disclosure provides a wide range of data information through the Internet, which is convenient for the search and acquisition of big data, so that the data mining module uses web crawler tools to obtain information, grab network data, and grab structured data from pages at the same time.
  • the Python-based Scrapy technical framework to achieve data capture in the website, so that data information can be obtained in real time through the network, and the real-time performance of data association updates can be improved.
  • FIG. 1 shows a data acquisition system for evaluating scientific research capabilities based on subject development according to one or more embodiments of the present disclosure
  • Fig. 2 shows a schematic structural diagram of the data reporting module of the present disclosure.
  • this disclosure provides a data collection system for evaluating scientific research capabilities based on discipline development, including a data mining module, a data reporting module, and the Internet, and the output end of the Internet is connected to the input end of the data mining module.
  • the output end of the data mining module is electrically connected to the input end of the data preprocessing module
  • the output end of the data reporting module is electrically connected to the input end of the data preprocessing module
  • the output end of the data preprocessing module is connected to the input end of the feature extraction module Electrically connected
  • the output end of the feature extraction module is electrically connected to the input end of the investigation and correction module.
  • the output end of the research and correction module is electrically connected to the input end of the cluster analysis module, the output end of the cluster analysis module is electrically connected to the input end of the association module, the output end of the association module is electrically connected to the input end of the quantitative calculation module, and the quantitative calculation
  • the output end of the module is electrically connected to the input end of the data dimensionality reduction module, the output end of the data dimensionality reduction module is electrically connected to the input end of the database, and the output end of the database is electrically connected to the input end of the feature extraction module.
  • Data reporting module comprises text entry module, voice entry module and image entry module, text entry module, speech entry module and image entry module, the output end of described text entry module, the output end of described speech entry module and image entry module The output end is electrically connected with the input end of the data preprocessing module.
  • the text entry module is used to enter data information by means of text input.
  • the voice input module is used for inputting data information by means of voice input.
  • the image input module is used to input data by means of image input, and can also input text data.
  • the Internet is used to search, acquire and share data information.
  • the database is used to manage, classify and sort various data information in the system, and realize the storage of data information at the same time.
  • the feature extraction module is used to further screen and extract important features and characteristics of the mined data information.
  • data related to scientific research capabilities scientific research capabilities include technological innovation capabilities, technological transformation capabilities, technological competitiveness and technological support capabilities, and technological innovation capabilities
  • technological innovation capabilities include technological innovation capabilities, technological transformation capabilities, technological competitiveness and technological support capabilities
  • technological innovation capabilities include technological innovation capabilities
  • S&T transfer capability includes two elements: military benefit and economic benefit
  • S&T competitiveness includes three elements: academic competition, talent competition, and development potential
  • S&T support capability includes two elements: platform support and management support.
  • the data mining module is used for: crawling data with the help of web crawler tools, fast and high-level cross-screen crawling of data on the network with the help of the Scrapy crawling framework, and crawling structured data from the page at the same time, using the Python-based Scrapy technical framework to achieve Carry out data capture in the website, and then according to the needs of evaluating data, carry out association algorithm analysis on the captured data through data mining.
  • This disclosure provides a wide range of data information through the Internet, which is convenient for the search and acquisition of big data, so that the data mining module uses web crawler tools to obtain information, grab network data, and grab structured data from pages at the same time.
  • Use the Python-based Scrapy technical framework to achieve data capture in the website, so that data information can be obtained in real time through the network, and the real-time performance of data association updates can be improved.
  • Data preprocessing is used to: convert the crawled data into a data set that can be recognized and operated by the computer, and perform on the data set: remove abnormal data, check data spelling errors, remove duplicate data records, and then calculate missing data by derivation And fill in the incomplete record data, remove the interference and noise in the data through filtering technology and data cleaning, and strengthen the useful information.
  • the condition for association rules to be established is the support degree S and the confidence degree C, in the support degree S, at least S% of transactions in D contain X ⁇ Y, that is In the confidence level C, among the transactions of X contained in D, at least C% of the transactions also contain Y, that is
  • the mining problem of association rules is to find the association rules in the transaction database D that satisfy the minimum support S min and the minimum confidence C min given by the user.
  • the calculation formula of the quantitative calculation module is:
  • A is the quantitative score of scientific research personnel performance evaluation
  • t h , S h are the number and ranking of scientific research personnel in scientific research activities h that meet the quantitative indicators K ij...
  • K ij...x is the quantitative index value of a scientific researcher satisfying the quantitative index K ij...x .
  • the data dimension reduction module is used to reduce the dimension of multi-dimensional data through OLAP, convert it into a report form and store it in the database, which can be queried in the database, and finally use the data as the evaluation basis to evaluate scientific research capabilities.
  • the clustering analysis module randomly selects the clustering threshold, and obtains the data category through the clustering algorithm, distinguishes and obtains the dimensionality reduction ranking, and establishes support and confidence through the association module, and analyzes the integrated association regularity.
  • the evaluation index value is calculated by the quantitative calculation module, and the accurate result evaluation value is obtained.
  • the multidimensional data is reduced by the OLAP of the data dimension reduction module, and it is converted into a report form and stored in the database.
  • the clustering threshold is randomly selected, and the basic important features of the cluster-like data are determined through a random algorithm, and then the features are clustered through the clustering algorithm to obtain the feature data that has been classified, and then the in-depth calculation of the category features is performed.
  • the average value is distinguished, and the data characteristics are obtained through dimensionality reduction.
  • the line chart can be clearly expressed to obtain accurate evaluation data, and the minimum support and minimum credibility of the given satisfaction can be found through the establishment of confidence and support.
  • carry out the mining analysis of association rules and then obtain the quantitative result index value through quantitative calculation, then through the above clustering processing, dimension reduction arrangement and data association analysis processing to ensure the extraction of accurate information, and then lay a solid foundation for subsequent evaluation, Improve the credibility and authenticity of the evaluation of scientific research results;
  • This disclosure provides a wide range of data information through the Internet, which is convenient for the search and acquisition of big data, so that the data mining module uses web crawler tools to obtain information, grab network data, and grab structured data from pages at the same time.
  • the Python-based Scrapy technical framework to achieve data capture in the website, so that data information can be obtained in real time through the network, and the real-time performance of data association updates can be improved.

Abstract

The present invention specifically relates to the technical filed of scientific research capability assessment, and disclosed is a data acquisition system for scientific research capability assessment based on subject development. The method comprises: determining basic important features of class clustering data by means of a random algorithm; clustering the features by means of a clustering algorithm to obtain clustered feature data; performing deep calculation on average distinguishing values of category features; performing dimensionality reduction arrangement to obtain data features which are clearly expressed by means of a line chart, such that accurate assessment data can be obtained; establishing confidence and support degree to find out given and satisfied minimum support degree and minimum credibility, and mining and analyzing association rules; and performing quantitative calculation to obtain a quantitative result index value. By means of the clustering processing, dimensionality reduction arrangement, and data association analysis processing, accurate information is ensured to be extracted, such that a practical foundation can be provided for subsequent assessment, and the credibility and authenticity of scientific research result assessment are improved.

Description

一种基于学科发展的科研能力评估用数据采集系统A Data Acquisition System for Evaluation of Scientific Research Ability Based on Subject Development
相关申请的交叉引用Cross References to Related Applications
本申请要求于2022年01月07日提交、申请号为202210015385.3且名称为“一种基于学科发展的科研能力评估用数据采集”的中国专利申请的优先权,其全部内容通过引用合并于此。This application claims the priority of a Chinese patent application with application number 202210015385.3 and titled "Data Collection for Evaluation of Scientific Research Capabilities Based on Disciplinary Development" filed on January 07, 2022, the entire contents of which are hereby incorporated by reference.
技术领域technical field
本公开内容涉及科研能力评估技术领域,尤其涉及一种基于学科发展的科研能力评估用数据采集系统。The disclosure relates to the technical field of scientific research capability assessment, and in particular to a data collection system for scientific research capability assessment based on subject development.
背景技术Background technique
科技是第一生产力,科学研究是高等医学院校的基本职能之一,附属医院作为医学院校的重要组成部分,除承担着治病救人的职能外,科研工作也是其重要组成部分,同时科研能力是衡量医学院校综合实力的重要指标之一,医学作为一门高度分化又高度综合的学科,强势的学科建设是实现医院可持续发展的关键之一,学科建设作为医院建设发展的一项基础工程,承担着长期的战略性任务。Science and technology is the primary productive force, and scientific research is one of the basic functions of medical colleges and universities. As an important part of medical colleges, affiliated hospitals, in addition to undertaking the function of treating diseases and saving lives, scientific research is also an important part of it. At the same time, scientific research capabilities are One of the important indicators to measure the comprehensive strength of medical colleges. Medicine is a highly differentiated and comprehensive discipline. Strong discipline construction is one of the keys to the sustainable development of hospitals. Discipline construction is a basic project for hospital construction and development. , undertaking long-term strategic tasks.
科研能力评估是旨在提高科技管理与科技决策水平的科技咨询活动,按照特定的目的,遵循一定的原则、程序和指标,运用科学、公正和可行的方法对科技活动及其相关环境进行综合分析和判断,并作出定性及定量的评价与估量,科研能力评估不能只局限于科技活动本身,还要考虑科技活动所处的环境条件因素及其所产生的效果,要全面地描述科研实力、准确地把握科研实力,应该采用若干指标来描述科研实力的内部结构,以完整的指标体系来全面综合描述科研实力,充分发挥指标体系的科学评价考核作用及对于科研发展的导向作用。Scientific research capability evaluation is a scientific and technological consulting activity aimed at improving the level of scientific and technological management and decision-making. According to specific purposes, follow certain principles, procedures and indicators, and use scientific, fair and feasible methods to comprehensively analyze scientific and technological activities and their related environments. and judgments, and make qualitative and quantitative evaluations and estimates. The evaluation of scientific research capabilities should not only be limited to scientific and technological activities themselves, but also consider the environmental conditions and effects of scientific and technological activities. It is necessary to comprehensively describe scientific research strengths, accurately To grasp scientific research strength accurately, several indicators should be used to describe the internal structure of scientific research strength, and a complete index system should be used to comprehensively describe scientific research strength, so as to give full play to the scientific evaluation and assessment function of the index system and its guiding role in the development of scientific research.
而一所医院的科研水平取决于医生的科研能力,其中科研能力评估是检验科研水平的有效办法之一,同时也为加强医院科研管理和制定科研发展规划策略提供参考依据,而在对科研能力进行评估的过程中往 往需要先对评估数据进行采集,现有技术中一般根据量化评价的特点,为了更好地描述科研成果的定性特征,经常需要对所评估人员的科研成果进行数据采集,但是一般采集时只是单单的考虑科研成果的数量,没有考虑到科研人员在科研成果中的排名,使得科研评估的准确性较低,而且由于信息来源的多样性和数据结构的复杂性等原因,从而很难提取到准确的有效信息,并且也不会对采集到的数据进行关联性分析,使得数据较为分散,不便于后续评估使用,因此,研究一种基于学科发展的科研能力评估用数据采集系统来解决上述问题具有重要意义。The scientific research level of a hospital depends on the scientific research ability of doctors, among which the scientific research ability evaluation is one of the effective ways to test the scientific research level, and it also provides a reference for strengthening the hospital's scientific research management and formulating scientific research development planning strategies. In the process of evaluation, it is often necessary to collect evaluation data first. In the prior art, according to the characteristics of quantitative evaluation, in order to better describe the qualitative characteristics of scientific research results, it is often necessary to collect data from the scientific research results of the personnel being evaluated. However, Generally, only the quantity of scientific research achievements is considered during collection, and the ranking of scientific research personnel in the scientific research results is not considered, which makes the accuracy of scientific research evaluation low, and due to the diversity of information sources and the complexity of data structures, etc. It is difficult to extract accurate and effective information, and there is no correlation analysis for the collected data, which makes the data scattered and inconvenient for subsequent evaluation. Therefore, a data acquisition system for scientific research ability evaluation based on discipline development is studied. It is important to solve the above problems.
发明内容Contents of the invention
为了克服现有技术的上述缺陷,本公开内容提供了一种基于学科发展的科研能力评估用数据采集系统,通过利用本公开内容的一个或多个实施例所要解决的技术问题是:现有技术中一般根据量化评价的特点,为了更好地描述科研成果的定性特征,经常需要对所评估人员的科研成果进行数据采集,但是一般采集时只是单单的考虑科研成果的数量,没有考虑到科研人员在科研成果中的排名,使得科研评估的准确性较低,而且由于信息来源的多样性和数据结构的复杂性等原因,从而很难提取到准确的有效信息,并且也不会对采集到的数据进行关联性分析,使得数据较为分散,不便于后续评估使用的问题。In order to overcome the above-mentioned defects of the prior art, the present disclosure provides a data acquisition system for evaluating scientific research capabilities based on subject development, and the technical problem to be solved by utilizing one or more embodiments of the present disclosure is: prior art In general, according to the characteristics of quantitative evaluation, in order to better describe the qualitative characteristics of scientific research results, it is often necessary to collect data on the scientific research results of the evaluated personnel. The ranking in scientific research results makes the accuracy of scientific research evaluation low, and due to the diversity of information sources and the complexity of data structures, it is difficult to extract accurate and effective information, and it will not be correct for the collected data. The correlation analysis of the data makes the data more scattered, which is not convenient for subsequent evaluation and use.
为实现上述目的,本公开内容提供如下技术方案:一种基于学科发展的科研能力评估用数据采集系统,包括数据挖掘模块、数据上报模块和互联网,所述互联网的输出端与数据挖掘模块的输入端电连接,所述数据挖掘模块的输出端与数据预处理模块的输入端电连接,所述数据上报模块的输出端与数据预处理模块的输入端电连接,所述数据预处理模块的输出端与特征提取模块的输入端电连接,所述特征提取模块的输出端与调研校正模块的输入端电连接;所述调研校正模块的输出端与聚类分析模块的输入端电连接,所述聚类分析模块的输出端与关联模块的输入端电连接,所述关联模块的输出端与量化计算模块的输入端电连接,所述量化计算模块的输出端与数据降维模块的输入端电连接,所述数据降维模块的输出端与数据库的输入端电连接,所述数据库的输出端与特征提取模块的输入端电连接。In order to achieve the above purpose, this disclosure provides the following technical solutions: a data collection system for evaluating scientific research capabilities based on subject development, including a data mining module, a data reporting module and the Internet, the output of the Internet and the input of the data mining module Terminals are electrically connected, the output terminal of the data mining module is electrically connected to the input terminal of the data preprocessing module, the output terminal of the data reporting module is electrically connected to the input terminal of the data preprocessing module, and the output terminal of the data preprocessing module end is electrically connected to the input end of the feature extraction module, and the output end of the feature extraction module is electrically connected to the input end of the investigation and correction module; the output end of the investigation and correction module is electrically connected to the input end of the cluster analysis module, and the The output end of the clustering analysis module is electrically connected to the input end of the association module, the output end of the association module is electrically connected to the input end of the quantization calculation module, and the output end of the quantization calculation module is electrically connected to the input end of the data dimension reduction module. connected, the output end of the data dimensionality reduction module is electrically connected to the input end of the database, and the output end of the database is electrically connected to the input end of the feature extraction module.
作为本公开内容的进一步方案:所述数据上报模块包括文字录入模块、语音录入模块和图像录入模块,所述文字录入模块的输出端、语音录入模块的输出端和图像录入模块的输出端与数据预处理模块的输入端电连接。As a further solution of the present disclosure: the data reporting module includes a text entry module, a voice entry module and an image entry module, the output end of the text entry module, the output end of the speech entry module and the output end of the image entry module are connected with the data The input end of the preprocessing module is electrically connected.
作为本公开内容的进一步方案,所述数据挖掘模块用于挖掘如下多种相关数据:与科研能力相关的数据,所述科研能力包括科技创新能力、科技转化能力、科技竞争能力和科技支撑能力,所述科技创新能力包括理论创新、技术创新和协同创新三个要素,所述科技转换能力包括军事效益和经济效益两个要素,所述科技竞争能力包括学术竞争、人才竞争和发展潜力三个要素,所述科技支撑能力包括平台支撑和管理支撑两个要素。As a further solution of the present disclosure, the data mining module is used to mine the following various related data: data related to scientific research capabilities, the scientific research capabilities include technological innovation capabilities, technological transformation capabilities, technological competitiveness and technological support capabilities, The scientific and technological innovation capability includes three elements of theoretical innovation, technological innovation and collaborative innovation, the scientific and technological transformation ability includes two elements of military benefit and economic benefit, and the scientific and technological competitiveness includes three elements of academic competition, talent competition and development potential , the technological support capability includes two elements of platform support and management support.
作为本公开内容的进一步方案,所述数据挖掘模块用于:借助网络爬虫工具进行爬取数据,借助Scrapy抓取框架跨屏幕抓取网络上数据,同时从页面中抓取结构化数据,利用基于Python的Scrapy技术框架实现在网站中进行数据抓取;后根据评估数据的需要,通过数据挖掘对抓取的数据进行关联算法分析。As a further solution of the present disclosure, the data mining module is used for: crawling data with the help of a web crawler tool, grabbing data on the network across screens with the help of the Scrapy grabbing framework, and grabbing structured data from the page at the same time, using the Python's Scrapy technical framework implements data crawling in the website; and then according to the needs of evaluating data, the captured data is analyzed by association algorithm through data mining.
作为本公开内容的进一步方案,所述数据预处理模块,用于:将所述数据挖掘模块抓取的数据转换为用于计算机识别和运算的数据集;对所述数据集进行:剔除异常数据,并检查数据拼写错误和去掉数据重复记录,通过推导计算缺失的数据并补上不完全的记录数据,通过滤波技术和数据清洗去除数据中的干扰和噪声,并对有用信息进行加强处理。As a further solution of the present disclosure, the data preprocessing module is configured to: convert the data captured by the data mining module into a data set for computer identification and calculation; perform on the data set: remove abnormal data , and check data spelling errors and remove duplicate data records, calculate missing data by derivation and fill incomplete record data, remove interference and noise in data through filtering technology and data cleaning, and strengthen useful information.
作为本公开内容的进一步方案,所述聚类分析模块用于实现如下步骤:S1、随机选择聚类的阈值,通过随机算法进行确定每个簇的类别,并使用聚类算法对调研校正后的数据进行聚类,得到具有类别的聚类:C={C 1,C 2,…,C k},计算每个特征中任何一个簇到其他簇之间的区分度;S2、计算其在不同类别之间的区分度的平均值Mean i,计算每个特征上不同类别之间平均区分度的最大值Max i和最小值Min i,计算每个特征在不同类别上的区分度f i:f i=(Max i-Min i)/Mean i,对特征按照区分度f i降序排列得到f i *(i=1,2,…,m);S3、将S2得到的结果通过折线图来表示,并在折线图中找到变化剧烈的点或拐点i 0
Figure PCTCN2022121792-appb-000001
即为选择的特征子集,并由此特征子 集共同构成特征聚类分析文档。
As a further solution of the present disclosure, the cluster analysis module is used to implement the following steps: S1. Randomly select the threshold of clustering, determine the category of each cluster through a random algorithm, and use the clustering algorithm to analyze the research-corrected Cluster the data to obtain clusters with categories: C={C 1 ,C 2 ,…,C k }, calculate the degree of discrimination between any cluster in each feature and other clusters; S2, calculate its different The average value Mean i of the discrimination between categories, calculate the maximum value Max i and the minimum value Min i of the average discrimination between different categories on each feature, and calculate the discrimination f i of each feature on different categories: f i =(Max i -Min i )/Mean i , arrange the features in descending order according to the degree of discrimination f i to obtain f i * (i=1,2,...,m); S3, express the results obtained in S2 through a line graph , and find the sharply changing point or inflection point i 0 in the line chart,
Figure PCTCN2022121792-appb-000001
is the selected feature subset, and this feature subset together constitutes a feature cluster analysis document.
作为本公开内容的进一步方案:所述关联模块执行的关联步骤和规则为:设I={i 1,i 2,…i m}是m个不同的项目的集合,给定一个事务数据库D,其中的每一个事务T是I中一组项目的集合,即
Figure PCTCN2022121792-appb-000002
T有唯一的标识符TD,关联规则是形如
Figure PCTCN2022121792-appb-000003
的蕴含式,其中
Figure PCTCN2022121792-appb-000004
X∩Y=Φ,关联规则成立的条件是支持度S和置信度C,支持度S中,D中至少有S%的事务包含X∪Y,即
Figure PCTCN2022121792-appb-000005
置信度C中,在D所包含X的事务中,至少有C%的事务同时也包含Y,即
Figure PCTCN2022121792-appb-000006
关联规则的挖掘问题就是在事务数据库D中找出具有用户给定的满足最小支持度S min和最小置信度C min的关联规则。
As a further solution of the present disclosure: the association steps and rules executed by the association module are: Let I={i 1 ,i 2 ,...i m } be a collection of m different items, given a transaction database D, Each transaction T is a set of items in I, namely
Figure PCTCN2022121792-appb-000002
T has a unique identifier TD, and association rules are of the form
Figure PCTCN2022121792-appb-000003
the implication of
Figure PCTCN2022121792-appb-000004
X∩Y=Φ, the condition for association rules to be established is the support degree S and the confidence degree C, in the support degree S, at least S% of transactions in D contain X∪Y, that is
Figure PCTCN2022121792-appb-000005
In the confidence level C, among the transactions of X contained in D, at least C% of the transactions also contain Y, that is
Figure PCTCN2022121792-appb-000006
The mining problem of association rules is to find the association rules in the transaction database D that satisfy the minimum support S min and the minimum confidence C min given by the user.
作为本公开内容的进一步方案,所述量化计算模块的计算公式为:As a further solution of the present disclosure, the calculation formula of the quantization calculation module is:
Figure PCTCN2022121792-appb-000007
Figure PCTCN2022121792-appb-000007
其中,A为科研人员绩效评估的量化得分,t h、S h分别为某科研人员符合量化指标K ij…x的科研活动h中人员数和排序,K ij…x为某科研人员满足量化指标K ij…x的科研成果的实际数量,K ij…x为某科研人员满足量化指标K ij…x的量化指标值。 Among them, A is the quantitative score of scientific research personnel performance evaluation, t h and S h are the number and ranking of scientific research personnel in scientific research activities h that meet the quantitative indicators K ij... The actual number of scientific research results of K ij...x , where K ij...x is the quantitative index value of a scientific researcher satisfying the quantitative index K ij...x .
作为本公开内容的进一步方案,所述数据降维模块用于:通过OLAP将多维数据降维,将其转化为报表形式后存入数据库;在数据库中进行查询,最终利用数据作为评估依据进行科研能力评估。As a further solution of the present disclosure, the data dimension reduction module is used to: reduce the dimension of multi-dimensional data through OLAP, convert it into a report form and store it in the database; query in the database, and finally use the data as evaluation basis for scientific research Capability Assessment.
作为本公开内容的进一步方案,所述文字录入模块,用于采用文字输入的方式对数据信息进行录入;所述语音录入模块,用于采用语音输入的方式对数据信息进行录入;所述图像录入模块,用于采用图像输入的方式进行数据的录入,并可进行文字的数据;所述互联网用于对数据信息的搜索、获取及共享;所述数据库用于对系统中的各项数据信息的管理及分类排序,同时实现对数据信息的存储;所述特征提取模块:用于对挖掘的数据信息进一步进行重要特征及特点的筛选提取。As a further solution of the present disclosure, the text input module is used to input data information by means of text input; the voice input module is used to input data information by voice input; the image input module The module is used to input data by means of image input, and can also perform text data; the Internet is used to search, obtain and share data information; the database is used to collect various data information in the system Management and classification and sorting, and at the same time realize the storage of data information; the feature extraction module: used to further screen and extract important features and characteristics of the mined data information.
本公开内容的有益效果在于:The beneficial effects of the disclosure are:
本公开内容通过随机选择聚类阈值,并通过随机算法确定类聚类数据的基本重要特征,再通过聚类算法对特征进行聚类,得到归类完成的特征 数据,然后进行深度计算类别特征的平均区分数值,并通过降维排列得到数据特征通过折线图明确表示,即可得出准确评估数据,并通过置信度和支撑度的建立找出给定满足的最小支持度及最小可信度,并进行关联规则的挖掘分析,随后通过量化计算得出量化成果指标值,则通过以上聚类处理,降维排列及数据的关联分析处理确保提取准确的信息,进而可以为后续评估奠定切实基础,提高科研成果评估的可信性真实性;In this disclosure, the clustering threshold is randomly selected, and the basic important features of the cluster-like data are determined through a random algorithm, and then the features are clustered through the clustering algorithm to obtain the feature data that has been classified, and then the in-depth calculation of the category features is performed. The average value is distinguished, and the data characteristics are obtained through dimensionality reduction. The line chart can be clearly expressed to obtain accurate evaluation data, and the minimum support and minimum credibility of the given satisfaction can be found through the establishment of confidence and support. And carry out the mining analysis of association rules, and then obtain the quantitative result index value through quantitative calculation, then through the above clustering processing, dimension reduction arrangement and data association analysis processing to ensure the extraction of accurate information, and then lay a solid foundation for subsequent evaluation, Improve the credibility and authenticity of the evaluation of scientific research results;
本公开内容通过互联网提供广泛的数据信息,进而便于提供大数据的搜索及获取,使得数据挖掘模块借助网络爬虫工具进行信息的获取,抓取网络数据,同时从页面中抓取结构化数据,并利用基于Python的Scrapy技术框架实现在网站中进行数据抓取,以此可以通过网络进行数据信息的实时获取,提高数据关联更新的实时性。This disclosure provides a wide range of data information through the Internet, which is convenient for the search and acquisition of big data, so that the data mining module uses web crawler tools to obtain information, grab network data, and grab structured data from pages at the same time. Use the Python-based Scrapy technical framework to achieve data capture in the website, so that data information can be obtained in real time through the network, and the real-time performance of data association updates can be improved.
附图说明Description of drawings
图1示出了本公开内容的一个或者多个实施方式的基于学科发展的科研能力评估用数据采集系统;FIG. 1 shows a data acquisition system for evaluating scientific research capabilities based on subject development according to one or more embodiments of the present disclosure;
图2示出了本公开内容的数据上报模块的结构示意图。Fig. 2 shows a schematic structural diagram of the data reporting module of the present disclosure.
具体实施方式Detailed ways
下面将结合本公开实施例中的附图,对本公开实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本公开内容一部分实施例,而不是全部的实施例。基于本公开内容中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本公开内容保护的范围。The following will clearly and completely describe the technical solutions in the embodiments of the present disclosure with reference to the accompanying drawings in the embodiments of the present disclosure. Obviously, the described embodiments are only some of the embodiments of the present disclosure, not all of them. Based on the embodiments in the present disclosure, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present disclosure.
如图1-2所示,本公开内容提供了一种基于学科发展的科研能力评估用数据采集系统,包括数据挖掘模块、数据上报模块和互联网,互联网的输出端与数据挖掘模块的输入端电连接,数据挖掘模块的输出端与数据预处理模块的输入端电连接,数据上报模块的输出端与数据预处理模块的输入端电连接,数据预处理模块的输出端与特征提取模块的输入端电连接,特征提取模块的输出端与调研校正模块的输入端电连接。As shown in Figure 1-2, this disclosure provides a data collection system for evaluating scientific research capabilities based on discipline development, including a data mining module, a data reporting module, and the Internet, and the output end of the Internet is connected to the input end of the data mining module. connection, the output end of the data mining module is electrically connected to the input end of the data preprocessing module, the output end of the data reporting module is electrically connected to the input end of the data preprocessing module, and the output end of the data preprocessing module is connected to the input end of the feature extraction module Electrically connected, the output end of the feature extraction module is electrically connected to the input end of the investigation and correction module.
调研校正模块的输出端与聚类分析模块的输入端电连接,聚类分析模块的输出端与关联模块的输入端电连接,关联模块的输出端与量化计算模块的输入端电连接,量化计算模块的输出端与数据降维模块的输入端电连接,数据降维模块的输出端与数据库的输入端电连接,数据库的输出端与特征提取模块的输入端电连接。The output end of the research and correction module is electrically connected to the input end of the cluster analysis module, the output end of the cluster analysis module is electrically connected to the input end of the association module, the output end of the association module is electrically connected to the input end of the quantitative calculation module, and the quantitative calculation The output end of the module is electrically connected to the input end of the data dimensionality reduction module, the output end of the data dimensionality reduction module is electrically connected to the input end of the database, and the output end of the database is electrically connected to the input end of the feature extraction module.
数据上报模块包括文字录入模块、语音录入模块和图像录入模块,文字录入模块、语音录入模块和图像录入模块,所述文字录入模块的输出端、所述语音录入模块的输出端和图像录入模块的输出端与数据预处理模块的输入端电连接。Data reporting module comprises text entry module, voice entry module and image entry module, text entry module, speech entry module and image entry module, the output end of described text entry module, the output end of described speech entry module and image entry module The output end is electrically connected with the input end of the data preprocessing module.
文字录入模块用于采用文字输入的方式对数据信息进行录入。语音录入模块用于采用语音输入的方式对数据信息进行录入。图像录入模块用于采用图像输入的方式进行数据的录入,并可进行文字的数据。互联网用于对数据信息的搜索、获取及共享。数据库用于对系统中的各项数据信息的管理及分类排序,同时实现对数据信息的存储。特征提取模块用于对挖掘的数据信息进一步进行重要特征及特点的筛选提取。The text entry module is used to enter data information by means of text input. The voice input module is used for inputting data information by means of voice input. The image input module is used to input data by means of image input, and can also input text data. The Internet is used to search, acquire and share data information. The database is used to manage, classify and sort various data information in the system, and realize the storage of data information at the same time. The feature extraction module is used to further screen and extract important features and characteristics of the mined data information.
数据挖掘中挖掘如下多种相关数据:与科研能力相关的数据,科研能力包括科技创新能力、科技转化能力、科技竞争能力和科技支撑能力,科技创新能力理论创新、技术创新和协同创新三个要素,科技转换能力包括军事效益和经济效益两个要素,科技竞争能力包括学术竞争、人才竞争和发展潜力三个要素,科技支撑能力包括平台支撑和管理支撑两个要素。In data mining, the following kinds of relevant data are mined: data related to scientific research capabilities, scientific research capabilities include technological innovation capabilities, technological transformation capabilities, technological competitiveness and technological support capabilities, and technological innovation capabilities Theoretical innovation, technological innovation and collaborative innovation are three elements , S&T transfer capability includes two elements: military benefit and economic benefit; S&T competitiveness includes three elements: academic competition, talent competition, and development potential; S&T support capability includes two elements: platform support and management support.
数据挖掘模块用于:借助网络爬虫工具进行爬取数据,借助Scrapy抓取框架快速高层次的跨屏幕抓取网络上数据,同时从页面中抓取结构化数据,利用基于Python的Scrapy技术框架实现在网站中进行数据抓取,然后根据评估数据的需要,通过数据挖掘对抓取的数据进行关联算法分析。本公开内容通过互联网提供广泛的数据信息,进而便于提供大数据的搜索及获取,使得数据挖掘模块借助网络爬虫工具进行信息的获取,抓取网络数据,同时从页面中抓取结构化数据,并利用基于Python的Scrapy技术框架实现在网站中进行数据抓取,以此可以通过网络进行数据信息的实时获取,提高数据关联更新的实时性。The data mining module is used for: crawling data with the help of web crawler tools, fast and high-level cross-screen crawling of data on the network with the help of the Scrapy crawling framework, and crawling structured data from the page at the same time, using the Python-based Scrapy technical framework to achieve Carry out data capture in the website, and then according to the needs of evaluating data, carry out association algorithm analysis on the captured data through data mining. This disclosure provides a wide range of data information through the Internet, which is convenient for the search and acquisition of big data, so that the data mining module uses web crawler tools to obtain information, grab network data, and grab structured data from pages at the same time. Use the Python-based Scrapy technical framework to achieve data capture in the website, so that data information can be obtained in real time through the network, and the real-time performance of data association updates can be improved.
数据预处理,用于:将爬取的数据转换为计算机可以识别和运算的数据集,对数据集进行:剔除异常数据,并检查数据拼写错误、去掉数据重复记录,然后通过推导计算缺失的数据并补上不完全的记录数据,通过滤波技术和数据清洗去除数据中的干扰和噪声,并对有用信息进行加强处理。Data preprocessing is used to: convert the crawled data into a data set that can be recognized and operated by the computer, and perform on the data set: remove abnormal data, check data spelling errors, remove duplicate data records, and then calculate missing data by derivation And fill in the incomplete record data, remove the interference and noise in the data through filtering technology and data cleaning, and strengthen the useful information.
聚类分析模块,用于实现如下步骤S1~S3:S1、随机选择聚类的阈值,通过随机算法进行确定每个簇的类别,并使用聚类算法对调研校正后的数据进行聚类,得到具有类别的聚类:C={C 1,C 2,…,C k},然后计算每个特征中任何一个簇到其他簇之间的区分度。S2、并计算其在不同类别之间的区分度的平均值Mean i,然后进一步计算每个特征上不同类别之间平均区分度的最大值Max i和最小值Min i,然后计算每个特征在不同类别上的区分度,即f i=(Max i-Min i)/Mean i,然后对特征按照f i降序排列得到f i *(i=1,2,…,m)。S3、将上述得到的结果通过折线图来表示,并在折线图中找到变化剧烈的点或拐点i 0
Figure PCTCN2022121792-appb-000008
即为选择的特征子集,并由此特征子集共同构成特征聚类分析文档。
The clustering analysis module is used to implement the following steps S1-S3: S1, randomly select the clustering threshold, determine the category of each cluster through a random algorithm, and use the clustering algorithm to cluster the survey-corrected data to obtain Clustering with categories: C={C 1 ,C 2 ,...,C k }, and then calculate the degree of discrimination between any one cluster and other clusters in each feature. S2, and calculate the average Mean i of its discrimination between different categories, and then further calculate the maximum value Max i and minimum value Min i of the average discrimination between different categories on each feature, and then calculate each feature in Distinguishing degrees of different categories, that is, f i =(Max i -Min i )/Mean i , and then arrange the features in descending order of f i to obtain f i * (i=1,2,...,m). S3. Express the above-mentioned results through a line graph, and find a sharply changing point or inflection point i 0 in the line graph,
Figure PCTCN2022121792-appb-000008
is the selected feature subset, and this feature subset together constitutes a feature cluster analysis document.
关联模块执行的关联步骤和规则为:设I={i 1,i 2,…i m}是m个不同的项目的集合,给定一个事务数据库D,其中的每一个事务T是I中一组项目的集合,即
Figure PCTCN2022121792-appb-000009
T有唯一的标识符TD,关联规则是形如
Figure PCTCN2022121792-appb-000010
的蕴含式,其中
Figure PCTCN2022121792-appb-000011
X∩Y=Φ,关联规则成立的条件是支持度S和置信度C,支持度S中,D中至少有S%的事务包含X∪Y,即
Figure PCTCN2022121792-appb-000012
置信度C中,在D所包含X的事务中,至少有C%的事务同时也包含Y,即
Figure PCTCN2022121792-appb-000013
关联规则的挖掘问题就是在事务数据库D中找出具有用户给定的满足最小支持度S min和最小置信度C min的关联规则。
The association steps and rules executed by the association module are: Let I={i 1 ,i 2 ,…i m } be a set of m different items, given a transaction database D, each transaction T in it is a A collection of group items, i.e.
Figure PCTCN2022121792-appb-000009
T has a unique identifier TD, and association rules are of the form
Figure PCTCN2022121792-appb-000010
the implication of
Figure PCTCN2022121792-appb-000011
X∩Y=Φ, the condition for association rules to be established is the support degree S and the confidence degree C, in the support degree S, at least S% of transactions in D contain X∪Y, that is
Figure PCTCN2022121792-appb-000012
In the confidence level C, among the transactions of X contained in D, at least C% of the transactions also contain Y, that is
Figure PCTCN2022121792-appb-000013
The mining problem of association rules is to find the association rules in the transaction database D that satisfy the minimum support S min and the minimum confidence C min given by the user.
量化计算模块的计算公式为:The calculation formula of the quantitative calculation module is:
Figure PCTCN2022121792-appb-000014
Figure PCTCN2022121792-appb-000014
其中,A为科研人员绩效评估的量化得分,t h,S h分别为某科研人员符合量化指标K ij…x的科研活动h中人员数和排序,K ij…x为某科研人员满足量化指标K ij…x的科研成果的实际数量,K ij…x为某科研人员满足量化指标K ij…x的 量化指标值。 Among them, A is the quantitative score of scientific research personnel performance evaluation, t h , S h are the number and ranking of scientific research personnel in scientific research activities h that meet the quantitative indicators K ij... The actual number of scientific research results of K ij...x , where K ij...x is the quantitative index value of a scientific researcher satisfying the quantitative index K ij...x .
数据降维模块用于:通过OLAP将多维数据降维,将其转化为报表形式后存入数据库,可在数据库中进行查询,最终利用数据作为评估依据进行科研能力评估。The data dimension reduction module is used to reduce the dimension of multi-dimensional data through OLAP, convert it into a report form and store it in the database, which can be queried in the database, and finally use the data as the evaluation basis to evaluate scientific research capabilities.
本公开内容的工作原理为:This disclosure works by:
S1、首先通过数据上报模块对数据信息的录入,同时也可通过数据挖掘模块对互联网上信息进行网络抓取及结构化抓取数据的操作,此时将挖掘得出数据信息传递给数据预处理模块,数据预处理模块对数据信息进行错误筛选,异常数据的删除,并通过推导计算得出缺失数据,使得数据信息得到加强处理;S1. First, enter the data information through the data reporting module. At the same time, you can also perform network capture and structured data capture operations on the information on the Internet through the data mining module. At this time, the data information obtained by mining will be passed to the data preprocessing Module, the data preprocessing module performs error screening on data information, deletes abnormal data, and obtains missing data through derivation and calculation, so that data information can be strengthened;
S2、然后即可通过特征提取模块提取重要基本特征,或者从数据库中提取相类似中要基本特征,并通过调研校正模块对数据特征进行可信度调研校对;S2. Then, important basic features can be extracted through the feature extraction module, or similar basic features can be extracted from the database, and the credibility of the data features can be checked and checked through the research and correction module;
S3、最后交由聚类分析模块随机选择聚类阈值,并通过聚类算法得出数据类别,区分得到降维排序,并通过关联模块建立支持度和置信度,并分析得整合关联规则性,此时通过量化计算模块计算评估指标值,得到准确结果评估值,最后通过数据降维模块的OLAP将多维数据降维,将其转化为报表形式存入数据库。S3. Finally, the clustering analysis module randomly selects the clustering threshold, and obtains the data category through the clustering algorithm, distinguishes and obtains the dimensionality reduction ranking, and establishes support and confidence through the association module, and analyzes the integrated association regularity. At this time, the evaluation index value is calculated by the quantitative calculation module, and the accurate result evaluation value is obtained. Finally, the multidimensional data is reduced by the OLAP of the data dimension reduction module, and it is converted into a report form and stored in the database.
本公开内容的有益效果在于:The beneficial effects of the disclosure are:
本公开内容通过随机选择聚类阈值,并通过随机算法确定类聚类数据的基本重要特征,再通过聚类算法对特征进行聚类,得到归类完成的特征数据,然后进行深度计算类别特征的平均区分数值,并通过降维排列得到数据特征通过折线图明确表示,即可得出准确评估数据,并通过置信度和支撑度的建立找出给定满足的最小支持度及最小可信度,并进行关联规则的挖掘分析,随后通过量化计算得出量化成果指标值,则通过以上聚类处理,降维排列及数据的关联分析处理确保提取准确的信息,进而可以为后续评估奠定切实基础,提高科研成果评估的可信性真实性;In this disclosure, the clustering threshold is randomly selected, and the basic important features of the cluster-like data are determined through a random algorithm, and then the features are clustered through the clustering algorithm to obtain the feature data that has been classified, and then the in-depth calculation of the category features is performed. The average value is distinguished, and the data characteristics are obtained through dimensionality reduction. The line chart can be clearly expressed to obtain accurate evaluation data, and the minimum support and minimum credibility of the given satisfaction can be found through the establishment of confidence and support. And carry out the mining analysis of association rules, and then obtain the quantitative result index value through quantitative calculation, then through the above clustering processing, dimension reduction arrangement and data association analysis processing to ensure the extraction of accurate information, and then lay a solid foundation for subsequent evaluation, Improve the credibility and authenticity of the evaluation of scientific research results;
本公开内容通过互联网提供广泛的数据信息,进而便于提供大数据的搜索及获取,使得数据挖掘模块借助网络爬虫工具进行信息的获取,抓取 网络数据,同时从页面中抓取结构化数据,并利用基于Python的Scrapy技术框架实现在网站中进行数据抓取,以此可以通过网络进行数据信息的实时获取,提高数据关联更新的实时性。This disclosure provides a wide range of data information through the Internet, which is convenient for the search and acquisition of big data, so that the data mining module uses web crawler tools to obtain information, grab network data, and grab structured data from pages at the same time. Use the Python-based Scrapy technical framework to achieve data capture in the website, so that data information can be obtained in real time through the network, and the real-time performance of data association updates can be improved.
最后应说明的几点是:虽然,上文中已经用一般性说明及具体实施例对本公开内容作了详尽的描述,但在本公开内容的基础上,以上各实施例仅用以说明本公开内容的技术方案,而非对其限制;尽管参照前述各实施例对本公开内容进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分或者全部技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本公开内容各实施例技术方案的范围。The last points that should be explained are: Although the present disclosure has been described in detail above with general descriptions and specific examples, on the basis of the present disclosure, the above embodiments are only used to illustrate the present disclosure. The technical solution of the present invention is not limited thereto; although the present disclosure has been described in detail with reference to the aforementioned embodiments, those of ordinary skill in the art should understand that the technical solutions described in the aforementioned embodiments can still be modified, Or perform equivalent replacements for some or all of the technical features; and these modifications or replacements do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present disclosure.

Claims (10)

  1. 一种基于学科发展的科研能力评估用数据采集系统,包括:A data collection system for evaluating scientific research capabilities based on discipline development, including:
    数据挖掘模块、数据上报模块和互联网,其特征在于:所述互联网的输出端与数据挖掘模块的输入端电连接,所述数据挖掘模块的输出端与数据预处理模块的输入端电连接,所述数据上报模块的输出端与数据预处理模块的输入端电连接,所述数据预处理模块的输出端与特征提取模块的输入端电连接,所述特征提取模块的输出端与调研校正模块的输入端电连接;The data mining module, the data reporting module and the Internet are characterized in that: the output end of the Internet is electrically connected to the input end of the data mining module, and the output end of the data mining module is electrically connected to the input end of the data preprocessing module. The output end of the data reporting module is electrically connected to the input end of the data preprocessing module, the output end of the data preprocessing module is electrically connected to the input end of the feature extraction module, and the output end of the feature extraction module is connected to the research correction module. The input terminal is electrically connected;
    所述调研校正模块的输出端与聚类分析模块的输入端电连接,所述聚类分析模块的输出端与关联模块的输入端电连接,所述关联模块的输出端与量化计算模块的输入端电连接,所述量化计算模块的输出端与数据降维模块的输入端电连接,所述数据降维模块的输出端与数据库的输入端电连接,所述数据库的输出端与特征提取模块的输入端电连接。The output end of the research and correction module is electrically connected to the input end of the cluster analysis module, the output end of the cluster analysis module is electrically connected to the input end of the association module, and the output end of the association module is electrically connected to the input end of the quantitative calculation module Terminals are electrically connected, the output terminal of the quantitative calculation module is electrically connected to the input terminal of the data dimensionality reduction module, the output terminal of the data dimensionality reduction module is electrically connected to the input terminal of the database, and the output terminal of the database is electrically connected to the feature extraction module The input terminals are electrically connected.
  2. 根据权利要求1所述的基于学科发展的科研能力评估用数据采集系统,其中,所述数据上报模块包括:The data collection system for evaluating scientific research capabilities based on subject development according to claim 1, wherein the data reporting module includes:
    文字录入模块、语音录入模块和图像录入模块,所述文字录入模块的输出端、所述语音录入模块的输出端和所述图像录入模块的输出端与所述数据预处理模块的输入端电连接。A text entry module, a voice entry module and an image entry module, the output of the text entry module, the output of the speech entry module and the output of the image entry module are electrically connected to the input of the data preprocessing module .
  3. 根据权利要求1所述的基于学科发展的科研能力评估用数据采集系统,其中,所述数据挖掘模块用于挖掘如下多种相关数据:The data collection system for evaluating scientific research capabilities based on subject development according to claim 1, wherein the data mining module is used to mine the following multiple related data:
    与科研能力相关的数据,所述科研能力包括科技创新能力、科技转化能力、科技竞争能力和科技支撑能力,所述科技创新能力包括理论创新、技术创新和协同创新三个要素,所述科技转换能力包括军事效益和经济效益两个要素,所述科技竞争能力包括学术竞争、人才竞争和发展潜力三个要素,所述科技支撑能力包括平台支撑和管理支撑两个要素。Data related to scientific research capabilities. The scientific research capabilities include technological innovation capabilities, technological transformation capabilities, technological competitiveness, and technological support capabilities. The technological innovation capabilities include three elements: theoretical innovation, technological innovation, and collaborative innovation. The technological transformation Capabilities include two elements: military benefit and economic benefit. The technological competitiveness includes three elements: academic competition, talent competition and development potential. The technological support capability includes two elements: platform support and management support.
  4. 根据权利要求3所述的基于学科发展的科研能力评估用数据采集系统,其中,所述数据挖掘模块用于:The data collection system for evaluating scientific research capabilities based on subject development according to claim 3, wherein the data mining module is used for:
    借助网络爬虫工具进行爬取数据,借助Scrapy抓取框架跨屏幕抓取网络上数据,同时从页面中抓取结构化数据,利用基于Python的Scrapy技术框架实现在网站中进行数据抓取;Crawl data with the help of web crawler tools, use the Scrapy crawling framework to crawl data on the network across screens, and grab structured data from pages at the same time, and use the Python-based Scrapy technical framework to achieve data crawling in the website;
    根据评估数据的需要,通过数据挖掘对抓取的数据进行关联算法分析。According to the needs of the evaluation data, the association algorithm analysis is carried out on the captured data through data mining.
  5. 根据权利要求4所述的基于学科发展的科研能力评估用数据采集系统,其中,所述数据预处理模块,用于:The data collection system for evaluating scientific research capabilities based on subject development according to claim 4, wherein the data preprocessing module is used for:
    将所述数据挖掘模块抓取的数据转换为用于计算机识别和运算的数据集;converting the data captured by the data mining module into a data set for computer identification and calculation;
    对所述数据集进行:剔除异常数据,并检查数据拼写错误和去掉数据重复记录,通过推导计算缺失的数据并补上不完全的记录数据,通过滤波技术和数据清洗去除数据中的干扰和噪声,并对有用信息进行加强处理。For the data set: eliminate abnormal data, check data spelling errors and remove duplicate data records, calculate missing data through derivation and fill incomplete record data, and remove interference and noise in data through filtering technology and data cleaning , and strengthen the useful information.
  6. 根据权利要求1所述的基于学科发展的科研能力评估用数据采集系统,其中,所述聚类分析模块用于实现如下步骤:The data collection system for evaluating scientific research capabilities based on subject development according to claim 1, wherein the cluster analysis module is used to implement the following steps:
    S1、随机选择聚类的阈值,通过随机算法进行确定每个簇的类别,并使用聚类算法对调研校正后的数据进行聚类,得到具有类别的聚类:C={C 1,C 2,…,C k},计算每个特征中任何一个簇到其他簇之间的区分度; S1. Randomly select the clustering threshold, determine the category of each cluster through a random algorithm, and use the clustering algorithm to cluster the data after the survey correction to obtain a cluster with categories: C={C 1 ,C 2 ,...,C k }, calculate the discrimination between any cluster and other clusters in each feature;
    S2、计算其在不同类别之间的区分度的平均值Mean i,计算每个特征上不同类别之间平均区分度的最大值Max i和最小值Min i,计算每个特征在不同类别上的区分度f i:f i=(Max i-Min i)/Mean i,对特征按照区分度f i降序排列得到f i *(i=1,2,…,m); S2. Calculate the average Mean i of its discrimination between different categories, calculate the maximum value Max i and minimum value Min i of the average discrimination between different categories on each feature, and calculate the mean value of each feature on different categories Distinguishing degree f i : f i =(Max i -Min i )/Mean i , arrange features in descending order according to distinguishing degree f i to obtain f i * (i=1,2,...,m);
    S3、将S2得到的结果通过折线图来表示,并在折线图中找到变化剧烈的点或拐点i 0
    Figure PCTCN2022121792-appb-100001
    即为选择的特征子集,并由此特征子集共同构成特征聚类分析文档。
    S3. Express the result obtained in S2 through a line graph, and find a sharply changing point or inflection point i 0 in the line graph,
    Figure PCTCN2022121792-appb-100001
    is the selected feature subset, and this feature subset together constitutes a feature cluster analysis document.
  7. 根据权利要求1所述的基于学科发展的科研能力评估用数据采集系统,其中,所述关联模块执行的关联步骤和规则为:The data acquisition system for scientific research capability assessment based on subject development according to claim 1, wherein the association steps and rules executed by the association module are:
    设I={i 1,i 2,…i m}是m个不同的项目的集合,给定一个事务数据库D,其中的每一个事务T是I中一组项目的集合,即
    Figure PCTCN2022121792-appb-100002
    T有唯一的标识符TD,关联规则是形如
    Figure PCTCN2022121792-appb-100003
    的蕴含式,其中
    Figure PCTCN2022121792-appb-100004
    X∩Y=Φ,关联规则成立的条件是支持度S和置信度C,支持度S中,D中至少有S%的事务包含X∪Y,即
    Figure PCTCN2022121792-appb-100005
    置信度C中,在D所包含X的事务中,至少有C%的事务同时也包含Y,即
    Figure PCTCN2022121792-appb-100006
    关联规则的挖掘问题就是在事务数据库D中找出具有用户给定的满足最小支持度S min和最小置信度C min的关联规则。
    Let I={i 1 ,i 2 ,…i m } be a set of m different items, given a transaction database D, each transaction T is a set of items in I, that is
    Figure PCTCN2022121792-appb-100002
    T has a unique identifier TD, and association rules are of the form
    Figure PCTCN2022121792-appb-100003
    the implication of
    Figure PCTCN2022121792-appb-100004
    X∩Y=Φ, the condition for association rules to be established is the support degree S and the confidence degree C, in the support degree S, at least S% of transactions in D contain X∪Y, that is
    Figure PCTCN2022121792-appb-100005
    In the confidence level C, among the transactions of X contained in D, at least C% of the transactions also contain Y, that is
    Figure PCTCN2022121792-appb-100006
    The mining problem of association rules is to find the association rules in the transaction database D that satisfy the minimum support S min and the minimum confidence C min given by the user.
  8. 根据权利要求1所述的基于学科发展的科研能力评估用数据采集系统,其中,所述量化计算模块的计算公式为:The data collection system for evaluating scientific research capabilities based on subject development according to claim 1, wherein the calculation formula of the quantitative calculation module is:
    Figure PCTCN2022121792-appb-100007
    Figure PCTCN2022121792-appb-100007
    其中,A为科研人员绩效评估的量化得分,t h、S h分别为某科研人员符合量化指标K ij…x的科研活动h中人员数和排序,K ij…x为某科研人员满足量化指标K ij…x的科研成果的实际数量,K ij…x为某科研人员满足量化指标K ij…x的量化指标值。 Among them, A is the quantitative score of scientific research personnel performance evaluation, t h and S h are the number and ranking of scientific research personnel in scientific research activities h that meet the quantitative indicators K ij... The actual number of scientific research results of K ij...x , where K ij...x is the quantitative index value of a scientific researcher satisfying the quantitative index K ij...x .
  9. 根据权利要求1所述的基于学科发展的科研能力评估用数据采集系统,其中,所述数据降维模块用于:The data collection system for evaluating scientific research capabilities based on subject development according to claim 1, wherein the data dimensionality reduction module is used for:
    通过OLAP将多维数据降维,将其转化为报表形式后存入数据库;Reduce the dimensionality of multidimensional data through OLAP, convert it into a report form and store it in the database;
    在数据库中进行查询,最终利用数据作为评估依据进行科研能力评估。Inquire in the database, and finally use the data as the evaluation basis to evaluate scientific research capabilities.
  10. 根据权利要求2所述的基于学科发展的科研能力评估用数据采集系统,其中:The data collection system for evaluating scientific research capabilities based on subject development according to claim 2, wherein:
    所述文字录入模块,用于采用文字输入的方式对数据信息进行录入;The text input module is used to input data information by means of text input;
    所述语音录入模块,用于采用语音输入的方式对数据信息进行录入;The voice input module is used to input data information by means of voice input;
    所述图像录入模块,用于采用图像输入的方式进行数据的录入,并可进行文字的数据;The image input module is used to input data by means of image input, and can also perform text data;
    所述互联网,用于对数据信息的搜索、获取及共享;The Internet is used for searching, obtaining and sharing data information;
    所述数据库,用于对系统中的各项数据信息的管理及分类排序,同时实现对数据信息的存储;The database is used for the management and classification of various data information in the system, and at the same time realizes the storage of data information;
    所述特征提取模块,用于对挖掘的数据信息进行重要特征及特点的筛选提取。The feature extraction module is used for screening and extracting important features and characteristics of the mined data information.
PCT/CN2022/121792 2022-01-07 2022-09-27 Data acquisition system for scientific research capability assessment based on subject development WO2023130774A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210015385.3A CN114358611A (en) 2022-01-07 2022-01-07 Subject development-based data acquisition system for scientific research capability assessment
CN202210015385.3 2022-01-07

Publications (1)

Publication Number Publication Date
WO2023130774A1 true WO2023130774A1 (en) 2023-07-13

Family

ID=81107471

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/121792 WO2023130774A1 (en) 2022-01-07 2022-09-27 Data acquisition system for scientific research capability assessment based on subject development

Country Status (2)

Country Link
CN (1) CN114358611A (en)
WO (1) WO2023130774A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117891812A (en) * 2024-03-18 2024-04-16 北京数字一百信息技术有限公司 Big data cleaning method and system based on artificial intelligence

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114358611A (en) * 2022-01-07 2022-04-15 华中科技大学同济医学院附属协和医院 Subject development-based data acquisition system for scientific research capability assessment
CN116384820A (en) * 2023-03-31 2023-07-04 四川省自然资源科学研究院(四川省生产力促进中心) Scientific and technological innovation capability assessment method, system, equipment and medium for enterprises

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020152201A1 (en) * 2001-04-17 2002-10-17 International Business Machines Corporation Mining of generalised disjunctive association rules
CN110751355A (en) * 2018-12-06 2020-02-04 国网河北省电力有限公司经济技术研究院 Scientific and technological achievement assessment method and device
CN111078852A (en) * 2019-12-09 2020-04-28 武汉大学 College leading-edge scientific research team detection system based on machine learning
CN111639237A (en) * 2020-04-07 2020-09-08 安徽理工大学 Electric power communication network risk assessment system based on clustering and association rule mining
CN112149955A (en) * 2020-08-18 2020-12-29 国网河北省电力有限公司沧州供电分公司 Scientific and technological achievement evaluation platform system
CN114358611A (en) * 2022-01-07 2022-04-15 华中科技大学同济医学院附属协和医院 Subject development-based data acquisition system for scientific research capability assessment

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020152201A1 (en) * 2001-04-17 2002-10-17 International Business Machines Corporation Mining of generalised disjunctive association rules
CN110751355A (en) * 2018-12-06 2020-02-04 国网河北省电力有限公司经济技术研究院 Scientific and technological achievement assessment method and device
CN111078852A (en) * 2019-12-09 2020-04-28 武汉大学 College leading-edge scientific research team detection system based on machine learning
CN111639237A (en) * 2020-04-07 2020-09-08 安徽理工大学 Electric power communication network risk assessment system based on clustering and association rule mining
CN112149955A (en) * 2020-08-18 2020-12-29 国网河北省电力有限公司沧州供电分公司 Scientific and technological achievement evaluation platform system
CN114358611A (en) * 2022-01-07 2022-04-15 华中科技大学同济医学院附属协和医院 Subject development-based data acquisition system for scientific research capability assessment

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117891812A (en) * 2024-03-18 2024-04-16 北京数字一百信息技术有限公司 Big data cleaning method and system based on artificial intelligence

Also Published As

Publication number Publication date
CN114358611A (en) 2022-04-15

Similar Documents

Publication Publication Date Title
WO2023130774A1 (en) Data acquisition system for scientific research capability assessment based on subject development
López-Robles et al. Understanding the intellectual structure and evolution of Competitive Intelligence: A bibliometric analysis from 1984 to 2017
Papadakis et al. Supervised meta-blocking
CN107247737B (en) The analysis of platform area default electricity use and method for digging based on electricity consumption
US20110191335A1 (en) Method and system for conducting legal research using clustering analytics
Nguyen et al. Vasabi: Hierarchical user profiles for interactive visual user behaviour analytics
CN108647729B (en) User portrait acquisition method
CN110968651A (en) Data processing method and system based on grey fuzzy clustering
Yang et al. Using weighted k-means to identify Chinese leading venture capital firms incorporating with centrality measures
US20110004582A1 (en) Method of constructing the intelligent computer systems based on information reasoning
Khan et al. Development of national health data warehouse for data mining.
Wu et al. Human resource allocation based on fuzzy data mining algorithm
CN110569273A (en) Patent retrieval system and method based on relevance sorting
CN115312183A (en) Intelligent interpretation method and system for medical inspection report
CN110033191B (en) Business artificial intelligence analysis method and system
Yao et al. A measurement-theoretic foundation of rule interestingness evaluation
WO2021128523A1 (en) Technology readiness level determination method and system based on science and technology big data
CN110956446A (en) Intellectual property one-stop service system
Chen et al. An improvement apriori arithmetic based on rough set theory
KR101401225B1 (en) System for analyzing documents
Inyang et al. Visual association analytics approach to predictive modelling of students’ academic performance
KR101985961B1 (en) Similarity Quantification System of National Research and Development Program and Searching Cooperative Program using same
CN115310869B (en) Combined supervision method, system, equipment and storage medium for supervision items
CN114238439B (en) Task-driven relational data view recommendation method based on joint embedding
Mohi Orange Data Mining as a tool to compare Classification Algorithms

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22918228

Country of ref document: EP

Kind code of ref document: A1