CN115563376A - Database construction method of mineral resource data based on multivariate data crawling and integration - Google Patents

Database construction method of mineral resource data based on multivariate data crawling and integration Download PDF

Info

Publication number
CN115563376A
CN115563376A CN202211320811.0A CN202211320811A CN115563376A CN 115563376 A CN115563376 A CN 115563376A CN 202211320811 A CN202211320811 A CN 202211320811A CN 115563376 A CN115563376 A CN 115563376A
Authority
CN
China
Prior art keywords
data
mineral
mineral resource
information
crawling
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211320811.0A
Other languages
Chinese (zh)
Inventor
鞠楠
张国宾
施璐
伍月
刘欣
张承程
吴桐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenyang Geological Survey Center China Geological Survey
Liaoning Technical University
Original Assignee
Shenyang Geological Survey Center China Geological Survey
Liaoning Technical University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenyang Geological Survey Center China Geological Survey, Liaoning Technical University filed Critical Shenyang Geological Survey Center China Geological Survey
Priority to CN202211320811.0A priority Critical patent/CN115563376A/en
Publication of CN115563376A publication Critical patent/CN115563376A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/906Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a mineral resource database building method based on multi-data crawling and integration, which relates to the technical field of mineral resource data management and aims to solve the problems of large data volume of mineral resources and inconvenient information extraction, fusion and analysis; the data crawled by the web crawler can be screened according to requirements through the preprocessing module and the processing module, and part of irrelevant information is screened and cleaned to obtain the required data; through the cooperation of the classification module and the storage module, the data is classified and stored, the data can be received and stored, and the follow-up calling of the data in the storage module is facilitated.

Description

基于多元数据爬取和整合的矿产资源数据建库方法Database construction method of mineral resource data based on multivariate data crawling and integration

技术领域technical field

本发明涉及矿业资源数据管理技术领域,具体为基于多元数据爬取和整合的矿产资源数据建库方法。The invention relates to the technical field of mining resource data management, in particular to a method for building a database of mineral resource data based on multivariate data crawling and integration.

背景技术Background technique

矿物资源,又名矿产资源,是指经过地质成矿作用而形成的,天然赋存于地壳内部或地表埋藏于地下或出露于地表,呈固态、液态或气态的,并具有开发利用价值的矿物或有用元素的集合体。Mineral resources, also known as mineral resources, refer to those formed through geological mineralization, naturally occurring in the crust or buried in the ground or exposed on the surface, in solid, liquid or gaseous state, and have development and utilization value A collection of minerals or useful elements.

矿物资源是重要的自然资源,是经过几百万年,甚至几亿年的地质变化才形成的,它是社会生产发展的重要物质基础,现代社会人们的生产和生活都离不开矿产资源。Mineral resources are important natural resources, which are formed after millions or even hundreds of millions of years of geological changes. They are an important material basis for social production and development. People's production and life in modern society cannot do without mineral resources.

矿产资源数据库(mineral resources data base)是利用计算机管理的包含全国或全世界矿产资源地理位置、地质情况、数量、质量、经济价值等情况的数据资料总合。矿产资源数据库可以作为矿产资源供需分析的基础、制定土地利用、矿产开发政策的依据,也可供区域成矿规律研究和成矿预测使用。电子计算机的广泛应用改善了矿产资源数据的采集、管理和分析,促使这项工作取得快速进展。The mineral resources database (mineral resources database) is a collection of data managed by computers, including the geographical location, geological conditions, quantity, quality, and economic value of mineral resources in the country or the world. The mineral resources database can be used as the basis for analyzing the supply and demand of mineral resources, the basis for formulating land use and mineral development policies, and can also be used for the study of regional metallogenic laws and metallogenic predictions. The widespread use of electronic computers has improved the collection, management, and analysis of mineral resource data, prompting rapid progress in this work.

1973年美国地质调查局就已建立了电子计算机管理的资源数据库,包括约4000个国内和6000个国外矿床和矿点的地质、储量、资源量等资料。世界上大多国家都建立了自己的矿产资源数据库。对于世界矿产资源数据库来说,矿产资源的分类标准和估算方法在各个国家间的可比性十分重要,联合国固体矿产储量/资源分类框架的应用就是解决这一问题的可行途径。In 1973, the U.S. Geological Survey established a computer-managed resource database, including information on geology, reserves, and resources of about 4,000 domestic and 6,000 foreign deposits and deposits. Most countries in the world have established their own mineral resource databases. For the World Mineral Resources Database, the comparability of classification standards and estimation methods of mineral resources among various countries is very important, and the application of the United Nations Solid Mineral Reserves/Resources Classification Framework is a feasible way to solve this problem.

多元化:指事物的发展,到了一个很丰富的境界,有多种分类,多种行业。结合史实的话,近代文明的多元化,则表现的是西学东渐的过程。包括信仰、文化、习俗、思维的差异,也叫转型的差异,产生这一原因的是人对周围的变化有不同的感受,有领悟能力的差异、思维习惯的差异、认识水平的差异。Diversification: Refers to the development of things, reaching a very rich level, with various classifications and industries. Combined with historical facts, the diversification of modern civilization reflects the process of the spread of Western learning to the east. Including differences in beliefs, cultures, customs, and thinking, also called differences in transformation. The reason for this is that people have different feelings about changes in their surroundings, differences in comprehension ability, thinking habits, and levels of understanding.

爬取数据的意思就是通过程序来获取需要的网站上的内容信息,比如文字、视频、图片等数据。Crawling data means to obtain the content information on the required website through programs, such as text, video, pictures and other data.

网络爬虫(又称为网页蜘蛛,网络机器人,在FOAF社区中间,更经常的称为网页追逐者),是一种按照一定的规则,自动地抓取万维网信息的程序或者脚本。另外一些不常使用的名字还有蚂蚁、自动索引、模拟程序或者蠕虫。A web crawler (also known as a web spider, a web robot, and more often referred to as a web chaser in the FOAF community) is a program or script that automatically grabs information on the World Wide Web according to certain rules. Other less commonly used names include ant, autoindex, emulator, or worm.

网络爬虫是一个自动提取网页的程序,它为搜索引擎从万维网上下载网页,是搜索引擎的重要组成。传统爬虫从一个或若干初始网页的URL开始,获得初始网页上的URL,在抓取网页的过程中,不断从当前页面上抽取新的URL放入队列,直到满足系统的一定停止条件。A web crawler is a program that automatically extracts web pages. It downloads web pages from the World Wide Web for search engines and is an important component of search engines. Traditional crawlers start from the URL of one or several initial webpages, obtain the URLs on the initial webpage, and continuously extract new URLs from the current page and put them into the queue during the process of crawling webpages until a certain stop condition of the system is met.

聚焦爬虫的工作流程较为复杂,需要根据一定的网页分析算法过滤与主题无关的链接,保留有用的链接并将其放入等待抓取的URL队列。然后,它将根据一定的搜索策略从队列中选择下一步要抓取的网页URL,并重复上述过程,直到达到系统的某一条件时停止。另外,所有被爬虫抓取的网页将会被系统存贮,进行一定的分析、过滤,并建立索引,以便之后的查询和检索;对于聚焦爬虫来说,这一过程所得到的分析结果还可能对以后的抓取过程给出反馈和指导。The work flow of the focused crawler is relatively complicated. It needs to filter links that have nothing to do with the topic according to a certain webpage analysis algorithm, keep useful links and put them into the URL queue waiting to be crawled. Then, it will select the URL of the web page to be crawled next from the queue according to a certain search strategy, and repeat the above process until it stops when a certain condition of the system is reached. In addition, all webpages crawled by crawlers will be stored by the system, analyzed, filtered, and indexed for later query and retrieval; for focused crawlers, the analysis results obtained in this process may also be Give feedback and guidance on future crawling processes.

随着网络信息的快速发展,矿产资源数据量大、专业性强,不同矿种之间数据存储存在差异,不仅种类繁多,而且错综复杂,通用性较差,对于如何快速提取和整合所需多元矿产数据,实现数据熔合、分析,对于提高矿产数据信息化水平有较大意义。With the rapid development of network information, mineral resource data volume is large and professional, and there are differences in data storage among different mineral types. Not only are there various types, but also intricate and poor versatility. How to quickly extract and integrate the required multi-mineral resources Data, to achieve data fusion and analysis, is of great significance for improving the level of mineral data informatization.

发明内容Contents of the invention

鉴于现有技术中所存在的问题,本发明公开了基于多元数据爬取和整合的矿产资源数据建库系统,采用的技术方案是,包括In view of the problems existing in the prior art, the present invention discloses a mineral resource data database building system based on multivariate data crawling and integration, and the adopted technical solution includes

数据服务器,用于建立矿产资源数据建库;The data server is used to establish the database of mineral resource data;

中心服务器,用于从多元化关于矿产资源信息库中获取矿产资源信息;The central server is used to obtain mineral resource information from the diversified mineral resource information database;

数据爬取模块,用于对矿产资源信息库使用网络爬虫进行数据爬取;The data crawling module is used to crawl the mineral resource information database using a web crawler for data crawling;

预处理模块,用于对网络爬虫获取的数据进行初步筛选;The preprocessing module is used for preliminary screening of the data obtained by the web crawler;

处理模块,用于对预处理后的数据再进一步筛选,确保网络爬虫所获取的数据为用户所需的数据;The processing module is used to further screen the preprocessed data to ensure that the data obtained by the web crawler is the data required by the user;

分类模块,用于处理后的数据进行分类,以便对数据的储存;The classification module is used to classify the processed data so as to store the data;

储存模块,将处理、分类后的矿产资源数据储存至矿产资源数据建库系统中。The storage module stores the processed and classified mineral resource data in the mineral resource data database building system.

作为本发明的一种优选技术方案,所述预处理模块对传输来的数据处理包括以下步骤:As a preferred technical solution of the present invention, the processing of the transmitted data by the preprocessing module includes the following steps:

对数据爬取的网页信心与所输入检索的关键词进行对比;Contrast the webpage confidence of data crawling with the keywords entered for retrieval;

网页信息中的关键词数量小于预设阈值的放弃,对于关键词大于预设阈值的传输至处理模块中,通过预处理模块、处理模块,能够根据需求对网络爬虫所爬取的数据进行筛分,将部分无关信息进行筛分清理,得到所需求的数据。If the number of keywords in the webpage information is less than the preset threshold, it will be discarded, and if the keyword is greater than the preset threshold, it will be transmitted to the processing module. Through the preprocessing module and the processing module, the data crawled by the web crawler can be screened according to the requirements. , to screen and clean some irrelevant information to obtain the required data.

作为本发明的一种优选技术方案,所述处理模块对预处理后的数据信息处理包括以下步骤:As a preferred technical solution of the present invention, the processing module includes the following steps for processing the preprocessed data information:

对符合关键词的网页信息进行获取;Obtain web page information that matches keywords;

对获取的网页信息进行复核;Review the obtained web page information;

正确的网页信息发送至分类模块,不符合的网页信息删除。The correct web page information is sent to the classification module, and the unsuitable web page information is deleted.

作为本发明的一种优选技术方案,所述分类模块对处理后的数据进行分类,可按照文字、图片、参数、视频进行区分分类。As a preferred technical solution of the present invention, the classification module classifies the processed data, which can be classified according to text, picture, parameter and video.

基于多元数据爬取和整合的矿产资源数据建库方法,采用的技术方案是,包括以下步骤:The method for building a database of mineral resources data based on multivariate data crawling and integration adopts a technical solution that includes the following steps:

建立矿产资源数据建库;Establish a database of mineral resource data;

通过中心服务器获取多元化关于矿产资源信息;Obtain diversified information about mineral resources through the central server;

使用网络爬虫对信息库信息进行数据爬取;Use web crawlers to crawl data from information base;

对数据爬取的数据进行预处理;Preprocess the data crawled;

对预处理后的数据进行二次处理;Perform secondary processing on the preprocessed data;

将处理后的数据储存至矿产资源数据库中。Store the processed data in the mineral resource database.

作为本发明的一种优选技术方案,所述建立矿产资源数据建库就够具体为:根据矿产资源数据类型,按照矿种或者项目区分通用字段和扩展字段,建立矿产资源数据库结构,通过建立矿产资源数据建库,通过使用网络爬虫来对全球各个矿产资源数据中的相关内容进行数据爬取,整合成所需的矿产资源数据库,所建数据库涵盖大量信息数据,以满足需求。As a preferred technical solution of the present invention, the establishment of the mineral resource data database is enough specifically: according to the mineral resource data type, according to the type of mineral or project, the general field and the extended field are distinguished, and the mineral resource database structure is established. Resource data database construction, by using web crawlers to crawl relevant content in various mineral resource data around the world, and integrate them into the required mineral resource database. The built database covers a large amount of information data to meet the needs.

作为本发明的一种优选技术方案,所述使用网络爬虫对信息库信息进行数据爬取包括关键词的输入,具体为先输入所要爬取信息的关键词,关键词可为一个或多个,且关键词之间采用空格、逗号或分号分隔。As a preferred technical solution of the present invention, the use of web crawlers to crawl data on information base information includes the input of keywords, specifically the keywords of the information to be crawled are first input, and the keywords can be one or more, And the keywords are separated by spaces, commas or semicolons.

作为本发明的一种优选技术方案,所述矿产资源数据库还包括查询端,用于各级用户从中心服务器获得查询、统计、分析与预警的结果,结果以表格或者图形方式显示,通过分类模块和储存模块的配合,将数据分类存储,能够提高数据的接收储存,并便于后续对存储模块内数据的调用,有利于提高从各种文件中提取矿产资源相关数据并进行存储入库的效率,对于提高地质矿产信息化水平有重要意义。As a preferred technical solution of the present invention, the mineral resource database also includes a query terminal, which is used for users at all levels to obtain the results of query, statistics, analysis and early warning from the central server, and the results are displayed in tables or graphics. Through the classification module In cooperation with the storage module, the data is classified and stored, which can improve the reception and storage of data, and facilitate the subsequent call of the data in the storage module, which is conducive to improving the efficiency of extracting mineral resource-related data from various files and storing them in the warehouse. It is of great significance to improve the information level of geological and mineral resources.

本发明的有益效果:本发明通过建立矿产资源数据建库,通过使用网络爬虫来对全球各个矿产资源数据中的相关内容进行数据爬取,整合成所需的矿产资源数据库,所建数据库涵盖大量信息数据,以满足需求;通过预处理模块、处理模块,能够根据需求对网络爬虫所爬取的数据进行筛分,将部分无关信息进行筛分清理,得到所需求的数据;通过分类模块和储存模块的配合,将数据分类存储,能够提高数据的接收储存,并便于后续对存储模块内数据的调用,有利于提高从各种文件中提取矿产资源相关数据并进行存储入库的效率,对于提高地质矿产信息化水平有重要意义。Beneficial effects of the present invention: the present invention builds a database of mineral resources data, and uses web crawlers to crawl relevant content in various mineral resource data around the world, and integrates them into required mineral resource databases. The built database covers a large number of Information data to meet the needs; through the preprocessing module and processing module, the data crawled by the web crawler can be screened according to the demand, and some irrelevant information can be screened and cleaned to obtain the required data; through the classification module and storage With the cooperation of the modules, the data is classified and stored, which can improve the reception and storage of data, and facilitate the subsequent call of the data in the storage module, which is conducive to improving the efficiency of extracting mineral resource-related data from various files and storing them in the warehouse. The level of geological and mineral information is of great significance.

附图说明Description of drawings

图1为本发明方法流程图;Fig. 1 is a flow chart of the method of the present invention;

图2为本发明系统结构示意图;Fig. 2 is a schematic structural diagram of the system of the present invention;

图3为本发明预处理模块示意图;Fig. 3 is a schematic diagram of the pretreatment module of the present invention;

图4为本发明处理模块示意图;Fig. 4 is a schematic diagram of the processing module of the present invention;

图5为本发明分类模块示意图。Fig. 5 is a schematic diagram of the classification module of the present invention.

具体实施方式detailed description

实施例1Example 1

如图2至图5所示,本发明公开了基于多元数据爬取和整合的矿产资源数据建库系统,采用的技术方案是,包括As shown in Fig. 2 to Fig. 5, the present invention discloses a mineral resource data database building system based on multivariate data crawling and integration, and the adopted technical solution is, including

数据服务器,用于建立矿产资源数据建库;The data server is used to establish the database of mineral resource data;

中心服务器,用于从多元化关于矿产资源信息库中获取矿产资源信息;The central server is used to obtain mineral resource information from the diversified mineral resource information database;

数据爬取模块,用于对矿产资源信息库使用网络爬虫进行数据爬取;The data crawling module is used to crawl the mineral resource information database using a web crawler for data crawling;

预处理模块,用于对网络爬虫获取的数据进行初步筛选;The preprocessing module is used for preliminary screening of the data obtained by the web crawler;

处理模块,用于对预处理后的数据再进一步筛选,确保网络爬虫所获取的数据为用户所需的数据;The processing module is used to further screen the preprocessed data to ensure that the data obtained by the web crawler is the data required by the user;

分类模块,用于处理后的数据进行分类,以便对数据的储存;The classification module is used to classify the processed data so as to store the data;

储存模块,将处理、分类后的矿产资源数据储存至矿产资源数据建库系统中。The storage module stores the processed and classified mineral resource data in the mineral resource data database building system.

数据服务器,数据服务器中所储存的矿产资源数据类型包括:空间数据、时间数据、属性数据及其他数据等,空间数据如shp数据和MapGIS数据等;时间数据如日期和年份等;属性数据如表格数据等;其他数据包括pdf文件、文本文档以及视频等。Data server, the types of mineral resource data stored in the data server include: spatial data, time data, attribute data and other data, etc., spatial data such as shp data and MapGIS data, etc.; time data such as date and year; attribute data such as tables data, etc.; other data includes pdf files, text documents, and videos.

中心服务器,中心服务器与外界多元矿产资源数据源连接,之间能够进行数据互通,多元矿产资源数据来源,主要有各国地质调查局网站(如美国地质调查局USGS、英国地质调查局BGS、日本石油天然气及金属矿产资源机构JOGMEC等)、各国商务部网站、各国外交部网站、各国矿产部门网站(如赞比亚矿业与矿产发展部等)、矿产资源行业协会网站(有色金属协会、国际钢铁协会等)、大型矿业公司网站(如五矿、中铝、嘉能可等)、其他网站(如风险咨询公司Control Risk网站、全球地质数据平台OneGology网站等)。The central server, the central server is connected with the external multiple mineral resource data sources, and data can be exchanged between them. The multiple mineral resource data sources mainly include the websites of various geological survey bureaus (such as the US Geological Survey USGS, British Geological Survey BGS, Japan Petroleum Natural gas and metal mineral resources organization JOGMEC, etc.), the websites of the Ministry of Commerce of various countries, the websites of the Ministry of Foreign Affairs of various countries, the websites of mineral departments of various countries (such as the Ministry of Mines and Mineral Development of Zambia, etc.), the websites of mineral resource industry associations (Nonferrous Metals Association, World Iron and Steel Association, etc.) , the websites of large mining companies (such as Minmetals, Chinalco, Glencore, etc.), and other websites (such as the website of Control Risk, a risk consulting company, and the website of OneGology, a global geological data platform, etc.).

作为本发明的一种优选技术方案,所述预处理模块对传输来的数据处理包括以下步骤:As a preferred technical solution of the present invention, the processing of the transmitted data by the preprocessing module includes the following steps:

对数据爬取的网页信心与所输入检索的关键词进行对比;Contrast the webpage confidence of data crawling with the keywords entered for retrieval;

网页信息中的关键词数量小于预设阈值的放弃,对于关键词大于预设阈值的传输至处理模块中,通过预处理模块、处理模块,能够根据需求对网络爬虫所爬取的数据进行筛分,将部分无关信息进行筛分清理,得到所需求的数据。If the number of keywords in the webpage information is less than the preset threshold, it will be discarded, and if the keyword is greater than the preset threshold, it will be transmitted to the processing module. Through the preprocessing module and the processing module, the data crawled by the web crawler can be screened according to the requirements. , to screen and clean some irrelevant information to obtain the required data.

作为本发明的一种优选技术方案,所述处理模块对预处理后的数据信息处理包括以下步骤:As a preferred technical solution of the present invention, the processing module includes the following steps for processing the preprocessed data information:

对符合关键词的网页信息进行获取;Obtain web page information that matches keywords;

对获取的网页信息进行复核;Review the obtained web page information;

正确的网页信息发送至分类模块,不符合的网页信息删除。The correct web page information is sent to the classification module, and the unsuitable web page information is deleted.

作为本发明的一种优选技术方案,所述分类模块对处理后的数据进行分类,可按照文字、图片、参数、视频进行区分分类,通过分类模块和储存模块的配合,将数据分类存储,能够提高数据的接收储存,并便于后续对存储模块内数据的调用,有利于提高从各种文件中提取矿产资源相关数据并进行存储入库的效率,对于提高地质矿产信息化水平有重要意义。As a preferred technical solution of the present invention, the classification module classifies the processed data, which can be distinguished and classified according to text, pictures, parameters, and videos, and the data is classified and stored through the cooperation of the classification module and the storage module, which can Improving the receiving and storage of data and facilitating the subsequent call of data in the storage module is conducive to improving the efficiency of extracting mineral resource-related data from various files and storing them in storage, which is of great significance for improving the level of geological and mineral information.

如图1所示,本发明公开了基于多元数据爬取和整合的矿产资源数据建库方法,包括以下步骤:As shown in Figure 1, the present invention discloses a method for building a database of mineral resource data based on multivariate data crawling and integration, including the following steps:

建立矿产资源数据建库;Establish a database of mineral resource data;

通过中心服务器获取多元化关于矿产资源信息;Obtain diversified information about mineral resources through the central server;

使用网络爬虫对信息库信息进行数据爬取;Use web crawlers to crawl data from information base;

对数据爬取的数据进行预处理;Preprocess the data crawled;

对预处理后的数据进行二次处理;Perform secondary processing on the preprocessed data;

将处理后的数据储存至矿产资源数据库中。Store the processed data in the mineral resources database.

对数据爬取的数据进行预处理,目的是按照输入的关键词进行筛选相应的数据信息,对各种不符合要求的数据进行初步筛查,以便提取满足要求的数据。The purpose of preprocessing the data crawled is to filter the corresponding data information according to the input keywords, and conduct a preliminary screening of various data that does not meet the requirements, so as to extract the data that meets the requirements.

不符合要求的数据信息主要是根据所检索的内容进行对比,内容中包含的关键词所占内容的百分比来定;所占内容百分比小于预设阈值,则此篇所检索的内容不符合,进行删除;若所占内容百分比大于等于预设阈值,则此内容复合要求,进入下一模块。The data information that does not meet the requirements is mainly based on the comparison of the retrieved content, and the percentage of the content of the keywords contained in the content is determined; if the percentage of the content is less than the preset threshold, the content retrieved in this article does not meet, and the content is searched. Delete; if the percentage of the content is greater than or equal to the preset threshold, the content will compound the requirements and enter the next module.

对预处理后的数据进行二次处理;对于错误数据,用统计分析的方法识别可能的错误值或异常值,如偏差分析、识别不遵守分布或回归方程的值,也可以用简单规则库(常识性规则、业务特定规则等)检查数据值,或使用不同属性间的约束、外部的数据来检测和清理数据。Perform secondary processing on the preprocessed data; for erroneous data, use statistical analysis methods to identify possible error values or outliers, such as deviation analysis, identify values that do not obey the distribution or regression equation, or use a simple rule base ( Common-sense rules, business-specific rules, etc.) check data values, or use constraints between different attributes, external data to detect and clean data.

对于重复数据,通过判断记录间的属性值是否相等来检测记录是否相等,相等的记录合并为一条记录(即合并/清除)。合并/清除是消重的基本方法。For duplicate data, check whether the records are equal by judging whether the attribute values between the records are equal, and the equal records are merged into one record (that is, merged/cleared). Merge/clear is the basic method of deduplication.

作为本发明的一种优选技术方案,所述建立矿产资源数据建库就够具体为:As a preferred technical solution of the present invention, it is enough to set up the database of mineral resource data as follows:

根据矿产资源数据类型,按照矿种或者项目区分通用字段和扩展字段,建立矿产资源数据库结构。According to the data type of mineral resources, general fields and extended fields are distinguished according to mineral types or projects, and the database structure of mineral resources is established.

作为本发明的一种优选技术方案,所述使用网络爬虫对信息库信息进行数据爬取包括关键词的输入,具体为先输入所要爬取信息的关键词,关键词可为一个或多个,且关键词之间采用空格、逗号或分号分隔。As a preferred technical solution of the present invention, the use of web crawlers to crawl data on information base information includes the input of keywords, specifically the keywords of the information to be crawled are first input, and the keywords can be one or more, And the keywords are separated by spaces, commas or semicolons.

在通过分类模块将数据分类完成后,即数据整合入库时,先在空闲内存中开辟一个用于临时存储入库数据的内存缓冲区,将不定时接到的多批入库数据存入该内存缓冲区,再将这些数据集中发送给数据库的储存模块。After the data classification is completed through the classification module, that is, when the data is integrated into the warehouse, a memory buffer for temporarily storing the warehoused data is opened in the free memory, and multiple batches of warehoused data received from time to time are stored in the warehouse. The memory buffer, and then send these data to the storage module of the database in a centralized manner.

作为本发明的一种优选技术方案,所述矿产资源数据库还包括查询端,用于各级用户从中心服务器获得查询、统计、分析与预警的结果,结果以表格或者图形方式显示。As a preferred technical solution of the present invention, the mineral resources database also includes a query terminal for users at all levels to obtain query, statistics, analysis and early warning results from the central server, and the results are displayed in tables or graphics.

上述虽然对本发明的具体实施例作了详细说明,但是本发明并不限于上述实施例,在本领域普通技术人员所具备的知识范围内,还可以在不脱离本发明宗旨的前提下做出各种变化,而不具备创造性劳动的修改或变形仍在本发明的保护范围以内。Although the above-mentioned specific embodiments of the present invention have been described in detail, the present invention is not limited to the above-mentioned embodiments. Within the scope of knowledge possessed by those of ordinary skill in the art, various The modification or deformation without creative work is still within the protection scope of the present invention.

Claims (8)

1. Mineral resources data base system of establishing based on multivariate data crawl and integration its characterized in that: comprises that
The data server is used for establishing a mineral resource data database;
the central server is used for acquiring mineral resource information from the diversified mineral resource information base;
the data crawling module is used for crawling data of the mineral resource information base by using a web crawler;
the preprocessing module is used for primarily screening the data acquired by the web crawler;
the processing module is used for further screening the preprocessed data to ensure that the data acquired by the web crawler is the data required by the user;
the classification module is used for classifying the processed data so as to store the data;
and the storage module is used for storing the processed and classified mineral resource data into a mineral resource data database building system.
2. The multivariate data crawling and integration based mineral resource database construction system as claimed in claim 1, wherein: the preprocessing module processes the transmitted data and comprises the following steps:
comparing the web page confidence of data crawling with the input searched keywords;
abandoning the keywords in the webpage information, wherein the number of the keywords is smaller than a preset threshold value, and transmitting the keywords larger than the preset threshold value to a processing module.
3. The mineral resource database building system based on multivariate data crawling and integration according to claim 1, wherein: the processing module for processing the preprocessed data information comprises the following steps:
acquiring webpage information conforming to the keywords;
rechecking the acquired webpage information;
and sending the correct webpage information to the classification module, and deleting the inconsistent webpage information.
4. The mineral resource database building system based on multivariate data crawling and integration according to claim 1, wherein: the classification module classifies the processed data and can distinguish and classify the data according to characters, pictures, parameters and videos.
5. The mineral resource database building method based on the multi-data crawling and integration is characterized by comprising the following steps of:
establishing a mineral resource database;
acquiring diversified information about mineral resources through a central server;
using a web crawler to perform data crawling on information of the information base;
preprocessing data obtained by data crawling;
carrying out secondary processing on the preprocessed data;
and storing the processed data into a mineral resource database.
6. The mineral resource database building method based on multivariate data crawling and integration according to claim 5, wherein the building of the mineral resource database is specifically as follows: and according to the data type of the mineral resources, distinguishing the general field and the extension field according to the mineral types or projects, and establishing a mineral resource database structure.
7. The mineral resources data base building method based on multivariate data crawling and integration according to claim 5, characterized in that: the data crawling for the information base information by using the web crawler comprises the step of inputting keywords, specifically, the keywords of the information to be crawled are firstly input, the keywords can be one or more, and the keywords are separated by spaces, commas or semicolons.
8. The mineral resources data base building method based on multivariate data crawling and integration according to claim 5, characterized in that: the mineral resource database also comprises a query end which is used for all levels of users to obtain results of query, statistics, analysis and early warning from the central server, and the results are displayed in a table or graphic mode.
CN202211320811.0A 2022-10-26 2022-10-26 Database construction method of mineral resource data based on multivariate data crawling and integration Pending CN115563376A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211320811.0A CN115563376A (en) 2022-10-26 2022-10-26 Database construction method of mineral resource data based on multivariate data crawling and integration

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211320811.0A CN115563376A (en) 2022-10-26 2022-10-26 Database construction method of mineral resource data based on multivariate data crawling and integration

Publications (1)

Publication Number Publication Date
CN115563376A true CN115563376A (en) 2023-01-03

Family

ID=84769533

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211320811.0A Pending CN115563376A (en) 2022-10-26 2022-10-26 Database construction method of mineral resource data based on multivariate data crawling and integration

Country Status (1)

Country Link
CN (1) CN115563376A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117422581A (en) * 2023-11-01 2024-01-19 中国地质科学院矿产资源研究所 Mineral resource safety monitoring and early warning method, system, equipment and medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117422581A (en) * 2023-11-01 2024-01-19 中国地质科学院矿产资源研究所 Mineral resource safety monitoring and early warning method, system, equipment and medium

Similar Documents

Publication Publication Date Title
CN106844640B (en) Webpage data analysis processing method
CN105468744B (en) Big data platform for realizing tax public opinion analysis and full text retrieval
CN103049542A (en) Domain-oriented network information search method
CN108229810A (en) Industry analysis system and method based on network information resource
CN107885793A (en) A kind of hot microblog topic analyzing and predicting method and system
CN101894351A (en) Tourism multimedia information personalized service system based on multi-intelligent Agent
CN104516954A (en) Visualized evidence obtaining and analyzing system
CN107086925B (en) Deep learning-based internet traffic big data analysis method
Tran et al. Radflow: A recurrent, aggregated, and decomposable model for networks of time series
CN112508743A (en) Technology transfer office general information interaction method, terminal and medium
CN115563376A (en) Database construction method of mineral resource data based on multivariate data crawling and integration
CN116307566B (en) Dynamic design system for large-scale building construction project construction organization scheme
CN112149422A (en) A dynamic monitoring method of enterprise news based on natural language
CN103365961A (en) Accurate search-oriented website structurization labeling method and system
CN112100395B (en) A feasibility analysis method for expert cooperation
CN114969477B (en) Mineral resource database building method and system based on multi-data crawling and integration
Alnoukari et al. Business Intelligence: Body of Knowledge
CN110880151A (en) Chain correlation analysis system is traceed back to quality safety of reassurance agricultural product
US20240304016A1 (en) Exploration and production document content and metadata scanner
CN116049243A (en) Enterprise intellectual property big data information analysis system, method and storage medium
CN114880588B (en) News heat prediction method based on knowledge graph
CN115080636A (en) Big data analysis system based on network service
CN114691835A (en) Audit plan data generation method, device and equipment based on text mining
Ersoy et al. Development of mining management information system for Soma Open Pit Mines
CN102968466B (en) Index network establishing method based on Web page classifying and Web-indexing thereof build device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination