CN115563376A - Database construction method of mineral resource data based on multivariate data crawling and integration - Google Patents
Database construction method of mineral resource data based on multivariate data crawling and integration Download PDFInfo
- Publication number
- CN115563376A CN115563376A CN202211320811.0A CN202211320811A CN115563376A CN 115563376 A CN115563376 A CN 115563376A CN 202211320811 A CN202211320811 A CN 202211320811A CN 115563376 A CN115563376 A CN 115563376A
- Authority
- CN
- China
- Prior art keywords
- data
- mineral
- mineral resource
- information
- crawling
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 229910052500 inorganic mineral Inorganic materials 0.000 title claims abstract description 97
- 239000011707 mineral Substances 0.000 title claims abstract description 97
- 230000009193 crawling Effects 0.000 title claims abstract description 29
- 230000010354 integration Effects 0.000 title claims abstract description 15
- 238000010276 construction Methods 0.000 title claims description 3
- 238000012545 processing Methods 0.000 claims abstract description 23
- 238000000034 method Methods 0.000 claims abstract description 19
- 238000003860 storage Methods 0.000 claims abstract description 16
- 238000007781 pre-processing Methods 0.000 claims abstract description 12
- 238000004458 analytical method Methods 0.000 claims abstract description 9
- 238000012216 screening Methods 0.000 claims description 5
- 238000013523 data management Methods 0.000 abstract description 2
- 230000004927 fusion Effects 0.000 abstract description 2
- 238000000605 extraction Methods 0.000 abstract 1
- 238000011161 development Methods 0.000 description 6
- 230000018109 developmental process Effects 0.000 description 6
- 238000010586 diagram Methods 0.000 description 4
- 239000000284 extract Substances 0.000 description 3
- XEEYBQQBJWHFJM-UHFFFAOYSA-N Iron Chemical compound [Fe] XEEYBQQBJWHFJM-UHFFFAOYSA-N 0.000 description 2
- 238000004519 manufacturing process Methods 0.000 description 2
- 239000002184 metal Substances 0.000 description 2
- 229910052751 metal Inorganic materials 0.000 description 2
- VNWKTOKETHGBQD-UHFFFAOYSA-N methane Chemical compound C VNWKTOKETHGBQD-UHFFFAOYSA-N 0.000 description 2
- 238000005065 mining Methods 0.000 description 2
- 238000012552 review Methods 0.000 description 2
- 239000007787 solid Substances 0.000 description 2
- 241000239290 Araneae Species 0.000 description 1
- 229910000831 Steel Inorganic materials 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000033558 biomineral tissue development Effects 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 150000001875 compounds Chemical class 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000009826 distribution Methods 0.000 description 1
- 229910052742 iron Inorganic materials 0.000 description 1
- 239000007788 liquid Substances 0.000 description 1
- 238000007726 management method Methods 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 150000002739 metals Chemical class 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 239000003345 natural gas Substances 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- 239000003208 petroleum Substances 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 239000010959 steel Substances 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/901—Indexing; Data structures therefor; Storage structures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/906—Clustering; Classification
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
技术领域technical field
本发明涉及矿业资源数据管理技术领域,具体为基于多元数据爬取和整合的矿产资源数据建库方法。The invention relates to the technical field of mining resource data management, in particular to a method for building a database of mineral resource data based on multivariate data crawling and integration.
背景技术Background technique
矿物资源,又名矿产资源,是指经过地质成矿作用而形成的,天然赋存于地壳内部或地表埋藏于地下或出露于地表,呈固态、液态或气态的,并具有开发利用价值的矿物或有用元素的集合体。Mineral resources, also known as mineral resources, refer to those formed through geological mineralization, naturally occurring in the crust or buried in the ground or exposed on the surface, in solid, liquid or gaseous state, and have development and utilization value A collection of minerals or useful elements.
矿物资源是重要的自然资源,是经过几百万年,甚至几亿年的地质变化才形成的,它是社会生产发展的重要物质基础,现代社会人们的生产和生活都离不开矿产资源。Mineral resources are important natural resources, which are formed after millions or even hundreds of millions of years of geological changes. They are an important material basis for social production and development. People's production and life in modern society cannot do without mineral resources.
矿产资源数据库(mineral resources data base)是利用计算机管理的包含全国或全世界矿产资源地理位置、地质情况、数量、质量、经济价值等情况的数据资料总合。矿产资源数据库可以作为矿产资源供需分析的基础、制定土地利用、矿产开发政策的依据,也可供区域成矿规律研究和成矿预测使用。电子计算机的广泛应用改善了矿产资源数据的采集、管理和分析,促使这项工作取得快速进展。The mineral resources database (mineral resources database) is a collection of data managed by computers, including the geographical location, geological conditions, quantity, quality, and economic value of mineral resources in the country or the world. The mineral resources database can be used as the basis for analyzing the supply and demand of mineral resources, the basis for formulating land use and mineral development policies, and can also be used for the study of regional metallogenic laws and metallogenic predictions. The widespread use of electronic computers has improved the collection, management, and analysis of mineral resource data, prompting rapid progress in this work.
1973年美国地质调查局就已建立了电子计算机管理的资源数据库,包括约4000个国内和6000个国外矿床和矿点的地质、储量、资源量等资料。世界上大多国家都建立了自己的矿产资源数据库。对于世界矿产资源数据库来说,矿产资源的分类标准和估算方法在各个国家间的可比性十分重要,联合国固体矿产储量/资源分类框架的应用就是解决这一问题的可行途径。In 1973, the U.S. Geological Survey established a computer-managed resource database, including information on geology, reserves, and resources of about 4,000 domestic and 6,000 foreign deposits and deposits. Most countries in the world have established their own mineral resource databases. For the World Mineral Resources Database, the comparability of classification standards and estimation methods of mineral resources among various countries is very important, and the application of the United Nations Solid Mineral Reserves/Resources Classification Framework is a feasible way to solve this problem.
多元化:指事物的发展,到了一个很丰富的境界,有多种分类,多种行业。结合史实的话,近代文明的多元化,则表现的是西学东渐的过程。包括信仰、文化、习俗、思维的差异,也叫转型的差异,产生这一原因的是人对周围的变化有不同的感受,有领悟能力的差异、思维习惯的差异、认识水平的差异。Diversification: Refers to the development of things, reaching a very rich level, with various classifications and industries. Combined with historical facts, the diversification of modern civilization reflects the process of the spread of Western learning to the east. Including differences in beliefs, cultures, customs, and thinking, also called differences in transformation. The reason for this is that people have different feelings about changes in their surroundings, differences in comprehension ability, thinking habits, and levels of understanding.
爬取数据的意思就是通过程序来获取需要的网站上的内容信息,比如文字、视频、图片等数据。Crawling data means to obtain the content information on the required website through programs, such as text, video, pictures and other data.
网络爬虫(又称为网页蜘蛛,网络机器人,在FOAF社区中间,更经常的称为网页追逐者),是一种按照一定的规则,自动地抓取万维网信息的程序或者脚本。另外一些不常使用的名字还有蚂蚁、自动索引、模拟程序或者蠕虫。A web crawler (also known as a web spider, a web robot, and more often referred to as a web chaser in the FOAF community) is a program or script that automatically grabs information on the World Wide Web according to certain rules. Other less commonly used names include ant, autoindex, emulator, or worm.
网络爬虫是一个自动提取网页的程序,它为搜索引擎从万维网上下载网页,是搜索引擎的重要组成。传统爬虫从一个或若干初始网页的URL开始,获得初始网页上的URL,在抓取网页的过程中,不断从当前页面上抽取新的URL放入队列,直到满足系统的一定停止条件。A web crawler is a program that automatically extracts web pages. It downloads web pages from the World Wide Web for search engines and is an important component of search engines. Traditional crawlers start from the URL of one or several initial webpages, obtain the URLs on the initial webpage, and continuously extract new URLs from the current page and put them into the queue during the process of crawling webpages until a certain stop condition of the system is met.
聚焦爬虫的工作流程较为复杂,需要根据一定的网页分析算法过滤与主题无关的链接,保留有用的链接并将其放入等待抓取的URL队列。然后,它将根据一定的搜索策略从队列中选择下一步要抓取的网页URL,并重复上述过程,直到达到系统的某一条件时停止。另外,所有被爬虫抓取的网页将会被系统存贮,进行一定的分析、过滤,并建立索引,以便之后的查询和检索;对于聚焦爬虫来说,这一过程所得到的分析结果还可能对以后的抓取过程给出反馈和指导。The work flow of the focused crawler is relatively complicated. It needs to filter links that have nothing to do with the topic according to a certain webpage analysis algorithm, keep useful links and put them into the URL queue waiting to be crawled. Then, it will select the URL of the web page to be crawled next from the queue according to a certain search strategy, and repeat the above process until it stops when a certain condition of the system is reached. In addition, all webpages crawled by crawlers will be stored by the system, analyzed, filtered, and indexed for later query and retrieval; for focused crawlers, the analysis results obtained in this process may also be Give feedback and guidance on future crawling processes.
随着网络信息的快速发展,矿产资源数据量大、专业性强,不同矿种之间数据存储存在差异,不仅种类繁多,而且错综复杂,通用性较差,对于如何快速提取和整合所需多元矿产数据,实现数据熔合、分析,对于提高矿产数据信息化水平有较大意义。With the rapid development of network information, mineral resource data volume is large and professional, and there are differences in data storage among different mineral types. Not only are there various types, but also intricate and poor versatility. How to quickly extract and integrate the required multi-mineral resources Data, to achieve data fusion and analysis, is of great significance for improving the level of mineral data informatization.
发明内容Contents of the invention
鉴于现有技术中所存在的问题,本发明公开了基于多元数据爬取和整合的矿产资源数据建库系统,采用的技术方案是,包括In view of the problems existing in the prior art, the present invention discloses a mineral resource data database building system based on multivariate data crawling and integration, and the adopted technical solution includes
数据服务器,用于建立矿产资源数据建库;The data server is used to establish the database of mineral resource data;
中心服务器,用于从多元化关于矿产资源信息库中获取矿产资源信息;The central server is used to obtain mineral resource information from the diversified mineral resource information database;
数据爬取模块,用于对矿产资源信息库使用网络爬虫进行数据爬取;The data crawling module is used to crawl the mineral resource information database using a web crawler for data crawling;
预处理模块,用于对网络爬虫获取的数据进行初步筛选;The preprocessing module is used for preliminary screening of the data obtained by the web crawler;
处理模块,用于对预处理后的数据再进一步筛选,确保网络爬虫所获取的数据为用户所需的数据;The processing module is used to further screen the preprocessed data to ensure that the data obtained by the web crawler is the data required by the user;
分类模块,用于处理后的数据进行分类,以便对数据的储存;The classification module is used to classify the processed data so as to store the data;
储存模块,将处理、分类后的矿产资源数据储存至矿产资源数据建库系统中。The storage module stores the processed and classified mineral resource data in the mineral resource data database building system.
作为本发明的一种优选技术方案,所述预处理模块对传输来的数据处理包括以下步骤:As a preferred technical solution of the present invention, the processing of the transmitted data by the preprocessing module includes the following steps:
对数据爬取的网页信心与所输入检索的关键词进行对比;Contrast the webpage confidence of data crawling with the keywords entered for retrieval;
网页信息中的关键词数量小于预设阈值的放弃,对于关键词大于预设阈值的传输至处理模块中,通过预处理模块、处理模块,能够根据需求对网络爬虫所爬取的数据进行筛分,将部分无关信息进行筛分清理,得到所需求的数据。If the number of keywords in the webpage information is less than the preset threshold, it will be discarded, and if the keyword is greater than the preset threshold, it will be transmitted to the processing module. Through the preprocessing module and the processing module, the data crawled by the web crawler can be screened according to the requirements. , to screen and clean some irrelevant information to obtain the required data.
作为本发明的一种优选技术方案,所述处理模块对预处理后的数据信息处理包括以下步骤:As a preferred technical solution of the present invention, the processing module includes the following steps for processing the preprocessed data information:
对符合关键词的网页信息进行获取;Obtain web page information that matches keywords;
对获取的网页信息进行复核;Review the obtained web page information;
正确的网页信息发送至分类模块,不符合的网页信息删除。The correct web page information is sent to the classification module, and the unsuitable web page information is deleted.
作为本发明的一种优选技术方案,所述分类模块对处理后的数据进行分类,可按照文字、图片、参数、视频进行区分分类。As a preferred technical solution of the present invention, the classification module classifies the processed data, which can be classified according to text, picture, parameter and video.
基于多元数据爬取和整合的矿产资源数据建库方法,采用的技术方案是,包括以下步骤:The method for building a database of mineral resources data based on multivariate data crawling and integration adopts a technical solution that includes the following steps:
建立矿产资源数据建库;Establish a database of mineral resource data;
通过中心服务器获取多元化关于矿产资源信息;Obtain diversified information about mineral resources through the central server;
使用网络爬虫对信息库信息进行数据爬取;Use web crawlers to crawl data from information base;
对数据爬取的数据进行预处理;Preprocess the data crawled;
对预处理后的数据进行二次处理;Perform secondary processing on the preprocessed data;
将处理后的数据储存至矿产资源数据库中。Store the processed data in the mineral resource database.
作为本发明的一种优选技术方案,所述建立矿产资源数据建库就够具体为:根据矿产资源数据类型,按照矿种或者项目区分通用字段和扩展字段,建立矿产资源数据库结构,通过建立矿产资源数据建库,通过使用网络爬虫来对全球各个矿产资源数据中的相关内容进行数据爬取,整合成所需的矿产资源数据库,所建数据库涵盖大量信息数据,以满足需求。As a preferred technical solution of the present invention, the establishment of the mineral resource data database is enough specifically: according to the mineral resource data type, according to the type of mineral or project, the general field and the extended field are distinguished, and the mineral resource database structure is established. Resource data database construction, by using web crawlers to crawl relevant content in various mineral resource data around the world, and integrate them into the required mineral resource database. The built database covers a large amount of information data to meet the needs.
作为本发明的一种优选技术方案,所述使用网络爬虫对信息库信息进行数据爬取包括关键词的输入,具体为先输入所要爬取信息的关键词,关键词可为一个或多个,且关键词之间采用空格、逗号或分号分隔。As a preferred technical solution of the present invention, the use of web crawlers to crawl data on information base information includes the input of keywords, specifically the keywords of the information to be crawled are first input, and the keywords can be one or more, And the keywords are separated by spaces, commas or semicolons.
作为本发明的一种优选技术方案,所述矿产资源数据库还包括查询端,用于各级用户从中心服务器获得查询、统计、分析与预警的结果,结果以表格或者图形方式显示,通过分类模块和储存模块的配合,将数据分类存储,能够提高数据的接收储存,并便于后续对存储模块内数据的调用,有利于提高从各种文件中提取矿产资源相关数据并进行存储入库的效率,对于提高地质矿产信息化水平有重要意义。As a preferred technical solution of the present invention, the mineral resource database also includes a query terminal, which is used for users at all levels to obtain the results of query, statistics, analysis and early warning from the central server, and the results are displayed in tables or graphics. Through the classification module In cooperation with the storage module, the data is classified and stored, which can improve the reception and storage of data, and facilitate the subsequent call of the data in the storage module, which is conducive to improving the efficiency of extracting mineral resource-related data from various files and storing them in the warehouse. It is of great significance to improve the information level of geological and mineral resources.
本发明的有益效果:本发明通过建立矿产资源数据建库,通过使用网络爬虫来对全球各个矿产资源数据中的相关内容进行数据爬取,整合成所需的矿产资源数据库,所建数据库涵盖大量信息数据,以满足需求;通过预处理模块、处理模块,能够根据需求对网络爬虫所爬取的数据进行筛分,将部分无关信息进行筛分清理,得到所需求的数据;通过分类模块和储存模块的配合,将数据分类存储,能够提高数据的接收储存,并便于后续对存储模块内数据的调用,有利于提高从各种文件中提取矿产资源相关数据并进行存储入库的效率,对于提高地质矿产信息化水平有重要意义。Beneficial effects of the present invention: the present invention builds a database of mineral resources data, and uses web crawlers to crawl relevant content in various mineral resource data around the world, and integrates them into required mineral resource databases. The built database covers a large number of Information data to meet the needs; through the preprocessing module and processing module, the data crawled by the web crawler can be screened according to the demand, and some irrelevant information can be screened and cleaned to obtain the required data; through the classification module and storage With the cooperation of the modules, the data is classified and stored, which can improve the reception and storage of data, and facilitate the subsequent call of the data in the storage module, which is conducive to improving the efficiency of extracting mineral resource-related data from various files and storing them in the warehouse. The level of geological and mineral information is of great significance.
附图说明Description of drawings
图1为本发明方法流程图;Fig. 1 is a flow chart of the method of the present invention;
图2为本发明系统结构示意图;Fig. 2 is a schematic structural diagram of the system of the present invention;
图3为本发明预处理模块示意图;Fig. 3 is a schematic diagram of the pretreatment module of the present invention;
图4为本发明处理模块示意图;Fig. 4 is a schematic diagram of the processing module of the present invention;
图5为本发明分类模块示意图。Fig. 5 is a schematic diagram of the classification module of the present invention.
具体实施方式detailed description
实施例1Example 1
如图2至图5所示,本发明公开了基于多元数据爬取和整合的矿产资源数据建库系统,采用的技术方案是,包括As shown in Fig. 2 to Fig. 5, the present invention discloses a mineral resource data database building system based on multivariate data crawling and integration, and the adopted technical solution is, including
数据服务器,用于建立矿产资源数据建库;The data server is used to establish the database of mineral resource data;
中心服务器,用于从多元化关于矿产资源信息库中获取矿产资源信息;The central server is used to obtain mineral resource information from the diversified mineral resource information database;
数据爬取模块,用于对矿产资源信息库使用网络爬虫进行数据爬取;The data crawling module is used to crawl the mineral resource information database using a web crawler for data crawling;
预处理模块,用于对网络爬虫获取的数据进行初步筛选;The preprocessing module is used for preliminary screening of the data obtained by the web crawler;
处理模块,用于对预处理后的数据再进一步筛选,确保网络爬虫所获取的数据为用户所需的数据;The processing module is used to further screen the preprocessed data to ensure that the data obtained by the web crawler is the data required by the user;
分类模块,用于处理后的数据进行分类,以便对数据的储存;The classification module is used to classify the processed data so as to store the data;
储存模块,将处理、分类后的矿产资源数据储存至矿产资源数据建库系统中。The storage module stores the processed and classified mineral resource data in the mineral resource data database building system.
数据服务器,数据服务器中所储存的矿产资源数据类型包括:空间数据、时间数据、属性数据及其他数据等,空间数据如shp数据和MapGIS数据等;时间数据如日期和年份等;属性数据如表格数据等;其他数据包括pdf文件、文本文档以及视频等。Data server, the types of mineral resource data stored in the data server include: spatial data, time data, attribute data and other data, etc., spatial data such as shp data and MapGIS data, etc.; time data such as date and year; attribute data such as tables data, etc.; other data includes pdf files, text documents, and videos.
中心服务器,中心服务器与外界多元矿产资源数据源连接,之间能够进行数据互通,多元矿产资源数据来源,主要有各国地质调查局网站(如美国地质调查局USGS、英国地质调查局BGS、日本石油天然气及金属矿产资源机构JOGMEC等)、各国商务部网站、各国外交部网站、各国矿产部门网站(如赞比亚矿业与矿产发展部等)、矿产资源行业协会网站(有色金属协会、国际钢铁协会等)、大型矿业公司网站(如五矿、中铝、嘉能可等)、其他网站(如风险咨询公司Control Risk网站、全球地质数据平台OneGology网站等)。The central server, the central server is connected with the external multiple mineral resource data sources, and data can be exchanged between them. The multiple mineral resource data sources mainly include the websites of various geological survey bureaus (such as the US Geological Survey USGS, British Geological Survey BGS, Japan Petroleum Natural gas and metal mineral resources organization JOGMEC, etc.), the websites of the Ministry of Commerce of various countries, the websites of the Ministry of Foreign Affairs of various countries, the websites of mineral departments of various countries (such as the Ministry of Mines and Mineral Development of Zambia, etc.), the websites of mineral resource industry associations (Nonferrous Metals Association, World Iron and Steel Association, etc.) , the websites of large mining companies (such as Minmetals, Chinalco, Glencore, etc.), and other websites (such as the website of Control Risk, a risk consulting company, and the website of OneGology, a global geological data platform, etc.).
作为本发明的一种优选技术方案,所述预处理模块对传输来的数据处理包括以下步骤:As a preferred technical solution of the present invention, the processing of the transmitted data by the preprocessing module includes the following steps:
对数据爬取的网页信心与所输入检索的关键词进行对比;Contrast the webpage confidence of data crawling with the keywords entered for retrieval;
网页信息中的关键词数量小于预设阈值的放弃,对于关键词大于预设阈值的传输至处理模块中,通过预处理模块、处理模块,能够根据需求对网络爬虫所爬取的数据进行筛分,将部分无关信息进行筛分清理,得到所需求的数据。If the number of keywords in the webpage information is less than the preset threshold, it will be discarded, and if the keyword is greater than the preset threshold, it will be transmitted to the processing module. Through the preprocessing module and the processing module, the data crawled by the web crawler can be screened according to the requirements. , to screen and clean some irrelevant information to obtain the required data.
作为本发明的一种优选技术方案,所述处理模块对预处理后的数据信息处理包括以下步骤:As a preferred technical solution of the present invention, the processing module includes the following steps for processing the preprocessed data information:
对符合关键词的网页信息进行获取;Obtain web page information that matches keywords;
对获取的网页信息进行复核;Review the obtained web page information;
正确的网页信息发送至分类模块,不符合的网页信息删除。The correct web page information is sent to the classification module, and the unsuitable web page information is deleted.
作为本发明的一种优选技术方案,所述分类模块对处理后的数据进行分类,可按照文字、图片、参数、视频进行区分分类,通过分类模块和储存模块的配合,将数据分类存储,能够提高数据的接收储存,并便于后续对存储模块内数据的调用,有利于提高从各种文件中提取矿产资源相关数据并进行存储入库的效率,对于提高地质矿产信息化水平有重要意义。As a preferred technical solution of the present invention, the classification module classifies the processed data, which can be distinguished and classified according to text, pictures, parameters, and videos, and the data is classified and stored through the cooperation of the classification module and the storage module, which can Improving the receiving and storage of data and facilitating the subsequent call of data in the storage module is conducive to improving the efficiency of extracting mineral resource-related data from various files and storing them in storage, which is of great significance for improving the level of geological and mineral information.
如图1所示,本发明公开了基于多元数据爬取和整合的矿产资源数据建库方法,包括以下步骤:As shown in Figure 1, the present invention discloses a method for building a database of mineral resource data based on multivariate data crawling and integration, including the following steps:
建立矿产资源数据建库;Establish a database of mineral resource data;
通过中心服务器获取多元化关于矿产资源信息;Obtain diversified information about mineral resources through the central server;
使用网络爬虫对信息库信息进行数据爬取;Use web crawlers to crawl data from information base;
对数据爬取的数据进行预处理;Preprocess the data crawled;
对预处理后的数据进行二次处理;Perform secondary processing on the preprocessed data;
将处理后的数据储存至矿产资源数据库中。Store the processed data in the mineral resources database.
对数据爬取的数据进行预处理,目的是按照输入的关键词进行筛选相应的数据信息,对各种不符合要求的数据进行初步筛查,以便提取满足要求的数据。The purpose of preprocessing the data crawled is to filter the corresponding data information according to the input keywords, and conduct a preliminary screening of various data that does not meet the requirements, so as to extract the data that meets the requirements.
不符合要求的数据信息主要是根据所检索的内容进行对比,内容中包含的关键词所占内容的百分比来定;所占内容百分比小于预设阈值,则此篇所检索的内容不符合,进行删除;若所占内容百分比大于等于预设阈值,则此内容复合要求,进入下一模块。The data information that does not meet the requirements is mainly based on the comparison of the retrieved content, and the percentage of the content of the keywords contained in the content is determined; if the percentage of the content is less than the preset threshold, the content retrieved in this article does not meet, and the content is searched. Delete; if the percentage of the content is greater than or equal to the preset threshold, the content will compound the requirements and enter the next module.
对预处理后的数据进行二次处理;对于错误数据,用统计分析的方法识别可能的错误值或异常值,如偏差分析、识别不遵守分布或回归方程的值,也可以用简单规则库(常识性规则、业务特定规则等)检查数据值,或使用不同属性间的约束、外部的数据来检测和清理数据。Perform secondary processing on the preprocessed data; for erroneous data, use statistical analysis methods to identify possible error values or outliers, such as deviation analysis, identify values that do not obey the distribution or regression equation, or use a simple rule base ( Common-sense rules, business-specific rules, etc.) check data values, or use constraints between different attributes, external data to detect and clean data.
对于重复数据,通过判断记录间的属性值是否相等来检测记录是否相等,相等的记录合并为一条记录(即合并/清除)。合并/清除是消重的基本方法。For duplicate data, check whether the records are equal by judging whether the attribute values between the records are equal, and the equal records are merged into one record (that is, merged/cleared). Merge/clear is the basic method of deduplication.
作为本发明的一种优选技术方案,所述建立矿产资源数据建库就够具体为:As a preferred technical solution of the present invention, it is enough to set up the database of mineral resource data as follows:
根据矿产资源数据类型,按照矿种或者项目区分通用字段和扩展字段,建立矿产资源数据库结构。According to the data type of mineral resources, general fields and extended fields are distinguished according to mineral types or projects, and the database structure of mineral resources is established.
作为本发明的一种优选技术方案,所述使用网络爬虫对信息库信息进行数据爬取包括关键词的输入,具体为先输入所要爬取信息的关键词,关键词可为一个或多个,且关键词之间采用空格、逗号或分号分隔。As a preferred technical solution of the present invention, the use of web crawlers to crawl data on information base information includes the input of keywords, specifically the keywords of the information to be crawled are first input, and the keywords can be one or more, And the keywords are separated by spaces, commas or semicolons.
在通过分类模块将数据分类完成后,即数据整合入库时,先在空闲内存中开辟一个用于临时存储入库数据的内存缓冲区,将不定时接到的多批入库数据存入该内存缓冲区,再将这些数据集中发送给数据库的储存模块。After the data classification is completed through the classification module, that is, when the data is integrated into the warehouse, a memory buffer for temporarily storing the warehoused data is opened in the free memory, and multiple batches of warehoused data received from time to time are stored in the warehouse. The memory buffer, and then send these data to the storage module of the database in a centralized manner.
作为本发明的一种优选技术方案,所述矿产资源数据库还包括查询端,用于各级用户从中心服务器获得查询、统计、分析与预警的结果,结果以表格或者图形方式显示。As a preferred technical solution of the present invention, the mineral resources database also includes a query terminal for users at all levels to obtain query, statistics, analysis and early warning results from the central server, and the results are displayed in tables or graphics.
上述虽然对本发明的具体实施例作了详细说明,但是本发明并不限于上述实施例,在本领域普通技术人员所具备的知识范围内,还可以在不脱离本发明宗旨的前提下做出各种变化,而不具备创造性劳动的修改或变形仍在本发明的保护范围以内。Although the above-mentioned specific embodiments of the present invention have been described in detail, the present invention is not limited to the above-mentioned embodiments. Within the scope of knowledge possessed by those of ordinary skill in the art, various The modification or deformation without creative work is still within the protection scope of the present invention.
Claims (8)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211320811.0A CN115563376A (en) | 2022-10-26 | 2022-10-26 | Database construction method of mineral resource data based on multivariate data crawling and integration |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211320811.0A CN115563376A (en) | 2022-10-26 | 2022-10-26 | Database construction method of mineral resource data based on multivariate data crawling and integration |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115563376A true CN115563376A (en) | 2023-01-03 |
Family
ID=84769533
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211320811.0A Pending CN115563376A (en) | 2022-10-26 | 2022-10-26 | Database construction method of mineral resource data based on multivariate data crawling and integration |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115563376A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117422581A (en) * | 2023-11-01 | 2024-01-19 | 中国地质科学院矿产资源研究所 | Mineral resource safety monitoring and early warning method, system, equipment and medium |
-
2022
- 2022-10-26 CN CN202211320811.0A patent/CN115563376A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117422581A (en) * | 2023-11-01 | 2024-01-19 | 中国地质科学院矿产资源研究所 | Mineral resource safety monitoring and early warning method, system, equipment and medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106844640B (en) | Webpage data analysis processing method | |
CN105468744B (en) | Big data platform for realizing tax public opinion analysis and full text retrieval | |
CN103049542A (en) | Domain-oriented network information search method | |
CN108229810A (en) | Industry analysis system and method based on network information resource | |
CN107885793A (en) | A kind of hot microblog topic analyzing and predicting method and system | |
CN101894351A (en) | Tourism multimedia information personalized service system based on multi-intelligent Agent | |
CN104516954A (en) | Visualized evidence obtaining and analyzing system | |
CN107086925B (en) | Deep learning-based internet traffic big data analysis method | |
Tran et al. | Radflow: A recurrent, aggregated, and decomposable model for networks of time series | |
CN112508743A (en) | Technology transfer office general information interaction method, terminal and medium | |
CN115563376A (en) | Database construction method of mineral resource data based on multivariate data crawling and integration | |
CN116307566B (en) | Dynamic design system for large-scale building construction project construction organization scheme | |
CN112149422A (en) | A dynamic monitoring method of enterprise news based on natural language | |
CN103365961A (en) | Accurate search-oriented website structurization labeling method and system | |
CN112100395B (en) | A feasibility analysis method for expert cooperation | |
CN114969477B (en) | Mineral resource database building method and system based on multi-data crawling and integration | |
Alnoukari et al. | Business Intelligence: Body of Knowledge | |
CN110880151A (en) | Chain correlation analysis system is traceed back to quality safety of reassurance agricultural product | |
US20240304016A1 (en) | Exploration and production document content and metadata scanner | |
CN116049243A (en) | Enterprise intellectual property big data information analysis system, method and storage medium | |
CN114880588B (en) | News heat prediction method based on knowledge graph | |
CN115080636A (en) | Big data analysis system based on network service | |
CN114691835A (en) | Audit plan data generation method, device and equipment based on text mining | |
Ersoy et al. | Development of mining management information system for Soma Open Pit Mines | |
CN102968466B (en) | Index network establishing method based on Web page classifying and Web-indexing thereof build device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |