CN115563376A - Mineral resource database building method based on multi-data crawling and integration - Google Patents

Mineral resource database building method based on multi-data crawling and integration Download PDF

Info

Publication number
CN115563376A
CN115563376A CN202211320811.0A CN202211320811A CN115563376A CN 115563376 A CN115563376 A CN 115563376A CN 202211320811 A CN202211320811 A CN 202211320811A CN 115563376 A CN115563376 A CN 115563376A
Authority
CN
China
Prior art keywords
data
mineral resource
mineral
information
crawling
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211320811.0A
Other languages
Chinese (zh)
Inventor
鞠楠
张国宾
施璐
伍月
刘欣
张承程
吴桐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenyang Geological Survey Center China Geological Survey
Liaoning Technical University
Original Assignee
Shenyang Geological Survey Center China Geological Survey
Liaoning Technical University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenyang Geological Survey Center China Geological Survey, Liaoning Technical University filed Critical Shenyang Geological Survey Center China Geological Survey
Priority to CN202211320811.0A priority Critical patent/CN115563376A/en
Publication of CN115563376A publication Critical patent/CN115563376A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/906Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a mineral resource database building method based on multi-data crawling and integration, which relates to the technical field of mineral resource data management and aims to solve the problems of large data volume of mineral resources and inconvenient information extraction, fusion and analysis; the data crawled by the web crawler can be screened according to requirements through the preprocessing module and the processing module, and part of irrelevant information is screened and cleaned to obtain the required data; through the cooperation of the classification module and the storage module, the data is classified and stored, the data can be received and stored, and the follow-up calling of the data in the storage module is facilitated.

Description

Mineral resource database building method based on multi-data crawling and integration
Technical Field
The invention relates to the technical field of mining resource data management, in particular to a mining resource database building method based on multi-data crawling and integration.
Background
Mineral resources, also known as mineral resources, are aggregates of minerals or useful elements that are formed by geological mineralization, naturally present in the earth's crust, buried underground or exposed to the surface, in solid, liquid or gaseous form, and have exploitation and utilization values.
Mineral resources are important natural resources, are formed after geological changes for millions of years or even hundreds of millions of years, are important material bases for social production development, and cannot be separated from production and life of people in modern society.
The mineral resources database (mineral resources data base) is a data information summary managed by a computer and containing the conditions of national or worldwide mineral resources, such as geographic positions, geological conditions, quantity, quality, economic value and the like. The mineral resource database can be used as a basis for mineral resource supply and demand analysis, a basis for formulating policies of land utilization and mineral development, and can also be used for regional mining law research and mining prediction. The widespread use of electronic computers has improved the acquisition, management and analysis of mineral resource data, making this work rapidly advancing.
In 1973, the U.S. geological survey had established an electronically computer-managed resource database containing geology, reserves, resources, etc. for about 4000 domestic and 6000 foreign deposits and sites. Most countries around the world have built their own mineral resource databases. For the world mineral resource database, the comparability of the classification standard and the estimation method of mineral resources among various countries is very important, and the application of the solid mineral reserve/resource classification framework of the united nations is a feasible way for solving the problem.
Diversification: the development of things means that the things are in a very rich boundary and have various classifications and industries. With the combination of historical information, the diversification of modern civilizations is the process of the eastern western sciences. The reason for this is that people have different feelings about the change around, and have differences in their abilities of understanding, thinking habits and levels of cognition.
Crawling data means acquiring content information on a required website, such as data of characters, videos, pictures and the like, through a program.
Web crawlers (also known as web spiders, web robots, in the middle of the FOAF community, more often called web chasers) are programs or scripts that automatically capture web information according to certain rules. Other less commonly used names are ants, automatic indexing, simulation programs, or worms.
The web crawler is a program for automatically extracting web pages, downloads web pages from the world wide web for a search engine, and is an important component of the search engine. The traditional crawler obtains the URL on the initial webpage from the URL of one or a plurality of initial webpages, continuously extracts new URLs from the current webpage and puts the URLs into a queue in the process of capturing the webpage until certain stop conditions of the system are met.
The workflow of the focused crawler is complex, and links irrelevant to the subject need to be filtered according to a certain webpage analysis algorithm, and useful links are reserved and put into a URL queue to be captured. Then, it will select the next web page URL from the queue according to a certain search strategy, and repeat the above process until reaching a certain condition of the system. In addition, all the web pages grabbed by the crawler are stored by the system, certain analysis and filtration are carried out, and indexes are built for the convenience of later query and retrieval; for focused crawlers, the analysis results obtained by this process may also give feedback and guidance to the subsequent grabbing process.
Along with the rapid development of network information, the mineral resource data volume is large, the specialization is strong, data storage is different among different mineral types, the variety is various, the complexity is high, the universality is poor, the data fusion and analysis are realized on how to rapidly extract and integrate the required multi-mineral data, and the method has great significance on improving the informatization level of the mineral data.
Disclosure of Invention
In view of the problems in the prior art, the invention discloses a mineral resource database building system based on multivariate data crawling and integration
The data server is used for establishing a mineral resource data database;
the central server is used for acquiring mineral resource information from the diversified mineral resource information base;
the data crawling module is used for crawling data of the mineral resource information base by using a web crawler;
the preprocessing module is used for primarily screening the data acquired by the web crawler;
the processing module is used for further screening the preprocessed data to ensure that the data acquired by the web crawler is the data required by the user;
the classification module is used for classifying the processed data so as to store the data;
and the storage module is used for storing the processed and classified mineral resource data into a mineral resource data database building system.
As a preferred technical solution of the present invention, the preprocessing module processes the transmitted data, including the following steps:
comparing the web page confidence of data crawling with the input search keyword;
abandoning that the number of keywords in the webpage information is less than the preset threshold value, transmitting the keywords greater than the preset threshold value to the processing module, screening the data crawled by the web crawler according to the requirement through the preprocessing module and the processing module, and screening and cleaning part of irrelevant information to obtain the required data.
As a preferred technical solution of the present invention, the processing module processes the preprocessed data information, including the following steps:
acquiring webpage information conforming to the keywords;
rechecking the acquired webpage information;
and sending the correct webpage information to the classification module, and deleting the inconsistent webpage information.
As a preferred technical solution of the present invention, the classification module classifies the processed data, and can perform classification according to characters, pictures, parameters, and videos.
A mineral resource database building method based on multivariate data crawling and integration adopts the technical scheme that the method comprises the following steps:
establishing a mineral resource database;
acquiring diversified information about mineral resources through a central server;
using a web crawler to perform data crawling on information of the information base;
preprocessing data obtained by data crawling;
carrying out secondary processing on the preprocessed data;
and storing the processed data into a mineral resource database.
As a preferred technical solution of the present invention, the establishing of the mineral resource database may specifically be: the method comprises the steps of distinguishing general fields and extension fields according to the types of mineral resource data and the types of minerals or projects, establishing a mineral resource database structure, establishing a mineral resource database, performing data crawling on related contents in all the global mineral resource data by using a web crawler, and integrating the related contents into a required mineral resource database, wherein the established database covers a large amount of information data so as to meet the requirements.
As a preferred technical solution of the present invention, the data crawling for the information of the information base by using the web crawler includes inputting keywords, specifically, the keywords of the information to be crawled are input first, the keywords may be one or more, and the keywords are separated by spaces, commas or semicolons.
As a preferred technical scheme of the invention, the mineral resource database further comprises a query end, which is used for users at all levels to obtain query, statistics, analysis and early warning results from the central server, the results are displayed in a table or graph mode, and the data are classified and stored through the cooperation of the classification module and the storage module, so that the receiving and storage of the data can be improved, the subsequent calling of the data in the storage module is facilitated, the efficiency of extracting the relevant data of the mineral resources from various files and storing the data in a warehouse is improved, and the method has an important significance for improving the geological mineral information level.
The invention has the beneficial effects that: according to the method, a mineral resource database is built, data crawling is performed on relevant contents in all global mineral resource data by using a web crawler, the relevant contents are integrated into a required mineral resource database, and the built database covers a large amount of information data so as to meet requirements; the data crawled by the web crawler can be screened according to requirements through the preprocessing module and the processing module, and part of irrelevant information is screened and cleaned to obtain the required data; through the cooperation of the classification module and the storage module, the data is classified and stored, the data can be received and stored, the follow-up calling of the data in the storage module is facilitated, the efficiency of extracting mineral resource related data from various files and storing the data in a warehouse is improved, and the method has important significance for improving the geological mineral resource informatization level.
Drawings
FIG. 1 is a flow chart of the method of the present invention;
FIG. 2 is a schematic diagram of the system of the present invention;
FIG. 3 is a schematic diagram of a pre-processing module according to the present invention;
FIG. 4 is a schematic view of a processing module according to the present invention;
FIG. 5 is a diagram of a classification module according to the present invention.
Detailed Description
Example 1
As shown in figures 2 to 5, the invention discloses a mineral resource database building system based on multivariate data crawling and integration, and adopts the technical scheme that the system comprises
The data server is used for establishing a mineral resource data database;
the central server is used for acquiring mineral resource information from the diversified mineral resource information base;
the data crawling module is used for crawling data of the mineral resource information base by using a web crawler;
the preprocessing module is used for primarily screening the data acquired by the web crawler;
the processing module is used for further screening the preprocessed data to ensure that the data acquired by the web crawler is the data required by the user;
the classification module is used for classifying the processed data so as to store the data;
and the storage module is used for storing the processed and classified mineral resource data into a mineral resource data database building system.
The data server, the mineral resources data type that stores in the data server includes: spatial data such as shp data and MapGIS data, temporal data, attribute data, and other data; time data such as date and year; attribute data such as table data and the like; other data includes pdf files, text documents, and videos, etc.
The central server is connected with an external multi-mineral resource data source, data intercommunication can be carried out between the central server and the external multi-mineral resource data source, and the multi-mineral resource data source mainly comprises various national geological survey bureau websites (such as the United states geological survey bureau USGS, the British geological survey bureau BGS, the Japan petroleum, gas and metal mineral resource organization JOGMEC and the like), various national business department websites, various national foreign department websites, various national mineral department websites (such as the Pradaya mining and mineral development department and the like), mineral resource industry association websites (such as the nonferrous metal Association, the International iron and steel Association and the like), large mineral company websites (such as the Wu mine, the medium aluminum, the Jia and the like), other websites (such as the Risk consulting company Control Risk website, the global geological data platform OneGology website and the like).
As a preferred technical solution of the present invention, the preprocessing module processes the transmitted data, including the following steps:
comparing the web page confidence of data crawling with the input searched keywords;
abandoning that the number of keywords in the webpage information is less than the preset threshold value, transmitting the keywords greater than the preset threshold value to the processing module, screening the data crawled by the web crawler according to the requirement through the preprocessing module and the processing module, and screening and cleaning part of irrelevant information to obtain the required data.
As a preferred technical solution of the present invention, the processing module processes the preprocessed data information, including the following steps:
acquiring webpage information conforming to the keywords;
rechecking the acquired webpage information;
and sending the correct webpage information to the classification module, and deleting the inconsistent webpage information.
As a preferred technical scheme of the invention, the classification module classifies the processed data, can distinguish and classify according to characters, pictures, parameters and videos, and classifies and stores the data by matching the classification module with the storage module, so that the data receiving and storing can be improved, the subsequent calling of the data in the storage module is facilitated, the efficiency of extracting mineral resource related data from various files and storing and warehousing the data is improved, and the method has important significance for improving the geological mineral resource informatization level.
As shown in FIG. 1, the invention discloses a mineral resource database building method based on multivariate data crawling and integration, which comprises the following steps:
establishing a mineral resource database;
acquiring diversified information about mineral resources through a central server;
using a web crawler to perform data crawling on information of the information base;
preprocessing data obtained by data crawling;
carrying out secondary processing on the preprocessed data;
and storing the processed data into a mineral resource database.
Preprocessing data crawled by the data, aiming at screening corresponding data information according to input keywords and preliminarily screening various data which do not meet requirements so as to extract the data which meet the requirements.
The data information which does not meet the requirements is mainly compared according to the retrieved content, and the percentage of the keywords contained in the content to the content is determined; if the percentage of the content is less than the preset threshold value, deleting the searched content if the content is not accordant; if the content percentage is larger than or equal to the preset threshold, the content is requested to be compounded, and the next module is entered.
Carrying out secondary processing on the preprocessed data; for error data, a statistical analysis method is used to identify possible error values or abnormal values, such as deviation analysis, and to identify values that do not comply with distribution or regression equations, and a simple rule base (common sense rule, business specific rule, etc.) can be used to check the data values, or constraints between different attributes, external data can be used to detect and clean up the data.
For duplicate data, whether the records are equal is detected by judging whether the attribute values among the records are equal, and the equal records are merged into one record (namely merging/clearing). Merge/purge is the basic method of deduplication.
As a preferred technical solution of the present invention, the establishing of the mineral resource database may specifically be:
and according to the data type of the mineral resources, distinguishing the general field and the extension field according to the mineral types or projects, and establishing a mineral resource database structure.
As a preferred technical solution of the present invention, the data crawling for the information of the information base by using the web crawler includes inputting keywords, specifically, the keywords of the information to be crawled are input first, the keywords may be one or more, and the keywords are separated by spaces, commas or semicolons.
After the data are classified by the classification module, namely when the data are integrated and put into storage, a memory buffer area for temporarily storing the put-in data is firstly opened up in the idle memory, a plurality of batches of put-in data received at random are stored in the memory buffer area, and then the data are sent to the storage module of the database in a centralized way.
As a preferred technical scheme of the invention, the mineral resource database also comprises a query end which is used for all levels of users to obtain the results of query, statistics, analysis and early warning from the central server, and the results are displayed in a form or graphic mode.
Although the present invention has been described in detail with reference to the specific embodiments thereof, the present invention is not limited to the above embodiments, and various changes can be made without departing from the gist of the present invention within the knowledge of those skilled in the art without departing from the scope of the present invention.

Claims (8)

1. Mineral resources data base system of establishing based on multivariate data crawl and integration its characterized in that: comprises that
The data server is used for establishing a mineral resource data database;
the central server is used for acquiring mineral resource information from the diversified mineral resource information base;
the data crawling module is used for crawling data of the mineral resource information base by using a web crawler;
the preprocessing module is used for primarily screening the data acquired by the web crawler;
the processing module is used for further screening the preprocessed data to ensure that the data acquired by the web crawler is the data required by the user;
the classification module is used for classifying the processed data so as to store the data;
and the storage module is used for storing the processed and classified mineral resource data into a mineral resource data database building system.
2. The multivariate data crawling and integration based mineral resource database construction system as claimed in claim 1, wherein: the preprocessing module processes the transmitted data and comprises the following steps:
comparing the web page confidence of data crawling with the input searched keywords;
abandoning the keywords in the webpage information, wherein the number of the keywords is smaller than a preset threshold value, and transmitting the keywords larger than the preset threshold value to a processing module.
3. The mineral resource database building system based on multivariate data crawling and integration according to claim 1, wherein: the processing module for processing the preprocessed data information comprises the following steps:
acquiring webpage information conforming to the keywords;
rechecking the acquired webpage information;
and sending the correct webpage information to the classification module, and deleting the inconsistent webpage information.
4. The mineral resource database building system based on multivariate data crawling and integration according to claim 1, wherein: the classification module classifies the processed data and can distinguish and classify the data according to characters, pictures, parameters and videos.
5. The mineral resource database building method based on the multi-data crawling and integration is characterized by comprising the following steps of:
establishing a mineral resource database;
acquiring diversified information about mineral resources through a central server;
using a web crawler to perform data crawling on information of the information base;
preprocessing data obtained by data crawling;
carrying out secondary processing on the preprocessed data;
and storing the processed data into a mineral resource database.
6. The mineral resource database building method based on multivariate data crawling and integration according to claim 5, wherein the building of the mineral resource database is specifically as follows: and according to the data type of the mineral resources, distinguishing the general field and the extension field according to the mineral types or projects, and establishing a mineral resource database structure.
7. The mineral resources data base building method based on multivariate data crawling and integration according to claim 5, characterized in that: the data crawling for the information base information by using the web crawler comprises the step of inputting keywords, specifically, the keywords of the information to be crawled are firstly input, the keywords can be one or more, and the keywords are separated by spaces, commas or semicolons.
8. The mineral resources data base building method based on multivariate data crawling and integration according to claim 5, characterized in that: the mineral resource database also comprises a query end which is used for all levels of users to obtain results of query, statistics, analysis and early warning from the central server, and the results are displayed in a table or graphic mode.
CN202211320811.0A 2022-10-26 2022-10-26 Mineral resource database building method based on multi-data crawling and integration Pending CN115563376A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211320811.0A CN115563376A (en) 2022-10-26 2022-10-26 Mineral resource database building method based on multi-data crawling and integration

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211320811.0A CN115563376A (en) 2022-10-26 2022-10-26 Mineral resource database building method based on multi-data crawling and integration

Publications (1)

Publication Number Publication Date
CN115563376A true CN115563376A (en) 2023-01-03

Family

ID=84769533

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211320811.0A Pending CN115563376A (en) 2022-10-26 2022-10-26 Mineral resource database building method based on multi-data crawling and integration

Country Status (1)

Country Link
CN (1) CN115563376A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117422581A (en) * 2023-11-01 2024-01-19 中国地质科学院矿产资源研究所 Mineral resource safety monitoring and early warning method, system, equipment and medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117422581A (en) * 2023-11-01 2024-01-19 中国地质科学院矿产资源研究所 Mineral resource safety monitoring and early warning method, system, equipment and medium

Similar Documents

Publication Publication Date Title
US20150032728A1 (en) System and method of generating a set of search results
CN102760151B (en) Implementation method of open source software acquisition and searching system
Langhnoja et al. Web usage mining using association rule mining on clustered data for pattern discovery
CN110442620B (en) Big data exploration and cognition method, device, equipment and computer storage medium
CN107766323A (en) A kind of text feature based on mutual information and correlation rule
CN115563376A (en) Mineral resource database building method based on multi-data crawling and integration
Vijiyarani et al. Research issues in web mining
CN105989184A (en) Classification method and apparatus
Tran et al. Radflow: A recurrent, aggregated, and decomposable model for networks of time series
CN107086925B (en) Deep learning-based internet traffic big data analysis method
Epperson et al. Leveraging analysis history for improved in situ visualization recommendation
CN103365961A (en) Accurate search-oriented website structurization labeling method and system
CN114969477B (en) Mineral resource database building method and system based on multi-data crawling and integration
Bhujbal et al. News aggregation using web scraping news portals
Murali An intelligent web spider for online e-commerce data extraction
CN110046294A (en) A kind of energy information system based on electric power big data
CN106232937A (en) For the method presenting drill-well operation information
CN116049243A (en) Enterprise intellectual property big data information analysis system, method and storage medium
US20240304016A1 (en) Exploration and production document content and metadata scanner
CN114880588B (en) News heat prediction method based on knowledge graph
Sathiyamoorthi et al. Data mining for intelligent enterprise resource planning system
CN106960039B (en) Social search engine system based on whole multi-Agent
CN110880151A (en) Chain correlation analysis system is traceed back to quality safety of reassurance agricultural product
Ersoy et al. Development of mining management information system for Soma Open Pit Mines
Chawla et al. Reverse apriori approach—an effective association rule mining algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination