CN110716898A

CN110716898A - Method and system for gathering field geological survey data in real time

Info

Publication number: CN110716898A
Application number: CN201910976866.9A
Authority: CN
Inventors: 李丰丹; 吕霞; 吴亮; 李超岭; 刘畅; 刘园园; 龚爱华
Original assignee: DEVELOPMENT AND Research CENTER GEOLOGIC SURVEY BUREAU OF CHINA
Current assignee: DEVELOPMENT AND Research CENTER GEOLOGIC SURVEY BUREAU OF CHINA
Priority date: 2019-10-15
Filing date: 2019-10-15
Publication date: 2020-01-21
Also published as: CN116126794A

Abstract

The invention discloses a method and a system for gathering field geological survey data in real time. The method comprises the following steps: acquiring project files to be aggregated, wherein the project files comprise a plurality of sub-project files; laying a Hadoop cluster; respectively and correspondingly uploading the field geological survey data in each sub-project file to catalogs in different data organization forms to form project result data, wherein the catalogs comprise a plurality of sub-project data catalogs; recording resource description information of the field geological survey data in each sub-project file, and storing the resource description information in a database; copying the project result data to a secondary project; updating the resource description information, wherein the resource description information comprises a resource type, a resource name and a resource size; and extracting the field geological survey data content in each sub-project file and storing the extracted field geological survey data content in the Hadoop cluster. The invention can quickly assemble the data and is convenient for the superior project to efficiently manage the assembled data.

Description

Method and system for gathering field geological survey data in real time

Technical Field

The invention relates to the field of geological survey, in particular to a method and a system for gathering field geological survey data in real time.

Background

Hadoop is a distributed system infrastructure developed by the Apache Foundation. A user can develop a distributed program without knowing the distributed underlying details. The power of the cluster is fully utilized to carry out high-speed operation and storage.

The geological project is based on geographic data collected in the field, the traditional geological project data gathering is usually realized by manually delivering paper materials of the project to a local dispatching office, and then the local dispatching office automatically arranges and stores the received project data in a computer. The data aggregation method is applicable when the geographic data volume is small, but the method is obviously not feasible when the data volume is large in level and the data types are numerous; in addition, geological projects at the present stage lack efficient utilization of project organization architecture, resulting in failure of upper projects to systematically manage such data.

Disclosure of Invention

The invention aims to provide a real-time gathering method for field geological survey data, which can quickly gather the data and is convenient for a superior project to efficiently manage the gathered data.

In order to achieve the purpose, the invention provides the following scheme:

a real-time gathering method for field geological survey data is characterized by comprising the following steps:

acquiring project files to be aggregated, wherein the project files comprise a plurality of sub-project files;

laying a Hadoop cluster, wherein the Hadoop cluster is used for storing field geological survey data in each sub-project file on line;

respectively and correspondingly uploading the field geological survey data in each sub-project file to catalogs in different data organization forms to form project result data, wherein the catalogs comprise a plurality of sub-project data catalogs;

recording resource description information of the field geological survey data in each sub-project file, and storing the resource description information in a database;

copying the project result data to a secondary project;

updating the resource description information, wherein the resource description information comprises a resource type, a resource name and a resource size;

and extracting the field geological survey data content in each sub-project file and storing the extracted field geological survey data content in the Hadoop cluster.

Optionally, the method further includes:

acquiring the size of the field geological survey data content in each sub-project file;

and transmitting the files according to the size of the field geological survey data content in each sub-project file.

Optionally, the file transmission according to the size of the field geological survey data content in each sub-project file specifically includes:

acquiring a set sub-project file size value;

if the size of the field geological survey data content in the sub-project file is larger than the set sub-project file size value, carrying out fragment transmission on the sub-project file, and combining the transmitted fragment files into a complete project file;

and if the size of the field geological survey data content in the sub-project file is less than or equal to the set sub-project file size value, directly transmitting the sub-project file.

Optionally, the sub-item data directory includes: regional geological maps, field route data, geological document data, satellite remote sensing images and other user-defined folders.

Optionally, the field geological survey data in each sub-project file is uploaded to directories of different data organization forms correspondingly to form project result data, where the directory includes a plurality of sub-project data directories, and specifically includes:

organizing and storing the field geological survey data according to different data types to form a sub-project data catalog;

and uploading the geological survey data to the corresponding sub-project data catalog according to the corresponding data types to obtain project result data.

Optionally, the recording resource description information of the field geological survey data in each sub-project file, and storing the resource description information in a database specifically includes:

and recording resource description information of the field geological survey data in each sub-project file by adopting the uniform resource identifier, and storing the resource description information into the hive database.

A system for real-time aggregation of field geological survey data, comprising:

the system comprises a project file acquisition module, a project file acquisition module and a project file management module, wherein the project file acquisition module is used for acquiring project files to be aggregated, and the project files comprise a plurality of sub-project files;

the Hadoop cluster laying module is used for laying a Hadoop cluster, and the Hadoop cluster is used for storing field geological survey data in each sub-project file on line;

the project result data forming module is used for correspondingly uploading the field geological survey data in each sub-project file to catalogs in different data organization forms respectively to form project result data, and each catalog comprises a plurality of sub-project data catalogs;

the resource description information recording module is used for recording the resource description information of the field geological survey data in each sub-project file and storing the resource description information into a database;

the data aggregation module is used for copying the project result data to a secondary project;

the updating module is used for updating the resource description information, and the resource description information comprises a resource type, a resource name and a resource size;

and the extraction module is used for extracting the field geological survey data content in each sub-project file and storing the extracted field geological survey data content in the Hadoop cluster.

Optionally, the system further includes:

the data content size acquisition module is used for acquiring the size of the field geological survey data content in each sub-project file;

and the transmission module is used for transmitting the files according to the size of the field geological survey data content in each sub-project file.

Optionally, the transmission module specifically includes:

the acquisition unit is used for acquiring the size value of the set sub-project file;

the fragment transmission unit is used for transmitting the sub-project files in fragments and combining the transmitted fragment files into a complete project file when the size of the field geological survey data content in the sub-project files is larger than the set size value of the sub-project files;

and the direct transmission unit is used for directly transmitting the sub-project file when the size of the field geological survey data content in the sub-project file is smaller than or equal to the set sub-project file size value.

Optionally, the module for forming the project result data specifically includes:

the sub-project data catalog determining unit is used for organizing and storing the field geological survey data according to different data types to form a sub-project data catalog;

and the project result data forming unit is used for uploading the geological survey data to the corresponding sub-project data catalogue according to the corresponding data type to obtain the project result data.

According to the specific embodiment provided by the invention, the invention discloses the following technical effects:

the method for gathering the field geological survey data in real time provided by the invention adopts Hadoop, and is organized according to project levels, the field geological survey data in the project are collected and arranged by a lower level organization to form project data, and the project data are gathered by an upper level organization, so that the project file presents a data flow process from bottom to top according to a project organization framework.

The real-time gathering method of the field geological survey data can efficiently and orderly finish the real-time gathering of the result data from the sub-projects to the upper project while ensuring the data integrity in the transmission process of the field geological survey data, and after the gathering is finished, the system provides the authority for managing the lower project files for the upper project, and the whole project organization structure shows the orderly dynamic management from top to bottom. The method for gathering the field geological survey data in real time lays a good foundation for a service publishing function, and facilitates the query and retrieval, real-time sharing and big data analysis and mining of the whole project file by a user.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.

FIG. 1 is a flow chart of a method for real-time gathering of field geological survey data according to the present invention;

FIG. 2 is a diagram of a real-time gathering system for field geological survey data according to the present invention;

FIG. 3 is an organization chart of the project according to embodiment 1 of the present invention;

fig. 4 is a diagram of an item file storage directory according to embodiment 1 of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

Hadoop is a software framework that enables distributed processing of large amounts of data. Hadoop performs data processing in a reliable, efficient, scalable manner. Hadoop is reliable because it assumes that compute elements and stores will fail, so it maintains multiple copies of the working data, ensuring that processing can be redistributed to the failed nodes. Hadoop is efficient because it works in parallel, speeding up processing by parallel processing. Hadoop is also scalable, being able to process PB-level data. Furthermore, Hadoop relies on community services, so it is low cost and can be used by anyone.

Hadoop is a distributed computing platform that can be easily constructed and used by users. The user can easily develop and run the application program for processing mass data on the Hadoop. It mainly has the following advantages: high reliability, and the ability of Hadoop to store and process data according to bits is worthy of being trusted. High scalability, Hadoop, distributes data and performs computational tasks among available computer clusters, which can be easily expanded into thousands of nodes. The Hadoop is efficient, data can be dynamically moved among nodes, dynamic balance of each node is guaranteed, and therefore processing speed is high. With high fault tolerance, Hadoop can automatically save multiple copies of data and can automatically redistribute failed tasks. And compared with the integrated machine, a commercial data warehouse, and data marts such as QlikView, Yonghong Z-Suite and the like, hadoop is open-source, and the software cost of the project is greatly reduced.

Hadoop has gained widespread use in big data processing applications thanks to its own natural advantages in data extraction, transformation and loading (ETL). The distributed architecture of Hadoop, which places the big data processing engine as close to the store as possible, is relatively suitable for batch operations such as ETL, because batch results like such operations can go directly to the store. The MapReduce function of Hadoop realizes the purposes of breaking up a single task, sending a broken task (Map) to a plurality of nodes, and then loading (Reduce) the broken task into a data warehouse in the form of a single data set.

The network disk, also called network U disk and network hard disk, is an online storage service released by internet companies, and the server room divides a certain disk space for users, provides file management functions such as storage, access, backup and sharing of files for users free or charged, and has high-level disaster recovery backup around the world. The user can see the network disk as a hard disk or a U disk placed on the network, and the user can manage and edit files in the network disk as long as the user is connected to the Internet regardless of whether the user is at home, in a unit or in any other place. The portable type solar water heater is not required to be carried about and is not afraid of being lost.

Uniform Resource Identifiers (URIs) that allow users to interoperate with resources in a network (generally the world wide web) via specific protocols. The most common form of URI is a Uniform Resource Locator (URL), often designated as an informal web site. A more rare usage is Uniform Resource Name (URN), which aims to provide a way for the identification of resources in a particular namespace.

Hash, which is generally translated as "Hash", also known as direct transliteration, is a Hash algorithm that transforms an input of arbitrary length (also called pre-map) into a fixed-length output, which is a Hash value. This transformation is a compression mapping, i.e. the space of hash values is usually much smaller than the space of inputs, different inputs may hash to the same output, and it is not possible to uniquely determine the input value from the hash value. In short, it is a function of compressing a message of an arbitrary length to a message digest of a certain fixed length.

Hash is mainly used in encryption algorithm in information security field, which converts some information with different length into a mixed 128-bit code, called Hash value. The hash is to find a mapping relationship between the data content and the data storage address. The file verification is the important embodiment of the application of the Hash algorithm in the aspect of information security, and the characteristic of 'digital fingerprint' of the MD5 Hash algorithm makes the Hash algorithm become the most widely applied file integrity Checksum (Checksum) algorithm at present, and a Unix system has commands for calculating MD5 Checksum.

FIG. 1 is a flow chart of the method for gathering field geological survey data in real time. As shown in fig. 1, a method for real-time gathering of field geological survey data includes:

step 101: and acquiring the project file to be aggregated, wherein the project file comprises a plurality of sub-project files.

Step 102: laying a Hadoop cluster, wherein the Hadoop cluster is used for storing field geological survey data in each sub-project file on line; the Hadoop cluster provides Hadoop cluster storage space for the sub-project files as required.

Step 103: respectively and correspondingly uploading the field geological survey data in each sub-project file to catalogs in different data organization forms to form project result data; the directory includes a plurality of sub-item data directories, and specifically includes:

The sub-item data directory includes: regional geological maps, field route data, geological document data, satellite remote sensing images and other user-defined folders.

Step 104: recording resource description information of field geological survey data in each sub-project file, and storing the resource description information into a database, wherein the resource description information specifically comprises the following steps:

Step 104 facilitates user retrieval and query of project performance data, wherein the resource description is represented using a uniform resource identifier URI, such as "file/2130", wherein "2130" represents a unique value of the resource. The resource description content comprises a resource type, a resource name, a resource size and the like, wherein the resource type is represented as ' file/2130/type/unstructured data ', the resource name is represented as ' file/2130/name/mica schist ' doc ', and the resource size is represented as ' file/2130/size/60255 '.

Step 105: and copying the project result data to a secondary project, specifically, copying the project result data to a data type folder corresponding to the secondary project, and further completing the process of converging the project result data from the sub-project file to the secondary project.

Step 106: and updating the resource description information, wherein the resource description information comprises a resource type, a resource name and a resource size.

Step 107: and extracting the field geological survey data content in each sub-project file and storing the extracted field geological survey data content in the Hadoop cluster.

Step 106 and step 107 are performed simultaneously with the copying of the project result data to the secondary project, i.e., step 106, step 107 and step 105 are performed simultaneously.

The method further comprises the following steps:

transmitting files according to the size of the field geological survey data content in each sub-project file, and specifically comprising the following steps:

acquiring a set sub-project file size value;

In practice, if the file is less than 5M, the original project file is directly transmitted; and if the file is larger than 5M, carrying out fragment transmission on the project file, and combining the transmitted fragment files into a complete project file.

FIG. 2 is a structural diagram of the real-time gathering system of the field geological survey data. As shown in fig. 2, a system for real-time gathering of field geological survey data includes:

a project file obtaining module 201, configured to obtain a project file to be aggregated, where the project file includes multiple sub-project files;

the Hadoop cluster laying module 202 is used for laying a Hadoop cluster, and the Hadoop cluster is used for storing field geological survey data in each sub-project file on line;

a project result data forming module 203, configured to upload field geological survey data in each sub-project file to directories in different data organization forms, respectively, so as to form project result data, where the directories include multiple sub-project data directories;

the resource description information recording module 204 is used for recording the resource description information of the field geological survey data in each sub-project file and storing the resource description information into a database;

the data aggregation module 205 is configured to copy the project result data to a secondary project;

an updating module 206, configured to update the resource description information, where the resource description information includes a resource type, a resource name, and a resource size;

and the extraction module 207 is used for extracting the field geological survey data content in each sub-project file and storing the extracted field geological survey data content in the Hadoop cluster.

The system further comprises:

The transmission module specifically includes:

The project achievement data forming module 203 specifically includes:

The traditional geological project data aggregation is completed by full manual operation, and the data aggregation method is suitable for the case of small geographic data volume, but obviously cannot be realized when the data volume is large in level and the data types are numerous. Secondly, most geological project data gathering technologies at the present stage lack a well-defined organization structure, so that project data cannot be systematically, scientifically and regularly managed by a superior project. In order to overcome the defects, the invention provides a method and a system for gathering field geological survey data in real time.

According to the method and the system for gathering the field geological survey data in real time, provided by the invention, the field geological survey data in the project are collected and arranged by a lower-level organization according to project hierarchy to form project data, the project data are gathered by an upper-level organization, and then a project file presents a data flow process from bottom to top according to a project organization framework. The data real-time aggregation method and the data real-time aggregation system can efficiently and orderly complete the real-time aggregation of the achievement data from the sub-projects to the upper project while ensuring the data integrity in the field geological survey data transmission process, after the aggregation is completed, the system provides the authority for managing the lower project files for the upper project, and the whole project organization structure shows the orderly dynamic management from top to bottom. The method and the system for real-time data aggregation also lay a good foundation for the service release function, and are convenient for users to query and search the whole project file, share the whole project file in real time and analyze and mine big data.

Example 1:

current geological projects are generally divided into four levels of organizational structures: plan, project, secondary project, sub-project.

FIG. 3 is an organization chart of the embodiment 1 of the present invention. FIG. 3 shows: the method comprises the following steps of land area energy mineral geological survey plan (plan), north China land and peripheral geological mineral survey engineering (project), two-link-east Wuqi finished ore area West Wuqi and white Naomi regional geological mineral survey (second-level project), and sub-projects (including sub-project 1, sub-project 2 and sub-project 3, and the total is three sub-projects).

Collecting field geological survey data by the sub-projects, and sorting the field geological survey data to form two project files: a 3M geological profile (hereinafter referred to as project file A), and a 50M field route data (hereinafter referred to as project file B). There is now a need to achieve convergence of the two project files from the child project to the secondary project.

Step 1) deploying a Hadoop cluster with the size of 10T.

And step 2) providing 50G of online storage space for the sub-project by the Hadoop cluster according to needs, and storing the field geological survey data in the sub-project file A and the project file B online.

And 3) uploading two kinds of field geological survey data of the project file A and the project file B to catalogues of different data organization forms, namely 'geological document data' and 'field route data', by the sub-project under the project organization architecture to form project result data. Wherein, the data of the geological document data, the field route data, the satellite remote sensing image, the regional geological map and the like are stored in the catalogue with different data organization forms, as shown in fig. 4.

And 4) uploading the field geological survey data, and simultaneously using the existing data warehouse tool to hive record the resource description of the field geological survey data in the project file A and the project file B, and storing the resource description into a hive database, so that a user can conveniently retrieve and query the project result data.

And 5) after the project achievement data is formed, copying the achievement data of the sub project to the secondary project. Thus, the project file A and the project file B present a data flow state from bottom to top according to the project organization architecture, and the secondary project has a top-down management authority on the project file A and the project file B in the sub-project.

And 6) when the result data are converged from the sub-project to the secondary project, updating the resource description of the field geological survey data in the project file A and the project file B recorded by the data warehouse tool hive.

And 7) extracting the contents of the field geological survey data in the project file A and the project file B by using the existing data extraction tool and storing the extracted contents in the Hadoop cluster while carrying out field geological survey data aggregation. The method lays a foundation for realizing service release of the field geological survey data in the project file A and the project file B, is beneficial to retrieval and query of the field geological survey data in the project file A and the project file B by a user, and finally achieves real-time sharing of the field geological survey data and analysis and mining of large data in the later period.

And 8) because the data transmission is limited by the network, the file transmission is directly interrupted due to the network interruption. After the network is restored, the file can start to be transmitted again, and a lot of useless time and energy can be consumed. To address this concern, the following transmission strategies are provided:

step S1) acquires the size of the project file a, which is 3M. And if the file is less than 5M, the project file A is directly transmitted to the original file without fragmentation.

Step S2) acquires the size of the project file B, which is 50M. The files are larger than 5M, the project file A is divided into 10 pieces by taking every 5M as one piece, and each piece of file is defined as a 'piece file'. After the 10 slice files are divided, each slice file has its own hash value, so that the hash values of the 10 slice files form a hash list. When the file is transmitted, the successfully transmitted slice file also forms a new hash list. And if the network is interrupted, comparing the hash values in the original hash list one by one with the hash values in the new hash list.

If a hash value in the original hash list exists in the new hash list, the transmission of the slice file corresponding to the hash value is completed;

on the contrary, if a hash value in the original hash list does not exist in the new hash list, the transmission of the slice file corresponding to the hash value is not successful, and the file is defined as a breakpoint file.

Step S2.1) after determining the "breakpoint file" in step S2), the transmission is continued from the "breakpoint file" until the transmission of all the slice files is completed. This avoids the time and labor consuming problem of repeatedly starting to transfer files.

And S2.2) after the transmission of all the piece files is finished, merging all the transmitted piece files, and recovering the piece files into a unified and complete project file, thereby ensuring the data integrity of the project file B in the transmission process.

By the method and the system, dynamic collection of data from sub projects to secondary projects, from secondary projects to projects and from projects to plans can be realized, and sharing can be realized. The process of gathering the data from the secondary project to the project and from the project to the plan is substantially the same as that in embodiment 1, and will not be described herein.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.

The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims

1. A real-time gathering method for field geological survey data is characterized by comprising the following steps:

copying the project result data to a secondary project;

2. The method for real-time aggregation of field geological survey data according to claim 1, wherein the method further comprises:

3. The method for real-time aggregation of field geological survey data according to claim 2, wherein the file transmission according to the size of the field geological survey data content in each sub-project file specifically comprises:

acquiring a set sub-project file size value;

4. The method for real-time aggregation of field geological survey data according to claim 1, wherein the sub-project data catalog comprises: regional geological maps, field route data, geological document data, satellite remote sensing images and other user-defined folders.

5. The method for gathering field geological survey data in real time according to claim 1, wherein the field geological survey data in each sub-project file is uploaded to catalogs in different data organization forms correspondingly to form project result data, the catalogs include a plurality of sub-project data catalogs, and specifically include:

6. The method for real-time convergence of field geological survey data according to claim 1, wherein the recording resource description information of the field geological survey data in each sub-project file and storing the resource description information into a database specifically comprises:

7. A real-time gathering system for field geological survey data is characterized by comprising:

8. The field geological survey data real-time gathering system as recited in claim 7, wherein the system further comprises:

9. The system for real-time convergence of field geological survey data according to claim 7, wherein the transmission module specifically comprises:

10. The system for real-time convergence of field geological survey data according to claim 7, wherein the project result data forming module specifically comprises: