CN116127086B

CN116127086B - Geographical science data demand analysis method and device based on scientific and technological literature resources

Info

Publication number: CN116127086B
Application number: CN202211476732.9A
Authority: CN
Inventors: 周昆; 邱琳; 李伊黎; 冯功学; 康昕怡; 孙端; 常中兵; 傅海鑫; 罗小梅; 王祯
Original assignee: SURVEYING AND MAPPING INSTITUTE LANDS AND RESOURCE DEPARTMENT OF GUANGDONG PROVINCE
Current assignee: SURVEYING AND MAPPING INSTITUTE LANDS AND RESOURCE DEPARTMENT OF GUANGDONG PROVINCE
Priority date: 2022-11-23
Filing date: 2022-11-23
Publication date: 2023-09-19
Anticipated expiration: 2042-11-23
Also published as: CN116127086A

Abstract

The invention discloses a geographical scientific data demand analysis method and device based on scientific and technological literature resources. By utilizing the close relation between scientific data and scientific literature, the invention provides more scientific and targeted guidance and support for data acquisition, data exchange and data mining by combing, statistical analysis and visual expression of a large number of scientific literature resources and scientific data conditions adopted by actual researches of li-definition scientific researchers, and systematically, scientifically and objectively summarizing the demands of geographic scientific researches on the geographic scientific data.

Description

Geographical science data demand analysis method and device based on scientific and technological literature resources

Technical Field

The invention relates to a data analysis technology, in particular to a geographical scientific data demand analysis method and device based on scientific and technological literature resources.

Background

The geographical scientific data is an important basic strategic resource for developing geographical scientific research and innovative discovery, and the comprehensive utilization of the geographical scientific data has key supporting function and important significance for regional front basic research, government important decision, industrial high-quality development and the like. In the national and regional science data center system, data centers related to geographic science have been built up so far, including: national earth system scientific data centers, northeast Asia geographic science data centers and the like lack geographic science data centers for south China, particularly Guangdong province, and the management and the use of the current geographic science data have the problems of data dispersion, different standards, repeated investment and the like. Thus, the Guangdong province science and technology hall advanced in 2022 to build a Guangdong province geoscience data center to achieve efficient exchange, management and sharing of geoscience data, however, how to provide sufficient, high-quality and valuable data resources has become an important foundation for improving the service level and capacity of the data center.

At present, the geographical science data resource expansion and data demand analysis mainly adopt the following modes:

(1) The existing geographic science data center mainly expands data resources in a data exchange mode and mainly derives from: data generated by basic research, application research, experimental development and the like, and raw data and derived data thereof which are obtained by means of observation monitoring, investigation and inspection and detection and are used for scientific research activities lack of data demand analysis.

(2) And carrying out data demand investigation on scientific researchers in the related fields in the form of an investigation table and the like, and knowing the important research direction and the related data resource demands in the fields.

The above prior art has mainly the following disadvantages:

(1) The data resources are expanded in a data exchange mode, the data sources are obtained passively, the requirements of other scientific researchers on the scientific data resources are not fully considered, the data value is reduced, the loss of manpower and material resource costs is increased, and the service capacity and influence of a data center are affected.

(2) The data demand investigation is carried out by adopting forms such as an investigation list, the method is excessively dependent on subjective judgment, has single form and limited objects, can not comprehensively reflect the data demand of geographic research, and is difficult to meet the demands of domestic and foreign scientific research.

Patent document CN104899258A discloses a visual analysis system architecture for information interaction of mass documents, and the scheme utilizes the ideas of classification and clustering to calculate and process original mass data, so that the problem that the analysis result of a scientific and technological document network lacks graphical display is solved, but the scheme cannot realize data demand analysis.

Patent document CN109255026a discloses a learning requirement analysis method based on co-word analysis and cluster analysis, which comprises the steps of firstly deriving data from a topic online learning forum, then cleaning the data by means of text cloud, and then converting the cleaned data into data with EndNote format; on the basis, a co-word analysis method is applied to obtain a co-word matrix and a co-different matrix; constructing a social network graph by using a social network graph analysis method; obtaining a tree diagram of the co-word clustering of the high-frequency keywords by means of SPSS software and a clustering method; finally, a learning requirement hierarchical tower of the thematic online learning forum is obtained based on the social network graph and the co-word clustering tree graph of the high-frequency keywords, and a foundation is laid for purposefully providing learning support service, answering and confusion, and resource organization and construction for online learning communities. The method is mainly applied to learning demand analysis and cannot be applied to data demand analysis.

Disclosure of Invention

Aiming at the problems existing in the background technology, the invention provides a geographical scientific data demand analysis method and a geographical scientific data demand analysis device based on scientific literature resources, which utilize the close relation existing between scientific data and scientific literature, and provide more scientific and targeted guidance and support for data acquisition, data exchange and data mining by combing, statistical analysis and visual expression of a large number of scientific literature resources and scientific data conditions adopted by actual researches of clear scientific researchers.

In order to achieve the above purpose, the technical scheme of the invention is as follows:

in a first aspect, the present invention provides a method for analyzing geoscience data demand based on scientific and technological literature resources, the method comprising:

constructing a literature resource database;

word segmentation processing and data name recognition are carried out on the documents in the document resource database, duplicate removal processing is carried out on repeated data names in the same document, and the data names are stored in the document resource database;

and constructing a standard data name list, and unifying various and non-uniform data name vocabularies to standard and unique data names through data name matching so as to scientifically and reasonably count the frequency of each data resource in the scientific literature.

According to the data name list, performing word frequency statistics of the data names, and constructing a demand index by using the total number of scientific and technological documents and the data name frequency;

drawing a data name cloud picture by using the demand index; the size of the data demand is represented by the size of the vocabulary, and a data demand trend graph is drawn by combining time information on the basis of the demand index of each data resource; the data demand characteristics of different stages are compared and analyzed, so that the change rule and the development trend of the data demand are reflected, and the data center is facilitated to provide more active data resources.

Clustering different keywords as objects according to the similarity of the keywords, enabling the keywords with similar semantic relations to be clustered together to form a group, introducing adhesive force to measure the contribution degree of each keyword in the group to the clustering process, selecting the word with the largest adhesive force in the group as a central word, summarizing and naming each group by referring to the central word, and obtaining a main research direction;

based on a literature resource database, data fusion and relation extraction are carried out on main research directions, keywords and data resource names formed by clustering results, the demand indexes of various data in each research direction are calculated, and a scientific knowledge graph of the research directions and the data resources is constructed by taking the data demand indexes as edges.

Further, the calculation formula of the data demand index is:

in the above, X _i Is the demand index of the i data. N (N) _i I is the frequency of the data name, n is the total number of technological resources,the number of technological resources is the ratio of the number of technological resources of i data names to the total number of technological resources.

Further, the calculation formula of the adhesive force is as follows:

in the key word A _i N (A) for the adhesion of (C) _i ) Is represented by E (A) _i →A _j ) Representing keyword A _i Co-occurrence frequency with the rest of keywords in the cluster.

Further, the building a literature resource database includes:

constructing a literature resource base library: establishing a topic word library according to the related field of the geographic discipline, using scientific and technological literature resources of a knowledge database as a data source, and searching publications according to topic words by utilizing a crawler technology, so as to obtain a corresponding scientific and technological literature data set and form a literature resource base library;

constructing a keyword word set: performing word segmentation on the titles, keywords and abstracts of all the crawled technical documents, deleting structural words, removing verbs and adjective part-of-speech words, only preserving nouns, removing daily words through a daily word corpus matching elimination method of professional documents, and constructing a keyword word set;

updating a document resource library: according to the keyword set, using knowledge database scientific and technological literature resources as data sources, acquiring more related scientific and technological literature information again by adopting a crawler technology, and storing the acquired title, abstract, keywords and publishing time information into a database.

Further, based on a word segmentation module and an entity word recognition module of the natural language processing algorithm package, word segmentation processing and data name recognition are carried out on the crawled title, keywords and abstract data.

Further, the demand index is normalized so as to perform demand comparison analysis of the data resources, and the purpose of quantifying the demand of the data resources is achieved.

Further, the standard data name list is constructed by means of browsing data official websites, consulting related monographs of the field, consulting specialists and the like.

Further, the data demand trend graph is a data demand year trend graph.

In a second aspect, the present invention provides a geographical scientific data demand analysis device based on scientific literature resources, comprising a memory, a processor and a computer program stored in said memory and executable on said processor, said processor implementing the steps of any of the methods described above when said computer program is executed.

In a third aspect, the present invention provides a computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of any of the methods described above.

Compared with the prior art, the invention has the beneficial effects that:

the geographical scientific data demand analysis method based on scientific and technological literature resources breaks through the limitation of questionnaire form demand investigation, can comprehensively and objectively analyze the demand of scientific and scientific workers on geographical scientific data in actual research, and provides more scientific and targeted guidance and support for scientific research.

The geographical scientific data demand analysis method based on scientific and technological literature resources provided by the invention is aimed at data acquisition, data acquisition and production are carried out according to the needs, the allocation of human, object and financial resources is optimized, funds are saved, the maximum benefit is obtained by the minimum funds, and the benefit and data utilization are maximized.

Drawings

FIG. 1 is a flowchart of a method for analyzing demands of geoscience data based on scientific and technological literature resources according to embodiment 1 of the present invention;

fig. 2 is a schematic diagram of the geographical scientific data demand analysis device based on scientific literature resources according to embodiment 2 of the present invention.

Detailed Description

Examples:

the technical scheme of the invention is further described below with reference to the accompanying drawings and examples.

Referring to fig. 1, the geographical scientific data demand analysis method based on scientific literature resources provided in this embodiment mainly includes the following steps:

the construction step of the literature resource database comprises the following steps:

(1) Constructing a literature resource base library: according to the subject classification and code (GB/T13745-2009), technical literature resources of a knowledge network database are used as data sources, a crawler technology is utilized to search publications according to the subject terms, so that a plurality of publications such as the subject matters, namely the subject matters, the economic geography, the national institute of geography, the economic and geography committee academy of sciences, the national economic and geography research, the twelfth academy of sciences, the globalization and the national regional development, and the seminar of the seminar, are obtained, and the corresponding technical literature data sets are obtained to form a literature resource base;

(2) Constructing a keyword word set: performing word segmentation on the titles, keywords and abstracts of all the crawled technical documents, deleting structural words such as conjunctions, prepositions and pronouns, removing verbs and adjectives, only preserving nouns, removing daily words through a daily word corpus matching elimination method of professional documents, and constructing a keyword word set;

(3) Updating a document resource library: according to the keyword set, the scientific and technological literature resources of the knowledge network database are used as data sources, the crawler technology is adopted again to acquire more related scientific and technological literature information, and the acquired information such as titles, abstracts, keywords, publishing time and the like is stored in the database.

Therefore, the sub-steps can be used for acquiring the scientific literature resources related to the geographic field based on the geographic, so that the searching range of the scientific literature resources is expanded, and the crawling efficiency of the scientific literature is improved.

The data name words are identified in the normalization step, which comprises the following steps:

(1) Data name identification

The word segmentation module and the entity word recognition module based on the natural language processing algorithm package perform word segmentation processing and data name recognition on the crawled text title, keywords and abstract data, perform duplication removal processing on repeated data names in the same document, and store the data names in a document resource database.

(2) Data name normalization

The standard data name list is constructed by means of browsing data official websites, consulting related monographs in the field, consulting specialists and the like, and various and non-uniform data name vocabularies are unified on the standard and unique data names through data name matching so as to scientifically and reasonably count the frequency of occurrence of the data names in the scientific and technological literature for each data resource.

Standard data name representation example

Thus, the data names are extracted and normalized through the substeps, and a foundation is laid for data resource statistics and demand analysis in scientific and technical literature.

A data resource demand calculation and analysis step comprising:

(1) Data demand computation

According to a data name list in a literature resource library, performing word frequency statistics of data names, constructing a demand index by using the total number of scientific and technical literatures and the frequency of the data names, and performing normalization processing so as to perform demand comparison analysis of data resources, thereby achieving the purpose of quantifying the demands of the data resources, wherein a demand index calculation formula is as follows:

The demand index can be accurately determined by the above.

(2) Data demand analysis

And drawing a data name cloud chart by utilizing the data demand indexes, representing the data demand by the size of a vocabulary, drawing a data demand year trend chart by combining time information on the basis of the demand indexes of each data resource, and comparing and analyzing the data demand characteristics of different stages so as to reflect the change rule and the development trend of the data demand and help the data center to provide the data resource with more behavior.

Therefore, the data resource requirements can be quantified through the substeps, and the data requirement characteristics can be analyzed to reflect the change rule and the development trend of the data requirements.

The scientific knowledge graph construction step comprises the following steps:

(1) Keyword cluster analysis

And clustering different keywords as objects according to the similarity of the keywords, so that the keywords with similar semantic relations are clustered together to form a group, introducing an adhesive force thought to measure the contribution degree of each keyword in the group to the clustering process, selecting the word with the largest adhesive force in the group as a central word, and summarizing and naming each group by referring to the central word to obtain the main research direction. The adhesive force calculation formula is:

(2) Map construction

Based on a literature resource database, data fusion and relation extraction are carried out on main research directions, keywords and data resource names formed by clustering results, the demand indexes of various data in each research direction are calculated, scientific knowledge maps of the research directions and the data resources are constructed by taking the data demand indexes as edges, the structure evolution characteristics of the knowledge maps in different time windows are analyzed, and the development condition and the data demand change condition of each research direction are researched.

Thus, through the relation between research directions and data resources formed by the science and technology literature resource clusters, the data resource requirements of different research directions are statistically analyzed.

In summary, compared with the prior art, the invention has the following technical advantages:

Example 2:

referring to fig. 2, the device for analyzing the geoscience data demand based on the scientific and technological literature resource provided in this embodiment includes a processor 21, a memory 22, and a computer program 23 stored in the memory 22 and executable on the processor 21, for example, the geoscience data demand analysis program based on the scientific and technological literature resource. The processor 21, when executing the computer program 23, implements the steps of embodiment 1 described above, such as the steps shown in fig. 1.

Illustratively, the computer program 23 may be partitioned into one or more modules/units that are stored in the memory 22 and executed by the processor 21 to complete the present invention. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions for describing the execution of the computer program 23 in the scientific and technological resource-based geoscience data demand analysis means.

The geographical scientific data demand analysis device based on the scientific and technological literature resources can be a desktop computer, a notebook computer, a palm computer, a cloud server and other computing devices. The geoscience data demand analysis device based on scientific and technological literature resources may include, but is not limited to, a processor 21 and a memory 22. It will be appreciated by those skilled in the art that fig. 2 is merely an example of a scientific and technological resource-based geoscience data demand analysis apparatus, and does not constitute a limitation of a scientific and technological resource-based geoscience data demand analysis apparatus, and may include more or less components than those illustrated, or may combine certain components, or different components, e.g., the scientific and technological resource-based geoscience data demand analysis apparatus may further include an input-output device, a network access device, a bus, etc.

The processor 21 may be a central processing unit (Central Processing Unit, CPU), but may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), off-the-shelf programmable gate arrays (FieldProgrammable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 22 may be an internal storage element of the scientific and technological literature resource based geoscience data requirement analysis device, such as a hard disk or a memory of the scientific and technological literature resource based geoscience data requirement analysis device. The memory 22 may be an external storage device of the scientific and technological resource-based geoscience data demand analysis apparatus, for example, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash Card (Flash Card) or the like provided in the scientific and technological resource-based geoscience data demand analysis apparatus. Further, the memory 22 may also include both an internal memory unit and an external memory device of the scientific and technological data demand analysis apparatus based on scientific and technological literature resources. The memory 22 is used to store the computer program and other programs and data required by the scientific and technological resource-based geoscience data demand analysis means. The memory 22 may also be used to temporarily store data that has been output or is to be output.

Example 3:

the present embodiment provides a computer readable storage medium storing a computer program which, when executed by a processor, implements the steps of the method described in embodiment 1.

The computer readable medium can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer readable medium may even be paper or another suitable medium upon which the program is printed, such as by optically scanning the paper or other medium, then editing, interpreting, or otherwise processing as necessary, and electronically obtaining the program, which is then stored in a computer memory.

The above embodiments are only for illustrating the technical concept and features of the present invention, and are intended to enable those skilled in the art to understand the content of the present invention and implement the same, and are not intended to limit the scope of the present invention. All equivalent changes or modifications made in accordance with the essence of the present invention are intended to be included within the scope of the present invention.

Claims

1. A method for analyzing geoscience data demand based on scientific and technological literature resources, the method comprising:

constructing a literature resource database;

constructing a standard data name list, and unifying various and non-uniform data name vocabularies to standard and unique data names through data name matching;

drawing a data name cloud picture by using the demand index; the size of the data demand is represented by the size of the vocabulary, and a data demand trend graph is drawn by combining time information on the basis of the demand index of each data resource;

based on a literature resource database, carrying out data fusion and relation extraction on main research directions, keywords and data resource names formed by clustering results, calculating the demand indexes of various data in each research direction, and constructing a scientific knowledge graph of the research directions and the data resources by taking the data demand indexes as edges;

the calculation formula of the data demand index is as follows:

in the above, X _k Is the demand index of the kth data name; n (N) _k Frequency number for kth data name, n is the total number of scientific resources,the number of the technological resources is the ratio of the number of the technological resources of the kth data name to the total number of the technological resources;

the construction of the literature resource database comprises the following steps:

constructing a literature resource base library: establishing a topic word library according to the related field of the geographic discipline, using scientific and technological literature resources of a knowledge database as a data source, and searching publications according to topic words by utilizing a crawler technology to obtain a corresponding scientific and technological literature data set so as to form a literature resource base library;

2. The method for analyzing the demands of the geoscience data based on the scientific literature resources according to claim 1, wherein the calculation formula of the adhesive force is:

3. The method for analyzing the demands of the geographical scientific data based on the scientific and literature resources according to claim 1, wherein the word segmentation module and the entity word recognition module based on the natural language processing algorithm package perform word segmentation processing and data name recognition on the crawled title, key words and abstract data.

4. The method for demand analysis of geoscience data based on scientific literature resources according to claim 1, wherein the demand index is normalized.

5. The method for analyzing the demands of geoscience data based on scientific and literature resources according to claim 1, wherein the standard data name list is constructed by browsing a data official website, consulting a field related monograph and consulting an expert.

6. The scientific and technological resource-based geoscience data demand analysis method of claim 1, wherein the data demand trend graph is a data demand year trend graph.

7. A geographical scientific data demand analysis device based on scientific literature resources, comprising a memory, a processor and a computer program stored in said memory and executable on said processor, characterized in that said processor, when executing said computer program, carries out the steps of the method according to any one of claims 1 to 6.

8. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the method according to any one of claims 1 to 6.