CN116127086B - Geographical science data demand analysis method and device based on scientific and technological literature resources - Google Patents

Geographical science data demand analysis method and device based on scientific and technological literature resources Download PDF

Info

Publication number
CN116127086B
CN116127086B CN202211476732.9A CN202211476732A CN116127086B CN 116127086 B CN116127086 B CN 116127086B CN 202211476732 A CN202211476732 A CN 202211476732A CN 116127086 B CN116127086 B CN 116127086B
Authority
CN
China
Prior art keywords
data
scientific
literature
demand
resources
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211476732.9A
Other languages
Chinese (zh)
Other versions
CN116127086A (en
Inventor
周昆
邱琳
李伊黎
冯功学
康昕怡
孙端
常中兵
傅海鑫
罗小梅
王祯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
SURVEYING AND MAPPING INSTITUTE LANDS AND RESOURCE DEPARTMENT OF GUANGDONG PROVINCE
Original Assignee
SURVEYING AND MAPPING INSTITUTE LANDS AND RESOURCE DEPARTMENT OF GUANGDONG PROVINCE
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SURVEYING AND MAPPING INSTITUTE LANDS AND RESOURCE DEPARTMENT OF GUANGDONG PROVINCE filed Critical SURVEYING AND MAPPING INSTITUTE LANDS AND RESOURCE DEPARTMENT OF GUANGDONG PROVINCE
Priority to CN202211476732.9A priority Critical patent/CN116127086B/en
Publication of CN116127086A publication Critical patent/CN116127086A/en
Application granted granted Critical
Publication of CN116127086B publication Critical patent/CN116127086B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a geographical scientific data demand analysis method and device based on scientific and technological literature resources. By utilizing the close relation between scientific data and scientific literature, the invention provides more scientific and targeted guidance and support for data acquisition, data exchange and data mining by combing, statistical analysis and visual expression of a large number of scientific literature resources and scientific data conditions adopted by actual researches of li-definition scientific researchers, and systematically, scientifically and objectively summarizing the demands of geographic scientific researches on the geographic scientific data.

Description

Geographical science data demand analysis method and device based on scientific and technological literature resources
Technical Field
The invention relates to a data analysis technology, in particular to a geographical scientific data demand analysis method and device based on scientific and technological literature resources.
Background
The geographical scientific data is an important basic strategic resource for developing geographical scientific research and innovative discovery, and the comprehensive utilization of the geographical scientific data has key supporting function and important significance for regional front basic research, government important decision, industrial high-quality development and the like. In the national and regional science data center system, data centers related to geographic science have been built up so far, including: national earth system scientific data centers, northeast Asia geographic science data centers and the like lack geographic science data centers for south China, particularly Guangdong province, and the management and the use of the current geographic science data have the problems of data dispersion, different standards, repeated investment and the like. Thus, the Guangdong province science and technology hall advanced in 2022 to build a Guangdong province geoscience data center to achieve efficient exchange, management and sharing of geoscience data, however, how to provide sufficient, high-quality and valuable data resources has become an important foundation for improving the service level and capacity of the data center.
At present, the geographical science data resource expansion and data demand analysis mainly adopt the following modes:
(1) The existing geographic science data center mainly expands data resources in a data exchange mode and mainly derives from: data generated by basic research, application research, experimental development and the like, and raw data and derived data thereof which are obtained by means of observation monitoring, investigation and inspection and detection and are used for scientific research activities lack of data demand analysis.
(2) And carrying out data demand investigation on scientific researchers in the related fields in the form of an investigation table and the like, and knowing the important research direction and the related data resource demands in the fields.
The above prior art has mainly the following disadvantages:
(1) The data resources are expanded in a data exchange mode, the data sources are obtained passively, the requirements of other scientific researchers on the scientific data resources are not fully considered, the data value is reduced, the loss of manpower and material resource costs is increased, and the service capacity and influence of a data center are affected.
(2) The data demand investigation is carried out by adopting forms such as an investigation list, the method is excessively dependent on subjective judgment, has single form and limited objects, can not comprehensively reflect the data demand of geographic research, and is difficult to meet the demands of domestic and foreign scientific research.
Patent document CN104899258A discloses a visual analysis system architecture for information interaction of mass documents, and the scheme utilizes the ideas of classification and clustering to calculate and process original mass data, so that the problem that the analysis result of a scientific and technological document network lacks graphical display is solved, but the scheme cannot realize data demand analysis.
Patent document CN109255026a discloses a learning requirement analysis method based on co-word analysis and cluster analysis, which comprises the steps of firstly deriving data from a topic online learning forum, then cleaning the data by means of text cloud, and then converting the cleaned data into data with EndNote format; on the basis, a co-word analysis method is applied to obtain a co-word matrix and a co-different matrix; constructing a social network graph by using a social network graph analysis method; obtaining a tree diagram of the co-word clustering of the high-frequency keywords by means of SPSS software and a clustering method; finally, a learning requirement hierarchical tower of the thematic online learning forum is obtained based on the social network graph and the co-word clustering tree graph of the high-frequency keywords, and a foundation is laid for purposefully providing learning support service, answering and confusion, and resource organization and construction for online learning communities. The method is mainly applied to learning demand analysis and cannot be applied to data demand analysis.
Disclosure of Invention
Aiming at the problems existing in the background technology, the invention provides a geographical scientific data demand analysis method and a geographical scientific data demand analysis device based on scientific literature resources, which utilize the close relation existing between scientific data and scientific literature, and provide more scientific and targeted guidance and support for data acquisition, data exchange and data mining by combing, statistical analysis and visual expression of a large number of scientific literature resources and scientific data conditions adopted by actual researches of clear scientific researchers.
In order to achieve the above purpose, the technical scheme of the invention is as follows:
in a first aspect, the present invention provides a method for analyzing geoscience data demand based on scientific and technological literature resources, the method comprising:
constructing a literature resource database;
word segmentation processing and data name recognition are carried out on the documents in the document resource database, duplicate removal processing is carried out on repeated data names in the same document, and the data names are stored in the document resource database;
and constructing a standard data name list, and unifying various and non-uniform data name vocabularies to standard and unique data names through data name matching so as to scientifically and reasonably count the frequency of each data resource in the scientific literature.
According to the data name list, performing word frequency statistics of the data names, and constructing a demand index by using the total number of scientific and technological documents and the data name frequency;
drawing a data name cloud picture by using the demand index; the size of the data demand is represented by the size of the vocabulary, and a data demand trend graph is drawn by combining time information on the basis of the demand index of each data resource; the data demand characteristics of different stages are compared and analyzed, so that the change rule and the development trend of the data demand are reflected, and the data center is facilitated to provide more active data resources.
Clustering different keywords as objects according to the similarity of the keywords, enabling the keywords with similar semantic relations to be clustered together to form a group, introducing adhesive force to measure the contribution degree of each keyword in the group to the clustering process, selecting the word with the largest adhesive force in the group as a central word, summarizing and naming each group by referring to the central word, and obtaining a main research direction;
based on a literature resource database, data fusion and relation extraction are carried out on main research directions, keywords and data resource names formed by clustering results, the demand indexes of various data in each research direction are calculated, and a scientific knowledge graph of the research directions and the data resources is constructed by taking the data demand indexes as edges.
Further, the calculation formula of the data demand index is:
in the above, X i Is the demand index of the i data. N (N) i I is the frequency of the data name, n is the total number of technological resources,the number of technological resources is the ratio of the number of technological resources of i data names to the total number of technological resources.
Further, the calculation formula of the adhesive force is as follows:
in the key word A i N (A) for the adhesion of (C) i ) Is represented by E (A) i →A j ) Representing keyword A i Co-occurrence frequency with the rest of keywords in the cluster.
Further, the building a literature resource database includes:
constructing a literature resource base library: establishing a topic word library according to the related field of the geographic discipline, using scientific and technological literature resources of a knowledge database as a data source, and searching publications according to topic words by utilizing a crawler technology, so as to obtain a corresponding scientific and technological literature data set and form a literature resource base library;
constructing a keyword word set: performing word segmentation on the titles, keywords and abstracts of all the crawled technical documents, deleting structural words, removing verbs and adjective part-of-speech words, only preserving nouns, removing daily words through a daily word corpus matching elimination method of professional documents, and constructing a keyword word set;
updating a document resource library: according to the keyword set, using knowledge database scientific and technological literature resources as data sources, acquiring more related scientific and technological literature information again by adopting a crawler technology, and storing the acquired title, abstract, keywords and publishing time information into a database.
Further, based on a word segmentation module and an entity word recognition module of the natural language processing algorithm package, word segmentation processing and data name recognition are carried out on the crawled title, keywords and abstract data.
Further, the demand index is normalized so as to perform demand comparison analysis of the data resources, and the purpose of quantifying the demand of the data resources is achieved.
Further, the standard data name list is constructed by means of browsing data official websites, consulting related monographs of the field, consulting specialists and the like.
Further, the data demand trend graph is a data demand year trend graph.
In a second aspect, the present invention provides a geographical scientific data demand analysis device based on scientific literature resources, comprising a memory, a processor and a computer program stored in said memory and executable on said processor, said processor implementing the steps of any of the methods described above when said computer program is executed.
In a third aspect, the present invention provides a computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of any of the methods described above.
Compared with the prior art, the invention has the beneficial effects that:
the geographical scientific data demand analysis method based on scientific and technological literature resources breaks through the limitation of questionnaire form demand investigation, can comprehensively and objectively analyze the demand of scientific and scientific workers on geographical scientific data in actual research, and provides more scientific and targeted guidance and support for scientific research.
The geographical scientific data demand analysis method based on scientific and technological literature resources provided by the invention is aimed at data acquisition, data acquisition and production are carried out according to the needs, the allocation of human, object and financial resources is optimized, funds are saved, the maximum benefit is obtained by the minimum funds, and the benefit and data utilization are maximized.
Drawings
FIG. 1 is a flowchart of a method for analyzing demands of geoscience data based on scientific and technological literature resources according to embodiment 1 of the present invention;
fig. 2 is a schematic diagram of the geographical scientific data demand analysis device based on scientific literature resources according to embodiment 2 of the present invention.
Detailed Description
Examples:
the technical scheme of the invention is further described below with reference to the accompanying drawings and examples.
Referring to fig. 1, the geographical scientific data demand analysis method based on scientific literature resources provided in this embodiment mainly includes the following steps:
the construction step of the literature resource database comprises the following steps:
(1) Constructing a literature resource base library: according to the subject classification and code (GB/T13745-2009), technical literature resources of a knowledge network database are used as data sources, a crawler technology is utilized to search publications according to the subject terms, so that a plurality of publications such as the subject matters, namely the subject matters, the economic geography, the national institute of geography, the economic and geography committee academy of sciences, the national economic and geography research, the twelfth academy of sciences, the globalization and the national regional development, and the seminar of the seminar, are obtained, and the corresponding technical literature data sets are obtained to form a literature resource base;
(2) Constructing a keyword word set: performing word segmentation on the titles, keywords and abstracts of all the crawled technical documents, deleting structural words such as conjunctions, prepositions and pronouns, removing verbs and adjectives, only preserving nouns, removing daily words through a daily word corpus matching elimination method of professional documents, and constructing a keyword word set;
(3) Updating a document resource library: according to the keyword set, the scientific and technological literature resources of the knowledge network database are used as data sources, the crawler technology is adopted again to acquire more related scientific and technological literature information, and the acquired information such as titles, abstracts, keywords, publishing time and the like is stored in the database.
Therefore, the sub-steps can be used for acquiring the scientific literature resources related to the geographic field based on the geographic, so that the searching range of the scientific literature resources is expanded, and the crawling efficiency of the scientific literature is improved.
The data name words are identified in the normalization step, which comprises the following steps:
(1) Data name identification
The word segmentation module and the entity word recognition module based on the natural language processing algorithm package perform word segmentation processing and data name recognition on the crawled text title, keywords and abstract data, perform duplication removal processing on repeated data names in the same document, and store the data names in a document resource database.
(2) Data name normalization
The standard data name list is constructed by means of browsing data official websites, consulting related monographs in the field, consulting specialists and the like, and various and non-uniform data name vocabularies are unified on the standard and unique data names through data name matching so as to scientifically and reasonably count the frequency of occurrence of the data names in the scientific and technological literature for each data resource.
Standard data name representation example
Thus, the data names are extracted and normalized through the substeps, and a foundation is laid for data resource statistics and demand analysis in scientific and technical literature.
A data resource demand calculation and analysis step comprising:
(1) Data demand computation
According to a data name list in a literature resource library, performing word frequency statistics of data names, constructing a demand index by using the total number of scientific and technical literatures and the frequency of the data names, and performing normalization processing so as to perform demand comparison analysis of data resources, thereby achieving the purpose of quantifying the demands of the data resources, wherein a demand index calculation formula is as follows:
in the above, X i Is the demand index of the i data. N (N) i I is the frequency of the data name, n is the total number of technological resources,the number of technological resources is the ratio of the number of technological resources of i data names to the total number of technological resources.
The demand index can be accurately determined by the above.
(2) Data demand analysis
And drawing a data name cloud chart by utilizing the data demand indexes, representing the data demand by the size of a vocabulary, drawing a data demand year trend chart by combining time information on the basis of the demand indexes of each data resource, and comparing and analyzing the data demand characteristics of different stages so as to reflect the change rule and the development trend of the data demand and help the data center to provide the data resource with more behavior.
Therefore, the data resource requirements can be quantified through the substeps, and the data requirement characteristics can be analyzed to reflect the change rule and the development trend of the data requirements.
The scientific knowledge graph construction step comprises the following steps:
(1) Keyword cluster analysis
And clustering different keywords as objects according to the similarity of the keywords, so that the keywords with similar semantic relations are clustered together to form a group, introducing an adhesive force thought to measure the contribution degree of each keyword in the group to the clustering process, selecting the word with the largest adhesive force in the group as a central word, and summarizing and naming each group by referring to the central word to obtain the main research direction. The adhesive force calculation formula is:
in the key word A i N (A) for the adhesion of (C) i ) Is represented by E (A) i →A j ) Representing keyword A i Co-occurrence frequency with the rest of keywords in the cluster.
(2) Map construction
Based on a literature resource database, data fusion and relation extraction are carried out on main research directions, keywords and data resource names formed by clustering results, the demand indexes of various data in each research direction are calculated, scientific knowledge maps of the research directions and the data resources are constructed by taking the data demand indexes as edges, the structure evolution characteristics of the knowledge maps in different time windows are analyzed, and the development condition and the data demand change condition of each research direction are researched.
Thus, through the relation between research directions and data resources formed by the science and technology literature resource clusters, the data resource requirements of different research directions are statistically analyzed.
In summary, compared with the prior art, the invention has the following technical advantages:
the geographical scientific data demand analysis method based on scientific and technological literature resources breaks through the limitation of questionnaire form demand investigation, can comprehensively and objectively analyze the demand of scientific and scientific workers on geographical scientific data in actual research, and provides more scientific and targeted guidance and support for scientific research.
The geographical scientific data demand analysis method based on scientific and technological literature resources provided by the invention is aimed at data acquisition, data acquisition and production are carried out according to the needs, the allocation of human, object and financial resources is optimized, funds are saved, the maximum benefit is obtained by the minimum funds, and the benefit and data utilization are maximized.
Example 2:
referring to fig. 2, the device for analyzing the geoscience data demand based on the scientific and technological literature resource provided in this embodiment includes a processor 21, a memory 22, and a computer program 23 stored in the memory 22 and executable on the processor 21, for example, the geoscience data demand analysis program based on the scientific and technological literature resource. The processor 21, when executing the computer program 23, implements the steps of embodiment 1 described above, such as the steps shown in fig. 1.
Illustratively, the computer program 23 may be partitioned into one or more modules/units that are stored in the memory 22 and executed by the processor 21 to complete the present invention. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions for describing the execution of the computer program 23 in the scientific and technological resource-based geoscience data demand analysis means.
The geographical scientific data demand analysis device based on the scientific and technological literature resources can be a desktop computer, a notebook computer, a palm computer, a cloud server and other computing devices. The geoscience data demand analysis device based on scientific and technological literature resources may include, but is not limited to, a processor 21 and a memory 22. It will be appreciated by those skilled in the art that fig. 2 is merely an example of a scientific and technological resource-based geoscience data demand analysis apparatus, and does not constitute a limitation of a scientific and technological resource-based geoscience data demand analysis apparatus, and may include more or less components than those illustrated, or may combine certain components, or different components, e.g., the scientific and technological resource-based geoscience data demand analysis apparatus may further include an input-output device, a network access device, a bus, etc.
The processor 21 may be a central processing unit (Central Processing Unit, CPU), but may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), off-the-shelf programmable gate arrays (FieldProgrammable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory 22 may be an internal storage element of the scientific and technological literature resource based geoscience data requirement analysis device, such as a hard disk or a memory of the scientific and technological literature resource based geoscience data requirement analysis device. The memory 22 may be an external storage device of the scientific and technological resource-based geoscience data demand analysis apparatus, for example, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash Card (Flash Card) or the like provided in the scientific and technological resource-based geoscience data demand analysis apparatus. Further, the memory 22 may also include both an internal memory unit and an external memory device of the scientific and technological data demand analysis apparatus based on scientific and technological literature resources. The memory 22 is used to store the computer program and other programs and data required by the scientific and technological resource-based geoscience data demand analysis means. The memory 22 may also be used to temporarily store data that has been output or is to be output.
Example 3:
the present embodiment provides a computer readable storage medium storing a computer program which, when executed by a processor, implements the steps of the method described in embodiment 1.
The computer readable medium can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer readable medium may even be paper or another suitable medium upon which the program is printed, such as by optically scanning the paper or other medium, then editing, interpreting, or otherwise processing as necessary, and electronically obtaining the program, which is then stored in a computer memory.
The above embodiments are only for illustrating the technical concept and features of the present invention, and are intended to enable those skilled in the art to understand the content of the present invention and implement the same, and are not intended to limit the scope of the present invention. All equivalent changes or modifications made in accordance with the essence of the present invention are intended to be included within the scope of the present invention.

Claims (8)

1. A method for analyzing geoscience data demand based on scientific and technological literature resources, the method comprising:
constructing a literature resource database;
word segmentation processing and data name recognition are carried out on the documents in the document resource database, duplicate removal processing is carried out on repeated data names in the same document, and the data names are stored in the document resource database;
constructing a standard data name list, and unifying various and non-uniform data name vocabularies to standard and unique data names through data name matching;
according to the data name list, performing word frequency statistics of the data names, and constructing a demand index by using the total number of scientific and technological documents and the data name frequency;
drawing a data name cloud picture by using the demand index; the size of the data demand is represented by the size of the vocabulary, and a data demand trend graph is drawn by combining time information on the basis of the demand index of each data resource;
clustering different keywords as objects according to the similarity of the keywords, enabling the keywords with similar semantic relations to be clustered together to form a group, introducing adhesive force to measure the contribution degree of each keyword in the group to the clustering process, selecting the word with the largest adhesive force in the group as a central word, summarizing and naming each group by referring to the central word, and obtaining a main research direction;
based on a literature resource database, carrying out data fusion and relation extraction on main research directions, keywords and data resource names formed by clustering results, calculating the demand indexes of various data in each research direction, and constructing a scientific knowledge graph of the research directions and the data resources by taking the data demand indexes as edges;
the calculation formula of the data demand index is as follows:
in the above, X k Is the demand index of the kth data name; n (N) k Frequency number for kth data name, n is the total number of scientific resources,the number of the technological resources is the ratio of the number of the technological resources of the kth data name to the total number of the technological resources;
the construction of the literature resource database comprises the following steps:
constructing a literature resource base library: establishing a topic word library according to the related field of the geographic discipline, using scientific and technological literature resources of a knowledge database as a data source, and searching publications according to topic words by utilizing a crawler technology to obtain a corresponding scientific and technological literature data set so as to form a literature resource base library;
constructing a keyword word set: performing word segmentation on the titles, keywords and abstracts of all the crawled technical documents, deleting structural words, removing verbs and adjective part-of-speech words, only preserving nouns, removing daily words through a daily word corpus matching elimination method of professional documents, and constructing a keyword word set;
updating a document resource library: according to the keyword set, using knowledge database scientific and technological literature resources as data sources, acquiring more related scientific and technological literature information again by adopting a crawler technology, and storing the acquired title, abstract, keywords and publishing time information into a database.
2. The method for analyzing the demands of the geoscience data based on the scientific literature resources according to claim 1, wherein the calculation formula of the adhesive force is:
in the key word A i N (A) for the adhesion of (C) i ) Is represented by E (A) i →A j ) Representing keyword A i Co-occurrence frequency with the rest of keywords in the cluster.
3. The method for analyzing the demands of the geographical scientific data based on the scientific and literature resources according to claim 1, wherein the word segmentation module and the entity word recognition module based on the natural language processing algorithm package perform word segmentation processing and data name recognition on the crawled title, key words and abstract data.
4. The method for demand analysis of geoscience data based on scientific literature resources according to claim 1, wherein the demand index is normalized.
5. The method for analyzing the demands of geoscience data based on scientific and literature resources according to claim 1, wherein the standard data name list is constructed by browsing a data official website, consulting a field related monograph and consulting an expert.
6. The scientific and technological resource-based geoscience data demand analysis method of claim 1, wherein the data demand trend graph is a data demand year trend graph.
7. A geographical scientific data demand analysis device based on scientific literature resources, comprising a memory, a processor and a computer program stored in said memory and executable on said processor, characterized in that said processor, when executing said computer program, carries out the steps of the method according to any one of claims 1 to 6.
8. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the method according to any one of claims 1 to 6.
CN202211476732.9A 2022-11-23 2022-11-23 Geographical science data demand analysis method and device based on scientific and technological literature resources Active CN116127086B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211476732.9A CN116127086B (en) 2022-11-23 2022-11-23 Geographical science data demand analysis method and device based on scientific and technological literature resources

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211476732.9A CN116127086B (en) 2022-11-23 2022-11-23 Geographical science data demand analysis method and device based on scientific and technological literature resources

Publications (2)

Publication Number Publication Date
CN116127086A CN116127086A (en) 2023-05-16
CN116127086B true CN116127086B (en) 2023-09-19

Family

ID=86298118

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211476732.9A Active CN116127086B (en) 2022-11-23 2022-11-23 Geographical science data demand analysis method and device based on scientific and technological literature resources

Country Status (1)

Country Link
CN (1) CN116127086B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106909606A (en) * 2017-01-05 2017-06-30 南昌大学 A kind of method that discipline information dynamic framework is made based on atlas analysis
CN109726298A (en) * 2019-01-08 2019-05-07 上海市研发公共服务平台管理中心 Knowledge mapping construction method, system, terminal and medium suitable for scientific and technical literature
CN115309885A (en) * 2022-08-26 2022-11-08 上海大学 Knowledge graph construction, retrieval and visualization method and system for scientific and technological service

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11977925B2 (en) * 2020-08-04 2024-05-07 Smart Software, Inc. Clustering and visualizing demand profiles of resources

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106909606A (en) * 2017-01-05 2017-06-30 南昌大学 A kind of method that discipline information dynamic framework is made based on atlas analysis
CN109726298A (en) * 2019-01-08 2019-05-07 上海市研发公共服务平台管理中心 Knowledge mapping construction method, system, terminal and medium suitable for scientific and technical literature
CN115309885A (en) * 2022-08-26 2022-11-08 上海大学 Knowledge graph construction, retrieval and visualization method and system for scientific and technological service

Also Published As

Publication number Publication date
CN116127086A (en) 2023-05-16

Similar Documents

Publication Publication Date Title
US10963794B2 (en) Concept analysis operations utilizing accelerators
CN109992645B (en) Data management system and method based on text data
Rusyn et al. Model and architecture for virtual library information system
US11663254B2 (en) System and engine for seeded clustering of news events
WO2019091026A1 (en) Knowledge base document rapid search method, application server, and computer readable storage medium
US9311823B2 (en) Caching natural language questions and results in a question and answer system
US8375061B2 (en) Graphical models for representing text documents for computer analysis
CN111797214A (en) FAQ database-based problem screening method and device, computer equipment and medium
Djenouri et al. Cluster-based information retrieval using pattern mining
CN111258966A (en) Data deduplication method, device, equipment and storage medium
CN107844493B (en) File association method and system
CN113407785B (en) Data processing method and system based on distributed storage system
US10372718B2 (en) Systems and methods for enterprise data search and analysis
US20210149892A1 (en) Systems and methods for enterprise data search and analysis
CN113836131A (en) Big data cleaning method and device, computer equipment and storage medium
KR20210129465A (en) Apparatus for managing laboratory note and method for searching laboratory note using thereof
Yin et al. Maximum entropy model for mobile text classification in cloud computing using improved information gain algorithm
Huang et al. Identification of topic evolution: network analytics with piecewise linear representation and word embedding
CN116595173A (en) Data processing method, device, equipment and storage medium for policy information management
US20140181097A1 (en) Providing organized content
CN116127086B (en) Geographical science data demand analysis method and device based on scientific and technological literature resources
CN114996400A (en) Referee document processing method and device, electronic equipment and storage medium
CN114328844A (en) Text data set management method, device, equipment and storage medium
CN114528378A (en) Text classification method and device, electronic equipment and storage medium
Shannaq Adapt clustering methods for arabic documents

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant