CN111709238B - Web page geoscience correlation calculation method based on geoscience expert knowledge - Google Patents

Web page geoscience correlation calculation method based on geoscience expert knowledge Download PDF

Info

Publication number
CN111709238B
CN111709238B CN202010497002.1A CN202010497002A CN111709238B CN 111709238 B CN111709238 B CN 111709238B CN 202010497002 A CN202010497002 A CN 202010497002A CN 111709238 B CN111709238 B CN 111709238B
Authority
CN
China
Prior art keywords
geological
data
webpage
subject
term
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010497002.1A
Other languages
Chinese (zh)
Other versions
CN111709238A (en
Inventor
李诗
陈建平
李志斌
刘苏庆
张亚光
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China University of Geosciences Beijing
Original Assignee
China University of Geosciences Beijing
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China University of Geosciences Beijing filed Critical China University of Geosciences Beijing
Priority to CN202010497002.1A priority Critical patent/CN111709238B/en
Publication of CN111709238A publication Critical patent/CN111709238A/en
Application granted granted Critical
Publication of CN111709238B publication Critical patent/CN111709238B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method for calculating webpage geoscience correlation based on geoscience expert knowledge, which comprises the following steps of: 1. acquiring webpage data by using a discovery algorithm; 2. preprocessing data; 3. calculating the correlation degree of the webpage data and the keyword set; 4. introducing a keyword set frequency vector; 5. and forming a webpage data-keyword weight matrix. The invention has the advantages that: the method can replace a user to select related words and quantify the relevance according to the objective expert narrative word list knowledge tree, solves the limitation problem of the traditional relevance calculation method, and can effectively avoid the omission of partial key words which is possibly caused when common researchers manually search the key words to carry out relevance calculation by the knowledge structure tree summarized by the expert team.

Description

Web page geoscience correlation calculation method based on geoscience expert knowledge
Technical Field
The invention relates to the technical field of geoscience data calculation, in particular to a webpage geoscience correlation calculation method based on geoscience expert knowledge.
Background
In recent years, internet technologies including big data are important contents and technical means in the current information field, and various industries are actively researching the application of big data in the field. Similarly, the development of the geological industry also needs the support and application of related information technologies such as big data. The arrival of the big data era is to change the traditional thinking mode mainly based on experience, and the main melody (Chenjian Ping, lijing, zhening, etc.) developed in the geological industry in future is the data speaking, finding the answer from the data, and using data decision and innovation, which is the construction and application of geological cloud under the background of big data.
How to fully utilize the existing geological text big data to timely and comprehensively obtain the required geological information from massive data and analyze and mine potential knowledge and value in the data is an important task for the application of the current geological big data. According to the aspects of geological domain data management, storage, property rights and the like, the geological data can be divided into three categories, namely core data, neighborhood data and public data (Lijing, chenjiangping, wang Xiang. Geological big data storage technology. Geological report 2015,34 (8): 1589-1594). The geological public data refer to various data resources related to geology on the internet, such as geological news, mine finding results, local tone information and geological literature resources of various academic websites and the like published by related portal websites in various geological fields.
With the rapid development of information technologies such as cloud computing, artificial intelligence, deep learning and the like, the research and application of big data make a breakthrough in various fields. The development of application research of geological big data is an important component of big data strategy in China. Under the big data concept, the utilization level of data resources is improved, data islands are eliminated, a large amount of achievements are obtained, and unprecedented opportunities are brought to the geological industry.
Due to the development of the internet and the mobile network technology, data in various formats such as news, microblogs, pictures and the like published everyday are increased explosively, the data generation, storage and updating speed is higher and higher, personalized requirements of users based on thematic customization are more and more prominent, particularly, no crawler system with geological theme functions exists in the geological field, and the acquisition and selection of geological data required on the internet cannot be realized. In the face of existing massive data, valuable geological text data needs to be selected from a wide data sea, and accurate information extraction and knowledge mining can be carried out. Geological data widely exists in a wide area network and a local area network, so that the rapid discovery, positioning and selection of geological big data are realized, and the characteristics of multi-source property, mass property, complexity, non-structuring and the like of geological text data resources are required to be overcome. For wide area network geological data, the traditional search engine mode is difficult to efficiently and comprehensively query and acquire concerned geological data.
The prior art (such as Zhao Bing Man, wang Wei ya. Webpage academic algorithm research based on correlation analysis [ J ] electronic test, 2018, (22): 70-71.) judges correlation by using word frequency: and performing relevance ranking of the web pages and the search targets by inputting the number of times of the keywords appearing in the web pages. However, in scientific research practice, researchers in the unknown subject field often cannot easily acquire the correlation between professional vocabularies without system training, and the method for judging the correlation by using word frequency has limitations and cannot objectively reflect the actual situation.
Abbreviations and Key term definitions as used in the present invention
Mining big data: the method is to count, analyze and extract potential information knowledge from big data, construct the knowledge into an intelligent and correlated knowledge base, and realize knowledge retrieval and calculation.
Expert knowledge structure tree: the expert team provides a tree diagram containing professional vocabularies and related relation information among the vocabularies.
Geological narrative table: the narrative vocabulary is also called a theme vocabulary and a retrieval dictionary, is a dictionary used for indexing, storing and retrieving documents, and is a concrete embodiment of a narrative method. The narrative list is a term control tool that converts the natural language used by the indexers and searchers into a normalized narrative-type topic search language.
A geological narrative table constructed according to the geoscience expert knowledge nodes: and constructing a geological narrative table containing hypernyms, hyponyms, related words, family capitals and synonyms according to the lexical relations of the superior class terms (BT), the Preferred Terms (PT), the opposite terms (VT), the Related Terms (RT) and the inferior terms (NT) by using the knowledge structure tree provided by the expert team.
Logical structure tree computation: and storing each sequence word in the geological narrative table into a tree structure according to a specified lexical relation through computer programming. And comparing the keyword to be searched with the generated logical structure tree, and performing traditional search by taking the related words in the tree as extended search words, so that the user can obtain various geoscience data which does not contain the keyword and is closely related to the keyword.
Disclosure of Invention
Aiming at the defects in the prior art, the invention provides a method for calculating the correlation of the geography of the webpage based on expert knowledge of geography, which solves the defects in the prior art.
In order to realize the purpose, the technical scheme adopted by the invention is as follows:
1. a webpage geoscience correlation calculation method based on geoscience expert knowledge is characterized by comprising the following steps of:
s1: extracting webpage data;
s11: confirming the associated key words obtained from the geological narrative table and related to the retrieval subject,
s12: web page retrieval is performed using an API provided by a search engine,
s13: acquiring URLs of web page links;
s14: judging the relevance of the geological theme according to the following steps;
calculating the correlation degree of the webpage data and the geological subject term:
regarding the keyword set with knowledge hierarchy structure relationship as a document D i When the weight and the like are not considered, the web page data D j The correlation degree calculation formula for the geological subject term set is
Figure GDA0004037693300000041
In the formula:
k: retrieval subject associated keywords obtained from geological narrative table organized by expert knowledge structure tree in document D i The serial number in (1);
m: the number of words related to the retrieval theme in the geological narrative list;
d kj : the keyword corresponding to the sequence number k appears in the webpage data D j The number of times of (1);
introducing a keyword set frequency vector:
obtaining a set of geological subject words k with weights from a narrative table i (i=1,2,…m) Web document C j The correlation calculation formula is as follows:
REL D =A title *W t +B content *W t
A title (A j1 ,A j2 …A jm ):k i in web page data D j A vector of the number of occurrences in the title;
B content (B j1 ,B j2 …B jm ): geological subject term k i In web document C j The number of occurrences in (a);
W(W 1 ,W 2 ,…W m ): the keywords constitute a weight vector;
forming a webpage data-geological subject term weight matrix;
by calculating the total weight of the keywords of each webpage data and determining a weight threshold according to the actual application requirements, the relevance judgment of the webpage theme is realized, and a webpage data-geological theme word weight matrix is formed:
Figure GDA0004037693300000042
A title (A j1 ,A j2 …A jm ):k i in web page data D j A vector of the number of occurrences in the title;
B content (B j1 ,B j2 …B jm ): geological subject term k i In web page document C j A vector of the number of occurrences;
q: a position adjustment parameter, the keywords appearing in the title more relevant than in the abstract;
W(W 1 ,W 2 ,…W m ): the geological subject words form weight vectors, and the value of the preferred term and the abnormal term is 1; the value of the upper class term is 0.5; the lower class term takes a value of 0.8; the related term takes a value of 0.5;
the values of Q and W are adjusted according to the requirement condition of the actually acquired data volume;
s15: determining a weight threshold: determining according to actual application requirements, increasing the threshold when the number of the earth screen pages exceeds the threshold and decreasing the threshold when the number of the earth screen pages is too small;
s16: crawling webpage data by using a beautiful soup library in python;
s2, preprocessing data, and cleaning webpage data acquired by a discovery algorithm;
s21, the repeatability inspection is to detect the information of the name and the size and remove the same file;
and S22, checking the content and the quality, wherein the checking is realized in a manual confirmation mode, the final uploaded data is ensured to meet the requirements, and the final obtaining of the content for calculating the correlation comprises the following steps: title, summary, and link address.
Compared with the prior art, the invention has the advantages that:
the method can replace users to select related words and quantify the relevance according to the objective expert narrative word list knowledge tree, solves the limitation problem of the traditional relevance calculation method, and meanwhile, the knowledge structure tree summarized by the expert team can effectively avoid the omission of partial key words possibly occurring when common researchers manually search the key words for relevance calculation.
Drawings
FIG. 1 is a flowchart illustrating a web page data extraction process according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating a web page data cleansing process according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below by referring to the accompanying drawings and embodiments.
1. Iteration is performed through the flow shown in fig. 1;
carrying out webpage retrieval on the associated keywords with retrieval subjects acquired from the geological narrative table organized by the expert knowledge structure tree by using an API (application programming interface) provided by any mainstream search engine (such as Google, baidu, congo and the like), and crawling webpage data by using a beautiful soup library in python:
and (3) judging the relevance of the geological subject: and introducing a frequency vector of the geological subject term set to calculate the correlation degree of the webpage data and the geological subject term set, forming a webpage data-geological subject term weight matrix, and judging the correlation.
Weight threshold: and determining according to actual application requirements, when the number of the ground screen pages exceeds the threshold value, the threshold value is properly adjusted to be high, and when the number of the ground screen pages exceeds the threshold value, the threshold value is properly adjusted to be low.
2. Preprocessing data;
as shown in fig. 2, the web page data collected by the discovery algorithm is cleaned:
the repeatability check mainly detects information such as name and size, and removes the same file (for example, the same file with the same name and different storage locations, or the same file with different names and different stage states, etc.). The content and quality checks are determined according to the task needs. The function is realized by a manual confirmation mode, the final uploaded data is ensured to meet the requirement, and the content for calculating the correlation is finally obtained, and the method comprises the following steps: title, summary, link address.
3. Calculating the correlation degree of the webpage data and the keyword set;
regarding the keyword set with knowledge hierarchy structure relationship as a document D i When the weight and the like are not considered, the web page data D j The relevancy calculation formula of the keyword set is
Figure GDA0004037693300000061
k: obtaining related key words with search subject from geological narrative list organized by expert knowledge structure tree in document D i Number in
m: number of words in geological narrative table relevant to subject of search
d kj : the keyword corresponding to the sequence number k appears in the webpage data D j Number of times of
4. Introducing a keyword set frequency vector;
from a narrative vocabularyTo obtain a set k of weighted keywords i (i =1,2, \ 8230; m) Web document C j The correlation calculation formula is as follows:
REL D =A title *W t +B content *W t
A title (A j1 ,A j2 …A jm ):k i in web page D j Vector of number of occurrences in data header
B content (B j1 ,B j2 …B jm ): keyword k i In document C j Vector of the number of occurrences
W(W 1 ,W 2 ,…W m ): keyword composition weight vector
5. Forming a web page data-keyword weight matrix
The relevance judgment of the webpage theme is realized by calculating the total weight of the keywords of each webpage data and determining a weight threshold according to the actual application requirement, so that a webpage data-keyword weight matrix is formed:
Figure GDA0004037693300000071
A title (A j1 ,A j2 …A jm ):k i in web page D j Vector of number of occurrences in data header
B content (B j1 ,B j2 …B jm ): keyword k i In document C j Vector of the number of occurrences
Q: position-adjusted parameters, keywords appear more relevant in the title than in the abstract.
W(W 1 ,W 2 ,…W m ): the keywords form a weight vector, and the value of the preferred term and the special-shaped term is 1; the value of the upper class term is 0.5; the lower class term takes the value of 0.8; the pertinent term takes a value of 0.5.
The values of Q and W can be adjusted according to the requirement of the actual data volume.
It will be appreciated by those of ordinary skill in the art that the examples described herein are intended to assist the reader in understanding the manner in which the invention is practiced, and it is to be understood that the scope of the invention is not limited to such specifically recited statements and examples. Those skilled in the art can make various other specific changes and combinations based on the teachings of the present invention without departing from the spirit of the invention, and these changes and combinations are within the scope of the invention.

Claims (1)

1. A webpage geoscience correlation calculation method based on geoscience expert knowledge is characterized by comprising the following steps of:
s1: extracting webpage data;
s11: confirming the associated key words acquired from the geological narrative list and related to the retrieval subject,
s12: web page retrieval is performed using an API provided by a search engine,
s13: acquiring URLs of webpage links;
s14: judging the relevance of the geological theme according to the following steps;
calculating the correlation degree of the webpage data and the geological subject term:
regarding the keyword set with knowledge hierarchy structure relationship as a document D i When the weight factor is not considered, the web page data D j The correlation degree calculation formula of the geological subject term set is
Figure FDA0004072167660000011
In the formula:
k: obtaining related key words with search subject from geological narrative list organized by expert knowledge structure tree in document D i The serial number in (1);
m: the number of words related to the retrieval subject in the geological narrative table;
d kj : the keyword corresponding to the sequence number k appears in the webpage data D j The number of times of (1);
introducing a keyword set frequency vector:
obtaining a set of geological subject words k with weights from a narrative table i (i =1,2, \ 8230; m) Web document C j The correlation calculation formula is as follows:
REL D =A title *W t +B content *W t
A title (A j1 ,A j2 …A jm ):k i in web page data D j A vector of the number of occurrences in the title;
B content (B j1 ,B j2 …B jm ): geological subject term k i In web document C j A vector of the number of occurrences;
W(W 1 ,W 2 ,…W m ): forming a weight vector by the keywords;
forming a webpage data-geological subject term weight matrix;
the relevance judgment of the webpage theme is realized by calculating the total weight of the keywords of each webpage data and determining a weight threshold according to the actual application requirement, so that a webpage data-geological theme word weight matrix is formed:
Figure FDA0004072167660000021
A title (A j1 ,A j2 …A jm ):k i in web page data D j A vector of the number of occurrences in the title;
B content (B j1 ,B j2 …B jm ): geological subject term k i In web document C j A vector of the number of occurrences;
q: a position adjustment parameter, the keywords appearing in the title more relevant than in the abstract;
W(W 1 ,W 2 ,…W m ): the geological subject words form a weight vector, and the value of the preferred term and the abnormal term is 1; on the upper partThe bit class term takes a value of 0.5; the lower class term takes the value of 0.8; the pertinent term takes on a value of 0.5;
the values of Q and W are adjusted according to the requirement condition of the actually acquired data volume;
s15: determining a weight threshold: determining according to actual application requirements, increasing the threshold when the number of the earth screen pages exceeds the threshold and decreasing the threshold when the number of the earth screen pages is too small;
s16: crawling webpage data by using a beautiful soup library in python;
s2, preprocessing data, and cleaning webpage data acquired by a discovery algorithm;
s21, the repeatability check is to detect the name and size information and remove the same file;
and S22, checking the content and the quality, wherein the checking is realized in a manual confirmation mode, the final uploaded data is ensured to meet the requirements, and the final obtaining of the content for calculating the correlation comprises the following steps: title, summary, and link address.
CN202010497002.1A 2020-06-04 2020-06-04 Web page geoscience correlation calculation method based on geoscience expert knowledge Active CN111709238B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010497002.1A CN111709238B (en) 2020-06-04 2020-06-04 Web page geoscience correlation calculation method based on geoscience expert knowledge

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010497002.1A CN111709238B (en) 2020-06-04 2020-06-04 Web page geoscience correlation calculation method based on geoscience expert knowledge

Publications (2)

Publication Number Publication Date
CN111709238A CN111709238A (en) 2020-09-25
CN111709238B true CN111709238B (en) 2023-04-07

Family

ID=72539334

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010497002.1A Active CN111709238B (en) 2020-06-04 2020-06-04 Web page geoscience correlation calculation method based on geoscience expert knowledge

Country Status (1)

Country Link
CN (1) CN111709238B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AU2021104429A4 (en) * 2021-07-22 2021-09-16 Chinese Academy Of Surveying And Mapping Machine Translation Method for French Geographical Names

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104679825A (en) * 2015-01-06 2015-06-03 中国农业大学 Web text-based acquiring and screening method of seismic macroscopic anomaly information
CN106156272A (en) * 2016-06-21 2016-11-23 北京工业大学 A kind of information retrieval method based on multi-source semantic analysis
CN107247780A (en) * 2017-06-12 2017-10-13 北京理工大学 A kind of patent document method for measuring similarity of knowledge based body
CN110309246A (en) * 2019-05-24 2019-10-08 中国地质调查局发展研究中心 A kind of method and device thereof internet geologic data retrieval and obtained

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10122720B2 (en) * 2017-02-07 2018-11-06 Plesk International Gmbh System and method for automated web site content analysis

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104679825A (en) * 2015-01-06 2015-06-03 中国农业大学 Web text-based acquiring and screening method of seismic macroscopic anomaly information
CN106156272A (en) * 2016-06-21 2016-11-23 北京工业大学 A kind of information retrieval method based on multi-source semantic analysis
CN107247780A (en) * 2017-06-12 2017-10-13 北京理工大学 A kind of patent document method for measuring similarity of knowledge based body
CN110309246A (en) * 2019-05-24 2019-10-08 中国地质调查局发展研究中心 A kind of method and device thereof internet geologic data retrieval and obtained

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
地学文本信息提取技术研究;吴文;《中国优秀硕士学位论文全文数据库》;20070215;全文 *
矿产资源定量评价中文本数据挖掘研究;陈建平 等;《物探化探计算技术》;20150831;全文 *

Also Published As

Publication number Publication date
CN111709238A (en) 2020-09-25

Similar Documents

Publication Publication Date Title
CN101630314B (en) Semantic query expansion method based on domain knowledge
CN104239513B (en) A kind of semantic retrieving method of domain-oriented data
CN101452463A (en) Method and apparatus for directionally grabbing page resource
US20100131485A1 (en) Method and system for automatic construction of information organization structure for related information browsing
US10558707B2 (en) Method for discovering relevant concepts in a semantic graph of concepts
CN106227788A (en) Database query method based on Lucene
CN109643315B (en) Method, system, computer device and computer readable medium for automatically generating Chinese ontology based on structured network knowledge
CN110555154B (en) Theme-oriented information retrieval method
CN108710672B (en) Theme crawler method based on incremental Bayesian algorithm
Grover et al. Comparative analysis of pagerank and hits algorithms
CN105389328B (en) A kind of extensive open source software searching order optimization method
CN111709238B (en) Web page geoscience correlation calculation method based on geoscience expert knowledge
Plangprasopchok et al. Exploiting social annotation for automatic resource discovery
CN117591738A (en) Information retrieval system and method based on cloud service
Yang An ontological website models-supported search agent for web services
CN112100500A (en) Example learning-driven content-associated website discovery method
Abass et al. Information retrieval models, techniques and applications
Priya et al. Design and development of an ontology based personal web search engine
CN110309246A (en) A kind of method and device thereof internet geologic data retrieval and obtained
Gupta et al. A system's approach towards domain identification of web pages
Almadhoun et al. Effects of using arabic web pages in building rank estimation algorithm for google search engine results page.
Bute et al. Evaluating search effectiveness of some selected search engines
Hoeber et al. Automatic topic learning for personalized re-ordering of web search results
Wardekar et al. SmartCrawler: A Personalized Web Search for Relevant Web Pages
Yang An ontology-supported website model for web search agents

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant