CN111709238B

CN111709238B - Web page geoscience correlation calculation method based on geoscience expert knowledge

Info

Publication number: CN111709238B
Application number: CN202010497002.1A
Authority: CN
Inventors: 李诗; 陈建平; 李志斌; 刘苏庆; 张亚光
Original assignee: China University of Geosciences Beijing
Current assignee: China University of Geosciences Beijing
Priority date: 2020-06-04
Filing date: 2020-06-04
Publication date: 2023-04-07
Anticipated expiration: 2040-06-04
Also published as: CN111709238A

Abstract

The invention discloses a method for calculating webpage geoscience correlation based on geoscience expert knowledge, which comprises the following steps of: 1. acquiring webpage data by using a discovery algorithm; 2. preprocessing data; 3. calculating the correlation degree of the webpage data and the keyword set; 4. introducing a keyword set frequency vector; 5. and forming a webpage data-keyword weight matrix. The invention has the advantages that: the method can replace a user to select related words and quantify the relevance according to the objective expert narrative word list knowledge tree, solves the limitation problem of the traditional relevance calculation method, and can effectively avoid the omission of partial key words which is possibly caused when common researchers manually search the key words to carry out relevance calculation by the knowledge structure tree summarized by the expert team.

Description

Web page geoscience correlation calculation method based on geoscience expert knowledge

Technical Field

The invention relates to the technical field of geoscience data calculation, in particular to a webpage geoscience correlation calculation method based on geoscience expert knowledge.

Background

In recent years, internet technologies including big data are important contents and technical means in the current information field, and various industries are actively researching the application of big data in the field. Similarly, the development of the geological industry also needs the support and application of related information technologies such as big data. The arrival of the big data era is to change the traditional thinking mode mainly based on experience, and the main melody (Chenjian Ping, lijing, zhening, etc.) developed in the geological industry in future is the data speaking, finding the answer from the data, and using data decision and innovation, which is the construction and application of geological cloud under the background of big data.

How to fully utilize the existing geological text big data to timely and comprehensively obtain the required geological information from massive data and analyze and mine potential knowledge and value in the data is an important task for the application of the current geological big data. According to the aspects of geological domain data management, storage, property rights and the like, the geological data can be divided into three categories, namely core data, neighborhood data and public data (Lijing, chenjiangping, wang Xiang. Geological big data storage technology. Geological report 2015,34 (8): 1589-1594). The geological public data refer to various data resources related to geology on the internet, such as geological news, mine finding results, local tone information and geological literature resources of various academic websites and the like published by related portal websites in various geological fields.

With the rapid development of information technologies such as cloud computing, artificial intelligence, deep learning and the like, the research and application of big data make a breakthrough in various fields. The development of application research of geological big data is an important component of big data strategy in China. Under the big data concept, the utilization level of data resources is improved, data islands are eliminated, a large amount of achievements are obtained, and unprecedented opportunities are brought to the geological industry.

Due to the development of the internet and the mobile network technology, data in various formats such as news, microblogs, pictures and the like published everyday are increased explosively, the data generation, storage and updating speed is higher and higher, personalized requirements of users based on thematic customization are more and more prominent, particularly, no crawler system with geological theme functions exists in the geological field, and the acquisition and selection of geological data required on the internet cannot be realized. In the face of existing massive data, valuable geological text data needs to be selected from a wide data sea, and accurate information extraction and knowledge mining can be carried out. Geological data widely exists in a wide area network and a local area network, so that the rapid discovery, positioning and selection of geological big data are realized, and the characteristics of multi-source property, mass property, complexity, non-structuring and the like of geological text data resources are required to be overcome. For wide area network geological data, the traditional search engine mode is difficult to efficiently and comprehensively query and acquire concerned geological data.

The prior art (such as Zhao Bing Man, wang Wei ya. Webpage academic algorithm research based on correlation analysis [ J ] electronic test, 2018, (22): 70-71.) judges correlation by using word frequency: and performing relevance ranking of the web pages and the search targets by inputting the number of times of the keywords appearing in the web pages. However, in scientific research practice, researchers in the unknown subject field often cannot easily acquire the correlation between professional vocabularies without system training, and the method for judging the correlation by using word frequency has limitations and cannot objectively reflect the actual situation.

Abbreviations and Key term definitions as used in the present invention

Mining big data: the method is to count, analyze and extract potential information knowledge from big data, construct the knowledge into an intelligent and correlated knowledge base, and realize knowledge retrieval and calculation.

Expert knowledge structure tree: the expert team provides a tree diagram containing professional vocabularies and related relation information among the vocabularies.

Geological narrative table: the narrative vocabulary is also called a theme vocabulary and a retrieval dictionary, is a dictionary used for indexing, storing and retrieving documents, and is a concrete embodiment of a narrative method. The narrative list is a term control tool that converts the natural language used by the indexers and searchers into a normalized narrative-type topic search language.

A geological narrative table constructed according to the geoscience expert knowledge nodes: and constructing a geological narrative table containing hypernyms, hyponyms, related words, family capitals and synonyms according to the lexical relations of the superior class terms (BT), the Preferred Terms (PT), the opposite terms (VT), the Related Terms (RT) and the inferior terms (NT) by using the knowledge structure tree provided by the expert team.

Logical structure tree computation: and storing each sequence word in the geological narrative table into a tree structure according to a specified lexical relation through computer programming. And comparing the keyword to be searched with the generated logical structure tree, and performing traditional search by taking the related words in the tree as extended search words, so that the user can obtain various geoscience data which does not contain the keyword and is closely related to the keyword.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a method for calculating the correlation of the geography of the webpage based on expert knowledge of geography, which solves the defects in the prior art.

In order to realize the purpose, the technical scheme adopted by the invention is as follows:

1. a webpage geoscience correlation calculation method based on geoscience expert knowledge is characterized by comprising the following steps of:

s1: extracting webpage data;

s11: confirming the associated key words obtained from the geological narrative table and related to the retrieval subject,

s12: web page retrieval is performed using an API provided by a search engine,

s13: acquiring URLs of web page links;

s14: judging the relevance of the geological theme according to the following steps;

calculating the correlation degree of the webpage data and the geological subject term:

regarding the keyword set with knowledge hierarchy structure relationship as a document D _i When the weight and the like are not considered, the web page data D _j The correlation degree calculation formula for the geological subject term set is

In the formula:

k: retrieval subject associated keywords obtained from geological narrative table organized by expert knowledge structure tree in document D _i The serial number in (1);

m: the number of words related to the retrieval theme in the geological narrative list;

d _kj : the keyword corresponding to the sequence number k appears in the webpage data D _j The number of times of (1);

introducing a keyword set frequency vector:

obtaining a set of geological subject words k with weights from a narrative table _i (i＝1,2,…m) Web document C _j The correlation calculation formula is as follows:

REL _D ＝A _title *W ^t +B _content *W ^t

A _title (A _j1 ,A _j2 …A _jm )：k _i in web page data D _j A vector of the number of occurrences in the title;

B _content (B _j1 ,B _j2 …B _jm ): geological subject term k _i In web document C _j The number of occurrences in (a);

W(W ₁ ,W ₂ ,…W _m ): the keywords constitute a weight vector;

forming a webpage data-geological subject term weight matrix;

by calculating the total weight of the keywords of each webpage data and determining a weight threshold according to the actual application requirements, the relevance judgment of the webpage theme is realized, and a webpage data-geological theme word weight matrix is formed:

B _content (B _j1 ,B _j2 …B _jm ): geological subject term k _i In web page document C _j A vector of the number of occurrences;

q: a position adjustment parameter, the keywords appearing in the title more relevant than in the abstract;

W(W ₁ ,W ₂ ,…W _m ): the geological subject words form weight vectors, and the value of the preferred term and the abnormal term is 1; the value of the upper class term is 0.5; the lower class term takes a value of 0.8; the related term takes a value of 0.5;

the values of Q and W are adjusted according to the requirement condition of the actually acquired data volume;

s15: determining a weight threshold: determining according to actual application requirements, increasing the threshold when the number of the earth screen pages exceeds the threshold and decreasing the threshold when the number of the earth screen pages is too small;

s16: crawling webpage data by using a beautiful soup library in python;

s2, preprocessing data, and cleaning webpage data acquired by a discovery algorithm;

s21, the repeatability inspection is to detect the information of the name and the size and remove the same file;

and S22, checking the content and the quality, wherein the checking is realized in a manual confirmation mode, the final uploaded data is ensured to meet the requirements, and the final obtaining of the content for calculating the correlation comprises the following steps: title, summary, and link address.

Compared with the prior art, the invention has the advantages that:

the method can replace users to select related words and quantify the relevance according to the objective expert narrative word list knowledge tree, solves the limitation problem of the traditional relevance calculation method, and meanwhile, the knowledge structure tree summarized by the expert team can effectively avoid the omission of partial key words possibly occurring when common researchers manually search the key words for relevance calculation.

Drawings

FIG. 1 is a flowchart illustrating a web page data extraction process according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating a web page data cleansing process according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below by referring to the accompanying drawings and embodiments.

1. Iteration is performed through the flow shown in fig. 1;

carrying out webpage retrieval on the associated keywords with retrieval subjects acquired from the geological narrative table organized by the expert knowledge structure tree by using an API (application programming interface) provided by any mainstream search engine (such as Google, baidu, congo and the like), and crawling webpage data by using a beautiful soup library in python:

and (3) judging the relevance of the geological subject: and introducing a frequency vector of the geological subject term set to calculate the correlation degree of the webpage data and the geological subject term set, forming a webpage data-geological subject term weight matrix, and judging the correlation.

Weight threshold: and determining according to actual application requirements, when the number of the ground screen pages exceeds the threshold value, the threshold value is properly adjusted to be high, and when the number of the ground screen pages exceeds the threshold value, the threshold value is properly adjusted to be low.

2. Preprocessing data;

as shown in fig. 2, the web page data collected by the discovery algorithm is cleaned:

the repeatability check mainly detects information such as name and size, and removes the same file (for example, the same file with the same name and different storage locations, or the same file with different names and different stage states, etc.). The content and quality checks are determined according to the task needs. The function is realized by a manual confirmation mode, the final uploaded data is ensured to meet the requirement, and the content for calculating the correlation is finally obtained, and the method comprises the following steps: title, summary, link address.

3. Calculating the correlation degree of the webpage data and the keyword set;

regarding the keyword set with knowledge hierarchy structure relationship as a document D _i When the weight and the like are not considered, the web page data D _j The relevancy calculation formula of the keyword set is

k: obtaining related key words with search subject from geological narrative list organized by expert knowledge structure tree in document D _i Number in

m: number of words in geological narrative table relevant to subject of search

d _kj : the keyword corresponding to the sequence number k appears in the webpage data D _j Number of times of

4. Introducing a keyword set frequency vector;

from a narrative vocabularyTo obtain a set k of weighted keywords _i (i =1,2, \ 8230; m) Web document C _j The correlation calculation formula is as follows:

REL _D ＝A _title *W ^t +B _content *W ^t

A _title (A _j1 ,A _j2 …A _jm )：k _i in web page D _j Vector of number of occurrences in data header

B _content (B _j1 ,B _j2 …B _jm ): keyword k _i In document C _j Vector of the number of occurrences

W(W ₁ ,W ₂ ,…W _m ): keyword composition weight vector

5. Forming a web page data-keyword weight matrix

The relevance judgment of the webpage theme is realized by calculating the total weight of the keywords of each webpage data and determining a weight threshold according to the actual application requirement, so that a webpage data-keyword weight matrix is formed:

Q: position-adjusted parameters, keywords appear more relevant in the title than in the abstract.

W(W ₁ ,W ₂ ,…W _m ): the keywords form a weight vector, and the value of the preferred term and the special-shaped term is 1; the value of the upper class term is 0.5; the lower class term takes the value of 0.8; the pertinent term takes a value of 0.5.

The values of Q and W can be adjusted according to the requirement of the actual data volume.

It will be appreciated by those of ordinary skill in the art that the examples described herein are intended to assist the reader in understanding the manner in which the invention is practiced, and it is to be understood that the scope of the invention is not limited to such specifically recited statements and examples. Those skilled in the art can make various other specific changes and combinations based on the teachings of the present invention without departing from the spirit of the invention, and these changes and combinations are within the scope of the invention.

Claims

s1: extracting webpage data;

s11: confirming the associated key words acquired from the geological narrative list and related to the retrieval subject,

s12: web page retrieval is performed using an API provided by a search engine,

s13: acquiring URLs of webpage links;

regarding the keyword set with knowledge hierarchy structure relationship as a document D _i When the weight factor is not considered, the web page data D _j The correlation degree calculation formula of the geological subject term set is

In the formula:

k: obtaining related key words with search subject from geological narrative list organized by expert knowledge structure tree in document D _i The serial number in (1);

m: the number of words related to the retrieval subject in the geological narrative table;

introducing a keyword set frequency vector:

obtaining a set of geological subject words k with weights from a narrative table _i (i =1,2, \ 8230; m) Web document C _j The correlation calculation formula is as follows:

REL _D ＝A _title *W ^t +B _content *W ^t

B _content (B _j1 ,B _j2 …B _jm ): geological subject term k _i In web document C _j A vector of the number of occurrences;

W(W ₁ ,W ₂ ,…W _m ): forming a weight vector by the keywords;

forming a webpage data-geological subject term weight matrix;

the relevance judgment of the webpage theme is realized by calculating the total weight of the keywords of each webpage data and determining a weight threshold according to the actual application requirement, so that a webpage data-geological theme word weight matrix is formed:

W(W ₁ ,W ₂ ,…W _m ): the geological subject words form a weight vector, and the value of the preferred term and the abnormal term is 1; on the upper partThe bit class term takes a value of 0.5; the lower class term takes the value of 0.8; the pertinent term takes on a value of 0.5;

s16: crawling webpage data by using a beautiful soup library in python;

s21, the repeatability check is to detect the name and size information and remove the same file;