CN111709238B - Web page geoscience correlation calculation method based on geoscience expert knowledge - Google Patents
Web page geoscience correlation calculation method based on geoscience expert knowledge Download PDFInfo
- Publication number
- CN111709238B CN111709238B CN202010497002.1A CN202010497002A CN111709238B CN 111709238 B CN111709238 B CN 111709238B CN 202010497002 A CN202010497002 A CN 202010497002A CN 111709238 B CN111709238 B CN 111709238B
- Authority
- CN
- China
- Prior art keywords
- geological
- data
- webpage
- subject
- term
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/02—Knowledge representation; Symbolic representation
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Evolutionary Computation (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a method for calculating webpage geoscience correlation based on geoscience expert knowledge, which comprises the following steps of: 1. acquiring webpage data by using a discovery algorithm; 2. preprocessing data; 3. calculating the correlation degree of the webpage data and the keyword set; 4. introducing a keyword set frequency vector; 5. and forming a webpage data-keyword weight matrix. The invention has the advantages that: the method can replace a user to select related words and quantify the relevance according to the objective expert narrative word list knowledge tree, solves the limitation problem of the traditional relevance calculation method, and can effectively avoid the omission of partial key words which is possibly caused when common researchers manually search the key words to carry out relevance calculation by the knowledge structure tree summarized by the expert team.
Description
Technical Field
The invention relates to the technical field of geoscience data calculation, in particular to a webpage geoscience correlation calculation method based on geoscience expert knowledge.
Background
In recent years, internet technologies including big data are important contents and technical means in the current information field, and various industries are actively researching the application of big data in the field. Similarly, the development of the geological industry also needs the support and application of related information technologies such as big data. The arrival of the big data era is to change the traditional thinking mode mainly based on experience, and the main melody (Chenjian Ping, lijing, zhening, etc.) developed in the geological industry in future is the data speaking, finding the answer from the data, and using data decision and innovation, which is the construction and application of geological cloud under the background of big data.
How to fully utilize the existing geological text big data to timely and comprehensively obtain the required geological information from massive data and analyze and mine potential knowledge and value in the data is an important task for the application of the current geological big data. According to the aspects of geological domain data management, storage, property rights and the like, the geological data can be divided into three categories, namely core data, neighborhood data and public data (Lijing, chenjiangping, wang Xiang. Geological big data storage technology. Geological report 2015,34 (8): 1589-1594). The geological public data refer to various data resources related to geology on the internet, such as geological news, mine finding results, local tone information and geological literature resources of various academic websites and the like published by related portal websites in various geological fields.
With the rapid development of information technologies such as cloud computing, artificial intelligence, deep learning and the like, the research and application of big data make a breakthrough in various fields. The development of application research of geological big data is an important component of big data strategy in China. Under the big data concept, the utilization level of data resources is improved, data islands are eliminated, a large amount of achievements are obtained, and unprecedented opportunities are brought to the geological industry.
Due to the development of the internet and the mobile network technology, data in various formats such as news, microblogs, pictures and the like published everyday are increased explosively, the data generation, storage and updating speed is higher and higher, personalized requirements of users based on thematic customization are more and more prominent, particularly, no crawler system with geological theme functions exists in the geological field, and the acquisition and selection of geological data required on the internet cannot be realized. In the face of existing massive data, valuable geological text data needs to be selected from a wide data sea, and accurate information extraction and knowledge mining can be carried out. Geological data widely exists in a wide area network and a local area network, so that the rapid discovery, positioning and selection of geological big data are realized, and the characteristics of multi-source property, mass property, complexity, non-structuring and the like of geological text data resources are required to be overcome. For wide area network geological data, the traditional search engine mode is difficult to efficiently and comprehensively query and acquire concerned geological data.
The prior art (such as Zhao Bing Man, wang Wei ya. Webpage academic algorithm research based on correlation analysis [ J ] electronic test, 2018, (22): 70-71.) judges correlation by using word frequency: and performing relevance ranking of the web pages and the search targets by inputting the number of times of the keywords appearing in the web pages. However, in scientific research practice, researchers in the unknown subject field often cannot easily acquire the correlation between professional vocabularies without system training, and the method for judging the correlation by using word frequency has limitations and cannot objectively reflect the actual situation.
Abbreviations and Key term definitions as used in the present invention
Mining big data: the method is to count, analyze and extract potential information knowledge from big data, construct the knowledge into an intelligent and correlated knowledge base, and realize knowledge retrieval and calculation.
Expert knowledge structure tree: the expert team provides a tree diagram containing professional vocabularies and related relation information among the vocabularies.
Geological narrative table: the narrative vocabulary is also called a theme vocabulary and a retrieval dictionary, is a dictionary used for indexing, storing and retrieving documents, and is a concrete embodiment of a narrative method. The narrative list is a term control tool that converts the natural language used by the indexers and searchers into a normalized narrative-type topic search language.
A geological narrative table constructed according to the geoscience expert knowledge nodes: and constructing a geological narrative table containing hypernyms, hyponyms, related words, family capitals and synonyms according to the lexical relations of the superior class terms (BT), the Preferred Terms (PT), the opposite terms (VT), the Related Terms (RT) and the inferior terms (NT) by using the knowledge structure tree provided by the expert team.
Logical structure tree computation: and storing each sequence word in the geological narrative table into a tree structure according to a specified lexical relation through computer programming. And comparing the keyword to be searched with the generated logical structure tree, and performing traditional search by taking the related words in the tree as extended search words, so that the user can obtain various geoscience data which does not contain the keyword and is closely related to the keyword.
Disclosure of Invention
Aiming at the defects in the prior art, the invention provides a method for calculating the correlation of the geography of the webpage based on expert knowledge of geography, which solves the defects in the prior art.
In order to realize the purpose, the technical scheme adopted by the invention is as follows:
1. a webpage geoscience correlation calculation method based on geoscience expert knowledge is characterized by comprising the following steps of:
s1: extracting webpage data;
s11: confirming the associated key words obtained from the geological narrative table and related to the retrieval subject,
s12: web page retrieval is performed using an API provided by a search engine,
s13: acquiring URLs of web page links;
s14: judging the relevance of the geological theme according to the following steps;
calculating the correlation degree of the webpage data and the geological subject term:
regarding the keyword set with knowledge hierarchy structure relationship as a document D i When the weight and the like are not considered, the web page data D j The correlation degree calculation formula for the geological subject term set is
In the formula:
k: retrieval subject associated keywords obtained from geological narrative table organized by expert knowledge structure tree in document D i The serial number in (1);
m: the number of words related to the retrieval theme in the geological narrative list;
d kj : the keyword corresponding to the sequence number k appears in the webpage data D j The number of times of (1);
introducing a keyword set frequency vector:
obtaining a set of geological subject words k with weights from a narrative table i (i=1,2,…m) Web document C j The correlation calculation formula is as follows:
REL D =A title *W t +B content *W t
A title (A j1 ,A j2 …A jm ):k i in web page data D j A vector of the number of occurrences in the title;
B content (B j1 ,B j2 …B jm ): geological subject term k i In web document C j The number of occurrences in (a);
W(W 1 ,W 2 ,…W m ): the keywords constitute a weight vector;
forming a webpage data-geological subject term weight matrix;
by calculating the total weight of the keywords of each webpage data and determining a weight threshold according to the actual application requirements, the relevance judgment of the webpage theme is realized, and a webpage data-geological theme word weight matrix is formed:
A title (A j1 ,A j2 …A jm ):k i in web page data D j A vector of the number of occurrences in the title;
B content (B j1 ,B j2 …B jm ): geological subject term k i In web page document C j A vector of the number of occurrences;
q: a position adjustment parameter, the keywords appearing in the title more relevant than in the abstract;
W(W 1 ,W 2 ,…W m ): the geological subject words form weight vectors, and the value of the preferred term and the abnormal term is 1; the value of the upper class term is 0.5; the lower class term takes a value of 0.8; the related term takes a value of 0.5;
the values of Q and W are adjusted according to the requirement condition of the actually acquired data volume;
s15: determining a weight threshold: determining according to actual application requirements, increasing the threshold when the number of the earth screen pages exceeds the threshold and decreasing the threshold when the number of the earth screen pages is too small;
s16: crawling webpage data by using a beautiful soup library in python;
s2, preprocessing data, and cleaning webpage data acquired by a discovery algorithm;
s21, the repeatability inspection is to detect the information of the name and the size and remove the same file;
and S22, checking the content and the quality, wherein the checking is realized in a manual confirmation mode, the final uploaded data is ensured to meet the requirements, and the final obtaining of the content for calculating the correlation comprises the following steps: title, summary, and link address.
Compared with the prior art, the invention has the advantages that:
the method can replace users to select related words and quantify the relevance according to the objective expert narrative word list knowledge tree, solves the limitation problem of the traditional relevance calculation method, and meanwhile, the knowledge structure tree summarized by the expert team can effectively avoid the omission of partial key words possibly occurring when common researchers manually search the key words for relevance calculation.
Drawings
FIG. 1 is a flowchart illustrating a web page data extraction process according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating a web page data cleansing process according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below by referring to the accompanying drawings and embodiments.
1. Iteration is performed through the flow shown in fig. 1;
carrying out webpage retrieval on the associated keywords with retrieval subjects acquired from the geological narrative table organized by the expert knowledge structure tree by using an API (application programming interface) provided by any mainstream search engine (such as Google, baidu, congo and the like), and crawling webpage data by using a beautiful soup library in python:
and (3) judging the relevance of the geological subject: and introducing a frequency vector of the geological subject term set to calculate the correlation degree of the webpage data and the geological subject term set, forming a webpage data-geological subject term weight matrix, and judging the correlation.
Weight threshold: and determining according to actual application requirements, when the number of the ground screen pages exceeds the threshold value, the threshold value is properly adjusted to be high, and when the number of the ground screen pages exceeds the threshold value, the threshold value is properly adjusted to be low.
2. Preprocessing data;
as shown in fig. 2, the web page data collected by the discovery algorithm is cleaned:
the repeatability check mainly detects information such as name and size, and removes the same file (for example, the same file with the same name and different storage locations, or the same file with different names and different stage states, etc.). The content and quality checks are determined according to the task needs. The function is realized by a manual confirmation mode, the final uploaded data is ensured to meet the requirement, and the content for calculating the correlation is finally obtained, and the method comprises the following steps: title, summary, link address.
3. Calculating the correlation degree of the webpage data and the keyword set;
regarding the keyword set with knowledge hierarchy structure relationship as a document D i When the weight and the like are not considered, the web page data D j The relevancy calculation formula of the keyword set is
k: obtaining related key words with search subject from geological narrative list organized by expert knowledge structure tree in document D i Number in
m: number of words in geological narrative table relevant to subject of search
d kj : the keyword corresponding to the sequence number k appears in the webpage data D j Number of times of
4. Introducing a keyword set frequency vector;
from a narrative vocabularyTo obtain a set k of weighted keywords i (i =1,2, \ 8230; m) Web document C j The correlation calculation formula is as follows:
REL D =A title *W t +B content *W t
A title (A j1 ,A j2 …A jm ):k i in web page D j Vector of number of occurrences in data header
B content (B j1 ,B j2 …B jm ): keyword k i In document C j Vector of the number of occurrences
W(W 1 ,W 2 ,…W m ): keyword composition weight vector
5. Forming a web page data-keyword weight matrix
The relevance judgment of the webpage theme is realized by calculating the total weight of the keywords of each webpage data and determining a weight threshold according to the actual application requirement, so that a webpage data-keyword weight matrix is formed:
A title (A j1 ,A j2 …A jm ):k i in web page D j Vector of number of occurrences in data header
B content (B j1 ,B j2 …B jm ): keyword k i In document C j Vector of the number of occurrences
Q: position-adjusted parameters, keywords appear more relevant in the title than in the abstract.
W(W 1 ,W 2 ,…W m ): the keywords form a weight vector, and the value of the preferred term and the special-shaped term is 1; the value of the upper class term is 0.5; the lower class term takes the value of 0.8; the pertinent term takes a value of 0.5.
The values of Q and W can be adjusted according to the requirement of the actual data volume.
It will be appreciated by those of ordinary skill in the art that the examples described herein are intended to assist the reader in understanding the manner in which the invention is practiced, and it is to be understood that the scope of the invention is not limited to such specifically recited statements and examples. Those skilled in the art can make various other specific changes and combinations based on the teachings of the present invention without departing from the spirit of the invention, and these changes and combinations are within the scope of the invention.
Claims (1)
1. A webpage geoscience correlation calculation method based on geoscience expert knowledge is characterized by comprising the following steps of:
s1: extracting webpage data;
s11: confirming the associated key words acquired from the geological narrative list and related to the retrieval subject,
s12: web page retrieval is performed using an API provided by a search engine,
s13: acquiring URLs of webpage links;
s14: judging the relevance of the geological theme according to the following steps;
calculating the correlation degree of the webpage data and the geological subject term:
regarding the keyword set with knowledge hierarchy structure relationship as a document D i When the weight factor is not considered, the web page data D j The correlation degree calculation formula of the geological subject term set is
In the formula:
k: obtaining related key words with search subject from geological narrative list organized by expert knowledge structure tree in document D i The serial number in (1);
m: the number of words related to the retrieval subject in the geological narrative table;
d kj : the keyword corresponding to the sequence number k appears in the webpage data D j The number of times of (1);
introducing a keyword set frequency vector:
obtaining a set of geological subject words k with weights from a narrative table i (i =1,2, \ 8230; m) Web document C j The correlation calculation formula is as follows:
REL D =A title *W t +B content *W t
A title (A j1 ,A j2 …A jm ):k i in web page data D j A vector of the number of occurrences in the title;
B content (B j1 ,B j2 …B jm ): geological subject term k i In web document C j A vector of the number of occurrences;
W(W 1 ,W 2 ,…W m ): forming a weight vector by the keywords;
forming a webpage data-geological subject term weight matrix;
the relevance judgment of the webpage theme is realized by calculating the total weight of the keywords of each webpage data and determining a weight threshold according to the actual application requirement, so that a webpage data-geological theme word weight matrix is formed:
A title (A j1 ,A j2 …A jm ):k i in web page data D j A vector of the number of occurrences in the title;
B content (B j1 ,B j2 …B jm ): geological subject term k i In web document C j A vector of the number of occurrences;
q: a position adjustment parameter, the keywords appearing in the title more relevant than in the abstract;
W(W 1 ,W 2 ,…W m ): the geological subject words form a weight vector, and the value of the preferred term and the abnormal term is 1; on the upper partThe bit class term takes a value of 0.5; the lower class term takes the value of 0.8; the pertinent term takes on a value of 0.5;
the values of Q and W are adjusted according to the requirement condition of the actually acquired data volume;
s15: determining a weight threshold: determining according to actual application requirements, increasing the threshold when the number of the earth screen pages exceeds the threshold and decreasing the threshold when the number of the earth screen pages is too small;
s16: crawling webpage data by using a beautiful soup library in python;
s2, preprocessing data, and cleaning webpage data acquired by a discovery algorithm;
s21, the repeatability check is to detect the name and size information and remove the same file;
and S22, checking the content and the quality, wherein the checking is realized in a manual confirmation mode, the final uploaded data is ensured to meet the requirements, and the final obtaining of the content for calculating the correlation comprises the following steps: title, summary, and link address.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010497002.1A CN111709238B (en) | 2020-06-04 | 2020-06-04 | Web page geoscience correlation calculation method based on geoscience expert knowledge |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010497002.1A CN111709238B (en) | 2020-06-04 | 2020-06-04 | Web page geoscience correlation calculation method based on geoscience expert knowledge |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111709238A CN111709238A (en) | 2020-09-25 |
CN111709238B true CN111709238B (en) | 2023-04-07 |
Family
ID=72539334
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010497002.1A Active CN111709238B (en) | 2020-06-04 | 2020-06-04 | Web page geoscience correlation calculation method based on geoscience expert knowledge |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111709238B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
AU2021104429A4 (en) * | 2021-07-22 | 2021-09-16 | Chinese Academy Of Surveying And Mapping | Machine Translation Method for French Geographical Names |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104679825A (en) * | 2015-01-06 | 2015-06-03 | 中国农业大学 | Web text-based acquiring and screening method of seismic macroscopic anomaly information |
CN106156272A (en) * | 2016-06-21 | 2016-11-23 | 北京工业大学 | A kind of information retrieval method based on multi-source semantic analysis |
CN107247780A (en) * | 2017-06-12 | 2017-10-13 | 北京理工大学 | A kind of patent document method for measuring similarity of knowledge based body |
CN110309246A (en) * | 2019-05-24 | 2019-10-08 | 中国地质调查局发展研究中心 | A kind of method and device thereof internet geologic data retrieval and obtained |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10122720B2 (en) * | 2017-02-07 | 2018-11-06 | Plesk International Gmbh | System and method for automated web site content analysis |
-
2020
- 2020-06-04 CN CN202010497002.1A patent/CN111709238B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104679825A (en) * | 2015-01-06 | 2015-06-03 | 中国农业大学 | Web text-based acquiring and screening method of seismic macroscopic anomaly information |
CN106156272A (en) * | 2016-06-21 | 2016-11-23 | 北京工业大学 | A kind of information retrieval method based on multi-source semantic analysis |
CN107247780A (en) * | 2017-06-12 | 2017-10-13 | 北京理工大学 | A kind of patent document method for measuring similarity of knowledge based body |
CN110309246A (en) * | 2019-05-24 | 2019-10-08 | 中国地质调查局发展研究中心 | A kind of method and device thereof internet geologic data retrieval and obtained |
Non-Patent Citations (2)
Title |
---|
地学文本信息提取技术研究;吴文;《中国优秀硕士学位论文全文数据库》;20070215;全文 * |
矿产资源定量评价中文本数据挖掘研究;陈建平 等;《物探化探计算技术》;20150831;全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN111709238A (en) | 2020-09-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN101630314B (en) | Semantic query expansion method based on domain knowledge | |
CN104239513B (en) | A kind of semantic retrieving method of domain-oriented data | |
CN101452463A (en) | Method and apparatus for directionally grabbing page resource | |
US20100131485A1 (en) | Method and system for automatic construction of information organization structure for related information browsing | |
US10558707B2 (en) | Method for discovering relevant concepts in a semantic graph of concepts | |
CN106227788A (en) | Database query method based on Lucene | |
CN109643315B (en) | Method, system, computer device and computer readable medium for automatically generating Chinese ontology based on structured network knowledge | |
CN110555154B (en) | Theme-oriented information retrieval method | |
CN108710672B (en) | Theme crawler method based on incremental Bayesian algorithm | |
Grover et al. | Comparative analysis of pagerank and hits algorithms | |
CN105389328B (en) | A kind of extensive open source software searching order optimization method | |
CN111709238B (en) | Web page geoscience correlation calculation method based on geoscience expert knowledge | |
Plangprasopchok et al. | Exploiting social annotation for automatic resource discovery | |
CN117591738A (en) | Information retrieval system and method based on cloud service | |
Yang | An ontological website models-supported search agent for web services | |
CN112100500A (en) | Example learning-driven content-associated website discovery method | |
Abass et al. | Information retrieval models, techniques and applications | |
Priya et al. | Design and development of an ontology based personal web search engine | |
CN110309246A (en) | A kind of method and device thereof internet geologic data retrieval and obtained | |
Gupta et al. | A system's approach towards domain identification of web pages | |
Almadhoun et al. | Effects of using arabic web pages in building rank estimation algorithm for google search engine results page. | |
Bute et al. | Evaluating search effectiveness of some selected search engines | |
Hoeber et al. | Automatic topic learning for personalized re-ordering of web search results | |
Wardekar et al. | SmartCrawler: A Personalized Web Search for Relevant Web Pages | |
Yang | An ontology-supported website model for web search agents |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |