CN117076773A - Data source screening and optimizing method based on internet information - Google Patents

Data source screening and optimizing method based on internet information Download PDF

Info

Publication number
CN117076773A
CN117076773A CN202311063341.9A CN202311063341A CN117076773A CN 117076773 A CN117076773 A CN 117076773A CN 202311063341 A CN202311063341 A CN 202311063341A CN 117076773 A CN117076773 A CN 117076773A
Authority
CN
China
Prior art keywords
content resource
information
website
data source
value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202311063341.9A
Other languages
Chinese (zh)
Other versions
CN117076773B (en
Inventor
闫磊
潘俊峰
梁雷
聂磊
董曙光
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Languiqi Technology Development Co ltd
Original Assignee
Shanghai Languiqi Technology Development Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Languiqi Technology Development Co ltd filed Critical Shanghai Languiqi Technology Development Co ltd
Priority to CN202311063341.9A priority Critical patent/CN117076773B/en
Publication of CN117076773A publication Critical patent/CN117076773A/en
Application granted granted Critical
Publication of CN117076773B publication Critical patent/CN117076773B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a data source screening and optimizing method based on internet information, which comprises the following specific steps: s1: the first n search results obtained by each search engine are selected and put into a content resource list, and are used as screening and optimizing input after the duplicate removal treatment; s2: initializing a weight of a search result in a content resource list; s3: weighting scoring is carried out on each content resource website in the content resource list according to the scoring rule, so that a dictionary taking the content resource website as a key and taking the weight score as a Value is obtained; s4: and sequencing the dictionary from high Value to low Value, and outputting the content resource websites with the top m rank to a result list as a data source after screening and optimizing. In the internet information crawling process, the data sources are screened and optimized to obtain the data with high value, high matching degree and high reliability, so that the problems of internet information overuse and low value density are solved, and the data support and the data sources are provided for agricultural production.

Description

Data source screening and optimizing method based on internet information
Technical Field
The invention belongs to the field of big data, and particularly relates to a data source screening and optimizing method based on internet information.
Background
In order to promote the development of intelligent agriculture and intelligent agriculture better, how to obtain data with high value, high matching degree and high reliability is particularly important. The internet is one of the important information acquisition means at present, and has the defects of huge information quantity, rich variety, information reject and low value density. Therefore, in order to acquire the most effective data information as possible, a large amount of labor is often required for data screening. And because the built-in search algorithms of each search engine are different, the search result of a single search engine often has certain limitation, so that the phenomenon of missed detection is caused, and important data information is missed.
Object of the Invention
In order to solve the technical problems, the invention discloses a data source screening and optimizing method based on internet information, which screens and optimizes data sources in the process of crawling internet information to obtain data with high value, high matching degree and high reliability, so as to solve the problems of internet information overuse and low value density and provide data support and data sources for agricultural production.
The specific technical scheme of the invention is as follows:
a data source screening optimization method based on Internet information comprises the following specific steps:
s1: searching keywords in different search engines in the Internet respectively, selecting the first n search results obtained by each search engine, putting the search results into a content resource list, and performing duplication removal processing to obtain the input of screening optimization;
s2: initializing a weight of a search result in a content resource list;
s3: weighting scoring is carried out on each content resource website in the content resource list according to the scoring rule, so that a dictionary taking the content resource website as a key and taking the weight score as a Value is obtained;
s4: and sequencing the dictionary from high Value to low Value, and outputting the content resource websites with the top m rank to a result list as a data source after screening and optimizing.
Preferably, the result list obtained in step S4 is further verified and evaluated, and the specific method is as follows:
crawling the content information of the content resource websites in the result list according to the expected data information items, and calculating the ratio of the crawled data information items to the expected data information items, wherein the ratio is used for measuring the value of the content resource websites, and the calculation formula is as follows:
content resource website value = information item crawled by the website/data item desired.
Preferably, in the step S3, the weight scoring calculation is performed on the content resource website from three dimensions of credibility, matching degree and popularity.
Preferably, in the step S3, the content resource websites are weighted according to the formula (1):
Value=V 1 *a 1 +V 2 *a 2 +…+V n *a n (1);
wherein V is n Score value, a, representing content resource web site in nth dimension n Represents the weight ratio of the nth dimension, and a 1 +a 2 +…+a n =1。
Preferably, in the step S3, the weight of the credibility is distributed according to the type of the information publishing website, the weight of the matching degree is distributed according to the type of the information matching, and the weight of the moderate degree is distributed according to the type of the information applicable standard.
Preferably, the information posting website types include a ministry official posting, a ministry subordinate entity posting, a provincial local official data posting, a local entity posting, an industry tap official website, an industry general enterprise official website, a third party statistics website, and an e-commerce website.
Preferably, the information matching type includes keyword matching, category matching, domain matching and industry matching.
Preferably, the information applicable standard types include national standards, industry standards, local standards, and enterprise standards.
The beneficial effects are that: the invention discloses a data source screening and optimizing method based on internet information, which has the following advantages:
(1) The invention takes the built-in search algorithm and the sequencing rule of different search engines as the data input of the preliminary screening, can comprehensively and fully utilize each search engine to realize the preliminary screening, can not only improve the comprehensiveness of the input data, but also effectively reduce the data quantity of the screening optimization at the later time, and is beneficial to improving the screening optimization efficiency;
(2) The invention performs scoring selection on the content resource websites from three dimensions of credibility, matching degree and popularity to output the content resource websites with high score, thereby realizing screening and optimization of data sources and being beneficial to improving the value degree, reliability and matching degree of search results.
(3) According to the invention, the content information of the content resource website after crawling and screening is compared with the preset expected data information items, so that the optimization result is verified and evaluated in one step, and the value degree, reliability and matching degree of the search result are further ensured.
Drawings
FIG. 1 is a schematic diagram of a data source screening optimization method according to the present invention.
Detailed Description
The invention is further improved and modified in the following description with reference to the drawings, which are also to be regarded as protection.
Example 1
Taking crawling of rice seedling data information in agricultural production as an example, as shown in fig. 1, screening and optimizing data sources based on internet information, and specifically comprises the following steps:
step 1: setting an input keyword as 'rice seed information', inputting the keyword into four different search engines of hundred degrees, dog searching, 360 and Bing in the embodiment, putting the search result of the top 20 of each search engine rank into a content resource list, and performing de-duplication processing to obtain the following content resource list:
[https://ricedata.cn/,https://www.ricedata.cn/variety/,https://www.cgris.net/,https://zhuanlan.zhihu.com/p/374483809,https://baike.baidu.hk/item/%E6%B0%B4%E7%A8%BB/21285,http://www.zys.moa.gov.cn/mhsh/202104/t20210422_6366373.htm,https://baike.baidu.com/item/%E7%A8%BB/4417005,https://www.ricedata.cn/variety/superice.htm,https://www.gov.cn/xinwen/2022-12/05/content_5730461.htm,http://www.jiangdu.gov.cn/jdqxxgk/nyncj/202304/9585364ff7644872a192aa4e764acbd2.shtml,...,https://www.baidu.com/linkurl=mqtDoXWwXYVLdKcQWTGUgzJODBEum5ZwKuGHls3NrfKKlgdy2N-5kfUU9Abxpw4w&wd=&eqid=8e799a1c00002480000000046497f78d,https://www.baidu.com/linkurl=mqtDoXWwXYVLdKcQWTGUgrK3K0aILqMtbYseQAn6vP2-5lVLOgsNpBv4RoklwWfcvNVoWN6OXLGcq3BtRJP_oWtzZritn37lyIlYvPn4fYDFgtxTvg7uqrzcMgWV3bkyRkgqVZEObUtkqLB3m1iUwWAzK3wAnFZXppTYghXeYDUC3pLMHonrqWLeRDJ7KcXKiqTtTRhJtZfzExYxI3mSVr4e8vLxhUSCsuL9doVU6TB0VeGXmp8QLVmkB8-HGBHCwxOUKVFM4f56y-lExxW4U_&wd=&eqid=8e799a1c00002480000000046497f78d,https://www.baidu.com/linkurl=mMw2X75qEAIbS7UaWryrE30mmDQC2vfgEAU1SUVbxG9FcbNBsXgj8I8_2eBtePgQGUP49x7a0L1-uFMfzuAXOw77M9u0awzhoN6a0gmyGqy&wd=&eqid=8e799a1c00002480000000046497f78d,https://www.baidu.com/linkurl=mMw2X75qEAIbS7UaWryrEEZJFrDq5Q8gbyA3LHePwBA6AkxTlgFSzbpcesUaRiFHhXCXi-xOUgwhJ__3SS16zZonqACOiHu99BsG9XVxrGS&wd=&eqid=8e799a1c00002480000000046497f78d]。
the search engine in the invention can be but not limited to the above search engine, and the existing search engine capable of realizing information retrieval is applicable.
Step 2: screening and optimizing the content resource list as input, and initializing the weight of the search result in the content resource list, namely initializing the Value of the content resource website corresponding to the dictionary key to be 0. And then, weighting scoring is carried out on the content resource websites in the content resource list according to a scoring rule, so that a dictionary which takes the content resource websites as keys and takes the weight score as a Value is obtained, wherein the dictionary is shown as follows:
[https://www.ricedata.cn/variety/:9,https://ricedata.cn/:8.4,https://www.ricedata.cn/variety/superice.htm:8,https://www.cgris.net/:7.8,http://www.zys.moa.gov.cn/mhsh/202104/t20210422_6366373.html:7.2,...,https://baike.baidu.com/item/%E7%A8%BB/4417005:5.8,https://baike.baidu.hk/item/%E6%B0%B4%E7%A8%BB/21285:5.8,https://zhuanlan.zhihu.com/p/374483809:4.6]。
the dictionary is ordered from high to low according to the Value, and the top 20 content resource websites are output to the result list.
In the invention, the weight scoring calculation is shown in the formula (1):
Value=V 1 *a 1 +V 2 *a 2 +…+V n *a n (1);
wherein V is n Score value, a, representing content resource web site in nth dimension n Represents the weight ratio of the nth dimension, and a 1 +a 2 +…+a n =1。
The scoring rules in this embodiment 1 are: and (3) performing weight scoring calculation on the content resource website from three dimensions of credibility, matching degree and popularity, namely, taking n as 3. The weight score table design for each dimension is as follows:
TABLE 1 credibility weight distribution Table
TABLE 2 match weight distribution Table
Keyword matching Category matching Domain matching Industry matching Weight ratio
Degree of matching 10 8 6 4 0.3
Table 3 general weight distribution table
National standard Industry standard Local standard Enterprise standard Weight ratio
General degree 10 8 6 4 0.2
Step 3: according to the rice seedling data information, expected data information items are set, 22 data items are obtained in total, and the table is shown as follows:
table 4 desired data information entry
Step 4: according to the expected data information items, content information crawling is carried out on the content resource websites in the result list obtained in the step 2, ratio calculation is carried out on the crawled data information items and the expected data information items, so that the value degree of the content resource websites is obtained, the value degree is used for evaluating the quality degree of the screening optimization method), and the calculation formula is as follows: content resource website value = information item crawled by the website/data item desired.
The evaluation criteria of the data source screening and optimizing method can be set according to the actual demands of users, for example: taking the value of the content resource website as a measurement standard, it can be considered that the content resource website is better than 85%, preferably 75% -85%, generally 60% -75% and not better than 60%.
If the result obtained by evaluation is not good, the screening optimization method needs to be adjusted, and the dimension can be increased, the measurement index of each dimension can be further subdivided, and the like.
The foregoing is merely illustrative of the present invention and is a preferred embodiment thereof. It should be noted that modifications and adaptations to the invention may occur to one skilled in the art and are intended to be within the scope of the present invention.

Claims (8)

1. The data source screening and optimizing method based on the Internet information is characterized by comprising the following specific steps:
s1: searching keywords in different search engines in the Internet respectively, selecting the first n search results obtained by each search engine, putting the search results into a content resource list, and performing duplication removal processing to obtain the input of screening optimization;
s2: initializing a weight of a search result in a content resource list;
s3: weighting scoring is carried out on each content resource website in the content resource list according to the scoring rule, so that a dictionary taking the content resource website as a key and taking the weight score as a Value is obtained;
s4: and sequencing the dictionary from high Value to low Value, and outputting the content resource websites with the top m rank to a result list as a data source after screening and optimizing.
2. The data source screening and optimizing method based on internet information according to claim 1, wherein the result list obtained in step S4 is further verified and evaluated, and the specific method is as follows:
crawling the content information of the content resource websites in the result list according to the expected data information items, and calculating the ratio of the crawled data information items to the expected data information items, wherein the ratio is used for measuring the value of the content resource websites, and the calculation formula is as follows:
content resource website value = information item crawled by the website/data item desired.
3. The data source screening and optimizing method based on internet information according to claim 1 or 2, wherein in S3, weight scoring calculation is performed on the content resource website from three dimensions of credibility, matching degree and popularity.
4. The internet information-based data source screening optimization method according to claim 3, wherein in S3, the content resource websites are weighted according to formula (1):
Value=V 1 *a 1 +V 2 *a 2 +…+V n *a n (1);
wherein V is n Score value, a, representing content resource web site in nth dimension n Represents the weight ratio of the nth dimension, and a 1 +a 2 +…+a n =1。
5. The internet information-based data source screening and optimizing method according to claim 4, wherein in S3, the weight of the credibility is distributed according to the type of the information release website, the weight of the matching degree is distributed according to the type of the information matching, and the weight of the popularity is distributed according to the type of the information application standard.
6. The internet information-based data source screening optimization method of claim 5, wherein the information posting website types include a ministry official posting, a ministry subordinate unit posting, a provincial local official data posting, a local unit posting, an industry tap official website, an industry general enterprise official website, a third party statistics website, and an e-commerce website.
7. The internet information-based data source screening optimization method of claim 5, wherein the information matching types include keyword matching, category matching, domain matching, and industry matching.
8. The internet information-based data source screening optimization method of claim 5, wherein the information applicable standard types include national standards, industry standards, local standards, and enterprise standards.
CN202311063341.9A 2023-08-23 2023-08-23 Data source screening and optimizing method based on internet information Active CN117076773B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311063341.9A CN117076773B (en) 2023-08-23 2023-08-23 Data source screening and optimizing method based on internet information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311063341.9A CN117076773B (en) 2023-08-23 2023-08-23 Data source screening and optimizing method based on internet information

Publications (2)

Publication Number Publication Date
CN117076773A true CN117076773A (en) 2023-11-17
CN117076773B CN117076773B (en) 2024-05-28

Family

ID=88714825

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311063341.9A Active CN117076773B (en) 2023-08-23 2023-08-23 Data source screening and optimizing method based on internet information

Country Status (1)

Country Link
CN (1) CN117076773B (en)

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101639838A (en) * 2008-07-31 2010-02-03 深圳龙媒网络技术有限公司 Method and system for searching resource
CN102023996A (en) * 2009-09-21 2011-04-20 英业达股份有限公司 System and method for sequencing websites based on contents of articles on websites
US20120047120A1 (en) * 2010-08-23 2012-02-23 Vistaprint Technologies Limited Search engine optimization assistant
CN104008210A (en) * 2014-06-20 2014-08-27 李玉坤 Web information retrieval method based on multiple search engines
CN104111888A (en) * 2014-07-03 2014-10-22 曹建楠 Code evaluation method, device and system for teaching
WO2015070673A1 (en) * 2013-11-15 2015-05-21 北京奇虎科技有限公司 Method for browser-side network search and browser
WO2015089860A1 (en) * 2013-12-18 2015-06-25 孙燕群 Search engine ranking method based on user participation
CN110175280A (en) * 2019-04-30 2019-08-27 广东鼎义互联科技股份有限公司 A kind of crawler analysis platform based on government affairs big data
CN110968511A (en) * 2019-11-29 2020-04-07 车智互联(北京)科技有限公司 Recommendation engine testing method, device, computing equipment and system
CN111177514A (en) * 2019-12-31 2020-05-19 沈阳航空航天大学 Information source evaluation method and device based on website characteristic analysis, storage equipment and program
US20200192951A1 (en) * 2018-12-13 2020-06-18 Microsoft Technology Licensing, Llc Personalized search result rankings
CN112417299A (en) * 2020-12-08 2021-02-26 西安联乘智能科技有限公司 Webpage recommendation method, computer storage medium and computing device
CN113722572A (en) * 2021-10-11 2021-11-30 上海易路软件有限公司 Distributed deep crawling method, device and medium

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101639838A (en) * 2008-07-31 2010-02-03 深圳龙媒网络技术有限公司 Method and system for searching resource
CN102023996A (en) * 2009-09-21 2011-04-20 英业达股份有限公司 System and method for sequencing websites based on contents of articles on websites
US20120047120A1 (en) * 2010-08-23 2012-02-23 Vistaprint Technologies Limited Search engine optimization assistant
WO2015070673A1 (en) * 2013-11-15 2015-05-21 北京奇虎科技有限公司 Method for browser-side network search and browser
WO2015089860A1 (en) * 2013-12-18 2015-06-25 孙燕群 Search engine ranking method based on user participation
CN104008210A (en) * 2014-06-20 2014-08-27 李玉坤 Web information retrieval method based on multiple search engines
CN104111888A (en) * 2014-07-03 2014-10-22 曹建楠 Code evaluation method, device and system for teaching
US20200192951A1 (en) * 2018-12-13 2020-06-18 Microsoft Technology Licensing, Llc Personalized search result rankings
CN110175280A (en) * 2019-04-30 2019-08-27 广东鼎义互联科技股份有限公司 A kind of crawler analysis platform based on government affairs big data
CN110968511A (en) * 2019-11-29 2020-04-07 车智互联(北京)科技有限公司 Recommendation engine testing method, device, computing equipment and system
CN111177514A (en) * 2019-12-31 2020-05-19 沈阳航空航天大学 Information source evaluation method and device based on website characteristic analysis, storage equipment and program
CN112417299A (en) * 2020-12-08 2021-02-26 西安联乘智能科技有限公司 Webpage recommendation method, computer storage medium and computing device
CN113722572A (en) * 2021-10-11 2021-11-30 上海易路软件有限公司 Distributed deep crawling method, device and medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
靳嘉林;王曰芬;郑小昌;: "面向网页信息筛选的可信度评估研究", 情报理论与实践, vol. 40, no. 5, 15 May 2017 (2017-05-15), pages 116 - 121 *

Also Published As

Publication number Publication date
CN117076773B (en) 2024-05-28

Similar Documents

Publication Publication Date Title
Jain et al. Page ranking algorithms for web mining
US7039631B1 (en) System and method for providing search results with configurable scoring formula
CN100433007C (en) Method for providing research result
CN100462969C (en) Method for providing and inquiry information for public by interconnection network
Choudhary et al. Role of ranking algorithms for information retrieval
Alghamdi et al. Extended user preference based weighted page ranking algorithm
CN117076773B (en) Data source screening and optimizing method based on internet information
Yan et al. An improved PageRank method based on genetic algorithm for web search
Singhal et al. Enhancing the page ranking for search engine optimization based on weightage of in-linked web pages
Zhang et al. A comparative analysis of the search feature effectiveness of the major English and Chinese search engines
Yerma et al. Updated page rank of dynamically generated research authors' pages: A new idea
Batra et al. Content based hidden web ranking algorithm (CHWRA)
Kadam Search Engine Optimization Techniques and Tools
Lei et al. Improved relevance ranking in WebGather
Liang et al. R-SpamRank: a spam detection algorithm based on link analysis
Zeraatkar et al. Improvement of Page Ranking Algorithm by Negative Score of Spam Pages.
Mahale et al. Advanced web crawler for deep web interface using binary vector & page rank
WO2005024661A2 (en) Improved search engine optimisation
Bama et al. Improved pagerank algorithm for web structure mining
CN109948019B (en) Deep network data acquisition method
Jachimczyk et al. Web directories: selected features and their impact on directory quality
Zubi Ranking webpages using web structure mining concepts
Sundarde et al. Smart crawler for hidden web interfaces
Zhang et al. Query representation with global consistency on user click graph
Sunitha et al. A comparative study over search engine optimization on precision and recall ratio

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant