CN103544314A - Searching data quality statistical method - Google Patents

Searching data quality statistical method Download PDF

Info

Publication number
CN103544314A
CN103544314A CN201310539908.5A CN201310539908A CN103544314A CN 103544314 A CN103544314 A CN 103544314A CN 201310539908 A CN201310539908 A CN 201310539908A CN 103544314 A CN103544314 A CN 103544314A
Authority
CN
China
Prior art keywords
index
individual event
item
quality
index item
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201310539908.5A
Other languages
Chinese (zh)
Other versions
CN103544314B (en
Inventor
张淼
杨杭
李小军
康治理
许国栋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Zhongsou Cloud Business Network Technology Co ltd
Original Assignee
Beijing Zhongsou Network Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Zhongsou Network Technology Co ltd filed Critical Beijing Zhongsou Network Technology Co ltd
Priority to CN201310539908.5A priority Critical patent/CN103544314B/en
Publication of CN103544314A publication Critical patent/CN103544314A/en
Application granted granted Critical
Publication of CN103544314B publication Critical patent/CN103544314B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention relates to a searching data quality statistical method. The method comprises (1) performing duplication removal on information and data which are captured by search crawler and indexed through a quality data test to obtain indexes which have relatively large influence on searching data comprehensive quality judgment; (2) classifying the obtained indexes according to index properties; (3) setting a reference score and a scoring weighting formula of the indexes; (4) obtaining separate scores, classified scores and overall scores of the indexes; (5) according to preset judging standards, performing searching data quality level judgment on obtained calculation results. The searching data quality statistical method can facilitate specific quantization of quality of keywords and help score the keywords from different angles, thereby facilitating more specific improvement and optimization of the quality of the keywords, improving the quality of the keywords and bringing better use experience to users.

Description

A kind of search data statistic of attribute method
Technical field
The invention belongs to data statistics, specifically relate to a kind of search data statistic of attribute method.
Background technology
In current information search, the informational needs that user need to obtain oneself wanting screens from a large amount of Search Results, if the result of wanting search more accurately and high-quality, reduces user's screening operation, just the quality of data need to be improved.
Yet along with internet data information multiple increases severely day by day, cause redundant search data to exist in a large number, invalid or low-quality search data has affected the information that user search oneself needs, under this background, need to carry out Data Quality Analysis to the search data existing, as far as possible by the higher information display of quality to user.
Summary of the invention
For the deficiencies in the prior art, the invention provides a kind of search data statistic of attribute method, the search information that reptile captured and indexed and the quality of data are quantized, formulated index item, by search information and data that reptile captured and indexed, through quality influence force data test re-scheduling, obtain and affect the larger index item of search data overall quality judge.Respectively by obtained index item, the corresponding index properties classification of classifying: level of factor comprehensively: " the micro-event data item of news front end pushes par " Accurate level of factor: " the dead chain ratio of page link " ....; Intellective factor index: " the independent visitor of the page of being interviewed " Interactive level of factor: " interactive micro-(answer all the questions micro-, mark micro-, vote micro-) column sum, the total column of shared SRP count ratio " Level of factor attractive in appearance: " picture (containing video thumbnails) links dead chain ratio " Set up the basis point of index item individual event, set up the basis point of classification indicators item, set up the basis point of whole index item.Set up index individual event score weight formula, set up classification indicators item score weight formula, set up whole index item score weight formula.Carry out data-pushing, obtain the data feedback of each index item individual event, and by default index weights formula and default index basis point, calculated crosswise, obtains index item individual event score, classification indicators item score, whole index item score.To by predetermined judgment criteria, the result of calculation of obtaining is carried out to the judgement of search data quality grade.
The object of the invention is to adopt following technical proposals to realize:
A statistic of attribute method, its improvements are, described method comprises:
(1), by search information and data that reptile captured and indexed, through qualitative data test re-scheduling, obtain and affect the larger index item of search data overall quality judge;
(2) respectively by obtained index item, the classification of classifying of corresponding index properties;
(3) set up the basis point of index item, set up index item score weight formula;
(4) obtain index item individual event score, classification indicators item score, whole index item score;
(5) to by predetermined judgment criteria, the result of calculation of obtaining is carried out to the judgement of search data quality grade.
Preferably, described step (2) comprises comprehensive level of factor, level of factor attractive in appearance, intellective factor index, accurate level of factor and the large class of interactive level of factor five.
Preferably, described step (4) comprises that index weights is multiplied by index item individual event basis point mark and is final index item score.
Preferably, the basis point of described index item individual event be divided into 10 minutes, 20 minutes, 30 minutes and 40 minutes fourth gear.
Preferably,, there is minimum value 10 minutes in the basis point of index item individual event, has maximal value 40 minutes, and meeting is the importance in total quality appraisement system with index item single index, and can change at any time.
Preferably,, there is minimum value 10 minutes in the basis point of classification indicators item, but does not have maximal value, and meeting is with the increase and decrease of affiliated index item single index sum, or the plus-minus of affiliated each index item single index basis point, and can change at any time.
Preferably,, there is minimum value 10 minutes, but do not have maximal value in the basis point of index item all, and meeting is with the increase and decrease of whole index item single index sums, or the plus-minus of whole index item single index basis points, and can change at any time.
Preferably, described step (5) comprises
All individual event score sum is more than or equal to 0, but is less than 60% of minimum individual event benchmark mark, is defined as and rejects warning;
All individual event score sum is more than or equal to 60% of minimum individual event benchmark mark, but is less than minimum individual event benchmark mark, is defined as and improves prompting;
All individual event score sum is more than or equal to minimum individual event benchmark mark, but is greater than Largest Single Item benchmark mark, is defined as and can accepts;
All individual event score sum is more than or equal to Largest Single Item benchmark mark, is defined as superior in quality.
Compared with the prior art, beneficial effect of the present invention is:
The present invention contributes to the quality good or not of keyword to carry out concrete quantification, marks, thereby can to keyword, improve more targetedly and optimize from different angles to keyword, improves keyword quality, gives the better experience of user.
The informative data that algorithm of the present invention obtains is credible, accumulative total is for searching comprehensive search in (http://www.zhongsou.com), the Search Results that retrieves quality dissatisfaction reaches tens thousand of, to improving the search data quality of searching comprehensive search in (http://www.zhongsou.com), played irreplaceable effect.
Accompanying drawing explanation
Fig. 1 is a kind of search data statistic of attribute method structural drawing provided by the invention.
Fig. 2 is a kind of search data statistic of attribute methods and results schematic diagram provided by the invention.
Embodiment
Below in conjunction with accompanying drawing, the specific embodiment of the present invention is described in further detail.
Take comprehensive level of factor, intellective factor index and interactive level of factor is example:
1, comprehensive level of factor: classification indicators total points is 50 minutes, is divided into 4 single indexs
1) the micro-event data item of news front end pushes par
Cycle is 72 hours, index item individual event mark is 20 minutes, by the promptness pushing, judge that whether Search Results is complete comprehensively, computing formula is Yc=YC/AA, and in YC=72 hour, the micro-event data forward end of news pushes total quantity, micro-column sum of AA=news, 0≤Yc<1, index weights is 0; 1≤Yc<2, index weights is 60%; 2≤Yc<3, index weights is 80%; Yc >=3, index weights is 100%; Index weights is multiplied by index item individual event mark and is final score.
2) the micro-event data item of webpage front end pushes par
Cycle is 72 hours, index item individual event mark is 20 minutes, by the promptness pushing, judge that whether Search Results is complete comprehensively, computing formula is Zc=ZC/BB, and in ZC=72 hour, the micro-event data forward end of news pushes total quantity, micro-column sum of BB=news, 0≤Zc<1, index weights is 0; 1≤Zc<2, index weights is 60%; 2≤Zc<3, index weights is 80%; Zc >=3, index weights is 100%; Index weights is multiplied by index item individual event mark and is final score.
3) the micro-event data item of picture front end pushes par
Cycle is 72 hours, index item individual event mark is 10 minutes, by the promptness pushing, judge that whether Search Results is complete comprehensively, computing formula is Xc=XC/CC, and in XC=72 hour, the micro-event data forward end of news pushes total quantity, micro-column sum of CC=news, 0≤Xc<1, index weights is 0; 1≤Xc<2, index weights is 60%; 2≤Xc<3, index weights is 80%; Xc >=3, index weights is 100%; Index weights is multiplied by index item individual event mark and is final score.
4) the micro-event data item of picture front end pushes par
Cycle is 72 hours, index item individual event mark is 10 minutes, by the promptness pushing, judge that whether Search Results is complete comprehensively, computing formula is Rc=RC/DD, and in Rc=72 hour, the micro-event data forward end of news pushes total quantity, micro-column sum of DD=news, 0≤Rc<1, index weights is 0; 1≤Rc<2, index weights is 60%; 2≤Rc<3, index weights is 80%; Rc >=3, index weights is 100%; Index weights is multiplied by index item individual event mark and is final score.
2, intellective factor index, classification indicators total points is 50 minutes, is divided into 4 single indexs.
1) vocabulary entry prompting improves and leaves over quantity
Cycle is 24 hours, and index item individual event mark is 20 minutes, gets the average of index sum of the statistics day of same final stage sort key word, is defined as E, E<0.6 times of E averages, index weights is 0; 0.6 times of E average≤E<0.8 times of E averages, index weights is 60%; 0.8 times E average≤E<1 times E average times E average, index weights is 80%; E >=1 times E average, index weights is 100%; Index weights is multiplied by index item individual event mark and is final score.
2) jump out rate
Cycle is 24 hours, and index item individual event mark is 10 minutes, belongs to user behavior analysis, computing formula N=PV/UV, the searching times of the PV=page leaving from station, the be interviewed independent visitor of the page of UV=, N<0.6 times of N averages, index weights is 0; 0.6 times of N average≤N<0.8 times of N averages, index weights is 60%; 0.8 times N average≤N<1 times N average times E average, index weights is 80%; N >=1 times N average, index weights is 100%; Index weights is multiplied by index item individual event mark and is final score.
3) the be interviewed independent visitor of the page
Cycle is 24 hours, and index item individual event mark is 10 minutes, and the independent visitor of the page of being interviewed, is defined as UV, UV<0.6 times of UV averages, index weights is 0; 0.6 times of UV average≤UV<0.8 times of UV averages, index weights is 60%; 0.8 times UV average≤UV<1 times UV average times UV average, index weights is 80%; UV >=1 times UV average, index weights is 100%; Index weights is multiplied by index item individual event mark and is final score.
4) the be interviewed new independent visitor of the page
Cycle is 24 hours, and index item individual event mark is 10 minutes, and the independent visitor of the page of being interviewed is defined as UV (NEW), UV (NEW) <0.6 times of UV (NEW) average, index weights is 0; 0.6 times of UV (NEW) average≤UV (NEW) <0.8 times of UV (NEW) average, index weights is 60%; 0.<1 times of UV of 8 times of UV (NEW) average≤UV (NEW) (NEW) average times UV (NEW) average, index weights is 80%; UV (NEW) >=1 times UV (NEW) average, index weights is 100%; Index weights is multiplied by index item individual event mark and is final score.
3, interactive level of factor, classification indicators total points is 20 minutes, is divided into 1 single index.
1) interactive micro-(answer all the questions micro-, mark micro-, vote micro-) the total column of the column shared SRP of sum counts ratio
Cycle is 24 hours, and index item individual event mark is 10 minutes, and computing formula is T=T1/T0, micro-column sum of interactive micro-the column sum T0=SRP of T1=, T<0.6 times of T averages, index weights is 0; 0.6 times of T average≤T<0.8 times of T averages, index weights is 60%; 0.8 times T average≤T<1 times T average times T average, index weights is 80%; T >=1 times T average, index weights is 100%; Index weights is multiplied by index item individual event mark and is final score.
As described in Figure 2, by whole index item individual event score sums, be whole scores, and be performed as follows judgement:
1, all individual event score sum is more than or equal to 0, but is less than 60% of minimum individual event benchmark mark, is defined as and rejects warning; To the serious of existing Search Results quality, negate, cannot maintain present situation, must roll off the production line.
2, all individual event score sum is more than or equal to 60% of minimum individual event benchmark mark, but is less than minimum individual event benchmark mark, is defined as and improves prompting; To the slight of existing Search Results quality, negate, can maintain present situation, but must optimize.
3, all individual event score sum is more than or equal to minimum individual event benchmark mark, but is greater than Largest Single Item benchmark mark, is defined as and can accepts; Substantially sure to existing Search Results quality, can maintain present situation, does not need to optimize.
4, all individual event score sum is more than or equal to Largest Single Item benchmark mark, is defined as superior in quality; Completely sure to existing Search Results quality, can maintain present situation, must recommend.
Embodiment
On August 1st, 2013, this search data statistic of attribute algorithm application is searched comprehensive search in (http://www.zhongsou.com), data result through nearly 2 months is verified repeatedly, the informative data that algorithm obtains is credible, accumulative total is for searching comprehensive search in (http://www.zhongsou.com), the Search Results that retrieves quality dissatisfaction reaches tens thousand of, to improving the search data quality of searching comprehensive search in (http://www.zhongsou.com), has played irreplaceable effect.
Finally should be noted that: above embodiment is only in order to illustrate that technical scheme of the present invention is not intended to limit, although the present invention is had been described in detail with reference to above-described embodiment, those of ordinary skill in the field are to be understood that: still can modify or be equal to replacement the specific embodiment of the present invention, and do not depart from any modification of spirit and scope of the invention or be equal to replacement, it all should be encompassed in the middle of claim scope of the present invention.

Claims (8)

1. a search data statistic of attribute method, is characterized in that, described method comprises:
(1), by search information and data that reptile captured and indexed, through qualitative data test re-scheduling, obtain and affect the larger index item of search data overall quality judge;
(2) respectively by obtained index item, the classification of classifying of corresponding index properties;
(3) set up the basis point of index item, set up index item score weight formula;
(4) obtain index item individual event score, classification indicators item score, whole index item score;
(5) to by predetermined judgment criteria, the result of calculation of obtaining is carried out to the judgement of search data quality grade.
2. a kind of search data statistic of attribute method as claimed in claim 1, is characterized in that, described step (2) comprises comprehensive level of factor, level of factor attractive in appearance, intellective factor index, accurate level of factor and the large class of interactive level of factor five.
3. a kind of search data statistic of attribute method as claimed in claim 1, is characterized in that, described step (4) comprises that index weights is multiplied by index item individual event basis point mark and is final index item score.
4. a kind of search data statistic of attribute method as claimed in claim 1, is characterized in that, the basis point of described index item individual event is divided into 10 minutes, 20 minutes, 30 minutes and 40 minutes fourth gear.
5. a kind of search data statistic of attribute method as claimed in claim 1, is characterized in that the basis point of index item individual event, there is minimum value 10 minutes, have maximal value 40 minutes, meeting is the importance in total quality appraisement system with index item single index, and can change at any time.
6. a kind of search data statistic of attribute method as claimed in claim 1, it is characterized in that, the basis point of classification indicators item, there is minimum value 10 minutes, but there is not maximal value, meeting is with the increase and decrease of affiliated index item single index sum, or the plus-minus of affiliated each index item single index basis point, and can change at any time.
7. a kind of search data statistic of attribute method as claimed in claim 1, it is characterized in that, the basis point of whole index item, there is minimum value 10 minutes, but there is not maximal value, meeting is with the increase and decrease of whole index item single index sums, or the plus-minus of whole index item single index basis points, and can change at any time.
8. a kind of search data statistic of attribute method as claimed in claim 1, is characterized in that, described step (5) comprises
All individual event score sum is more than or equal to 0, but is less than 60% of minimum individual event benchmark mark, is defined as and rejects warning;
All individual event score sum is more than or equal to 60% of minimum individual event benchmark mark, but is less than minimum individual event benchmark mark, is defined as and improves prompting;
All individual event score sum is more than or equal to minimum individual event benchmark mark, but is greater than Largest Single Item benchmark mark, is defined as and can accepts;
All individual event score sum is more than or equal to Largest Single Item benchmark mark, is defined as superior in quality.
CN201310539908.5A 2013-11-04 2013-11-04 One kind search quality of data statistical method Expired - Fee Related CN103544314B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310539908.5A CN103544314B (en) 2013-11-04 2013-11-04 One kind search quality of data statistical method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310539908.5A CN103544314B (en) 2013-11-04 2013-11-04 One kind search quality of data statistical method

Publications (2)

Publication Number Publication Date
CN103544314A true CN103544314A (en) 2014-01-29
CN103544314B CN103544314B (en) 2017-12-12

Family

ID=49967766

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310539908.5A Expired - Fee Related CN103544314B (en) 2013-11-04 2013-11-04 One kind search quality of data statistical method

Country Status (1)

Country Link
CN (1) CN103544314B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104834731A (en) * 2015-05-15 2015-08-12 百度在线网络技术(北京)有限公司 Recommendation method and device for self-media information
CN106326447A (en) * 2016-08-26 2017-01-11 北京量科邦信息技术有限公司 Detection method and system of data captured by crowd sourcing network crawlers
CN106649338A (en) * 2015-10-30 2017-05-10 中国移动通信集团公司 Information filtering policy generation method and apparatus
CN108829750A (en) * 2018-05-24 2018-11-16 国信优易数据有限公司 A kind of quality of data determines system and method
CN109815403A (en) * 2019-01-29 2019-05-28 北京奇艺世纪科技有限公司 A kind of screening sample method and device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102339296A (en) * 2010-07-26 2012-02-01 阿里巴巴集团控股有限公司 Method and device for sorting query results
US20120158702A1 (en) * 2010-12-15 2012-06-21 Microsoft Corporation Classifying Results of Search Queries
CN102567475A (en) * 2010-12-15 2012-07-11 微软公司 User interface for interactive query reformulation

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102339296A (en) * 2010-07-26 2012-02-01 阿里巴巴集团控股有限公司 Method and device for sorting query results
US20120158702A1 (en) * 2010-12-15 2012-06-21 Microsoft Corporation Classifying Results of Search Queries
CN102567475A (en) * 2010-12-15 2012-07-11 微软公司 User interface for interactive query reformulation

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104834731A (en) * 2015-05-15 2015-08-12 百度在线网络技术(北京)有限公司 Recommendation method and device for self-media information
CN104834731B (en) * 2015-05-15 2019-02-26 百度在线网络技术(北京)有限公司 A kind of recommended method and device from media information
CN106649338A (en) * 2015-10-30 2017-05-10 中国移动通信集团公司 Information filtering policy generation method and apparatus
CN106649338B (en) * 2015-10-30 2020-08-21 中国移动通信集团公司 Information filtering strategy generation method and device
CN106326447A (en) * 2016-08-26 2017-01-11 北京量科邦信息技术有限公司 Detection method and system of data captured by crowd sourcing network crawlers
CN108829750A (en) * 2018-05-24 2018-11-16 国信优易数据有限公司 A kind of quality of data determines system and method
CN109815403A (en) * 2019-01-29 2019-05-28 北京奇艺世纪科技有限公司 A kind of screening sample method and device

Also Published As

Publication number Publication date
CN103544314B (en) 2017-12-12

Similar Documents

Publication Publication Date Title
CN103544314A (en) Searching data quality statistical method
CN106407484B (en) Video tag extraction method based on barrage semantic association
CN103699626B (en) Method and system for analysing individual emotion tendency of microblog user
CN108460082B (en) Recommendation method and device and electronic equipment
CN103984681B (en) News event evolution analysis method based on time sequence distribution information and topic model
US8375073B1 (en) Identification and ranking of news stories of interest
CN108259929B (en) Prediction and caching method for video active period mode
CN103116605A (en) Method and system of microblog hot events real-time detection based on detection subnet
CN103310003A (en) Method and system for predicting click rate of new advertisement based on click log
US20110004573A1 (en) Identifying training documents for a content classifier
CN102750320B (en) Method, device and system for calculating network video real-time attention
CN103049435A (en) Text fine granularity sentiment analysis method and text fine granularity sentiment analysis device
CN104994424B (en) A kind of method and apparatus for building audio and video standard data set
CN102880712A (en) Method and system for sequencing searched network videos
CN107273295B (en) Software problem report classification method based on text chaos
CN105718587A (en) Network content resource evaluation method and evaluation system
CA2777506A1 (en) System and method for grouping multiple streams of data
CN106156372A (en) The sorting technique of a kind of internet site and device
WO2010030982A3 (en) Associating an entity with a category
CN102637172B (en) Webpage blocking marking method and system
US10467255B2 (en) Methods and systems for analyzing reading logs and documents thereof
CN102591977A (en) Method and system for sequencing search results
CN103744838B (en) A kind of Chinese emotion digest system and method for measuring main flow emotion information
CN106204103A (en) The method of similar users found by a kind of moving advertising platform
CN105224955A (en) Based on the method for microblogging large data acquisition network service state

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20170426

Address after: 100086 Beijing, Haidian District, North Third Ring Road West, No. 43, building 5, floor 08-09, No. 2

Applicant after: BEIJING ZHONGSOU CLOUD BUSINESS NETWORK TECHNOLOGY Co.,Ltd.

Address before: Shou Heng Technology Building No. 51 Beijing 100191 Haidian District Xueyuan Road room 0902

Applicant before: BEIJING ZHONGSOU NETWORK TECHNOLOGY Co.,Ltd.

GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20171212

Termination date: 20211104