CN103310014B

CN103310014B - A kind of method improving retrieval result accuracy rate

Info

Publication number: CN103310014B
Application number: CN201310276040.4A
Authority: CN
Inventors: 王宝会; 王洪军
Original assignee: Beihang University
Current assignee: Beijing easy to use Lianyou Technology Co.,Ltd.
Priority date: 2013-07-02
Filing date: 2013-07-02
Publication date: 2016-06-29
Anticipated expiration: 2033-07-02
Also published as: CN103310014A

Abstract

A kind of method improving retrieval result accuracy rate, step is as follows: html tag is classified and arranges the weight coefficient of all kinds of label by (1)；(2) html page content is carried out structuring process according to the classification of step (1), form structural data, and generate index data for each class label；(3) according to the weight coefficient obtained in the index data generated in step (2) and step (1), calculate, according to weighting algorithm, the degree of association that term mates with html page, go out the frequency of occurrences of this label in html page according to relatedness computation.The present invention has in html page raising retrieval rate, reaches the advantage high with the accurate rate matching degree of data under similar condition, solve tag class in html page many time the not high problem of the inaccurate speed of search.

Description

A kind of method improving retrieval result accuracy rate

Technical field

The present invention relates in full-text search field and text mining field, particularly full-text search for the Search Results optimization of web page and the content analysis to web page.

Background technology

When carrying out full-text search, calculate the degree of association of term and one section of document typically via TF-IDF algorithm.TF-IDF is a kind of statistical method, for assessing a word for the importance degree of text document in a file set or corpus.Importance degree is more big, it is believed that this word is more big with the degree of association of this part of file, and in final retrieval the results list, degree of association is more big just will come more forward position.

The theoretical foundation of TF-IDF originates from Shannon information theory, and its main thought is: the significance level of a word or phrase, and the frequency (TF:TermFrequency) occurred in one section of document to it is directly proportional；Meanwhile, word or expression is for the significance level of one section of document, and the frequency occurred in other documents with it is inversely proportional to (InverseDocumentFrequency is abbreviated as IDF), both TFIDF=TF^*IDF。

For general text document, it does not have the differentiation of position or structure, this algorithm can be good at solving the computational problem of degree of association；But for certain types of document, such as html page, Feature Words is in different positions, and the reflection degree of document content is also different, weight when calculating degree of association also should be different, and TF-IDF algorithm does not embody the architectural feature of document when calculating degree of association.

Summary of the invention

The technology of the present invention solves problem: overcome the deficiencies in the prior art, a kind of method improving retrieval result accuracy rate is provided, there is raising retrieval rate in html page, reach the advantage high with the accurate rate matching degree of data under similar condition, solve tag class in html page many time the not high problem of the inaccurate speed of search.

The technic relization scheme of the present invention: a kind of raising retrieves result accuracy rate method, it is achieved step is as follows:

1. pair html tag is classified and arranges the weight coefficient of all kinds of label.

(11) according to HTML specification, label is classified；

(12) according to the implication of (11) sorted each label and importance, label is arranged weight coefficient；

(13) according to the mode of (tag name, weighted value), output label weighted results.

According to HTML specification, carry out being categorized as example with significant label, html tag is classified as follows: h1, h2, em, caption, li, th, title ... and one weight coefficient is set for each class label, more important label, weight coefficient is more high, is exemplified below:

(title, 0.8), (h1,0.7), (h2,0.65), (h3,0.6), (h4,0.5), (li, 0.5), (h5,0.45), (h6,0.4), (th, 0.4), (caption, 0.4), (em, 0.3), (strong, 0.3), (b, 0.3) ...

It addition, the weight coefficient of other region contents of the page is 0.2.

2. pair html page content carries out structuring process according to above-mentioned classification, forms structural data, and generates index data (accompanying drawing 1) for each class label.

Html page content is carried out structuring process and generates the method concrete steps of index data:

2.1: analyze html page content, and according to the labeling that step A is arranged, convert HTML content to structural data.Structural data represents with the form of bivariate table, bivariate table be classified as labeling, each section of HTML converts a record of bivariate table to.If same class label has multiple data in one section of html page, then multiple data are merged in a field, split with separator.As: one section of html page has multiple h2 label, then the content of these multiple h2 is merged into one piece, with specific separators, such as " ^ ", be placed in h2 field of this record.

2.2: according to labeling, for every a line of structural data, index in units of field, for instance, set up the h1 field index of all data, h2 field index, full-text index etc..The index data of each field is called a field index storehouse, records each word and occur in this field of which data in storehouse, the number of times of appearance, position；Field index storehouse also records the record strip number that each word occurs.

3., during retrieval, calculate the degree of association of term and each html page according to weighting algorithm.

The method step calculating degree of association is as follows:

3.1 assume that user have input several terms in primary retrieval, first pass through retrieval field in full, it is possible to obtaining the set of hiting data, then calculate the degree of association of each term and each hiting data respectively, specific algorithm is as follows:

First 3.2 calculate the degree of association of a field in each term and a data, and specific algorithm is as follows:

3.3 search the number of times that a word occurs in some field of this data in index data, are designated as n_i, calculate total word number of this this field of data simultaneously, be designated as N_i, then pass through formulaCalculate TF value.

3.4 search a word occurred in how many data in index data, are designated as d_j, data total number is designated as D simultaneously, then passes through formulaCalculate IDF value.

3.5 search the weight coefficient arranged by step 1 for this field, are designated as W_k, then pass through formula TIW_x=TF_i×IDF_j×W_kCalculate the degree of association TIW of a field in a word and a record_xValue.

Below according to formulaCalculating the degree of association in the recording at of all terms, x is the field quantity that term hits in a data, and value is 1-m, y is the term quantity once inputted, and value is that 1-n. specifically calculates and is divided into following two steps.

3.6 circulation steps 3.3 to 3.5, calculate this term degree of association to other hit field,

Then pass through formulaCalculate the degree of association of a term and a data.

Each term of 3.7 pairs of user's inputs, circulation step 3.6, calculate the degree of association GC of each term and each hiting data_y, then pass through formula

G = Σ_{y = 1}^{n} {GC}_{y}

Calculate the degree of association of each hiting data and user input content.

It is an advantage of the current invention that, for specific html document, when the retrieval content that calculating user inputs is with the degree of association of retrieval hiting data, have employed the weighting relevancy algorithm based on html tag, the value enabling degree of association embodies the architectural feature of html document, by retrieval result being ranked up according to degree of association afterwards and showing user, user is made to obtain better experience.

Accompanying drawing explanation

Fig. 1 is html page structuring and procedure chart set up in index；

Fig. 2 is the flowchart of the present invention.

Detailed description of the invention

Below in conjunction with concrete example in detail embodiments of the present invention.

HTML (English: HyperTextMarkupLanguage, HTML) is a kind of markup language designed for " webpage creates and other information can seen in web browser ".HTML is used to structured message such as title, paragraph and list etc., it is possible to be used for describing to a certain extent outward appearance and the semanteme of document.Nineteen eighty-two is created by Di Mubainasi-Li, IETF the SGML(standard generalized markup language simplified) grammer carries out the HTML of further development, becomes international standard later, by World Wide Web Consortium (W3C) maintenance.W3C presently suggested use XHTML1.1, XHTML1.0 or HTML4.01 Standard compilation webpage, but the HTML5 coding that existing many webpages are converted newer writes (such as Google).

Analyze HTML standard specification, it is possible to obtain complete html tag list, such as following table:

Label	Describe	DTD
			<!--...-->	Definition annotation.	STF
<!DOCTYPE>	Definition document type.	STF
			<a>	Definition anchor.	STF
<abbr>	Definition abbreviation.	STF
			<acronym>	Definition only takes the abbreviation of initial.	STF
<address>	The contact details of definition document author or owner.	STF
			<applet>	Do not agree with using.The applet that definition embeds.	TF
<area>	Region within definition image mapping.	STF
			<b>	Definition boldface letter.	STF
<base>	Define default address or the default objects of all-links in the page.	STF
			<basefont>	Do not agree with using.The definition default font of page Chinese version, color or size.	TF
<bdo>	Definition words direction.	STF
			<big>	Definition large size text.	STF
<blockquote>	Long the quoting of definition.	STF
			<body>	The main body of definition document.	STF
<br>	Define simple folds.	STF
			<button>	Definition button (push button).	STF
<caption>	Definition tables title.	STF
			<center>	Do not agree with using.Definition center text.	TF
<cite>	(citation) is quoted in definition.	STF
			<code>	Definition computer code text.	STF
<col>	The property value of one or more row in definition tables.	STF 3 -->
			<colgroup>	For the row group of formatting in definition tables.	STF
<dd>	The description of project in definition list.	STF
			<del>	Definition is deleted text.	STF
<dir>	Do not agree with using.Definition directory listing.	TF
			<div>	Joint in definition document.	STF
<dfn>	Definition project.	STF

<dl>	Definition list.	STF
			<dt>	Project in definition definition list.	STF
<em>	Text is emphasized in definition.	STF
			<fieldset>	The frame being defined around in list element.	STF
<font>	Do not agree with using.The definition font of word, size and color.	TF
			<form>	Definition is for the HTML form of user's input.	STF
<frame>	The window of definition frame collection or framework.	F
			<frameset>	Definition frame collection.	F
<h1>to<h6>	Definition HTML title.	STF
			<head>	Define the information about document.	STF
<hr>	Definition horizontal line.	STF
			<HTML>	Definition html document.	STF
<i>	Definition italics.	STF
			<iframe>	Definition inline frame.	TF
<img>	Definition image.	STF
			<input>	Definition input control.	STF
<ins>	Definition is inserted into text.	STF
			<isindex>	Do not agree with using.Define and relevant to document can search for index.	TF
<kbd>	Definition keyboard text.	STF
			<label>	The mark of definition input element.	STF
<legend>	The title of definition fieldset element.	STF
			<li>	The project of definition list.	STF
<link>	The relation of definition document and external resource.	STF
			<map>	Definition image maps.	STF
<menu>	Do not agree with using.Definition menu list.	TF
			<meta>	Define the metamessage about html document.	STF
<noframes>	Define the replacement of the user for not supporting frame.	TF
			<noscript>	Define the replacement for the user not supporting client script.	STF
<object>	Definition embedded object.	STF
			<ol>	Definition ordered list.	STF
<optgroup>	The combination of relevant options in definition selective listing.	STF
			<option>	Option in definition selective listing.	STF

<p>	Definition paragraph.	STF
			<param>	The parameter of definition object.	STF
<pre>	Define pre-format text.	STF
			<q>	Define short quoting.	STF
<s>	Do not agree with using.Definition adds the text of strikethrough.	TF
			<samp>	Definition computer code sample.	STF
<script>	Definition client script.	STF
			<select>	Definition selective listing (drop-down list).	STF
<small>	The small size text of definition.	STF
			<span>	Joint in definition document.	STF
<strike>	Do not agree with using.Definition adds strikethrough text.	TF
			<strong>	Text is emphasized in definition.	STF
<style>	The style information of definition document.	STF
			<sub>	Definition subscript text.	STF
<sup>	Definition subscript text.	STF
			<table>	Definition tables.	STF
<tbody>	Body matter in definition tables.	STF
			<td>	Unit in definition tables.	STF
<textarea>	The text-entry control of definition multirow.	STF
			<tfoot>	Table note content (footnote) in definition tables.	STF
<th>	Gauge outfit cell in definition tables.	STF
			<thead>	Gauge outfit content in definition tables.	STF
<title>	The title of definition document.	STF
			<tr>	Row in definition tables.	STF
<tt>	Definition typewriter text.	STF
			<u>	Do not agree with using.Definition underline text.	TF
<ul>	Definition unordered list.	STF
			<var>	The variable part of definition text.	STF
<xmp>	Do not agree with using.Define pre-format text.

DTD: instruction allows this label in which kind of XHTML1.0DTD.S=Strict, T=Transitional, F=Frameset.

As in figure 2 it is shown, the method is specifically implemented by the following steps:

Step 1: the html tag in upper table is classified and the weight coefficient of all kinds of label is set.Specifically it is classified as follows (tag name, weight coefficient): (title, 0.8), (h1,0.7), (h2,0.65), (h3,0.6), (h4,0.5), (li, 0.5), (h5,0.45), (h6,0.4), (th, 0.4), (caption, 0.4), (em, 0.3), (strong, 0.3), (b, 0.3) ... it addition, the weight coefficient of other region contents of the page is 0.2.

Step 2.1: when html page content carrying out structuring and processing, being stored by content of pages in two-dimensional data table, storage organization is as shown in the table:

Step 2.2: indexing storehouse in units of label, above table is example, sets up following several index database:

Title index database:

Diaoyudaoite: { 3,100}；(1,1,1)；(2,1,2)；(3,1,1)

First: { 20,100}；(2,1,11)；(3,1,6)

Put on display: { 5,100}；(3,1,8)

……

H1 index database:

Diaoyudaoite: { 3,100}；(1,1,1)；(2,1,2)；(3,1,1)

First: { 20,100}；(2,1,11)；(3,1,6)

Put on display: { 5,100}；(3,1,8)

……

H2 index database

First: { 10,100}；(1,1,13)

Put on display: { 2,100}；(1,1,15)

……

The index database of other fields ...

Field index storehouse in full.

Each record of index database is divided into 3 pieces: 1: word or phrase；2: the IDF value inside brace, respectively occurrence number and total number of documents；3: the TF value inside round parentheses, with an element group representation, value respectively number of documents in unit ancestral, occurrence number, there is position first in document.

Step 3.1: assume that the retrieval content that user inputs is " fishing socle is put on display first "

Step 3.2: by retrieving field in full, it is possible to obtain the set of hiting data, then calculate the degree of association of each word and each hit field of each hiting data successively.With the degree of association of the title field that calculates " fishing socle " this word and Article 1 data under such as:

Step 3.3: first calculate TF value, looks into index it can be seen that TF=1(occurs 1 time)/3(word sum)=0.333.

Step 3.4: then calculate IDF value, look into and index it can be seen that IDF=log (100/ (1+3))=1.398.

Step 3.5: calculate the weighting relevance degree of this field, TIW(title)=0.333*1.398*0.8=0.3724.

Step 3.6: repeat step 3.3 to 3.5, calculate the degree of association of " fishing socle " and Article 1 other fields of data: TIW (h1)=1.0*log (100/ (1+3)) * 0.7=0.9786;TIW (h2)=0.Finally calculate population characteristic valuve degree GC=(TIW(title)+TIW (h1) of " fishing socle " and Article 1 data)/2=(0.3724+0.9786)/2=0.6755.

Step 3.7: repeat step 3.3 to 3.6, calculate " exhibition " respectively, the degree of association of " first " and Article 1 data: 0.0079,0.0123;Finally calculate the degree of association=0.6755+0.0079+0.0123=0.6958 of user input content and Article 1 data

Step 3.8: repeating step 3.3 to 3.7, the dependent segment degree of calculating the 2nd, 3 data and user input content is 0.7958 respectively, 0.8741, what finally can draw retrieval result is ordered as (3,2,1).

Now can be seen that, the present invention can well calculate the degree of correlation of each section of relevant documentation and user input content, finally make retrieval result can embody the architectural feature of HTML, user can be made again to be more accurately obtained the result wanted, make user obtain better Consumer's Experience.

The above is embodiments of the present invention; the interest field of the present invention can not be limited with this; should be understood that; for those skilled in the art; under the premise without departing from the principles of the invention; some improvement and variation can also be made; such as (include but not limited to pdf document for other kinds of document; doc/docx document; xls/xlsx document; ppt/pptx document etc.), change the classification of label and change the weighting coefficient values of all kinds of label, these improve and variation is also considered as protection scope of the present invention.

Claims

1. the method improving retrieval result accuracy rate, it is characterised in that realize step as follows:

(1) html tag is classified and the weight coefficient of all kinds of label is set；

(2) html page content is carried out structuring process according to the classification of step (1), form structural data, and generate index data for each class label；

(3) according to the weight coefficient obtained in the index data generated in step (2) and step (1), the degree of association that several terms of user's input mate is calculated with html page according to weighting algorithm, the architectural feature of html document is embodied by the value of degree of association, afterwards retrieval result it is ranked up according to degree of association and shows user, thus improving retrieval result accuracy rate；

The step of weight coefficient html tag classified and arranges all kinds of label in described step (1) is:

(11) according to HTML specification, label is classified；

(12) according to the implication of the sorted each label of step (11) and importance, label is arranged weight coefficient；

(13) according to the mode of (tag name, weighted value), output label weighted results；

Html page content is carried out structuring process the method forming index data by described step (2), concretely comprises the following steps:

(21) analyze html page content, and according to the labeling that step (1) is arranged, html page Content Transformation is become structural data；Structural data represents with the form of bivariate table, bivariate table be classified as labeling, each section of HTML converts a record of bivariate table to, if same class label has multiple data in one section of html page, then multiple data are merged in a field, with separators, field is wherein adopted to carry out each label of presentation class；

(22) according to labeling, every a line for structural data, indexing in units of field, the index data of each field is called a field index storehouse, field index storehouse records in this field which each word record at occur, the number of times that occurs, position；Also recording the record strip number that each word occurs in field index storehouse, each field index storehouse set up constitutes field index storehouse in full；

Relatedness computation method in described step (3) is:

User have input several terms in primary retrieval, first passes through retrieval field index storehouse in full, obtains the set of hit record, then calculates the degree of association of each term and each hit record respectively, specific as follows:

(3.1) in index data, search the number of times that a word occurs in this some field recorded, be designated as n_i, calculate this total word number recording this field simultaneously, be designated as N_i, then pass through formulaCalculate TF value；

(3.2) in index data, search a word to occur in how many records, be designated as d_j, record total number is designated as D simultaneously, then passes through formulaCalculate IDF value；

(3.3) search the weight coefficient arranged for this field by step (1), be designated as W_k, then pass through formula TIW_x=TF_i×IDF_j×W_kCalculate the degree of association TIW of a hit field in a word and a hit record_xValue；

(3.4) circulation step (3.1) to (3.3), calculate the degree of association of other hit field during is recorded by this term, then pass through formulaCalculate a term and the degree of association of a record；

(3.5) each term to user's input, circulation step (3.4), calculates the degree of association GC of each term and each hit record_y, then pass through formulaCalculating the degree of association during all terms record at one, x is the field quantity of hit during term records at, and value is 1 to m, y is the term quantity once inputted, and value is 1 to n；Namely according to formulaCalculate the degree of association of each hit record and user input content.