CN106168982A

CN106168982A - Data retrieval method for particular topic

Info

Publication number: CN106168982A
Application number: CN201610630951.6A
Authority: CN
Inventors: 赖真霖; 文君
Original assignee: Chengdu Sixiang Lianchuang Technology Co Ltd
Current assignee: Chengdu Sixiang Lianchuang Technology Co Ltd
Priority date: 2016-08-03
Filing date: 2016-08-03
Publication date: 2016-11-30

Abstract

The invention provides a kind of data retrieval method for particular topic, the method includes: according to user's evaluation to commodity, calculate by the way of score value calculates commodity corresponding add attributes, realize semantic class commercial articles searching based on adding attributes search.The present invention proposes a kind of data retrieval method for particular topic, overcomes the bottleneck of character string pairing type search, improves the accuracy of Search Results, and achieves intelligent and high-efficiency search, it is possible to adapts to the demand of all kinds of business.

Description

Data retrieval method for particular topic

Technical field

The present invention relates to data retrieval, particularly to a kind of data retrieval method for particular topic.

Background technology

Along with the data acquisition of internet arena is required more and more accurate by user, professional search engine is in order to realize these Demand and produce, this is search engine integrates according to the proprietary realm information of data many types.Such as commercial articles searching, finance is searched Rope, video search etc..Compared with comprehensive search engine, professional search engine search rule is more rich, more accurately, more professional. But from the point of view of existing vertical search technology and application product, however it remains in place of some technical imperfections, including: existing Electricity business's search engine sequence is ranked up, if needs are according to visit capacity generally according to the comprehensive marking in a document of term institute It is ranked up, then whole search result sets is carried out two minor sorts, but upsets the result of sequence for the first time, to the experience of user Cause the biggest impact；The most existing search engine generally uses the mode of search word characters matching to carry out, and can only accomplish letter Single character pairing, and the implication of some object search itself can not be got a real idea of, can only by the subjective perception of people Can refine, furthermore along with web technology is maked rapid progress, need again to write regular expression for electricity business's search engine, it is clear that be difficult to Adapt to the generating date of the whole network magnanimity.

Summary of the invention

For solving the problem existing for above-mentioned prior art, the present invention proposes a kind of data retrieval for particular topic Method, including:

According to user's evaluation to commodity, calculate by the way of score value calculates commodity corresponding add attributes, based on Add attributes search and realize semantic class commercial articles searching.

Preferably, described method farther includes:

According to user, commodity being evaluated extraction automatically and obtain the property value of commodity, search attribute is worth to meet specific field The certain type of commodity of scape, described automatic extraction includes:

(1). by commodity evaluation structure；

(2). same user is carried out participle to the content part of all comments of same commodity, will be pre-after word segmentation processing Definition stop words filters, and then dittograph is chosen the corresponding comment time up-to-date, finally obtains same user to same The property value of individual commodity；

(3). it is calculated the property value to same commodity of all users according to step 2, identical property value is entered Row is assembled；

(4). obtain, according to step 2 and 3, the property value that the comment of all commodity is obtained by all users；

Then property value is classified；Using the type of merchandise that obtains as the dimension of attribute；Number of repetition is more than predefined threshold The property value of value, the value being in dimension；Respective score value is calculated by mode complementary between commodity and user Weights, i.e. analyze all evaluations and obtain the items list that all users are interested；From commodity dimension, obtain property value by evaluation And computation attribute value, it is calculated the items list under each property value by property value；

Definition dimension collection D；Dimension collection value V；Items list SU (the p being evaluated₁, p₂...p_n)；Participate in evaluation User list UU (u₁, u₂...u_m)；The dimension list DU{d of commodity₁, d₂...d_k}；For any attribute value list VU in DU {v₁, v₂...v_o}；Attribute list SMU (pm₁, pm₂...pm_x), the value of corresponding SU element；Attributive classification list UMU (um₁, um₂...um_y), the value of corresponding UU element；Assume that certain dimension is A{a₁, a₂...a_n, user gathers U{U₁, U₂... U_m, business Product set P{P₁, P₂... P_k}

(1) commodity score value is calculated jointly according to the number and evaluation user's weights of evaluating user, and process is as follows:

S P (p r o d u c t | a x) = Σ_{i = 1}^{n} (U_{i {U_{i} . A = a x}} / {cnt}_{M} / {cnt}_{v} \times θ) \times {cnt}_{v x} / s u m

Wherein, ax is a dimension values in dimension A；Product | ax represents that commodity are ax's in its dimension values of dimension A Score value；

U_i.A=ax all users that product is evaluated as in dimension A ax are included；

cnt_MSum for all properties；Sum represents all users sum to product evaluation in dimension A； cnt_vxFor all users to the evaluation sum that product dimension values in dimension A is ax；cnt_vx/ sum is all users couple Product upper in dimension A dimension values be the weights coefficient of ax；cnt_vFor user's evaluation number in this value of this dimension Amount；θ is fall weight factor, and up-to-date time and the earliest time evaluated product in dimension A by user are determined；

(2) score value of user is calculated by commodity correspondence attribute score value: assume that user is combined into for the category set of commodity DV(D_iV_j|D_i∈ DU, V_j∈ VU) definition pdv be commodity p score value in dimension values v of dimension d, pdv '=pdv/cnt_pdv, Wherein cnt_pdvThe number of the user for voting in dimension d to value v at commodity p；User score value SP (U_u) it is calculated as follows:

S P (U_{u}) = Σ_{i = 1}^{n} Σ_{j = 1}^{m} Σ_{k = 1}^{o} P_{i} D_{j} V_{k} / P_{i} D_{j} V_{k},

(3) weights equation group is built

Weight computing score value SP according to above-mentioned user and item property (product | ax) and SP (U_u), set up M+N* V unit linear function group, wherein the sum of commodity is N, the sum of user be M, V be dimension values set element in each dimension Number, solves weights equation group by the way of iteration and obtains weights and user's weights of each commodity correspondence attribute.

The present invention compared to existing technology, has the advantage that

The present invention proposes a kind of data retrieval method for particular topic, overcomes the bottle of character string pairing type search Neck, improves the accuracy of Search Results, and achieves intelligent and high-efficiency search, it is possible to adapt to the demand of all kinds of electricity business business.

Accompanying drawing explanation

Fig. 1 is the flow chart of the data retrieval method for particular topic according to embodiments of the present invention.

Detailed description of the invention

Hereafter provide retouching in detail one or more embodiment of the present invention together with the accompanying drawing of the diagram principle of the invention State.Describe the present invention in conjunction with such embodiment, but the invention is not restricted to any embodiment.The scope of the present invention is only by right Claim limits, and the present invention contains many replacements, amendment and equivalent.Illustrate in the following description many details with Thorough understanding of the present invention is just provided.These details are provided for exemplary purposes, and without in these details Some or all details can also realize the present invention according to claims.

An aspect of of the present present invention provides a kind of data retrieval method for particular topic.Fig. 1 is real according to the present invention Execute the data retrieval method flow chart for particular topic of example.

Present invention achieves a kind of professional field search engine architecture, utilize order of classification, divided by two-dimensional space Value calculates foundation and adds and attributes carry out degree of depth intelligent search；Set up Multi-dimensional constraint data extracting mode and realize the content intelligence of the page Extract, and scan for word extension content generation and update, the search word to supertext especially, realize searching based on hash search Highlighting of rope word.

Professional field search engine architecture includes: acquisition module, and the collection being responsible for data receives, and is saved in specific File under, it is provided that web page.Data memory module, is responsible for the data by accepting whole according to the data form needed for index Reason.There is self-recovery, rollback function.Rolling back action can not be cancelled, and once rolls back to the specific date, when next update, this Data before date will retain, and the data after this date will be deleted.Data directory module, is responsible for setting up rope according to data Drawing, index has back mechanism simultaneously.Search calling interface module, issues into http service by search engine.Daily record and monitoring Module, monitors the running status of each system above.Data analysis module, is carried out the data service part of web page content Data analysis.User's modified module, revises Search Results, changes result including additions and deletions and modify sequence from outside.Data Search module, is responsible for data search, and automatically updates latest data from directory system.

Data analysis module is for the marked feature of specific website, it determines and find out all web pages；Then, according to web Network address is searched for the semanteme of concept, the body that the page pointed to by web page and each network address thereof is comprised respectively on the page The comparison of the magnitude relationship between collection, finds out the URL of this web page；Finally, the link text on URL is mapped to this URL point to The body that comprised of web page on, be included into the property set of this body.Unnecessary for avoid hiding during Attribute Discovery Repeating, arrange the beta pruning mechanism of search B-tree, one web page of each node on behalf of search B-tree, father node points to leaf segment The limit of point represents the next relation between corresponding web page, and the value on limit is corresponding hiding attribute, from root node to leaf node All hiding attribute on path constitutes the hiding property set of this leaf node.First with depth-first fashion, according to the next network address Semantic generation lower floor leaf node；Then, for newly-generated each leaf node, it is judged that its hiding property set whether with existing certain Individual leaf node is identical, if having, abandons this leaf node, to complete crawling of attribute.At the end of crawling process, it is thus achieved that without repeat All object pages, all properties information extracts procedure extraction for page info.

Data analysis module of the present invention by electricity business website on web page be divided into three kinds: results page, the object page and its His page.One search correspondence is the series of results page, and the object page comprises an independent ontology information, including commodity. The page classifications being not belonging to the both the above page is other pages.Each body is described with one group of community set, is formed The condition of search.Each body has and the only object page.Describing electricity business website with non-directed graph, P represents vertex set, Each summit represents a page, and L is limit collection, and each limit represents the URL from a page to another page.R represents institute Having the set of results page, O represents the set of all object pages, and Q represents the set of all search.Search, results page and Between object page three, all of attribute constitutes an attribute space, clusters based on the attachment structure between the page.

Attribute information in the search is hidden for finding out each body.The full set finding out search is needed to search with each The key-value pair of the corresponding attribute of rope and the composition of value, meet the body of each search.Making q is a search, we use with Results page set delta (q) that search is consistent represents q.Specifically perform step:

1. crawl whole Website page, utilize its URL identify each page and extract all of network address from the page.

2. identify the type of each page, i.e. results page, the object page and other pages.In page type identification, Based on object page HTML structure similarity in same web site, page classifications method based on SVM is used to complete the object page Identification.Then have employed greedy algorithm, as long as any non-object page comprises a network address pointing to the object page, then by it It is categorized as a results page.

3. according to search, results page is clustered as multiple set, the corresponding search of each set.I.e. in set R Set t (p) of all results page that each page p points to, represents the distance of each two page by the symmetric difference between t (p), Introduce a distance threshold d, when described distance is less than d, indicate two pages to belong to identical search.

4. find out the relation between search.Check the URL of each page of each results page set s；If one is searched Rope URL points to the page in another results page set r, then check the body page associated by inquiry that s and r is comprised respectively Face w_sAnd w_rBetween subset relation.IfThen the URL of extraction s and r is as attribute, uses its hypertext as genus Property value and upper strata HTML element as attribute-name create an attribute key-value pair.

5. extract the union of the attribute meeting all search and the key-value pair of the composition of value, as the hiding attribute of body.

The data search module of described search engine architecture includes: order module, search module based on attribute weights, search Rope word expansion module, web page intelligent processing module, search word highlights module.Order module carries out order of classification, each Grade arranges the sequence logic of multiple equal weights, and every layer of logic is carried out a grade internal sort.Simultaneously using visit capacity as row in real time The reference frame of sequence.Overall procedure includes that sequence logic classification, sequence logic are integrated, ranking results block divides, ranking results is whole Close, ranking results collection stores.Search logic is carried out with a matrix type by the actual demand according to searching service according to priority Staged care.Ranking results is divided by rank, and the corresponding ranking results set of each sequence logic layer, then according to system The sequence logic of one grade carries out a grade internal sort, and as the factor sorted, real-time visit capacity data are carried out secondary row in level Sequence, finds suitable ranking results subset to return to user after integrating from the ranking results layer that each is orderly.

Search modules based on attribute weights, according to user's evaluation to commodity, calculate business by the way of score value calculates Adding that product are corresponding is attributes, solves the commercial articles searching of semanteme by the way of search based on attribute weights, moves including property value State generates, attribute score value calculates, commodity multiple attributes sorts and item property search.

Search word expansion module, after user inputs the partial content of search word, points out out the term row that user needs Table, the arbitrary search word in user by selecting search word list scans for.The present invention by web page object through division after It is stored in internal memory, generates search word expanded list, for search and the renewal of search word by traveling through and divide web page.

Ordinary pages as training set, is determined the constraint rule set of certain type page by web page intelligent processing module, Then directly utilize these constraint rule set and carry out corresponding information retrieval, allow to manually adjust node division rule simultaneously, Node division rule describes the most basic attribute of node in terms of different, and the same type of page only need to define a class joint Point division rule, thus meet the demand of existing search engine.

Described search word highlights module, for a kind of general search word letter of long text search word display problem design Breath content display method.Information content is resolved the position of the multiple search words obtained by the memory data structure first passing through design Information inverted index is stored in internal memory, then improves search word information by the positional information inverted index of hash search search word Loading efficiency, positions the positional information of specified search terms to determine that search word highlights scope simultaneously, resolve including search word, Information content resolves, search word information loads, display content integration, display unit.

Owing to order module specifically includes sequence logic classification, sequence logic integration, the division of ranking results block, ranking results Integrate, ranking results collection stores unit, and the present invention describes unit in a further embodiment in detail.

Sequence logic is carried out classification according to the actual demand of user by logical hierarchy unit, forms a matrix sort logic Model.Wherein in matrix, row element represents multiple logics of ad eundem, and different rows represents different brackets, the power between different layers Value is different.Assume N*M matrix by N number of sequence logic grade, and each sequence logic grade is by M sequence logic, therefrom Part logic in selected part level and grade.Choosing sequence logic classification matrix, arranging front P row in matrix is search logic Layer, priority logically increasing or decreasing, a certain search logic layer has 1-M the subset sequence logic as this layer. Each logical mappings is become a numeral, is character matrix by search logic matrix conversion.

Sequence logic integral unit is integrated into the collection of a search according to the sequence logic in M*N order of classification logic matrix Close, all file scannings are completed all search, result set orderly in forming multiple level.Ranking results block division unit according to Hierarchy model carries out piecemeal, each layer of correspondence one piece, generates M data block i.e. sorting data layer, and each data block forms one Data field.

Ranking results integral unit takes out a number of result subset according to the parameter being transmitted through from each data block, Then carry out result and be integrated into a complete result set.The parameter being transmitted through is a regional value, and the flow process of integration is as follows:

1. the sorting data layer at the Search Results place returned according to region beginning and end address determination requirement；

2. judge that beginning and end address, whether in same sorting data layer, otherwise goes to step 8；

3. take out the data subset bottom first sorting data layer；

4. judge that sorting data layer number, whether more than 2, goes to step 6 if greater than 2；

5. all result subsets of sorting data layer in the middle of taking out；

6. take out last sorting data layer upper data subset；

7. the result set of taking-up carries out order merge；

8. return result set.

The described sequence logic according to unified grade carries out a grade internal sort, using real-time visit capacity data as sequence because of Element carries out two minor sorts in level, farther includes:

The visit capacity that will sort in real time matrix correspondence order of classification logic matrix, the corresponding multiple external sequences of every first level logical are visited The amount of asking is as the reference frame of sequence in real time.Value according to sequence visit capacity matrix in real time carries out two minor sorts, including: according to ginseng Number is positioned to data block and the block inner region of sequence；The numerical value that ranking factor is corresponding is taken out in real time from data base；To sequence district Territory is ranked up；

For search modules based on attribute weights, the evaluation of commodity is extracted according to user and is obtained commodity by the present invention automatically Property value, by property value search obtain meeting the certain type of commodity of special scenes, reached the mesh of quasi-semantic search Mark.Described automatic extraction includes:

1. by commodity evaluation structure；

2. same user is carried out participle to the content part of all comments of same commodity, will be predetermined after word segmentation processing Justice stop words filters, and then dittograph is chosen the corresponding comment time up-to-date, finally obtains same user to same The property value of commodity；

3. it is calculated the property value to same commodity of all users according to step 2, identical property value is carried out Assemble；

4. obtain, according to step 2 and 3, the property value that the comment of all commodity is obtained by all users.

According to above-mentioned steps, each commodity have had multiple attributes defined in user.Then property value is classified.Will The type of merchandise arrived is as the dimension of attribute.Number of repetition is more than the property value of predefined threshold value, the value being in dimension.

Then calculate respective score value weights by mode complementary between commodity and user, i.e. analyze all evaluations Obtain the items list that all users are interested.From commodity dimension, obtain property value computation attribute value by evaluation, by belonging to Property value be calculated the items list under each property value, embody user to these commodity support situation under this attribute.

Definition dimension collection D；Dimension collection value V；Items list SU (the p being evaluated₁, p₂...p_n)；Participate in evaluation User list UU (u₁, u₂...u_m)；The dimension list DU{d of commodity₁, d₂...d_k}；For any attribute value list VU in DU {v₁, v₂...v_o}；Attribute list SMU (pm₁, pm₂...pm_x), the value of corresponding SU element；Attributive classification list UMU (um₁, um₂...um_y), the value of corresponding UU element.

Assume that certain dimension is A{a₁, a₂...a_n, user gathers U{U₁, U₂... U_m, commodity set P{P₁, P₂, ...P_k}

(1) commodity score value is calculated jointly according to the number and evaluation user's weights of evaluating user, calculates process as follows:

S P (p r o d u c t | a x) = Σ_{i = 1}^{n} (U_{i {U_{i} . A = a x}} / {cnt}_{M} / {cnt}_{v} \times θ) \times {cnt}_{v x} / s u m

U_i.A=ax all users that product is evaluated as in dimension A ax are included.

cnt_MSum for all properties；Sum represents all users sum to product evaluation in dimension A； cnt_vxFor all users to the evaluation sum that product dimension values in dimension A is ax；cnt_vx/ sum is all users couple Product upper in dimension A dimension values be the weights coefficient of ax；cnt_vFor user's evaluation number in this value of this dimension Amount；θ is fall weight factor, and up-to-date time and the earliest time evaluated product in dimension A by user are determined.

(2) score value of user is calculated by commodity correspondence attribute score value:

Assume that user is combined into DV (D for the category set of commodity_iV_j|D_i∈ DU, V_j∈ VU) definition pdv be that commodity p is in dimension Score value in dimension values v of d, pdv '=pdv/cnt_pdv, wherein cnt_pdvFor voting in dimension d to the user being worth v at commodity p Number.User score value SP (U_u) it is calculated as follows:

S P (U_{u}) = Σ_{i = 1}^{n} Σ_{j = 1}^{m} Σ_{k = 1}^{o} P_{i} D_{j} V_{k} / P_{i} D_{j} V_{k},

(3) weights equation group is built

Described search word expansion module firstly generates web page object, of its corresponding search engine web page concentration Record, this object comprises three parts: data ID, represents the reference address of this data；Data value, refers to concrete data；Sequence Attribute list, represents the ordering attribute value multidimensional list that the sequence logic of classification is corresponding, and dimensionality reduction obtains one-dimensional ordering attribute row Table, these ordering attribute values are stored in an array from high to low according to the priority of grade, two ordering attribute lists Contrast according to priority time relatively.This web page object array is a public data pool, by data ID to the inside Each data quote, and safeguard a web page object hash table with the data value in web page object as key.

Then generate search word object and include following element: search word, data ID list object and data ID object candidates List.Wherein search word is to be obtained by the data value Attribute transposition in the inside web page object in common data pond, each data Value obtains multiple search word according to the model split of increasing lengths；One data ID is to liking by web page ID and sorting data Value list two is elementary composition, and data ID list object refers to the effective data ID list object that a search word is corresponding； Data ID object candidates list is used for supplementary data ID list object.

The generation process of search word extension content is carried out, by web page according to searching during traversal web page The rule of rope word increasing lengths divides web page one by one, the search word divided carries out during dividing change formation and searches Rope word list, is stored in each search word in hash table as key.It is described in detail below:

1. require to be stored in internal memory, traversal search web page list according to internal storage structure by web page；

2. change and divide every web page and form search word list；

3. determine corresponding web page ID is inserted data ID list also according to the ordering attribute value list of each search word It is in data ID candidate list；

4. generating the search word object hash table of search web page, this hash table comprises data ID list and the number of filling According to ID candidate list.

Wherein the division flow process of every data is core, is described in detail below:

Carry out being converted into polytype data value set by the data value of web page object；To data value set every Data value divides according to the mode of search word increasing lengths；Dissipate as key search search word according to the search word list divided List, searches successfully, then turn above step 3；Set up search word object according to memory data structure to add in hash table.

Web page intelligent processing module generates the detailed step of information constrained set and the process of optimization thereof and includes:

First sample is resolved to document object tree node set:

Spot_U{Spot₁, Spot₂... Spot_N, wherein Spot_N∈ document object tree node；

According to field or Type division dimension

Info_dim(Dim₁, Dim₂...Dim_M, wherein Dim_MThe M field of expression information；

Again information node result corresponding for these dimensions is stated with following set

U_Info{SpotX_l, SpotX₂, SpotX₃...SpotX_m, SpotX_i∈Spot_U；

The final result set of node of the i.e. information retrieval of U_Info set；

2., from Node distribution region, node represents in form, and intra-node organization rule analysis set Spot_U every Individual nodal community, and carry out gathering equivalent partition according to the difference of attribute；

3. the restriction relation of each node self in set of computations U_Info: in record U_Info, each node is at each stroke The value of attribute defined in point, each node calculated in the node set that dimension Info_dim is corresponding the most respectively occurs in step In which set in 2, obtain the constraint set of node on U_Info；

4. calculate the restriction relation between dimension: take any two node in U_Info, choose the attribute of a node, meter Each binary distance relation defined in calculating on this attribute:

|Dim(i)_Attr-Dim(j)_Attr|<σ

Wherein i, j refer to arbitrary two dimensions, and Attr refers to each attribute of dimension, the threshold value that σ sets, and by training certainly Dynamic adjustment；

5. all samples have been calculated according to above step.It is calculated two kinds of set: (1) is specific by above-mentioned Attribute on, the constraints set of the scope of the value that the specific dimension of information is taken, i.e. node or dimension self；(2) dimension Between constraints set, draw in the binary crelation set on specific nodal community of multiple dimensions；

6. merge dimension internal node equivalence relation on attribute or value attribute；By step 3 to 5, have recorded all Sample specific dimension value in particular community, be designated as

Value_Cnt{(V_l, cnt₁), (V₂, cnt₂)...(V_n, cnt_N)], wherein N is the species number of value；

For the merging of equivalence relation, calculate and be divided into two types:

(1) if the property value of discrete type, use statistical probability to calculate the node of this dimension on this attribute and take this The probability P of value_vi, formula is:

Wherein i takes [0, N]

For the property value of continuous, obeying and be desired for μ, standard deviation is the Normal probability distribution of δ, wherein:

μ=V₁*P_V1+V₂*P_V2+…+V_n*P_Vn

δ = \sqrt{\frac{Σ_{i = 1}^{N} (v_{i} - \frac{Σ_{j = 1}^{N} V_{j}}{N})}{N}}

7. between pair different dimensions, relation that may be present carries out computational analysis.Merging to comparison, takes and is more than, little In, equal to the discrete type numerical attribute as enumerated value.The probit of comparison is obtained according to step 4, will be at different samples Being distributed identical situation in Ye to merge, different relations is removed；For the merging of distance relation, using each distance value as The point of value, as the seriality numerical attribute of sample value.The probit of distance relation, computed range value is obtained according to step 4 The scope covered, determines the region of distribution；With the relation existed between the angle-determining dimension of all sample sets.

8. go to check each sample with equivalence relation constraint set and multiple dimension relation constraint on same attribute.

(1) assuming that inside each dimension values collection, the number of element is 1, the result set drawn is:

Result{U_di(N_ij| j ∈ (1, m}) | i ∈ { 1, n}}

If the result set in dimension d1 is U_d1(N_xi,…N_xm), the result set in dimension d2 is U_d2(N_xi,…N_yn), D1 and d2 has binary crelation collection UR (R₁... R_n), take U_d1And U_d2Combination:

U_d12((N_xi,N_yj) | j ∈ (1, m}) | i ∈ { 1, n}}

Traveling through above-mentioned node pair, definition meets the set of all of binary crelation node pair on UR.

(2) obtain arbitrarily choosing enumerating of two dimensions from all of dimension, travel through this combination, for each two dimension Combination, repeat step (1)；

(3) if the set of the result finally drawn only has 1, it is determined that close, by equivalence at the above collection divided Binary crelation between relation and dimension can be correct each dimension of the information that identifies, if unnecessary 1 of result, then increase Add more constraint.

9. if step 8 can not draw correct result, then take maximum with the comparative sequences of value or minima determines.Right Each node of result set, obtains comparable attribute, draws actual value by limited extreme value sequence from result set.

If 10. obtaining public extreme value arrangement set U_info for empty set by all samples of calculating, then it is assumed that Division collection closes, and information Info_dim is discernible.If U_info is empty set and big by other two kinds of results drawn In actual result, then it is assumed that closing dividing collection, information Info_dim is unrecognizable.Now refine division, or increase New division.

11. assume that information Info_dim is discernible dividing collection and closing, output three of the above constraint set；If no Recognizable, provide all of result set drawn according to other two kinds of constraints, by artificial observed result collection and correct result, Obtain between them the knowledge of difference, and add in division set and recalculate.

By above process of calculation analysis, obtain one group of regular constraint set relevant to information retrieval dimension the most at last Close, these constraint set and dimensional information are configured in template, for information retrieval.

On the basis of constraint set, by node division, the page needing parsing is carried out process and divide, then basis Suitable information node is screened in the constraint set that training generates, thus completes the extraction of information.

Firstly generate information aggregate:

1. the page parsing of input is become document tree；

2. all nodes on traversal document tree；

3. obtain a node of document tree；

4. judge whether this node is comment nodes, if it is, perform step 3, otherwise, perform next step；

5. this node is added in information aggregate；

6. judge whether document tree also has node not travel through, if it has, perform step 3, otherwise perform next step；

7. export the information aggregate U (N obtained_l, N₂...N_n), according to predefined node-classification rule, by each element It is stored in the subset that it is affiliated；Then carry out sort out merge, same node different characteristic value is merged, generate with Element is key, and feature tuple is the look-up table of value.

Then first these nodes are entered by both candidate nodes process of aggregation self-contained to each dimension according to constraint rule Then the multiple set of blocks obtained after classification are carried out block internal sort according to the ordering rule specified, then press by row classification respectively Condition according to configuration takes TopN the element of every piece respectively as candidate result collection.Specific as follows:

Read the ordering constraint condition of each dimension；Then this dimension is carried out classifying screen and selects the joint meeting ordering rule Point set；Node set stores sequence divide in constraint look-up table；Judge whether that also dimension did not process, if Having, iteration performs the step of category filter, otherwise exports obtained ordering constraint and divides look-up table.

During extracting, to the interconnection constraint look-up table drawn, obtain the both candidate nodes set of a dimension；Determine The number of the element in set is 1, if it is, extract the relevant content information of this node according to demand, i.e. removes the page Labelling and related pattern information, save this information into dimension as key assignments, nodal information content be value to information aggregate In；The informosome set that output obtains, completes information retrieval, terminates this process；Connected by this page, dimension identifies and candidate Node set write error processes in daily record.

In sum, the present invention proposes a kind of data retrieval method for particular topic, overcomes character string pairing The bottleneck of formula search, improves the accuracy of Search Results, and achieves intelligent and high-efficiency search, it is possible to adapt to the need of all kinds of business Ask.

Obviously, it should be appreciated by those skilled in the art, each module of the above-mentioned present invention or each step can be with general Calculating system realize, they can concentrate in single calculating system, or be distributed in multiple calculating system and formed Network on, alternatively, they can realize with the executable program code of calculating system, it is thus possible to by they store Performed by calculating system within the storage system.So, the present invention is not restricted to the combination of any specific hardware and software.

It should be appreciated that the above-mentioned detailed description of the invention of the present invention is used only for exemplary illustration or explains the present invention's Principle, and be not construed as limiting the invention.Therefore, that is done in the case of without departing from the spirit and scope of the present invention is any Amendment, equivalent, improvement etc., should be included within the scope of the present invention.Additionally, claims purport of the present invention Whole within containing the equivalents falling into scope and border or this scope and border change and repair Change example.

Claims

1. the data retrieval method for particular topic, it is characterised in that including:

According to user's evaluation to commodity, calculate by the way of score value calculates commodity corresponding add attributes, based on weighting The search of attribute realizes semantic class commercial articles searching.

Method the most according to claim 1, it is characterised in that described method farther includes:

According to user, commodity being evaluated extraction automatically and obtain the property value of commodity, search attribute is worth to meet special scenes Certain type of commodity, described automatic extraction includes:

(1). by commodity evaluation structure；

(2). same user is carried out participle to the content part of all comments of same commodity, will be predefined after word segmentation processing Stop words filters, and then dittograph is chosen the corresponding comment time up-to-date, finally obtains same user to same business The property value of product；

(3). it is calculated the property value to same commodity of all users according to step 2, identical property value is gathered Collection；

Then property value is classified；Using the type of merchandise that obtains as the dimension of attribute；Number of repetition is more than predefined threshold value Property value, the value being in dimension；Respective score value weights are calculated by mode complementary between commodity and user, I.e. analyze all evaluations and obtain the items list that all users are interested；From commodity dimension, obtain property value by evaluation and count Calculate property value, be calculated the items list under each property value by property value；

Definition dimension collection D；Dimension collection value V；Items list SU (the p being evaluated₁, p₂...p_n)；Participate in the user evaluated List UU (u₁, u₂...u_m)；The dimension list DU{d of commodity₁, d₂...d_k}；For any attribute value list VU{v in DU₁, v₂...v_o}；Attribute list SMU (pm₁, pm₂...pm_x), the value of corresponding SU element；Attributive classification list UMU (um₁, um₂...um_y), the value of corresponding UU element；Assume that certain dimension is A{a₁, a₂...a_n, user gathers U{U₁, U₂... U_m, business Product set P{P₁, P₂... P_k}

S P (p r o d u c t | a x) = Σ_{i = 1}^{n} (U_{i {U_{i} . A = a x}} / {cnt}_{M} / {cnt}_{v} \times θ) \times {cnt}_{v x} / s u m

Wherein, ax is a dimension values in dimension A；Product | ax represents that commodity are the score value of ax in its dimension values of dimension A；

cnt_MSum for all properties；Sum represents all users sum to product evaluation in dimension A；cnt_vxFor All users are to the evaluation sum that product dimension values in dimension A is ax；cnt_vx/ sum is that all users are on product In dimension A, dimension values is the weights coefficient of ax；cnt_vFor user's evaluation quantity in this value of this dimension；θ is fall Weight factor, up-to-date time and the earliest time evaluated product in dimension A by user are determined；

(2) score value of user is calculated by commodity correspondence attribute score value: assume that user is combined into DV for the category set of commodity (D_iV_j|D_i∈ DU, V_j∈ VU) definition pdv be commodity p score value in dimension values v of dimension d, pdv '=pdv/cnt_pdv, its Middle cnt_pdvThe number of the user for voting in dimension d to value v at commodity p；User score value SP (U_u) it is calculated as follows:

S P (U_{u}) = Σ_{i = 1}^{n} Σ_{j = 1}^{m} Σ_{k = 1}^{o} P_{i} D_{j} V_{k} / P_{i} D_{j} V_{k},

(3) weights equation group is built

Weight computing score value SP according to above-mentioned user and item property (product | ax) and SP (U_u), set up M+N*V unit one Equation of n th order n group, wherein the sum of commodity is N, the sum of user be M, V be dimension values set element number in each dimension, pass through The mode of iteration solves weights equation group and obtains weights and user's weights of each commodity correspondence attribute.