CN106168982A - Data retrieval method for particular topic - Google Patents
Data retrieval method for particular topic Download PDFInfo
- Publication number
- CN106168982A CN106168982A CN201610630951.6A CN201610630951A CN106168982A CN 106168982 A CN106168982 A CN 106168982A CN 201610630951 A CN201610630951 A CN 201610630951A CN 106168982 A CN106168982 A CN 106168982A
- Authority
- CN
- China
- Prior art keywords
- dimension
- commodity
- value
- user
- search
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/955—Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a kind of data retrieval method for particular topic, the method includes: according to user's evaluation to commodity, calculate by the way of score value calculates commodity corresponding add attributes, realize semantic class commercial articles searching based on adding attributes search.The present invention proposes a kind of data retrieval method for particular topic, overcomes the bottleneck of character string pairing type search, improves the accuracy of Search Results, and achieves intelligent and high-efficiency search, it is possible to adapts to the demand of all kinds of business.
Description
Technical field
The present invention relates to data retrieval, particularly to a kind of data retrieval method for particular topic.
Background technology
Along with the data acquisition of internet arena is required more and more accurate by user, professional search engine is in order to realize these
Demand and produce, this is search engine integrates according to the proprietary realm information of data many types.Such as commercial articles searching, finance is searched
Rope, video search etc..Compared with comprehensive search engine, professional search engine search rule is more rich, more accurately, more professional.
But from the point of view of existing vertical search technology and application product, however it remains in place of some technical imperfections, including: existing
Electricity business's search engine sequence is ranked up, if needs are according to visit capacity generally according to the comprehensive marking in a document of term institute
It is ranked up, then whole search result sets is carried out two minor sorts, but upsets the result of sequence for the first time, to the experience of user
Cause the biggest impact;The most existing search engine generally uses the mode of search word characters matching to carry out, and can only accomplish letter
Single character pairing, and the implication of some object search itself can not be got a real idea of, can only by the subjective perception of people
Can refine, furthermore along with web technology is maked rapid progress, need again to write regular expression for electricity business's search engine, it is clear that be difficult to
Adapt to the generating date of the whole network magnanimity.
Summary of the invention
For solving the problem existing for above-mentioned prior art, the present invention proposes a kind of data retrieval for particular topic
Method, including:
According to user's evaluation to commodity, calculate by the way of score value calculates commodity corresponding add attributes, based on
Add attributes search and realize semantic class commercial articles searching.
Preferably, described method farther includes:
According to user, commodity being evaluated extraction automatically and obtain the property value of commodity, search attribute is worth to meet specific field
The certain type of commodity of scape, described automatic extraction includes:
(1). by commodity evaluation structure;
(2). same user is carried out participle to the content part of all comments of same commodity, will be pre-after word segmentation processing
Definition stop words filters, and then dittograph is chosen the corresponding comment time up-to-date, finally obtains same user to same
The property value of individual commodity;
(3). it is calculated the property value to same commodity of all users according to step 2, identical property value is entered
Row is assembled;
(4). obtain, according to step 2 and 3, the property value that the comment of all commodity is obtained by all users;
Then property value is classified;Using the type of merchandise that obtains as the dimension of attribute;Number of repetition is more than predefined threshold
The property value of value, the value being in dimension;Respective score value is calculated by mode complementary between commodity and user
Weights, i.e. analyze all evaluations and obtain the items list that all users are interested;From commodity dimension, obtain property value by evaluation
And computation attribute value, it is calculated the items list under each property value by property value;
Definition dimension collection D;Dimension collection value V;Items list SU (the p being evaluated1, p2...pn);Participate in evaluation
User list UU (u1, u2...um);The dimension list DU{d of commodity1, d2...dk};For any attribute value list VU in DU
{v1, v2...vo};Attribute list SMU (pm1, pm2...pmx), the value of corresponding SU element;Attributive classification list UMU (um1,
um2...umy), the value of corresponding UU element;Assume that certain dimension is A{a1, a2...an, user gathers U{U1, U2... Um, business
Product set P{P1, P2... Pk}
(1) commodity score value is calculated jointly according to the number and evaluation user's weights of evaluating user, and process is as follows:
Wherein, ax is a dimension values in dimension A;Product | ax represents that commodity are ax's in its dimension values of dimension A
Score value;
Ui.A=ax all users that product is evaluated as in dimension A ax are included;
cntMSum for all properties;Sum represents all users sum to product evaluation in dimension A;
cntvxFor all users to the evaluation sum that product dimension values in dimension A is ax;cntvx/ sum is all users couple
Product upper in dimension A dimension values be the weights coefficient of ax;cntvFor user's evaluation number in this value of this dimension
Amount;θ is fall weight factor, and up-to-date time and the earliest time evaluated product in dimension A by user are determined;
(2) score value of user is calculated by commodity correspondence attribute score value: assume that user is combined into for the category set of commodity
DV(DiVj|Di∈ DU, Vj∈ VU) definition pdv be commodity p score value in dimension values v of dimension d, pdv '=pdv/cntpdv,
Wherein cntpdvThe number of the user for voting in dimension d to value v at commodity p;User score value SP (Uu) it is calculated as follows:
(3) weights equation group is built
Weight computing score value SP according to above-mentioned user and item property (product | ax) and SP (Uu), set up M+N*
V unit linear function group, wherein the sum of commodity is N, the sum of user be M, V be dimension values set element in each dimension
Number, solves weights equation group by the way of iteration and obtains weights and user's weights of each commodity correspondence attribute.
The present invention compared to existing technology, has the advantage that
The present invention proposes a kind of data retrieval method for particular topic, overcomes the bottle of character string pairing type search
Neck, improves the accuracy of Search Results, and achieves intelligent and high-efficiency search, it is possible to adapt to the demand of all kinds of electricity business business.
Accompanying drawing explanation
Fig. 1 is the flow chart of the data retrieval method for particular topic according to embodiments of the present invention.
Detailed description of the invention
Hereafter provide retouching in detail one or more embodiment of the present invention together with the accompanying drawing of the diagram principle of the invention
State.Describe the present invention in conjunction with such embodiment, but the invention is not restricted to any embodiment.The scope of the present invention is only by right
Claim limits, and the present invention contains many replacements, amendment and equivalent.Illustrate in the following description many details with
Thorough understanding of the present invention is just provided.These details are provided for exemplary purposes, and without in these details
Some or all details can also realize the present invention according to claims.
An aspect of of the present present invention provides a kind of data retrieval method for particular topic.Fig. 1 is real according to the present invention
Execute the data retrieval method flow chart for particular topic of example.
Present invention achieves a kind of professional field search engine architecture, utilize order of classification, divided by two-dimensional space
Value calculates foundation and adds and attributes carry out degree of depth intelligent search;Set up Multi-dimensional constraint data extracting mode and realize the content intelligence of the page
Extract, and scan for word extension content generation and update, the search word to supertext especially, realize searching based on hash search
Highlighting of rope word.
Professional field search engine architecture includes: acquisition module, and the collection being responsible for data receives, and is saved in specific
File under, it is provided that web page.Data memory module, is responsible for the data by accepting whole according to the data form needed for index
Reason.There is self-recovery, rollback function.Rolling back action can not be cancelled, and once rolls back to the specific date, when next update, this
Data before date will retain, and the data after this date will be deleted.Data directory module, is responsible for setting up rope according to data
Drawing, index has back mechanism simultaneously.Search calling interface module, issues into http service by search engine.Daily record and monitoring
Module, monitors the running status of each system above.Data analysis module, is carried out the data service part of web page content
Data analysis.User's modified module, revises Search Results, changes result including additions and deletions and modify sequence from outside.Data
Search module, is responsible for data search, and automatically updates latest data from directory system.
Data analysis module is for the marked feature of specific website, it determines and find out all web pages;Then, according to web
Network address is searched for the semanteme of concept, the body that the page pointed to by web page and each network address thereof is comprised respectively on the page
The comparison of the magnitude relationship between collection, finds out the URL of this web page;Finally, the link text on URL is mapped to this URL point to
The body that comprised of web page on, be included into the property set of this body.Unnecessary for avoid hiding during Attribute Discovery
Repeating, arrange the beta pruning mechanism of search B-tree, one web page of each node on behalf of search B-tree, father node points to leaf segment
The limit of point represents the next relation between corresponding web page, and the value on limit is corresponding hiding attribute, from root node to leaf node
All hiding attribute on path constitutes the hiding property set of this leaf node.First with depth-first fashion, according to the next network address
Semantic generation lower floor leaf node;Then, for newly-generated each leaf node, it is judged that its hiding property set whether with existing certain
Individual leaf node is identical, if having, abandons this leaf node, to complete crawling of attribute.At the end of crawling process, it is thus achieved that without repeat
All object pages, all properties information extracts procedure extraction for page info.
Data analysis module of the present invention by electricity business website on web page be divided into three kinds: results page, the object page and its
His page.One search correspondence is the series of results page, and the object page comprises an independent ontology information, including commodity.
The page classifications being not belonging to the both the above page is other pages.Each body is described with one group of community set, is formed
The condition of search.Each body has and the only object page.Describing electricity business website with non-directed graph, P represents vertex set,
Each summit represents a page, and L is limit collection, and each limit represents the URL from a page to another page.R represents institute
Having the set of results page, O represents the set of all object pages, and Q represents the set of all search.Search, results page and
Between object page three, all of attribute constitutes an attribute space, clusters based on the attachment structure between the page.
Attribute information in the search is hidden for finding out each body.The full set finding out search is needed to search with each
The key-value pair of the corresponding attribute of rope and the composition of value, meet the body of each search.Making q is a search, we use with
Results page set delta (q) that search is consistent represents q.Specifically perform step:
1. crawl whole Website page, utilize its URL identify each page and extract all of network address from the page.
2. identify the type of each page, i.e. results page, the object page and other pages.In page type identification,
Based on object page HTML structure similarity in same web site, page classifications method based on SVM is used to complete the object page
Identification.Then have employed greedy algorithm, as long as any non-object page comprises a network address pointing to the object page, then by it
It is categorized as a results page.
3. according to search, results page is clustered as multiple set, the corresponding search of each set.I.e. in set R
Set t (p) of all results page that each page p points to, represents the distance of each two page by the symmetric difference between t (p),
Introduce a distance threshold d, when described distance is less than d, indicate two pages to belong to identical search.
4. find out the relation between search.Check the URL of each page of each results page set s;If one is searched
Rope URL points to the page in another results page set r, then check the body page associated by inquiry that s and r is comprised respectively
Face wsAnd wrBetween subset relation.IfThen the URL of extraction s and r is as attribute, uses its hypertext as genus
Property value and upper strata HTML element as attribute-name create an attribute key-value pair.
5. extract the union of the attribute meeting all search and the key-value pair of the composition of value, as the hiding attribute of body.
The data search module of described search engine architecture includes: order module, search module based on attribute weights, search
Rope word expansion module, web page intelligent processing module, search word highlights module.Order module carries out order of classification, each
Grade arranges the sequence logic of multiple equal weights, and every layer of logic is carried out a grade internal sort.Simultaneously using visit capacity as row in real time
The reference frame of sequence.Overall procedure includes that sequence logic classification, sequence logic are integrated, ranking results block divides, ranking results is whole
Close, ranking results collection stores.Search logic is carried out with a matrix type by the actual demand according to searching service according to priority
Staged care.Ranking results is divided by rank, and the corresponding ranking results set of each sequence logic layer, then according to system
The sequence logic of one grade carries out a grade internal sort, and as the factor sorted, real-time visit capacity data are carried out secondary row in level
Sequence, finds suitable ranking results subset to return to user after integrating from the ranking results layer that each is orderly.
Search modules based on attribute weights, according to user's evaluation to commodity, calculate business by the way of score value calculates
Adding that product are corresponding is attributes, solves the commercial articles searching of semanteme by the way of search based on attribute weights, moves including property value
State generates, attribute score value calculates, commodity multiple attributes sorts and item property search.
Search word expansion module, after user inputs the partial content of search word, points out out the term row that user needs
Table, the arbitrary search word in user by selecting search word list scans for.The present invention by web page object through division after
It is stored in internal memory, generates search word expanded list, for search and the renewal of search word by traveling through and divide web page.
Ordinary pages as training set, is determined the constraint rule set of certain type page by web page intelligent processing module,
Then directly utilize these constraint rule set and carry out corresponding information retrieval, allow to manually adjust node division rule simultaneously,
Node division rule describes the most basic attribute of node in terms of different, and the same type of page only need to define a class joint
Point division rule, thus meet the demand of existing search engine.
Described search word highlights module, for a kind of general search word letter of long text search word display problem design
Breath content display method.Information content is resolved the position of the multiple search words obtained by the memory data structure first passing through design
Information inverted index is stored in internal memory, then improves search word information by the positional information inverted index of hash search search word
Loading efficiency, positions the positional information of specified search terms to determine that search word highlights scope simultaneously, resolve including search word,
Information content resolves, search word information loads, display content integration, display unit.
Owing to order module specifically includes sequence logic classification, sequence logic integration, the division of ranking results block, ranking results
Integrate, ranking results collection stores unit, and the present invention describes unit in a further embodiment in detail.
Sequence logic is carried out classification according to the actual demand of user by logical hierarchy unit, forms a matrix sort logic
Model.Wherein in matrix, row element represents multiple logics of ad eundem, and different rows represents different brackets, the power between different layers
Value is different.Assume N*M matrix by N number of sequence logic grade, and each sequence logic grade is by M sequence logic, therefrom
Part logic in selected part level and grade.Choosing sequence logic classification matrix, arranging front P row in matrix is search logic
Layer, priority logically increasing or decreasing, a certain search logic layer has 1-M the subset sequence logic as this layer.
Each logical mappings is become a numeral, is character matrix by search logic matrix conversion.
Sequence logic integral unit is integrated into the collection of a search according to the sequence logic in M*N order of classification logic matrix
Close, all file scannings are completed all search, result set orderly in forming multiple level.Ranking results block division unit according to
Hierarchy model carries out piecemeal, each layer of correspondence one piece, generates M data block i.e. sorting data layer, and each data block forms one
Data field.
Ranking results integral unit takes out a number of result subset according to the parameter being transmitted through from each data block,
Then carry out result and be integrated into a complete result set.The parameter being transmitted through is a regional value, and the flow process of integration is as follows:
1. the sorting data layer at the Search Results place returned according to region beginning and end address determination requirement;
2. judge that beginning and end address, whether in same sorting data layer, otherwise goes to step 8;
3. take out the data subset bottom first sorting data layer;
4. judge that sorting data layer number, whether more than 2, goes to step 6 if greater than 2;
5. all result subsets of sorting data layer in the middle of taking out;
6. take out last sorting data layer upper data subset;
7. the result set of taking-up carries out order merge;
8. return result set.
The described sequence logic according to unified grade carries out a grade internal sort, using real-time visit capacity data as sequence because of
Element carries out two minor sorts in level, farther includes:
The visit capacity that will sort in real time matrix correspondence order of classification logic matrix, the corresponding multiple external sequences of every first level logical are visited
The amount of asking is as the reference frame of sequence in real time.Value according to sequence visit capacity matrix in real time carries out two minor sorts, including: according to ginseng
Number is positioned to data block and the block inner region of sequence;The numerical value that ranking factor is corresponding is taken out in real time from data base;To sequence district
Territory is ranked up;
For search modules based on attribute weights, the evaluation of commodity is extracted according to user and is obtained commodity by the present invention automatically
Property value, by property value search obtain meeting the certain type of commodity of special scenes, reached the mesh of quasi-semantic search
Mark.Described automatic extraction includes:
1. by commodity evaluation structure;
2. same user is carried out participle to the content part of all comments of same commodity, will be predetermined after word segmentation processing
Justice stop words filters, and then dittograph is chosen the corresponding comment time up-to-date, finally obtains same user to same
The property value of commodity;
3. it is calculated the property value to same commodity of all users according to step 2, identical property value is carried out
Assemble;
4. obtain, according to step 2 and 3, the property value that the comment of all commodity is obtained by all users.
According to above-mentioned steps, each commodity have had multiple attributes defined in user.Then property value is classified.Will
The type of merchandise arrived is as the dimension of attribute.Number of repetition is more than the property value of predefined threshold value, the value being in dimension.
Then calculate respective score value weights by mode complementary between commodity and user, i.e. analyze all evaluations
Obtain the items list that all users are interested.From commodity dimension, obtain property value computation attribute value by evaluation, by belonging to
Property value be calculated the items list under each property value, embody user to these commodity support situation under this attribute.
Definition dimension collection D;Dimension collection value V;Items list SU (the p being evaluated1, p2...pn);Participate in evaluation
User list UU (u1, u2...um);The dimension list DU{d of commodity1, d2...dk};For any attribute value list VU in DU
{v1, v2...vo};Attribute list SMU (pm1, pm2...pmx), the value of corresponding SU element;Attributive classification list UMU (um1,
um2...umy), the value of corresponding UU element.
Assume that certain dimension is A{a1, a2...an, user gathers U{U1, U2... Um, commodity set P{P1, P2,
...Pk}
(1) commodity score value is calculated jointly according to the number and evaluation user's weights of evaluating user, calculates process as follows:
Wherein, ax is a dimension values in dimension A;Product | ax represents that commodity are ax's in its dimension values of dimension A
Score value;
Ui.A=ax all users that product is evaluated as in dimension A ax are included.
cntMSum for all properties;Sum represents all users sum to product evaluation in dimension A;
cntvxFor all users to the evaluation sum that product dimension values in dimension A is ax;cntvx/ sum is all users couple
Product upper in dimension A dimension values be the weights coefficient of ax;cntvFor user's evaluation number in this value of this dimension
Amount;θ is fall weight factor, and up-to-date time and the earliest time evaluated product in dimension A by user are determined.
(2) score value of user is calculated by commodity correspondence attribute score value:
Assume that user is combined into DV (D for the category set of commodityiVj|Di∈ DU, Vj∈ VU) definition pdv be that commodity p is in dimension
Score value in dimension values v of d, pdv '=pdv/cntpdv, wherein cntpdvFor voting in dimension d to the user being worth v at commodity p
Number.User score value SP (Uu) it is calculated as follows:
(3) weights equation group is built
Weight computing score value SP according to above-mentioned user and item property (product | ax) and SP (Uu), set up M+N*
V unit linear function group, wherein the sum of commodity is N, the sum of user be M, V be dimension values set element in each dimension
Number, solves weights equation group by the way of iteration and obtains weights and user's weights of each commodity correspondence attribute.
Described search word expansion module firstly generates web page object, of its corresponding search engine web page concentration
Record, this object comprises three parts: data ID, represents the reference address of this data;Data value, refers to concrete data;Sequence
Attribute list, represents the ordering attribute value multidimensional list that the sequence logic of classification is corresponding, and dimensionality reduction obtains one-dimensional ordering attribute row
Table, these ordering attribute values are stored in an array from high to low according to the priority of grade, two ordering attribute lists
Contrast according to priority time relatively.This web page object array is a public data pool, by data ID to the inside
Each data quote, and safeguard a web page object hash table with the data value in web page object as key.
Then generate search word object and include following element: search word, data ID list object and data ID object candidates
List.Wherein search word is to be obtained by the data value Attribute transposition in the inside web page object in common data pond, each data
Value obtains multiple search word according to the model split of increasing lengths;One data ID is to liking by web page ID and sorting data
Value list two is elementary composition, and data ID list object refers to the effective data ID list object that a search word is corresponding;
Data ID object candidates list is used for supplementary data ID list object.
The generation process of search word extension content is carried out, by web page according to searching during traversal web page
The rule of rope word increasing lengths divides web page one by one, the search word divided carries out during dividing change formation and searches
Rope word list, is stored in each search word in hash table as key.It is described in detail below:
1. require to be stored in internal memory, traversal search web page list according to internal storage structure by web page;
2. change and divide every web page and form search word list;
3. determine corresponding web page ID is inserted data ID list also according to the ordering attribute value list of each search word
It is in data ID candidate list;
4. generating the search word object hash table of search web page, this hash table comprises data ID list and the number of filling
According to ID candidate list.
Wherein the division flow process of every data is core, is described in detail below:
Carry out being converted into polytype data value set by the data value of web page object;To data value set every
Data value divides according to the mode of search word increasing lengths;Dissipate as key search search word according to the search word list divided
List, searches successfully, then turn above step 3;Set up search word object according to memory data structure to add in hash table.
Web page intelligent processing module generates the detailed step of information constrained set and the process of optimization thereof and includes:
First sample is resolved to document object tree node set:
Spot_U{Spot1, Spot2... SpotN, wherein SpotN∈ document object tree node;
According to field or Type division dimension
Info_dim(Dim1, Dim2...DimM, wherein DimMThe M field of expression information;
Again information node result corresponding for these dimensions is stated with following set
U_Info{SpotXl, SpotX2, SpotX3...SpotXm, SpotXi∈Spot_U;
The final result set of node of the i.e. information retrieval of U_Info set;
2., from Node distribution region, node represents in form, and intra-node organization rule analysis set Spot_U every
Individual nodal community, and carry out gathering equivalent partition according to the difference of attribute;
3. the restriction relation of each node self in set of computations U_Info: in record U_Info, each node is at each stroke
The value of attribute defined in point, each node calculated in the node set that dimension Info_dim is corresponding the most respectively occurs in step
In which set in 2, obtain the constraint set of node on U_Info;
4. calculate the restriction relation between dimension: take any two node in U_Info, choose the attribute of a node, meter
Each binary distance relation defined in calculating on this attribute:
|Dim(i)Attr-Dim(j)Attr|<σ
Wherein i, j refer to arbitrary two dimensions, and Attr refers to each attribute of dimension, the threshold value that σ sets, and by training certainly
Dynamic adjustment;
5. all samples have been calculated according to above step.It is calculated two kinds of set: (1) is specific by above-mentioned
Attribute on, the constraints set of the scope of the value that the specific dimension of information is taken, i.e. node or dimension self;(2) dimension
Between constraints set, draw in the binary crelation set on specific nodal community of multiple dimensions;
6. merge dimension internal node equivalence relation on attribute or value attribute;By step 3 to 5, have recorded all
Sample specific dimension value in particular community, be designated as
Value_Cnt{(Vl, cnt1), (V2, cnt2)...(Vn, cntN)], wherein N is the species number of value;
For the merging of equivalence relation, calculate and be divided into two types:
(1) if the property value of discrete type, use statistical probability to calculate the node of this dimension on this attribute and take this
The probability P of valuevi, formula is:
Wherein i takes [0, N]
For the property value of continuous, obeying and be desired for μ, standard deviation is the Normal probability distribution of δ, wherein:
μ=V1*PV1+V2*PV2+…+Vn*PVn
7. between pair different dimensions, relation that may be present carries out computational analysis.Merging to comparison, takes and is more than, little
In, equal to the discrete type numerical attribute as enumerated value.The probit of comparison is obtained according to step 4, will be at different samples
Being distributed identical situation in Ye to merge, different relations is removed;For the merging of distance relation, using each distance value as
The point of value, as the seriality numerical attribute of sample value.The probit of distance relation, computed range value is obtained according to step 4
The scope covered, determines the region of distribution;With the relation existed between the angle-determining dimension of all sample sets.
8. go to check each sample with equivalence relation constraint set and multiple dimension relation constraint on same attribute.
(1) assuming that inside each dimension values collection, the number of element is 1, the result set drawn is:
Result{Udi(Nij| j ∈ (1, m}) | i ∈ { 1, n}}
If the result set in dimension d1 is Ud1(Nxi,…Nxm), the result set in dimension d2 is Ud2(Nxi,…Nyn),
D1 and d2 has binary crelation collection UR (R1... Rn), take Ud1And Ud2Combination:
Ud12((Nxi,Nyj) | j ∈ (1, m}) | i ∈ { 1, n}}
Traveling through above-mentioned node pair, definition meets the set of all of binary crelation node pair on UR.
(2) obtain arbitrarily choosing enumerating of two dimensions from all of dimension, travel through this combination, for each two dimension
Combination, repeat step (1);
(3) if the set of the result finally drawn only has 1, it is determined that close, by equivalence at the above collection divided
Binary crelation between relation and dimension can be correct each dimension of the information that identifies, if unnecessary 1 of result, then increase
Add more constraint.
9. if step 8 can not draw correct result, then take maximum with the comparative sequences of value or minima determines.Right
Each node of result set, obtains comparable attribute, draws actual value by limited extreme value sequence from result set.
If 10. obtaining public extreme value arrangement set U_info for empty set by all samples of calculating, then it is assumed that
Division collection closes, and information Info_dim is discernible.If U_info is empty set and big by other two kinds of results drawn
In actual result, then it is assumed that closing dividing collection, information Info_dim is unrecognizable.Now refine division, or increase
New division.
11. assume that information Info_dim is discernible dividing collection and closing, output three of the above constraint set;If no
Recognizable, provide all of result set drawn according to other two kinds of constraints, by artificial observed result collection and correct result,
Obtain between them the knowledge of difference, and add in division set and recalculate.
By above process of calculation analysis, obtain one group of regular constraint set relevant to information retrieval dimension the most at last
Close, these constraint set and dimensional information are configured in template, for information retrieval.
On the basis of constraint set, by node division, the page needing parsing is carried out process and divide, then basis
Suitable information node is screened in the constraint set that training generates, thus completes the extraction of information.
Firstly generate information aggregate:
1. the page parsing of input is become document tree;
2. all nodes on traversal document tree;
3. obtain a node of document tree;
4. judge whether this node is comment nodes, if it is, perform step 3, otherwise, perform next step;
5. this node is added in information aggregate;
6. judge whether document tree also has node not travel through, if it has, perform step 3, otherwise perform next step;
7. export the information aggregate U (N obtainedl, N2...Nn), according to predefined node-classification rule, by each element
It is stored in the subset that it is affiliated;Then carry out sort out merge, same node different characteristic value is merged, generate with
Element is key, and feature tuple is the look-up table of value.
Then first these nodes are entered by both candidate nodes process of aggregation self-contained to each dimension according to constraint rule
Then the multiple set of blocks obtained after classification are carried out block internal sort according to the ordering rule specified, then press by row classification respectively
Condition according to configuration takes TopN the element of every piece respectively as candidate result collection.Specific as follows:
Read the ordering constraint condition of each dimension;Then this dimension is carried out classifying screen and selects the joint meeting ordering rule
Point set;Node set stores sequence divide in constraint look-up table;Judge whether that also dimension did not process, if
Having, iteration performs the step of category filter, otherwise exports obtained ordering constraint and divides look-up table.
During extracting, to the interconnection constraint look-up table drawn, obtain the both candidate nodes set of a dimension;Determine
The number of the element in set is 1, if it is, extract the relevant content information of this node according to demand, i.e. removes the page
Labelling and related pattern information, save this information into dimension as key assignments, nodal information content be value to information aggregate
In;The informosome set that output obtains, completes information retrieval, terminates this process;Connected by this page, dimension identifies and candidate
Node set write error processes in daily record.
In sum, the present invention proposes a kind of data retrieval method for particular topic, overcomes character string pairing
The bottleneck of formula search, improves the accuracy of Search Results, and achieves intelligent and high-efficiency search, it is possible to adapt to the need of all kinds of business
Ask.
Obviously, it should be appreciated by those skilled in the art, each module of the above-mentioned present invention or each step can be with general
Calculating system realize, they can concentrate in single calculating system, or be distributed in multiple calculating system and formed
Network on, alternatively, they can realize with the executable program code of calculating system, it is thus possible to by they store
Performed by calculating system within the storage system.So, the present invention is not restricted to the combination of any specific hardware and software.
It should be appreciated that the above-mentioned detailed description of the invention of the present invention is used only for exemplary illustration or explains the present invention's
Principle, and be not construed as limiting the invention.Therefore, that is done in the case of without departing from the spirit and scope of the present invention is any
Amendment, equivalent, improvement etc., should be included within the scope of the present invention.Additionally, claims purport of the present invention
Whole within containing the equivalents falling into scope and border or this scope and border change and repair
Change example.
Claims (2)
1. the data retrieval method for particular topic, it is characterised in that including:
According to user's evaluation to commodity, calculate by the way of score value calculates commodity corresponding add attributes, based on weighting
The search of attribute realizes semantic class commercial articles searching.
Method the most according to claim 1, it is characterised in that described method farther includes:
According to user, commodity being evaluated extraction automatically and obtain the property value of commodity, search attribute is worth to meet special scenes
Certain type of commodity, described automatic extraction includes:
(1). by commodity evaluation structure;
(2). same user is carried out participle to the content part of all comments of same commodity, will be predefined after word segmentation processing
Stop words filters, and then dittograph is chosen the corresponding comment time up-to-date, finally obtains same user to same business
The property value of product;
(3). it is calculated the property value to same commodity of all users according to step 2, identical property value is gathered
Collection;
(4). obtain, according to step 2 and 3, the property value that the comment of all commodity is obtained by all users;
Then property value is classified;Using the type of merchandise that obtains as the dimension of attribute;Number of repetition is more than predefined threshold value
Property value, the value being in dimension;Respective score value weights are calculated by mode complementary between commodity and user,
I.e. analyze all evaluations and obtain the items list that all users are interested;From commodity dimension, obtain property value by evaluation and count
Calculate property value, be calculated the items list under each property value by property value;
Definition dimension collection D;Dimension collection value V;Items list SU (the p being evaluated1, p2...pn);Participate in the user evaluated
List UU (u1, u2...um);The dimension list DU{d of commodity1, d2...dk};For any attribute value list VU{v in DU1,
v2...vo};Attribute list SMU (pm1, pm2...pmx), the value of corresponding SU element;Attributive classification list UMU (um1,
um2...umy), the value of corresponding UU element;Assume that certain dimension is A{a1, a2...an, user gathers U{U1, U2... Um, business
Product set P{P1, P2... Pk}
(1) commodity score value is calculated jointly according to the number and evaluation user's weights of evaluating user, and process is as follows:
Wherein, ax is a dimension values in dimension A;Product | ax represents that commodity are the score value of ax in its dimension values of dimension A;
Ui.A=ax all users that product is evaluated as in dimension A ax are included;
cntMSum for all properties;Sum represents all users sum to product evaluation in dimension A;cntvxFor
All users are to the evaluation sum that product dimension values in dimension A is ax;cntvx/ sum is that all users are on product
In dimension A, dimension values is the weights coefficient of ax;cntvFor user's evaluation quantity in this value of this dimension;θ is fall
Weight factor, up-to-date time and the earliest time evaluated product in dimension A by user are determined;
(2) score value of user is calculated by commodity correspondence attribute score value: assume that user is combined into DV for the category set of commodity
(DiVj|Di∈ DU, Vj∈ VU) definition pdv be commodity p score value in dimension values v of dimension d, pdv '=pdv/cntpdv, its
Middle cntpdvThe number of the user for voting in dimension d to value v at commodity p;User score value SP (Uu) it is calculated as follows:
(3) weights equation group is built
Weight computing score value SP according to above-mentioned user and item property (product | ax) and SP (Uu), set up M+N*V unit one
Equation of n th order n group, wherein the sum of commodity is N, the sum of user be M, V be dimension values set element number in each dimension, pass through
The mode of iteration solves weights equation group and obtains weights and user's weights of each commodity correspondence attribute.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610630951.6A CN106168982A (en) | 2016-08-03 | 2016-08-03 | Data retrieval method for particular topic |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610630951.6A CN106168982A (en) | 2016-08-03 | 2016-08-03 | Data retrieval method for particular topic |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106168982A true CN106168982A (en) | 2016-11-30 |
Family
ID=58065635
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610630951.6A Pending CN106168982A (en) | 2016-08-03 | 2016-08-03 | Data retrieval method for particular topic |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106168982A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109948048A (en) * | 2019-01-28 | 2019-06-28 | 广州大学 | A kind of commercial articles searching, sequence, methods of exhibiting and system |
CN110569420A (en) * | 2019-08-22 | 2019-12-13 | 上海摩库数据技术有限公司 | Search method based on chemical industry |
-
2016
- 2016-08-03 CN CN201610630951.6A patent/CN106168982A/en active Pending
Non-Patent Citations (1)
Title |
---|
袁凤云: "垂直搜索引擎关键技术研究与实现", 《中国优秀硕士学位论文全文数据库》 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109948048A (en) * | 2019-01-28 | 2019-06-28 | 广州大学 | A kind of commercial articles searching, sequence, methods of exhibiting and system |
CN110569420A (en) * | 2019-08-22 | 2019-12-13 | 上海摩库数据技术有限公司 | Search method based on chemical industry |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Gal | Uncertain schema matching | |
Pezzoni et al. | How to kill inventors: testing the Massacrator© algorithm for inventor disambiguation | |
CN105787068B (en) | The academic recommended method and system analyzed based on citation network and user's proficiency | |
CN105045875B (en) | Personalized search and device | |
CN108920544A (en) | A kind of personalized position recommended method of knowledge based map | |
CN110471948A (en) | A kind of customs declaration commodity classifying intelligently method excavated based on historical data | |
CN103823824A (en) | Method and system for automatically constructing text classification corpus by aid of internet | |
CN103927302A (en) | Text classification method and system | |
CN109918563A (en) | A method of the book recommendation based on public data | |
CN113254630B (en) | Domain knowledge map recommendation method for global comprehensive observation results | |
CN107291895A (en) | A kind of quick stratification document searching method | |
CN103425740A (en) | IOT (Internet Of Things) faced material information retrieval method based on semantic clustering | |
CN110263331A (en) | A kind of English-Chinese semanteme of word similarity automatic testing method of Knowledge driving | |
CN107016566A (en) | User model construction method based on body | |
CN108228787A (en) | According to the method and apparatus of multistage classification processing information | |
Sourabh et al. | Peer recommendation in dynamic attributed graphs | |
CN113127650A (en) | Technical map construction method and system based on map database | |
CN106168982A (en) | Data retrieval method for particular topic | |
CN106294652A (en) | Web page information search method | |
Zheng | Individualized Recommendation Method of Multimedia Network Teaching Resources Based on Classification Algorithm in a Smart University | |
CN106202567A (en) | Industry data efficient retrieval method | |
US11354519B2 (en) | Numerical information management device enabling numerical information search | |
CN109086373B (en) | Method for constructing fair link prediction evaluation system | |
US20200183952A1 (en) | Numerical information management device using data structure | |
CN117891961B (en) | Data cascade sharing method and system based on map product aggregation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20161130 |
|
RJ01 | Rejection of invention patent application after publication |