CN103324761B

CN103324761B - A kind of based on internet data formation product database method and system

Info

Publication number: CN103324761B
Application number: CN201310292303.0A
Authority: CN
Inventors: 张丽
Original assignee: ZOOM COMMERCE TECHNOLOGY Co Ltd
Current assignee: ZOOM COMMERCE TECHNOLOGY Co Ltd
Filing date: 2013-07-11
Publication date: 2016-11-30
Anticipated expiration: 2033-07-11

Abstract

The invention discloses a kind of based on internet data formation product database method and system.The method is as follows: use Theme Crawler of Content technology, crawl to be higher than the web data of predetermined threshold value with degree of subject relativity；The web data of crawl is carried out structured storage；The web data of structured storage is classified automatically according to product generic；Add up occurrence number and the time of occurrence of product attribute in automatic sorted web data, according to default weight, product attribute occurrence number and time of occurrence are weighted, obtain product attribute decision value, determine that product attribute puts in order according to product attribute decision value.This system, including data capture module, structured storage module, data categorization module and attribute decision-making module.This based on internet data formation product database method and system, user just would know that more comprehensively integrated information without collecting the product information arranged in the Internet；Ensure that the real-time of data, meet the real-time requirement of user.

Description

A kind of based on internet data formation product database method and system

Technical field

The present invention relates to internet data processing technology field, formed based on internet data in particular to one Product database method and system.

Background technology

At present, the catalogue of some main stream website is formed, and is both for every profession and trade and uses fixing product to issue template, shape Become the description of a product.Further, for the describing mode of same product, the standard that each website is taked is the most different.This Sample, due to product promulgated standard form disunity, for product demand side, demanding criteria is of all kinds, due to each big net Stand product description form disunity, therefore comprehensive improvement is carried out for product information the most difficult, it is impossible to know and meet demand mark Accurate product more comprehensively information, selecting if carrying out product by demanding criteria, selecting for high-volume polytypic product Situation, generally requires reading magnanimity webpage, inefficiency.

In sum, owing to lacking a kind of unified product description standard in correlation technique, and product information is caused to arrange The technical problem of difficulty.

Summary of the invention

It is an object of the invention to provide a kind of based on internet data formation product database method and system, to solve Above-mentioned problem.

Provide a kind of based on internet data formation product database method, including step in an embodiment of the present invention Rapid:

Step A, uses Theme Crawler of Content technology, crawl to be higher than the web data of predetermined threshold value, wherein, institute with degree of subject relativity State degree of subject relativity to be calculated by content Controlling UEP and link Controlling UEP；

Step B, carries out structured storage by the described web data captured；

Step C, classifies according to product generic automatically to the web data of described structured storage；

Step D, adds up occurrence number and the time of occurrence of product attribute in automatic sorted web data, according to presetting Weight product attribute occurrence number and time of occurrence are weighted, obtain product attribute decision value, according to described product Product attribute decision value determines that product attribute puts in order；

Wherein, the occurrence number of product attribute is designated as F, and the time of occurrence of product attribute is designated as T, and the power of Data Source Heavily it is designated as W, by formula (F+T) * W, obtains described product attribute decision value.

Wherein, described step A includes step:

Web data after Content Feature Extraction is analyzed, it is determined that web page contents with designated key degree of association is No reach described predetermined threshold value, be then to retain this webpage, no, then filter out this webpage；And/or, extraction from webpage is surpassed Chain information is calculated, and draws the degree of association of each URL indication page and designated key, degree of association reaches the net of predetermined threshold value Page retains；

The URL of the webpage of reservation joined in queue of creeping and be ranked up with the height of degree of subject relativity according to it；

According to the URL creeped in queue, set up with network after being connected to download its indication content of pages.

Wherein, described step B includes step:

The web page tag of the web data captured is analyzed, for the different product pages, is obtained by entity tag Take product entity information, and form record, obtain corresponding product attribute information and the property value of correspondence by attribute tags Carry out structured storage.

Wherein, described step C includes step:

Extract the text message in web data, determine the characteristic item set for classification automatically, according to described characteristic item Training text vector is redescribed in set, determines training text collection；

After current text arrives, analyzing current text according to the Feature Words in described characteristic item set, determining ought be above This vector representation；

Concentrating at training text and select K the text most like with current text, computing formula is:

s i m ({\overset{&RightArrow;}{d}}_{i}, {\overset{&RightArrow;}{d}}_{j}) = \frac{Σ_{k = 1}^{M} W_{i k} \times W_{j k}}{\sqrt{(Σ_{k = 1}^{M} {W^{2}}_{i k}) (Σ_{k = 1}^{M} {W^{2}}_{j k})}}

W_iRepresent the characteristic vector of i-th document, W_jRepresenting the characteristic vector of jth piece document, M is characterized the dimension of vector, Sim (d) represents the similarity of the i-th and j piece document, and k represents the kth dimension of text vector；

In K the text most like with current text, calculating each weight successively, computing formula is as follows:

p (\overset{&RightArrow;}{x}, C_{j}) = \underset{\overset{&RightArrow;}{d}, &Element; K N N}{Σ} s i m (\overset{&RightArrow;}{x}, {\overset{&RightArrow;}{d}}_{i}) y ({\overset{&RightArrow;}{d}}_{i}, C_{j})

X is a point, and Cj is known class, d_iIt is k nearest neighbours' point of x,It it is vector And vectorSimilarity,For category attribute function；

According to the weight obtained, calculate the similarity between current text and K text, according to similarity, determine and deserve The generic of front text.

Wherein, described C includes step:

Categorization vector space is set up in advance according to training sample and taxonomic hierarchies；

To one when a point sample is classified, calculate the similarity of sample to be divided and each categorization vector, then select Take the maximum classification of similarity as the classification corresponding to this sample to be divided.

Wherein, described step C includes step:

According to SVM algorithm and/or Bayes algorithm, web data is classified automatically.

Wherein, after described step D, further comprise the steps of:

According to the product attribute key word of user's input, retrieve the product information matched and according to product attribute decision value Height product information is shown with tabular form.

The embodiment of the present invention also provides for a kind of based on internet data formation product database system, including data grabber mould Block, structured storage module, data categorization module and attribute decision-making module；

Described data capture module, is used for using Theme Crawler of Content technology, captures with degree of subject relativity higher than predetermined threshold value Web data, wherein, described degree of subject relativity is calculated by content Controlling UEP and link Controlling UEP；

Described structured storage module, for carrying out structured storage by the described web data captured；

Described data categorization module, for carrying out certainly according to product generic the web data of described structured storage Dynamic classification；

Described attribute decision-making module, for adding up in automatic sorted web data the occurrence number of product attribute and going out Between Xian Shi, according to default weight, product attribute occurrence number and time of occurrence are weighted, obtain product attribute certainly According to described product attribute decision value, plan value, determines that product attribute puts in order；

Wherein, described data capture module, it is used for:

Wherein, described structured storage module, it is used for:

The one of the above embodiment of the present invention forms product database method and system based on internet data, by capturing Data, structured storage, automatically classification and attribute decision value calculate several steps, the product information in magnanimity web data are entered Classify after row structured storage, then each attribute of product is calculated, obtain the row of each attribute that product shows The most skimble-scamble various product information description contents so, have just been carried out summarizing by row order, and user is known wanting During the specifying information of a certain product, related data can be transferred according to product attribute, it is not necessary to read magnanimity webpage so that user for Product information in the Internet arranges without carrying out collecting, and i.e. would know that more comprehensively integrated information.Meanwhile, calculate product to belong to Property decision value time, occurrence number and time by attribute are weighted, in this manner it is ensured that the real-time of data, full The real-time requirement of the most of users of foot.

Accompanying drawing explanation

Fig. 1 is the flow process of a kind of embodiment forming product database method based on internet data of the present invention Figure；

Fig. 2 be the present invention a kind of based on internet data formed product database method an embodiment in use The principle schematic of SVM algorithm；

Fig. 3 is that the structure of a kind of embodiment forming product database system based on internet data of the present invention is shown It is intended to.

Detailed description of the invention

Below by specific embodiment and combine accompanying drawing the present invention is described in further detail.

Embodiments provide a kind of based on internet data formation product database method, shown in Figure 1, bag Include step:

Step S110: use Theme Crawler of Content technology, crawl to be higher than the web data of predetermined threshold value with degree of subject relativity.

The embodiment of the present invention uses Theme Crawler of Content technology, utilizes search engines to realize information gathering merit based on theme Energy.Typically by functions such as queue of creeping, network connector, topic model, content Controlling UEP and link Controlling UEP Module forms.

Wherein, queue of creeping is URL (UniformResourceLocator, the net higher by a series of degree of subject relativity Page address) composition.In addition to special instruction, in the present invention, URL refers both to web page address.

Queue of creeping is made up of seed website at the beginning of topic search engine carries out subject search, and these seed websites can Be given with the expert by the sector field, it is also possible to automatically generate by some authoritative websites.

After search procedure starts, the URL that system discovery is new, and add to climb to after its sequence according to degree of subject relativity In row queue.Network connector, then according to creeping the URL in queue, is set up with network after being connected to download in its indication page Hold.

Topic model is realized by theme modeling method, and theme morphology is conventional theme modeling method.Key word method with One stack features key word represents subject content, including user's request theme and document content.One subject key words is permissible Being single word phrase, including the attribute such as weight, languages, conventional relevancy algorithm is Word-frequency.

Wherein, calculate degree of subject relativity, can be by content Controlling UEP and link Controlling UEP.

Content Controlling UEP refers to that the web data after Content Feature Extraction is analyzed by system, it is determined that webpage How are content and designated key degree of association, filter the unrelated page, retain degree of association and reach the webpage of threshold value.

Link Controlling UEP refers to that the hyperlink information extracted from webpage is calculated by system, draws each URL institute Refer to the degree of association of the page and designated key, the URL meeting theme degree requirement is joined and creeps in queue, and it is crawled Priority ordered, to ensure that the page that degree of association is high is preferentially retrieved.

Described predetermined threshold value, judges whether to retain this web data according to data on webpage and degree of subject relativity size The quantization cut off value of one degree of association, specifically can be determined according to practical situation by those skilled in the art, and the present invention differs string Lift.If degree of association hundred-mark system represents, then predetermined threshold value can be 60-100.

Step S111: the described web data captured is carried out structured storage.

The embodiment of the present invention, by being analyzed the web page tag capturing data, forms label repository, to capturing net Page data carries out structured storage.

For the different product pages, obtain product entity by entity tag, and form record, obtained by attribute tags Take product attribute and the property value of correspondence of correspondence, carry out structured storage.

Step S112: the web data of described structured storage is classified automatically according to product generic.

Automatically the mode of classification has multiple, and several embodiment be set forth below:

The classifying rules of one of which method foundation is:

The method only determines sample to be divided according to the classification of one or several closest samples determining in class decision-making Affiliated classification.

Concrete algorithm steps is as follows:

Training text vector is redescribed according to characteristic item set；

After current text arrives, according to Feature Words participle current text, determine the vector representation of current text；

s i m ({\overset{&RightArrow;}{d}}_{i}, {\overset{&RightArrow;}{d}}_{j}) = \frac{Σ_{k = 1}^{M} W_{i k} \times W_{j k}}{\sqrt{(Σ_{k = 1}^{M} {W^{2}}_{i k}) (Σ_{k = 1}^{M} {W^{2}}_{j k})}}

W_iRepresent the characteristic vector of i-th shelves, W_jRepresenting the characteristic vector of jth piece document, M is characterized the dimension of vector, Sim (d) represents the similarity of the i-th and j piece document, and K is the kth dimension of vector；

In K neighbours of current text, calculating the weight of every class successively, computing formula is as follows:

p (\overset{&RightArrow;}{x}, C_{j}) = \underset{\overset{&RightArrow;}{d}, &Element; K N N}{Σ} s i m (\overset{&RightArrow;}{x}, {\overset{&RightArrow;}{d}}_{i}) y ({\overset{&RightArrow;}{d}}_{i}, C_{j})

X is a point, and Cj is known class, and di is k nearest neighbours' point of x,It it is vectorWith VectorSimilarity,For category attribute function, if d_iBelong to class C_j, then functional value is 1, is otherwise 0.

Afterwards, according to the weight obtained, calculate the similarity between current text and K text, according to similarity, determine The generic of this current text.

Another way is, the characteristic vector by document representation is weighting: D=D (T1, W1；T2, W2；…；Tn, Wn), so Method by calculating text similarity determines the classification of sample to be divided afterwards.When text is represented as vector space model Waiting, the similarity of text just can represent by the inner product between characteristic vector.

This kind of mode sets up categorization vector space according to the training sample in corpus and taxonomic hierarchies typically in advance.When needing The when of one sample to be divided classification, it is only necessary to the similarity calculating sample to be divided and each categorization vector is interior Long-pending, then choose the maximum classification of similarity as the classification corresponding to this sample to be divided.

Additionally, also can use SVM algorithm and/or Bayes algorithm that web data is classified automatically.

SVM algorithm, shown in Figure 2, it is that the optimal classification surface in the case of linear separability develops, basic thought Visible figure, cut-off rule 1 and cut-off rule 2 can correctly by 2 class samples separately, and such cut-off rule has wireless a plurality of, but segmentation Line 1 makes the gap maximum of 2 class samples, referred to as optimal classification line (more higher-dimension is optimal classification surface or optimal hyperlane).

Bayes algorithm is a kind of method for classifying modes in the case of known prior probability and class conditional probability, treats point The classification results of sample depends on the entirety of sample in each class field.

If training sample set is divided into M class, it is designated as C={c1 ..., ci ... cM}, the prior probability of every class is P (ci), i= 1,2 ..., M.When sample set is the biggest, it is believed that P (ci)=ci class sample number/total number of samples.For a sample to be divided X, its class conditional probability being attributed to cj class is P (X/ci), then according to Bayes theorem, the posterior probability P (ci/ of available cj class X):

P (ci/x)=P (x/ci) P (ci)/P (x) (formula 1-1)

If P (ci/X)=MaxjP (cj/X), i=1,2 ..., M, j=1,2 ..., M, then there is x ∈ ci (formula 1-2)

Formula (1-2) is maximum posterior probability decision rule criterion, formula (1-1) is substituted into formula (1-2), then has:

If P (x/ci) P (ci)=Maxj [P (x/cj) P (cj)], i=1,2 ..., M, j=1,2 ..., M, then x ∈ ci.

Step S113: add up occurrence number and the time of occurrence of product attribute in automatic sorted web data, according to Product attribute occurrence number and time of occurrence are weighted by the weight preset, and obtain product attribute decision value, according to institute State product attribute decision value and determine that product attribute puts in order.

Attribute decision package contains two parameters, the occurrence number (F) of attribute, the time of occurrence (T) of attribute, and Data Source Weight (W), pass through formula: (F+T) W, obtain attribute decision value.Obtain attribute according to this attribute decision value to be selected in and sequence.

Wherein, the weight of the time of occurrence of attribute and the weight of occurrence number, all specifically can determine according to practical situation, Usually, the time of Data Source is the most remote, then the weight of the time of occurrence of these data is the least.

The embodiment of the present invention also provides for a kind of based on internet data formation product database system, shown in Figure 3, bag Include data capture module 1, structured storage module 2, data categorization module 3 and attribute decision-making module 4.

Described data capture module 1, is used for using Theme Crawler of Content technology, captures with degree of subject relativity higher than predetermined threshold value Web data.

Described structured storage module 2, for carrying out structured storage by the described web data captured.

Described data categorization module 3, for carrying out according to product generic the web data of described structured storage Automatically classification.

Described attribute decision-making module 4, for add up in automatic sorted web data the occurrence number of product attribute and Time of occurrence, is weighted product attribute occurrence number and time of occurrence according to default weight, obtains product attribute According to described product attribute decision value, decision value, determines that product attribute puts in order.

These Database Systems should still be provided with searcher and management platform.

Searcher provides the user query interface, retrieves index data base according to the retrieval type that user proposes, presses Page link and relevant information are returned to user to after result ranking by degree of association height.

Management platform is responsible for being monitored whole system and managing, and main realization determines theme, initializes crawl device, control The functions such as crawling process processed, coordination optimization intermodule functional realiey, user are mutual.As a perfect search engine, management Platform also should supply cross-platform network service application interface.

Wherein, as a kind of embodiment, described data capture module 1, it is used for: to after Content Feature Extraction Web data is analyzed, it is determined that whether web page contents and designated key degree of association reach described predetermined threshold value, are, then retaining should Webpage, no, then filter out this webpage；And/or, the hyperlink information extracted from webpage is calculated, draws each URL indication The page and the degree of association of designated key, reach the webpage reservation of predetermined threshold value by degree of association；The URL of the webpage of reservation is joined Creep in queue and be ranked up with the height of degree of subject relativity according to it；According to the URL creeped in queue, set up even with network To download its indication content of pages after connecing.

Preferably as a kind of embodiment, described structured storage module 2, it is used for: to the web data captured Web page tag is analyzed, and for the different product pages, obtains product entity information by entity tag, and forms record, The property value being obtained corresponding product attribute information and correspondence by attribute tags carries out structured storage.

To sum up, the method and system that the embodiment of the present invention is provided, main utilization web crawlers technology, magnanimity webpage is entered Row captures, and mainly carries out comprehensive e-commerce website, vertical electron-like business web site, manufacturer website, purchaser website Capturing, and extract product up-to-date, effective and related data, the data captured are entered by maintenance data Structure Storage Technology afterwards Row structured storage, sets up electronic commerce data source.Maintenance data sorting technique again, classifies the data captured.Pass through Set up learning sample data for each classification, by the language material of data, name Entity recognition, semantic understanding, optimize the intelligence such as sample Change technology, and it is aided with artificial correction, it is achieved data automatic classification.Finally, by Attribute Synthetic Assessment System, the frequency that attribute is occurred Rate, time are analyzed, and analyze in conjunction with user's typing custom, form the attribute queueing discipline under each classification, generate each classification Description standard.

So, by the integrated use to above technology, the unified standard to every profession and trade product description is defined, by right Purchaser's standard is acquired, and can form the product description standard just to particular Buyer, and product description content can be simultaneously Multiple standard rooms are changed, and adapt to different purchasers and check, and can dock purchasing system, realize order contents by interface Auto-initiation, improves the treatment effeciency of system greatly.

Obviously, those skilled in the art should be understood that each module of the above-mentioned present invention or each step can be with general Calculating device realize, they can concentrate on single calculating device, or be distributed in multiple calculating device and formed Network on, alternatively, they can with calculate the executable program code of device realize, it is thus possible to by they store Performed by calculating device in the storage device, or they are fabricated to respectively each integrated circuit modules, or by them In multiple modules or step be fabricated to single integrated circuit module and realize.So, the present invention be not restricted to any specifically Hardware and software combines.

The foregoing is only the preferred embodiments of the present invention, be not limited to the present invention, for the skill of this area For art personnel, the present invention can have various modifications and variations.All within the spirit and principles in the present invention, that is made any repaiies Change, equivalent, improvement etc., should be included within the scope of the present invention.

Claims

1. one kind forms product database method based on internet data, it is characterised in that include step:

Step A, uses Theme Crawler of Content technology, crawl to be higher than the web data of predetermined threshold value, wherein, described master with degree of subject relativity Topic degree of association is calculated by content Controlling UEP and link Controlling UEP；

Step B, carries out structured storage by the described web data captured；

Step D, adds up occurrence number and the time of occurrence of product attribute in automatic sorted web data, according to default power Heavily product attribute occurrence number and time of occurrence are weighted, obtain product attribute decision value, belong to according to described product Property decision value determines that product attribute puts in order；

Wherein, the occurrence number of product attribute is designated as F, and the time of occurrence of product attribute is designated as T, and the weight note of Data Source For W, by formula (F+T) * W, obtain described product attribute decision value.

The most according to claim 1 based on internet data formation product database method, it is characterised in that described step A includes step:

Web data after Content Feature Extraction is analyzed, it is determined that whether web page contents reaches with designated key degree of association To described predetermined threshold value, it is then to retain this webpage, no, then filter out this webpage；And/or, to the hyperlink letter extracted from webpage Breath is calculated, and draws the degree of association of each URL indication page and designated key, and the webpage that degree of association reaches predetermined threshold value is protected Stay；

The most according to claim 1 based on internet data formation product database method, it is characterised in that described step Rapid B includes step:

The web page tag of the web data captured is analyzed, for the different product pages, is obtained by entity tag and produce Product entity information, and form record, the property value being obtained corresponding product attribute information and correspondence by attribute tags is carried out Structured storage.

The most according to claim 1 based on internet data formation product database method, it is characterised in that described step C includes step:

Extract the text message in web data, determine the characteristic item set for classification automatically, according to described characteristic item set Redescribe training text vector, determine training text collection；

After current text arrives, analyze current text according to the Feature Words in described characteristic item set, determine current text Vector representation；

W_iRepresent the characteristic vector of i-th document, W_jRepresenting the characteristic vector of jth piece document, M is characterized the dimension of vector, sim D () represents the similarity of the i-th and j piece document, k represents the kth dimension of text vector,It is i-th document vector,It it is jth Piece document vector；

X is a point, and Cj is known class, and di is k nearest neighbours' point of x,It it is vectorWith to AmountSimilarity,For category attribute function,Representing the vector of a point, KNN represents that K arest neighbors is tied Point algorithm,For the weight of any one text in K neighbours of current text；

According to the weight obtained, calculate the similarity between current text and K text, according to similarity, determine and deserve above This generic.

To one when a point sample is classified, calculate the similarity of sample to be divided and each categorization vector, then choose phase Like spending maximum classification as the classification corresponding to this sample to be divided.

The most according to claim 1 based on internet data formation product database method, it is characterised in that described step After D, further comprise the steps of:

According to the product attribute key word of user's input, retrieve the product information matched the height according to product attribute decision value Low product information is shown with tabular form.

8. one kind forms product database system based on internet data, it is characterised in that include data capture module, structuring Memory module, data categorization module and attribute decision-making module；

Described data capture module, is used for using Theme Crawler of Content technology, crawl to be higher than the webpage of predetermined threshold value with degree of subject relativity Data, wherein, described degree of subject relativity is calculated by content Controlling UEP and link Controlling UEP；

Described data categorization module, for automatically dividing according to product generic the web data of described structured storage Class；

Described attribute decision-making module, during for adding up in automatic sorted web data the occurrence number of product attribute and occurring Between, according to default weight, product attribute occurrence number and time of occurrence are weighted, obtain product attribute decision value, Determine that product attribute puts in order according to described product attribute decision value；

The most according to claim 8 based on internet data formation product database system, it is characterised in that described data Handling module, is used for:

The most according to claim 8 based on internet data formation product database system, it is characterised in that described knot Structure memory module, is used for: