CN104063453A - Method for extracting key words of marketing based on URL (uniform resource locator) analysis - Google Patents

Method for extracting key words of marketing based on URL (uniform resource locator) analysis Download PDF

Info

Publication number
CN104063453A
CN104063453A CN201410285743.8A CN201410285743A CN104063453A CN 104063453 A CN104063453 A CN 104063453A CN 201410285743 A CN201410285743 A CN 201410285743A CN 104063453 A CN104063453 A CN 104063453A
Authority
CN
China
Prior art keywords
url
uniform resource
resource locator
website
marketing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410285743.8A
Other languages
Chinese (zh)
Inventor
汤奇峰
刘作涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
ZAMPLUS ADVERTISING (SHANGHAI) CO Ltd
Original Assignee
ZAMPLUS ADVERTISING (SHANGHAI) CO Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ZAMPLUS ADVERTISING (SHANGHAI) CO Ltd filed Critical ZAMPLUS ADVERTISING (SHANGHAI) CO Ltd
Priority to CN201410285743.8A priority Critical patent/CN104063453A/en
Publication of CN104063453A publication Critical patent/CN104063453A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a method for extracting key words of marketing based on URL (uniform resource locator) analysis. The method comprises the following steps: (1) presetting a database, wherein the database comprises a plurality of structured texts and the corresponding relation between the establishment of a plurality of website URL structures and the structured texts in the database, and the structured texts at least comprise the key words of marketing; (2) analyzing at least one website URL, and at least capturing the website name and path of the website URL; (3) indexing whether the structured texts matched with the website URL exist in the database according to the website name and path of the website URL, if yes, carrying out the step (4); (4) obtaining the structured texts matched with the website URL. According to the method, a great deal of URL can be rapidly analyzed, and the corresponding key words of marketing are extracted and stored.

Description

A kind of method of the marketing keyword extraction of analyzing based on url
Technical field
The present invention relates to networking technology area, particularly a kind of method of the marketing keyword extraction of analyzing based on url.
Background technology
Url is exactly URL(uniform resource locator) (Uniform Resource Locator, is abbreviated as URL), is the position of the resource to obtaining from internet and a kind of succinct expression of access method, is the address of standard resource on internet.Each file on internet has a unique URL, and the information that it comprises points out how the position of file and browser should process it
Concerning most of advertisement marketing activities, how from the boundless and indistinct sea of faces, identifying potential targeted customer is a very difficult thing.In order to be directed to accurately targeted customer, the historical behavior that need to catch user, and from historical behavior, extract user's hobby, especially with the relevant key word information of marketing activity.
For example user may be interested in the automobile of price 80,000-100,000 yuan, also may be interested in being positioned at the economy hotel in area, PVG, " automobiles that 8-10 is ten thousand yuan ", " Economical Hotel of PVG " they are exactly 2 different marketing keywords here.
Once user profile and behavior record that utilization of the present invention was collected are analyzed, obtained the interested marketing lists of keywords of user, when advertisement marketing, the present invention just can be by advertisement putting to only to the interested crowd of particular keywords, carry out accurate orientation.Compare with traditional advertisement placement method, precisely directional energy cost still less, touches more potential customers, thereby is advertiser's creation of value.Meanwhile, due to precisely directed, only user is thrown in to the interested advertisement of its possibility, also can promote user and experience, reduce the interference of irrelevant advertisement to user.
Analyze marketing keyword, most suitable data are user's Visitor Logs while surfing the Net, especially, at the historical behavior of each vertical industry website, such as taking journey, search room, the family of Taobao, automobile etc.Concerning Internet advertising service provider, can collect user's Visitor Logs at the website of each cooperation deploy JS code.
But the url of user's access is various informative, a unified standard, brings very large difficulty to the extraction of marketing keyword.
For example, in Taobao about the navigation page of Huawei's mobile phone; Or the upper navigation page about the A4L of Audi of the family of automobile.In these two url, do not have the clear and definite information that it comprises of pointing out, in order to extract marketing key word information, just need to do deep analysis mining to url.
A kind of common way is that html text corresponding to url captured, then parsing obtains the text message needing from capture the html obtaining.This way needs reptile to carry out orientation crawl to a large amount of url, consider the authorization information that has comprised user in a lot of url, and a lot of websites shield the unrestricted crawl of reptile, not only efficiency is very low to cause way by reptile, and it is very high to capture mortality; On the other hand, due to the complicacy of the html page, from capture the html text obtaining, extracting marketing keyword is also a very difficult task.
To this, the present invention proposes a kind of method and system, can be automatically from the historical record of user's url access, extract the interested marketing keyword of user, for the accurate orientation of advertisement delivery system.
Summary of the invention
The invention provides a kind of method of the marketing keyword extraction of analyzing based on url, overcome the difficulty of prior art, by this method, the present invention can analyze a large amount of url fast, and extracts and preserve corresponding marketing keyword.
The present invention adopts following technical scheme:
A kind of method that the invention provides marketing keyword extraction of analyzing based on url, comprising:
(1) a default database, comprises a plurality of structured text and the corresponding relation of setting up the middle structured text of a plurality of websites URL(uniform resource locator) structure and described database in described database, and described structured text at least comprises marketing keyword;
(2) analyze at least one website URL(uniform resource locator), at least catch web site name and the path of this website URL(uniform resource locator);
(3) whether the index in database according to the web site name of described website URL(uniform resource locator) and path, have the structured text matching, and if so, performs step (4); And
(4) obtain the structured text of mating with this website URL(uniform resource locator).
Preferably, in described step (2), by a website URL(uniform resource locator) resolver, catch web site name and the path of this website URL(uniform resource locator).
Preferably, prestore in the URL(uniform resource locator) resolver of website in described step (2) the tree-shaped index of website URL(uniform resource locator) structure.
Preferably, in described step (2), extract website, subdomain name, URL(uniform resource locator) path and the URL(uniform resource locator) parameter list of website URL(uniform resource locator).
Preferably, described step (3) comprising:
(31) web site name that checks website URL(uniform resource locator) whether in index, if so, execution step (32); And
(32) path that checks website URL(uniform resource locator) whether in index, if so, execution step (4).
Preferably, the website URL(uniform resource locator) in described step (2) is the one or more websites URL(uniform resource locator) in user's history access record.
Preferably, described database is Key-Value database.
The method of the marketing keyword extraction of analyzing based on url of the present invention is precisely directed for the user in advertisement putting field, has proposed, efficiently the market method of keyword extraction that analyze, general based on url.Method of the present invention has following advantage:
(1) method of the present invention lays particular emphasis on the structure of url itself is carried out to automatic analysis, does not need url to capture on a large scale, thereby only takies less system resource;
(2) because the html page corresponding to url may be expired or lost efficacy, cause the crawl of url to have certain probability failure, thereby compare with the method that captures the html page and resolve, the inventive method has higher success ratio;
(3) because the large search engines such as Baidu can fall power to the url of frequent change, the url structure of most of websites remains unchanged for a long time, thereby the method for analyzing based on url that the present invention proposes has good stability;
(4) in the process that the present invention analyzes at url, set up efficient knowledge base index and url resolver, made the inventive method there is very high execution efficiency.
Below in conjunction with drawings and Examples, further illustrate the present invention.
Accompanying drawing explanation
Fig. 1 is the process flow diagram of the method for the marketing keyword extraction of analyzing based on url of the present invention.
Embodiment
Below by Fig. 1, introduce a kind of specific embodiment of the present invention
As shown in Figure 1, the method for a kind of marketing keyword extraction of analyzing based on url of the present invention, comprises the following steps:
(1) a default database, comprises a plurality of structured text and the corresponding relation of setting up the middle structured text of a plurality of websites URL(uniform resource locator) structure and described database in described database, and described structured text at least comprises marketing keyword;
(2) analyze at least one website URL(uniform resource locator), at least catch web site name and the path of this website URL(uniform resource locator);
(3) whether the index in database according to the web site name of described website URL(uniform resource locator) and path, have the structured text matching, and if so, performs step (4); And
(4) obtain the structured text of mating with this website URL(uniform resource locator).
In described step (2), by a website URL(uniform resource locator) resolver, catch web site name and the path of this website URL(uniform resource locator).
The tree-shaped index of the website URL(uniform resource locator) structure that prestores in the URL(uniform resource locator) resolver of website in described step (2).
In described step (2), extract website, subdomain name, URL(uniform resource locator) path and the URL(uniform resource locator) parameter list of website URL(uniform resource locator).
Described step (3) comprising:
(31) web site name that checks website URL(uniform resource locator) whether in index, if so, execution step (32); And
(32) path that checks website URL(uniform resource locator) whether in index, if so, execution step (4).
Website URL(uniform resource locator) in described step (2) is the one or more websites URL(uniform resource locator) in user's history access record.
Described database is Key-Value database.
The present invention need to build a knowledge base that industry is relevant, has comprised the structurized text message of industry-by-industry in this knowledge base.For example, " A4L of Audi " is a vehicle of automobile industry.
The present invention need to obtain the corresponding relation of url structure and the entry in knowledge base of each website.For example, to this website of www.autohome.com.cn, with the url catalogue of/692/ beginning, be " A4L of Audi " relevant information of corresponding automobile industry.
The present invention need to build an efficient knowledge base index.The scale of knowledge base may be very huge, comprises and surpass 1,000,000 concrete entries.In the leaching process of keyword, need an efficient index to search the time in minimizing.
The present invention need to build a url resolver, the url different to each, and resolver can capture structure corresponding with knowledge base in url rapidly.
Had above 4 modules, the present invention just can locate and extract marketing keyword fast to each url.First obtain structure corresponding with knowledge base in url, then by knowledge base index, obtain corresponding structured text.
Embodiments of the present invention are as follows:
1, build structurized industry knowledge base
We only wish that extraction is to the helpful marketing keyword of advertisement putting, thereby the keyword of need to marketing can correspond to different industries, and have clear and definite semantic information.Thereby we need to build a knowledge base that industry is relevant, represent the structurized text message of industry-by-industry.
For example, the structure of tourism industry knowledge base and example, as shown in the table.To each industry, the corresponding a plurality of different products of meeting, each product can corresponding a plurality of different fields.
The structure of industry knowledge base, can adopt manual sorting, adds the directed method capturing from network.First manually for industry-by-industry, formulate the structure of knowledge base, then from the leading website of industry, capture field name and the corresponding value that we need.
2, obtain url structure to the corresponding relation of knowledge base
To each industry, have the vertical website of many families, or the channel of comprehensive website, corresponding with this industry.For example, to tourism industry, take journey, where take a trip to, way ox, Taobao etc. have a large amount of url correspondence with it.We need to obtain the corresponding relation of url structure and the entry in knowledge base of each website.For example, to this website of www.autohome.com.cn, with the url catalogue of/692/ beginning, be " A4L of Audi " relevant information of corresponding automobile industry.Thereby when running into this url of http://www.autohome.com.cn/692/#pvareaid=103177, we just can extract the marketing keyword of " the automobile model A4L of Audi ".
In order to obtain this corresponding relation, need us to website, to analyze, and do directed crawl and parsing.For example, the car of each model of family that this page of http://car.autohome.com.cn/zhaoche/pinpai/ pvareaid=101451 has comprised automobile and the corresponding relation of url structure.We can capture this page get off, and by the parsing of html, extract in the such structure in similar/692/ corresponding relation of different digital id and automobile model.
3, build efficient knowledge base index
The scale of knowledge base may be very huge, comprises and surpass 1,000,000 concrete entries, and a plurality of url structures of each entry possibility Yu Duo home Web site are corresponding.Because the knowledge base structure of different industries may be different, and often change, be difficult to safeguard a simple and reliable relevant database, the knowledge base entry of URL structural correspondence is retrieved.
We have adopted novel Key-Value database, the data of each in knowledge base, corresponding a Value, and a plurality of different Key.Key is combined by the type of site name, url structure, a plurality of possible territories such as context of the value of url structure, url structure, and final purpose is to navigate to certain specific Value.
Efficient due to Key-Value database, our index structure can have very fast inquiry velocity.
4, cover the url resolver of a plurality of industries
We need to build a general url resolver, the url to any one website, and resolver can obtain whether comprising in url structured message rapidly, and captures structure corresponding with knowledge base in url.
To each url, we need to extract corresponding website, subdomain name, url path(URL(uniform resource locator) path) and url parameter list, for example: http://spu.taobao.com/spu/3c/spulist.htm spm=1.7274553.204.9. gMHLW
H & cat=1512 & page=& q=& pidvid=20000%3A11813 & sniperv=#content's
Website is taobao.com;
Subdomain name is spu.taobao.com;
Url path is /spu/3c/spulist.htm;
Url parameter list is spm=1.7274553.204.9.gMHLWH, cat=1512, pidvid=20000%3A11813, sniperv=#content.
In our url resolver, preserved the tree-shaped index of effective url structure.For example, whether the website that first checks url in index, if met, reads the subindex under this website, and whether the structure that checks url path is in index, by that analogy.It is useful finally obtaining those url structures, and by these url structures, assembles the Key of corresponding knowledge base index.
5, locate fast and extract keyword
To each url to be extracted, we first judge whether to extract marketing keyword by url resolver, if can, obtain the Key of a knowledge base index; By the Key of this knowledge base index, obtain corresponding knowledge base entry, and the marketing keyword of export structure.
By this method, we can analyze a large amount of url fast, and extract and preserve corresponding marketing keyword.
In summary, the method for the marketing keyword extraction of analyzing based on url of the present invention is precisely directed for the user in advertisement putting field, has proposed, efficiently the market method of keyword extraction that analyze, general based on url.Method of the present invention has following advantage:
(1) method of the present invention lays particular emphasis on the structure of url itself is carried out to automatic analysis, does not need url to capture on a large scale, thereby only takies less system resource;
(2) because the html page corresponding to url may be expired or lost efficacy, cause the crawl of url to have certain probability failure, thereby compare with the method that captures the html page and resolve, the inventive method has higher success ratio;
(3) because the large search engines such as Baidu can fall power to the url of frequent change, the url structure of most of websites remains unchanged for a long time, thereby the method for analyzing based on url that the present invention proposes has good stability;
(4) in the process that the present invention analyzes at url, set up efficient knowledge base index and url resolver, made the inventive method there is very high execution efficiency.
Above-described embodiment is only for illustrating technological thought and the feature of this patent, its object is the content that makes those skilled in the art can understand this patent and implements according to this, can not only with the present embodiment, limit the scope of the claims of this patent, be equal variation or the modification that all spirit disclosing according to this patent is done, still drop in the scope of the claims of this patent.

Claims (7)

1. a method for the marketing keyword extraction of analyzing based on URL(uniform resource locator), is characterized in that, comprising:
(1) a default database, comprises a plurality of structured text and the corresponding relation of setting up the middle structured text of a plurality of websites URL(uniform resource locator) structure and described database in described database, and described structured text at least comprises marketing keyword;
(2) analyze at least one website URL(uniform resource locator), at least catch web site name and the path of this website URL(uniform resource locator);
(3) whether the index in database according to the web site name of described website URL(uniform resource locator) and path, have the structured text matching, and if so, performs step (4); And
(4) obtain the structured text of mating with this website URL(uniform resource locator).
2. the method for the marketing keyword extraction of analyzing based on URL(uniform resource locator) as claimed in claim 1, is characterized in that: the web site name and the path that in described step (2), by a website URL(uniform resource locator) resolver, catch this website URL(uniform resource locator).
3. the method for the marketing keyword extraction of analyzing based on URL(uniform resource locator) as claimed in claim 2, is characterized in that: the tree-shaped index of the website URL(uniform resource locator) structure that prestores in the URL(uniform resource locator) resolver of website in described step (2).
4. the method for the marketing keyword extraction of analyzing based on URL(uniform resource locator) as claimed in claim 3, is characterized in that: the website, subdomain name, URL(uniform resource locator) path and the URL(uniform resource locator) parameter list that in described step (2), extract website URL(uniform resource locator).
5. the method for the marketing keyword extraction of analyzing based on URL(uniform resource locator) as claimed in claim 4, is characterized in that, described step (3) comprising:
(31) web site name that checks website URL(uniform resource locator) whether in index, if so, execution step (32); And
(32) path that checks website URL(uniform resource locator) whether in index, if so, execution step (4).
6. the method for the marketing keyword extraction of analyzing based on URL(uniform resource locator) as claimed in claim 1, is characterized in that: the website URL(uniform resource locator) in described step (2) is the one or more websites URL(uniform resource locator) in user's history access record.
7. the method for the marketing keyword extraction of analyzing based on URL(uniform resource locator) as claimed in claim 1, is characterized in that: described database is Key-Value database.
CN201410285743.8A 2014-06-24 2014-06-24 Method for extracting key words of marketing based on URL (uniform resource locator) analysis Pending CN104063453A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410285743.8A CN104063453A (en) 2014-06-24 2014-06-24 Method for extracting key words of marketing based on URL (uniform resource locator) analysis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410285743.8A CN104063453A (en) 2014-06-24 2014-06-24 Method for extracting key words of marketing based on URL (uniform resource locator) analysis

Publications (1)

Publication Number Publication Date
CN104063453A true CN104063453A (en) 2014-09-24

Family

ID=51551167

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410285743.8A Pending CN104063453A (en) 2014-06-24 2014-06-24 Method for extracting key words of marketing based on URL (uniform resource locator) analysis

Country Status (1)

Country Link
CN (1) CN104063453A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104866909A (en) * 2015-04-29 2015-08-26 国网智能电网研究院 Method and system for finishing air ticket booking function URL
CN110737851A (en) * 2018-07-03 2020-01-31 百度在线网络技术(北京)有限公司 Method, device and equipment for semantization of hyperlink and computer readable storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2006520939A (en) * 2003-03-18 2006-09-14 エヌエイチエヌ コーポレーション Internet user's connection intention determination method, and Internet advertisement method and system using the same
CN101511055A (en) * 2009-02-19 2009-08-19 华为技术有限公司 Method and device for delivering advertisement
CN102143224A (en) * 2011-01-25 2011-08-03 张金海 Mobile phone Internet accessing-based user behavior analysis method and device
CN102663022A (en) * 2012-03-21 2012-09-12 浙江盘石信息技术有限公司 Classification recognition method based on URL (uniform resource locator)
CN103164521A (en) * 2013-03-11 2013-06-19 亿赞普(北京)科技有限公司 Keyword calculation method and device based on user browse and search actions

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2006520939A (en) * 2003-03-18 2006-09-14 エヌエイチエヌ コーポレーション Internet user's connection intention determination method, and Internet advertisement method and system using the same
CN101511055A (en) * 2009-02-19 2009-08-19 华为技术有限公司 Method and device for delivering advertisement
CN102143224A (en) * 2011-01-25 2011-08-03 张金海 Mobile phone Internet accessing-based user behavior analysis method and device
CN102663022A (en) * 2012-03-21 2012-09-12 浙江盘石信息技术有限公司 Classification recognition method based on URL (uniform resource locator)
CN103164521A (en) * 2013-03-11 2013-06-19 亿赞普(北京)科技有限公司 Keyword calculation method and device based on user browse and search actions

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104866909A (en) * 2015-04-29 2015-08-26 国网智能电网研究院 Method and system for finishing air ticket booking function URL
CN110737851A (en) * 2018-07-03 2020-01-31 百度在线网络技术(北京)有限公司 Method, device and equipment for semantization of hyperlink and computer readable storage medium
CN110737851B (en) * 2018-07-03 2022-09-09 百度在线网络技术(北京)有限公司 Hyper-link semantization method, device, equipment and computer readable storage medium

Similar Documents

Publication Publication Date Title
CN1955963B (en) System and method for searching dates in electronic documents
CN101329687B (en) Method for positioning news web page
CN101908071B (en) Method and device thereof for improving search efficiency of search engine
CN103218431B (en) A kind ofly can identify the system that info web gathers automatically
CN106095979B (en) URL merging processing method and device
CN102710795B (en) Hotspot collecting method and device
CN102567494B (en) Website classification method and device
CN104615627B (en) A kind of event public feelings information extracting method and system based on microblog
CN101676907A (en) Method and system of directionally acquiring Internet resources
CN104063454A (en) Search push method and device for mining user demands
CN103051637A (en) User identification method and device
CN1963816A (en) Automatization processing method of rating of merit of search engine
CN102486799B (en) World wide web (WWW) page processing method and device
CN102169496A (en) Anchor text analysis-based automatic domain term generating method
CN104869009A (en) Website data statistics system and method
CN103023714A (en) Activeness and cluster structure analyzing system and method based on network topics
CN102073960A (en) Method for assessing operation effect in website marketing process
CN103116635B (en) Field-oriented method and system for collecting invisible web resources
CN102819580A (en) Monitoring method and system of advertisements of internet third-part media website
CN104462397A (en) Promotion information processing method and promotion information processing device
CN103177036A (en) Method and system for label automatic extraction
CN102811207A (en) Network information pushing method and system
CN107203588A (en) A kind of data classification managing system
CN103605742A (en) Method and device for recognizing network resource entity content page
CN101807187A (en) Browsing information-based instant search method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20140924