CN104915422A - Webpage collecting method and device based on browser - Google Patents

Webpage collecting method and device based on browser Download PDF

Info

Publication number
CN104915422A
CN104915422A CN201510316329.3A CN201510316329A CN104915422A CN 104915422 A CN104915422 A CN 104915422A CN 201510316329 A CN201510316329 A CN 201510316329A CN 104915422 A CN104915422 A CN 104915422A
Authority
CN
China
Prior art keywords
webpage
characteristic information
collection
search engine
web pages
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510316329.3A
Other languages
Chinese (zh)
Inventor
赵俊博
陈庆伟
王阳
胡海涛
郭俊杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anyi Hengtong Beijing Technology Co Ltd
Original Assignee
Anyi Hengtong Beijing Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anyi Hengtong Beijing Technology Co Ltd filed Critical Anyi Hengtong Beijing Technology Co Ltd
Priority to CN201510316329.3A priority Critical patent/CN104915422A/en
Publication of CN104915422A publication Critical patent/CN104915422A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9562Bookmark management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]

Abstract

The invention discloses a webpage collecting method and device based on a browser. The method includes: receiving a webpage collecting instruction; extracting webpage feature information; automatically saving the feature information into a collecting directory, wherein the webpage information comprises at least one of a keyword set, a title and an abstract. By the method, effectiveness of information collected by a user can be guaranteed, and instantaneity and accuracy of the information collected by the user can be increased.

Description

Based on webpage collection method and the device of browser
Technical field
The application relates to field of computer technology, is specifically related to field of terminal technology, particularly relates to the webpage collection method based on browser and device.
Background technology
Current browser, when collecting webpage, is all carry out collecting based on the URL(uniform resource locator) (Uniform Resource Locator, URL) of webpage.The URL of interested for user webpage is saved to collection by browser, thus preserves the information on the interested webpage of user.So the interested information of user is associated with the URL of collection.
There is following defect in this method: if the URL of user's collection lost efficacy (such as network address expired or website close), then the desired interested information of preserving of user can be lost; When the information on the interested webpage of user occur new dynamic time, because the URL preserved in collection can not upgrade info web, therefore, user by the URL in collection again accessed web page time, up-to-date information cannot be obtained, thus have impact on the accuracy of collection Information on Collection.
Summary of the invention
In view of this, a kind of method that webpage obtaining effective information in real time by collection can be provided dynamically to collect is expected.Further, also expect that can pass through provided webpage collection method obtains more information from the webpage that collection is collected.For solving above-mentioned one or more problem, this application provides the webpage collection method based on browser and device.
On the one hand, this application provides a kind of webpage collection method based on browser.The method comprises: the instruction receiving collection webpage; Extract the characteristic information of webpage; Characteristic information is saved to collection catalogue automatically.Wherein, characteristic information comprises following at least one item: set of keywords, title and summary.
In some implementation, extract the characteristic information of described webpage, comprise following at least one item: the high frequency words occurred in Corpus--based Method feature extraction webpage, based on semantic feature, high frequency words is screened, to obtain the set of keywords of webpage; Based on text density, webpage is resolved, obtain the title of webpage; And based on the summary of semantic feature extraction webpage.
In some implementation, the webpage collection method based on browser also comprises: in response to user to collection catalogue in characteristic information choose instruction, utilize search engine retrieving characteristic information, to determine target web; And jump to target web.
In some implementation, utilize characteristic information described in search engine retrieving, to determine target web, comprising: send the search command comprising characteristic information to search engine; From the result for retrieval of search engine, obtain the candidate web pages of at least one coupling and mate angle value accordingly; Detect described candidate web pages successively according to the sequence of coupling angle value whether can use; Using the available and the highest candidate web pages of coupling angle value as target web.
In some implementation, the webpage collection method based on browser also comprises: the link of the candidate web pages providing other available in the predeterminable area of target web.
In further implementation, webpage collection method based on browser also comprises: in response to the click behavior of user to the link of other available candidate web pages, report the related data of click behavior to search engine, increase the coupling angle value that the candidate web pages clicked corresponds to described characteristic information.
In further implementation, the related data clicking behavior comprises click time and number of clicks.
Second aspect, this application provides a kind of web page storage device based on browser.This device comprises: receiving element, is configured for the instruction receiving collection webpage; Extraction unit, is configured for the characteristic information extracting webpage; And storage unit, be configured for and characteristic information is saved to collection catalogue automatically.Wherein, characteristic information comprises following at least one item: set of keywords, title and summary.
In some implementation, extraction unit at least one of being configured for as follows extracts the characteristic information of webpage: the high frequency words occurred in Corpus--based Method feature extraction webpage, based on semantic feature, high frequency words is screened, to obtain the set of keywords of webpage; Based on text density, webpage is resolved, obtain the title of webpage; And based on the summary of semantic feature extraction webpage.
In some implementation, web page storage device based on browser also comprises: retrieval unit, is configured for and chooses instruction in response to user to characteristic information in collection, utilize search engine retrieving characteristic information, to determine target web, and jump to described target web.
In some implementation, retrieval unit is configured for determines target web as follows: send the search command comprising characteristic information to search engine; From the result for retrieval that search engine returns, obtain the candidate web pages of at least one coupling and mate angle value accordingly; Detect candidate web pages successively according to the sequence of coupling angle value whether can use; Using the available and the highest candidate web pages of coupling angle value as target web.
In further implementation, the web page storage device based on browser also comprises: recommendation unit, is configured for the link of the candidate web pages providing other available in the predeterminable area of target web.
In further implementation, web page storage device based on browser also comprises: adjustment unit, be configured in response to the click behavior of user to the link of other available candidate web pages, report the related data of click behavior to search engine, increase the coupling angle value that the candidate web pages clicked corresponds to characteristic information.
In further implementation, the related data clicking behavior comprises click time and number of clicks.
The webpage collection method based on browser that the application provides and device, characteristic information in the webpage will collected by extraction, automatically characteristic information is saved in collection catalogue, the validity of the information that user collects can be ensured, further, user is real-time by the info web of the acquisition of information collected in collection catalogue.Therefore, the webpage collection method based on browser that provides of the application and device improve the accuracy of Information on Collection.
Accompanying drawing explanation
That is done with reference to the following drawings by reading is described in detail non-limiting example, and the other features, objects and advantages of the application will become more obvious:
Fig. 1 shows the exemplary process diagram of the webpage collection method based on browser according to the application's embodiment;
Fig. 2 shows the exemplary process diagram of the webpage collection method based on browser according to another embodiment of the application;
Fig. 3 shows the exemplary process diagram utilizing the method for search engine determination target web according to the application's embodiment;
Fig. 4 shows the effect schematic diagram of the webpage of the way access collection provided according to the embodiment of the present application; And
Fig. 5 shows the structural representation of the web page storage device based on browser according to the application's embodiment.
Embodiment
Below in conjunction with drawings and Examples, the application is described in further detail.Be understandable that, specific embodiment described herein is only for explaining related invention, but not the restriction to this invention.It also should be noted that, for convenience of description, in accompanying drawing, illustrate only the part relevant to Invention.
It should be noted that, when not conflicting, the embodiment in the application and the feature in embodiment can combine mutually.Below with reference to the accompanying drawings and describe the application in detail in conjunction with the embodiments.
In the following description, a large amount of detail is by the complete description setting forth to provide to embodiments of the invention.But, it should be appreciated by those skilled in the art that the embodiment of the application is not when having these details, also can be implemented.
Please refer to Fig. 1, it illustrates the exemplary process diagram of the webpage collection method based on browser according to the application's embodiment.For the ease of understanding, in the present embodiment, illustrate in conjunction with the electronic equipment with network communicating function.It will be understood by those skilled in the art that this electronic equipment can include but not limited to smart mobile phone, panel computer, intelligent watch, E-book reader, pocket computer on knee and desk-top computer etc.
As shown in Figure 1, in a step 101, the instruction of collection webpage is received.
User is being browsed in webpage process by browser, if interested in web page contents, can send the instruction of collection webpage to browser.At this moment, browser can receive the instruction of the collection webpage that user sends.User sends the mode of instruction can for sending by the favorite of click browser or icon.Such as, in some browsers, the instruction of collection webpage can be sent by selecting from drop-down option after click right in Webpage " adding collection to ".In some implementations, can comprise the information of collection webpage, such as, can comprise the option of " collecting this website " in webpage in webpage, at this moment, user can by clicking this option to send the instruction of collection webpage.In some optional implementations, when electronic equipment has voice input device, user can send audio instructions.Audio instructions can be resolved by electronic equipment, and sends to browser.Browser can receive the instruction of the collection webpage after parsing.
In a step 102, the characteristic information of webpage is extracted.
Browser in response to the instruction of collection webpage, can carry out collection process to the interested webpage of user.In the present embodiment, browser can preserve the information in the interested webpage of user by the characteristic information extracting webpage.Wherein characteristic information can comprise following at least one item: set of keywords, title and summary.
Browser can carry out analyzing and processing to the word content in webpage, extracts characteristic information wherein.In some implementations, browser can adopt set of keywords, title and the summary in the method extraction webpage of machine learning.Document keyword models such as can be adopted to extract the key word in webpage, and wherein document keyword model can be drawn by a large amount of documents and webpage training.
In some implementations, the set of keywords extracting webpage can be carried out in the following way: the high frequency words occurred in Corpus--based Method feature extraction webpage, screens, to obtain the set of keywords of webpage based on semantic feature to high frequency words.First browser can identify word content from webpage, when word content be Chinese wait do not comprise the language format in participle space time, can based on preset dictionary or term frequencies carry out participle to word content.Wherein term frequencies can be the statistics of large volume document or webpage.Browser can add up the frequency occurred in the webpage that all words in word segmentation result will collect user afterwards, using frequency higher than the word of a certain predetermined threshold value as high frequency words; Or the frequency occurred in the webpage will be able to collected user according to word sorts, will the sequence word that is front default position as high frequency words.Alternatively, the word after participle can merge based on the semantic similarity of word by browser, and such as " artistic illustration " and " illustration " can be merged into same word, the word after being combined carries out word frequency statistics.May comprise the conjunction, article etc. without practical significance in the high frequency words extracted after word frequency statistics, browser can also screen high frequency words based on semantic feature, such as can filtering without the word of practical significance, thus draw the keyword set of webpage.
In further realizing, set of keywords can also be extracted by the key marker in webpage.Such as in some webpages, intermediate portions adopts font (such as overstriking), color (such as highlighted) or the special symbol (such as text before and after add " # " symbol) different from other guide to mark.At this moment, directly can be extracted by markd for these tools content of text, then carry out participle operation to content of text, filtering, without the word of grammatical meaning, namely can obtain set of keywords at least partially.
Alternatively, except the key word extracted from the word content of webpage based on semantic feature and/or statistical nature, the abbreviation of webpage title can in set of keywords, be comprised.Browser can be analyzed the address of webpage, or from webpage, extract the abbreviation of webpage title, such as " Patent Office ", adds in set of keywords.
In some implementations, the title extracting webpage can carry out in the following way: resolve webpage based on text density, obtains the title of webpage.The title of webpage is generally arranged in the ad-hoc location of webpage pressure surface, and such as, above the page or left side, title text density is much smaller than the text density of body matter.Browser can detect by text density the position determining web page title, and then extracts the title of webpage according to the position of web page title.In other realize, browser can also adopt trained web page title extraction model to obtain the title of webpage.The training data of this model can be random webpage.
In some implementations, the summary of webpage can extract based on the semantic feature of webpage word content.Particularly, summary can be extracted as follows: the Feature Words in extraction word content is (when processing Chinese web page, need first to carry out word segmentation processing before extracting Feature Words), based on the weight of word frequency statistics determination Feature Words, according to the weight of the weight determination Feature Words place statement of Feature Words.Merge similar statement based on semantic feature afterwards, statement is connected according to weight, form the summary of webpage.In some optional implementations, the summary of webpage can be in short describe, and keyword for extract keyword from webpage, can connect for a word according to semantic feature by adding the words such as conjunction by the extracting mode of this word description afterwards.
In step 103, characteristic information is saved to collection catalogue automatically.
When collecting webpage, the URL of webpage can be saved in collection catalogue by browser, and can automatically or by user for this URL configures a title, so that user is by the webpage of title access correspondence.In the present embodiment, the URL of webpage do not preserved by browser, but the characteristic information extracted is saved to collection catalogue automatically, namely collect in catalogue preserve be that the key word extracted from webpage combines, title or summary, do not preserve the hyperlink of webpage.Like this, user, when the interested info web collected by collection directory access, does not directly navigate to corresponding webpage by URL, but carries out the access of webpage by the characteristic information in collection order.The interested information of user is associated with the characteristic information extracted from webpage, but not is associated with the URL of webpage.When user searches preserved interested information in collection catalogue, lookup result is characteristic information.Due to characteristic information can not change because of webpage URL, web page contents upgrades or the reason such as network address inefficacy and changing, the information that thus user collects has better real-time and accuracy relative to the mode of traditional preservation URL.
The webpage collection method based on browser that the above embodiments of the present application provide, characteristic information in the webpage will collected by extraction, automatically characteristic information is saved in collection catalogue, when the webpage URL that user collects lost efficacy, the desired interested information of preserving of user can not be lost, thus ensure that the validity of the information that user collects.When the information on the interested webpage of user occur new dynamic time, can pass through preserved characteristic information find renewal after webpage, obtain up-to-date information, thus ensure that user passes through to collect the acquisition of information collected in catalogue information accurately in real time.
With further reference to Fig. 2, it illustrates the exemplary process diagram of the webpage collection method based on browser according to another embodiment of the application.
As shown in Figure 2, in step 201, the instruction of collection webpage is received.
In the present embodiment, browser can receive the instruction of the collection webpage that user sends.Receive mode can in response to user's click browser collection icon or choose favorite and receive.Also can be obtain audio collection instruction by the voice input device of electronic equipment, then obtain by audio frequency parsing module the collection instruction that browser can identify.
In step 202., the characteristic information of webpage is extracted.
Characteristic information can comprise following at least one item: set of keywords, title and summary.In the present embodiment, browser can Corpus--based Method feature and/or semantic feature characteristic information extraction from web page contents.Such as, can high frequency words in Corpus--based Method feature extraction web page text, according to semantic feature, high frequency words is processed to the set of keywords obtaining webpage afterwards; Can based on the title of text density statistical nature determination webpage; Summary can be generated based on semantic feature.
In step 203, characteristic information is saved to collection catalogue automatically.
In the present embodiment, characteristic information can be added in collection catalogue by browser automatically.Alternatively, the information comprised when information characteristics can also arrange label for characteristic information when exceeding presupposed information amount (such as keyword quantity exceedes predetermined number or summary number of words exceedes default number of words).When the webpage that user is correlated with by collection directory access, characteristic information can be found from the label of correspondence.
Step 201 in the realization flow that above-mentioned composition graphs 2 describes, step 202 are identical with the step 101 in previous embodiment, step 102 and step 103 respectively with step 203, do not repeat them here.
Then, in step 204, in response to user instruction is chosen to characteristic information in collection catalogue, utilize search engine retrieving characteristic information, to determine target web.
When user is by webpage that the characteristic information access in collection catalogue is relevant, browser can choose instruction in response to user to characteristic information, utilizes search engine to carry out retrieval character information, thus determines target web.User can obtain interested information by target web.
In some implementations, user can send the instruction of access related web page by the characteristic information clicked in collection catalogue, browser can be retrieved by Automatically invoked search engine, can comprise multiple webpage in result for retrieval.User can manually to select in multiple webpage interested webpage as target web.Browser also can select a webpage as target web based on pre-defined rule.Wherein pre-defined rule can comprise webpage update time and the current accessed time closest, and/or the highest with characteristic information matching degree.
In the present embodiment, when user sends request of access, browser is retrieved according to characteristic information automatically, can ensure that the webpage retrieved is real-time.Namely, when the content of the webpage that user collects changes, user can navigate to the webpage after change by search engine.Target web can provide user interested information, and the webpage in the result for retrieval of search engine is generally effective webpage, therefore can avoid collecting the webpage that in catalogue, URL is directed and lose efficacy.
With further reference to Fig. 3, it illustrates the exemplary process diagram utilizing the method for search engine determination target web according to the application's embodiment.
As shown in Figure 3, in step 301, the search command comprising characteristic information is sent to search engine.
In the present embodiment, browser can start search engine, and characteristic information is sent to search engine, and the characteristic information received is retrieved as search key by search engine.The levels of precision of result for retrieval can be determined by the order of accuarcy of preserved characteristic information.Characteristic information is more accurate, then the webpage that arrives of search engine retrieving is more accurate.According to the search mechanism of search engine, can filter webpage, inefficacy webpage is presented in a browser after filtering from result for retrieval.Therefore utilize search engine to carry out retrieval and can ensure that the webpage retrieved is effective webpage.Meanwhile, because search engine is real-time information from the info web that server obtains, therefore can also ensure that the content on the webpage that retrieves is real-time content.
In step 302, from the result for retrieval of search engine, obtain the candidate web pages of at least one coupling and mate angle value accordingly.
Search engine can using the characteristic information that receives in search instruction as search word, fast processing is carried out to search word, word segmentation processing as distinctive in Chinese, remove stop-word, judge whether need start integrate search, judged whether the situation such as misspelling or wrongly written or mispronounced characters.After search word process, search engine can find out all webpages comprising search word from index data base, as the candidate web pages of coupling.Search engine can also according to the matching degree of netpage search word, position/frequency that search word occurs, the quality of web page interlinkage etc. calculate the coupling angle value of each webpage.In some implementations, search engine can also obtain the update time of webpage, and in conjunction with the matching degree of search word, the quality etc. of position/frequency that search word occurs and web page interlinkage calculates the coupling angle value of each webpage.
In the present embodiment, browser can obtain the webpage found out of search engine, can also obtain the coupling angle value of each webpage from search engine simultaneously.
In step 303, detect candidate web pages successively according to the sequence of coupling angle value whether can use.
Search engine can sort according to coupling angle value to candidate web pages, and returns the result for retrieval after sequence.At this moment whether browser or search engine can detect each candidate web pages successively according to the sequence of search engine and can use.In some implementations, whether can with can comprise: whether be addressable state, if so, then can determine that candidate web pages can be used if detecting each candidate web pages at current time if detecting each candidate web pages.
In step 304, using the available and the highest candidate web pages of coupling angle value as target web.
In the present embodiment, candidate web pages that can be available using first of detecting in step 303 as target web, by the available and the highest candidate web pages of coupling angle value as target web.
In the embodiment that above-mentioned composition graphs 3 describes, browser can be analyzed by the webpage of search engine to big data quantity, determines the target web relevant to characteristic information.Determined target web is effective webpage, and target web comprises real-time information, can provide more accurate, real-time content for user.
Return Fig. 2, in step 205, jump to target web.
After determining target web, browser can obtain the URL of target web, automatically opens new window or new tab, jumps to target web.User can in the page newly opened browsing objective webpage.
In step 206, the link of the candidate web pages providing other available in the predeterminable area of target web.
In the present embodiment, browser can in the new page content of display-object webpage.The link of other available candidate web pages can also be presented in the predeterminable area of target web (such as, white space).In some implementations, the link of other available candidate web pages can be presented with the form of floating frame.When the user clicks a link, browser can jump to the Webpage pointed by link automatically.
Alternatively, in user's browsing objective webpage process, if user does not click the link of other presented available candidate web pages, then presented link can be hidden by browser, or from other available candidate webpages retrieved, reselects one or more web page interlinkage present.
In step 207, in response to the click behavior of user to the link of other available candidate web pages, report the related data of click behavior to search engine, increase the coupling angle value that the candidate web pages clicked corresponds to characteristic information.
In the present embodiment, browser can detect in real time user click, the navigation patterns such as to pull.If user carries out clicking operation to the link of other candidate web pages be presented in target web, then can think that user is interested in clicked web page interlinkage.Browser can in response to the above-mentioned clicking operation of user, to search engine reported data.Search engine, after receiving the data that browser reports, can increase the coupling angle value that clicked candidate web pages corresponds to characteristic information.Such as, if browser detects that user clicks a certain web page interlinkage, then solicited message can be sent to search engine.This solicited message can for increasing the information of mating angle value of the characteristic information in collection catalogue that the web page interlinkage that click and user choose.Search engine can obtain a large amount of click datas from multiple browser, thus upgrades the coupling angle value of candidate web pages corresponding to characteristic information.
In some optional implementations, the related data clicking behavior can comprise click time and number of clicks.Search engine can according to the increment clicking time and/or number of clicks determination candidate web pages and correspond to the coupling angle value of characteristic information.Such as, can determine according to following rule the increment mating angle value: number of clicks is more, then the increment mating angle value is larger; The time that click time and user pass through the characteristic information accessed web page in collection is more close, and the increment of coupling angle value is larger.
For above-described embodiment of the application, the scene of application can be: user is browsing in webpage process, if find interested information, can send the instruction of collection webpage, the characteristic information of webpage can be saved to collection catalogue by browser automatically.When user needs again to access this webpage, can click the characteristic information in collection catalogue, browser can be that search word is searched for characteristic information by search engine, draws the webpage of the multiple couplings through sequence.Afterwards, browser can carry out validation checking to the webpage of coupling, and automatically jumps to effective and that sequence sequence number is minimum webpage.In some scenes, other pages mated with characteristic information can also be recommended in the webpage newly opened to select for user.If user clicks the page recommended, then to search engine reported data, the degree of association of the page that adding users is clicked and characteristic information.
With further reference to Fig. 4, it illustrates the effect schematic diagram of the webpage of the way access collection provided according to the embodiment of the present application.As shown in Figure 4, three Informations collected 4111,4112 and 4113 are comprised in the collection hurdle 411 of browser 410.The set of keywords of the webpage that the content in the Information 4111,4112 and 4113 wherein collected can be collected for user or in short description.Such as, content in the Information 4111 collected can be " machine learning; artificial neural network; programming; application ", when then user clicks the Information 4111 collected, browser 410 can start search engine and retrieve for search key with " machine learning ", " artificial neural network ", " programming ", " application ".After search complete, browser can add new tab in tab hurdle 412, shows available in result for retrieval and the webpage that matching degree is the highest in the page 413 opened.In the diagram, the region 4131 for showing other available web page interlinkages in result for retrieval can be comprised in the page 413.User can carry out clicking operation to web page interlinkage shown in region 4131.
With further reference to Fig. 5, it illustrates the structural representation of the web page storage device based on browser according to the application's embodiment.As shown in Figure 5, the web page storage device 500 based on browser can comprise receiving element 501, extraction unit 502 and storage unit 503.Wherein receiving element 501 can be configured for the instruction receiving collection webpage, extraction unit 502 can be configured for the characteristic information extracting webpage, and storage unit 503 can be configured for the characteristic information extracted by extraction unit 502 and automatically be saved to collection catalogue.
In some optional implementations, extraction unit 502 at least one of can be configured for as follows extracts the characteristic information of webpage: the high frequency words occurred in Corpus--based Method feature extraction webpage, based on semantic feature, described high frequency words is screened, to obtain the set of keywords of webpage; Based on text density, webpage is resolved, obtain the title of webpage; And based on the summary of semantic feature extraction webpage.
In certain embodiments, the web page storage device 500 based on browser can also comprise retrieval unit 504, jump-transfer unit 505, recommendation unit 506 and adjustment unit 507 (not shown).Wherein, retrieval unit 504 can be configured for chooses instruction in response to user to characteristic information in collection, utilizes search engine retrieving characteristic information, to determine target web; Jump-transfer unit 505 can be configured for and jump to target web; Recommendation unit 506 can be configured for the link of the candidate web pages providing other available in the predeterminable area of target web; Adjustment unit 507 can be configured in response to the click behavior of user to the link of other available candidate web pages, reports the related data of click behavior to search engine, increases the coupling angle value that the candidate web pages clicked corresponds to characteristic information.Alternatively, click behavior comprises click time and number of clicks.
In some optional implementations, retrieval unit 504 can be configured for determines target web as follows: send the search command comprising characteristic information to search engine; From the result for retrieval that search engine returns, obtain the candidate web pages of at least one coupling and mate angle value accordingly; Detect candidate web pages successively according to the sequence of coupling angle value whether can use; And using the available and the highest candidate web pages of coupling angle value as target web.
The web page storage device based on browser that the above embodiments of the present application provide, the characteristic information of the webpage that can automatically user be collected is saved in collection catalogue, the validity of the information that user collects can be ensured, further, user can by the more in real time and accurately information of the acquisition of information collected in collection catalogue.
Should be appreciated that in the web page storage device 500 based on browser that each step in the method that all elements reference Fig. 1-3 recorded describe is corresponding.Thus, the operation described for method above and feature are equally applicable to web page storage device 500 based on browser and the unit that wherein comprises, do not repeat them here.
As another aspect, present invention also provides a kind of computer-readable recording medium, this computer-readable recording medium can be the computer-readable recording medium comprised in device described in above-described embodiment; Also can be individualism, be unkitted the computer-readable recording medium allocated in terminal device.This computer-readable recording medium stores more than one or one program, and this program can comprise the program code for the method shown in flowchart.
Process flow diagram in accompanying drawing and block diagram, illustrate according to the architectural framework in the cards of the system of various embodiments of the invention, device, method and computer program product, function and operation.In this, each square frame in process flow diagram or block diagram can represent a part for module, program segment or a code, and a part for described module, program segment or code comprises one or more executable instruction for realizing the logic function specified.Also it should be noted that at some as in the realization of replacing, the function marked in square frame also can be different from occurring in sequence of marking in accompanying drawing.Such as, in fact the square frame that two adjoining lands represent can perform substantially concurrently, and they also can perform by contrary order sometimes, and this determines according to involved function.Also it should be noted that, the combination of the square frame in each square frame in block diagram and/or process flow diagram and block diagram and/or process flow diagram, can realize by the special hardware based system of the function put rules into practice or operation, or can realize with the combination of specialized hardware and computer instruction.
More than describe and be only the preferred embodiment of the application and the explanation to institute's application technology principle.Those skilled in the art are to be understood that, invention scope involved in the application, be not limited to the technical scheme of the particular combination of above-mentioned technical characteristic, also should be encompassed in when not departing from described inventive concept, other technical scheme of being carried out combination in any by above-mentioned technical characteristic or its equivalent feature and being formed simultaneously.The technical characteristic that such as, disclosed in above-mentioned feature and the application (but being not limited to) has similar functions is replaced mutually and the technical scheme formed.

Claims (14)

1. based on a webpage collection method for browser, it is characterized in that, described method comprises:
Receive the instruction of collection webpage;
Extract the characteristic information of described webpage; And
Described characteristic information is saved to collection catalogue automatically;
Wherein, described characteristic information comprises following at least one item: set of keywords, title and summary.
2. method according to claim 1, is characterized in that, the characteristic information of the described webpage of described extraction, comprises following at least one item:
The high frequency words occurred in webpage described in Corpus--based Method feature extraction, screens described high frequency words based on semantic feature, to obtain the set of keywords of described webpage;
Based on text density, webpage is resolved, obtain the title of described webpage; And
Based on the summary of webpage described in semantic feature extraction.
3. method according to claim 1 and 2, is characterized in that, described method also comprises:
In response to user to collection catalogue in characteristic information choose instruction, utilize characteristic information described in search engine retrieving, to determine target web; And
Jump to described target web.
4. method according to claim 3, is characterized in that, describedly utilizes characteristic information described in search engine retrieving, to determine target web, comprising:
The search command comprising described characteristic information is sent to described search engine;
From the result for retrieval of described search engine, obtain the candidate web pages of at least one coupling and mate angle value accordingly;
Detect described candidate web pages successively according to the sequence of described coupling angle value whether can use; And
Using the available and the highest candidate web pages of coupling angle value as target web.
5. method according to claim 4, is characterized in that, described method also comprises:
The link of other available candidate web pages is provided in the predeterminable area of described target web.
6. method according to claim 5, is characterized in that, described method also comprises:
In response to the click behavior of user to the link of other available candidate web pages described, report the related data of described click behavior to search engine, increase the coupling angle value that the candidate web pages clicked corresponds to described characteristic information.
7. method according to claim 6, is characterized in that, the related data of described click behavior comprises click time and number of clicks.
8. based on a web page storage device for browser, it is characterized in that, described device comprises:
Receiving element, is configured for the instruction receiving collection webpage;
Extraction unit, is configured for the characteristic information extracting described webpage, and described characteristic information comprises following at least one item: set of keywords, title and summary; And
Storage unit, is configured for and described characteristic information is saved to collection catalogue.
9. device according to claim 8, is characterized in that, described extraction unit at least one of being configured for as follows extracts the characteristic information of described webpage:
The high frequency words occurred in webpage described in Corpus--based Method feature extraction, screens described high frequency words based on semantic feature, to obtain the set of keywords of described webpage;
Based on text density, webpage is resolved, obtain the title of described webpage; And
Based on the summary of webpage described in semantic feature extraction.
10. device according to claim 8 or claim 9, it is characterized in that, described device also comprises:
Retrieval unit, is configured for and chooses instruction in response to user to characteristic information in collection, utilize characteristic information described in search engine retrieving, to determine target web; And
Jump-transfer unit, is configured for and jumps to described target web.
11. devices according to claim 10, is characterized in that, described retrieval unit is configured for determines target web as follows:
The search command comprising described characteristic information is sent to described search engine;
From the result for retrieval that described search engine returns, obtain the candidate web pages of at least one coupling and mate angle value accordingly;
Detect described candidate web pages successively according to the sequence of described coupling angle value whether can use; And
Using the available and the highest candidate web pages of coupling angle value as target web.
12. devices according to claim 11, is characterized in that, described device also comprises:
Recommendation unit, is configured for the link providing other available candidate web pages in the predeterminable area of described target web.
13. devices according to claim 12, is characterized in that, described device also comprises:
Adjustment unit, is configured in response to the click behavior of user to the link of other available candidate web pages described, reports the related data of described click behavior to search engine, increases the coupling angle value that the candidate web pages clicked corresponds to described characteristic information.
14. devices according to claim 13, is characterized in that, the related data of described click behavior comprises click time and number of clicks.
CN201510316329.3A 2015-06-10 2015-06-10 Webpage collecting method and device based on browser Pending CN104915422A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510316329.3A CN104915422A (en) 2015-06-10 2015-06-10 Webpage collecting method and device based on browser

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510316329.3A CN104915422A (en) 2015-06-10 2015-06-10 Webpage collecting method and device based on browser

Publications (1)

Publication Number Publication Date
CN104915422A true CN104915422A (en) 2015-09-16

Family

ID=54084485

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510316329.3A Pending CN104915422A (en) 2015-06-10 2015-06-10 Webpage collecting method and device based on browser

Country Status (1)

Country Link
CN (1) CN104915422A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105893584A (en) * 2016-04-03 2016-08-24 北京设集约科技有限公司 Method, client and system for displaying website label of favorites
CN107343104A (en) * 2017-07-19 2017-11-10 北京小米移动软件有限公司 Handle the method, apparatus and terminal device of Information on Collection
CN108280106A (en) * 2017-03-08 2018-07-13 广州市动景计算机科技有限公司 Processing method, device and the mobile terminal of search key
CN109508430A (en) * 2018-09-27 2019-03-22 努比亚技术有限公司 Browser network address label management method, terminal and computer readable storage medium
CN110020335A (en) * 2017-07-28 2019-07-16 北京搜狗科技发展有限公司 The treating method and apparatus of collection
CN113268692A (en) * 2021-05-18 2021-08-17 五八到家有限公司 Method and system for automatically collecting customer options, electronic equipment and storage medium
CN113282817A (en) * 2021-05-31 2021-08-20 武汉野途电子商务有限公司 Webpage content intelligent collection processing method and system based on webpage search engine data analysis and computer storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1912869A (en) * 2005-08-11 2007-02-14 腾讯科技(深圳)有限公司 Implementing method of network profile
CN102663064A (en) * 2012-03-30 2012-09-12 奇智软件(北京)有限公司 Method and device for processing favorite data
CN102831186A (en) * 2012-08-02 2012-12-19 深圳市同洲电子股份有限公司 Method and device for storing and searching webpage
CN103631827A (en) * 2012-08-29 2014-03-12 腾讯科技(深圳)有限公司 Method and system for synchronizing webpage information

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1912869A (en) * 2005-08-11 2007-02-14 腾讯科技(深圳)有限公司 Implementing method of network profile
CN102663064A (en) * 2012-03-30 2012-09-12 奇智软件(北京)有限公司 Method and device for processing favorite data
CN102831186A (en) * 2012-08-02 2012-12-19 深圳市同洲电子股份有限公司 Method and device for storing and searching webpage
CN103631827A (en) * 2012-08-29 2014-03-12 腾讯科技(深圳)有限公司 Method and system for synchronizing webpage information

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105893584A (en) * 2016-04-03 2016-08-24 北京设集约科技有限公司 Method, client and system for displaying website label of favorites
CN108280106A (en) * 2017-03-08 2018-07-13 广州市动景计算机科技有限公司 Processing method, device and the mobile terminal of search key
CN107343104A (en) * 2017-07-19 2017-11-10 北京小米移动软件有限公司 Handle the method, apparatus and terminal device of Information on Collection
CN110020335A (en) * 2017-07-28 2019-07-16 北京搜狗科技发展有限公司 The treating method and apparatus of collection
CN110020335B (en) * 2017-07-28 2022-04-26 北京搜狗科技发展有限公司 Favorite processing method and device
CN109508430A (en) * 2018-09-27 2019-03-22 努比亚技术有限公司 Browser network address label management method, terminal and computer readable storage medium
CN113268692A (en) * 2021-05-18 2021-08-17 五八到家有限公司 Method and system for automatically collecting customer options, electronic equipment and storage medium
CN113282817A (en) * 2021-05-31 2021-08-20 武汉野途电子商务有限公司 Webpage content intelligent collection processing method and system based on webpage search engine data analysis and computer storage medium
CN113282817B (en) * 2021-05-31 2022-08-23 喀斯玛(北京)科技有限公司 Webpage content collection processing method and processing system

Similar Documents

Publication Publication Date Title
CN109800352B (en) Method, system and terminal device for pushing information based on clipboard
CN104915422A (en) Webpage collecting method and device based on browser
US8849725B2 (en) Automatic classification of segmented portions of web pages
CN104102639B (en) Popularization triggering method based on text classification and device
WO2010125463A1 (en) Method and apparatus for identifying synonyms and using synonyms to search
Wu et al. Automatic web content extraction by combination of learning and grouping
CN102831199A (en) Method and device for establishing interest model
US9280522B2 (en) Highlighting of document elements
CN103631794A (en) Method, device and equipment for sorting search results
CN103324622A (en) Method and device for automatic generating of front page abstract
KR100974064B1 (en) System for providing information adapted to users and method thereof
CN102402566A (en) Web user behavior analysis method based on Chinese webpage automatic classification technology
US20160103913A1 (en) Method and system for calculating a degree of linkage for webpages
CN103838798A (en) Page classification system and method
US20160154891A1 (en) Intelligent-Predictable Input Method and System
US20150302093A1 (en) Method and system for filtering of a website
CN103164423A (en) Method and device for confirming browser inner core type rendering web pages
CN104503988A (en) Searching method and device
KR101011726B1 (en) Apparatus and method for providing snippet
CN103729178A (en) Method and system for processing multiple tabs of browsers
KR100913733B1 (en) Method for Providing Search Result Using Template
CN103455572B (en) Obtain the method and device of video display main body in webpage
CN104881447A (en) Searching method and device
CN104778232B (en) Searching result optimizing method and device based on long query
CN106202349A (en) Web page classifying dictionary creation method and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20150916

RJ01 Rejection of invention patent application after publication