CN107544980B - Method and device for searching webpage - Google Patents

Method and device for searching webpage Download PDF

Info

Publication number
CN107544980B
CN107544980B CN201610474660.2A CN201610474660A CN107544980B CN 107544980 B CN107544980 B CN 107544980B CN 201610474660 A CN201610474660 A CN 201610474660A CN 107544980 B CN107544980 B CN 107544980B
Authority
CN
China
Prior art keywords
vector
webpage
similarity
page
vectorization
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610474660.2A
Other languages
Chinese (zh)
Other versions
CN107544980A (en
Inventor
王天祎
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Gridsum Technology Co Ltd
Original Assignee
Beijing Gridsum Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Gridsum Technology Co Ltd filed Critical Beijing Gridsum Technology Co Ltd
Priority to CN201610474660.2A priority Critical patent/CN107544980B/en
Publication of CN107544980A publication Critical patent/CN107544980A/en
Application granted granted Critical
Publication of CN107544980B publication Critical patent/CN107544980B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The embodiment of the invention provides a method and a device for searching a webpage, which are used for acquiring an identifier of a first webpage; searching a first vector corresponding to the identifier of the first webpage from a preset page vectorization database; searching at least one second vector which is in accordance with the preset similarity with the first vector from the preset page vectorization database; and outputting the identifier of the second webpage corresponding to the at least one second vector in the preset page vectorization database. The preset page vectorization database stores the corresponding relation between the identification of the web page and the vectors, the similarity between the vectors is used for indicating the page similarity of the web page, and at least one second web page with the page similarity meeting the preset requirement with the first web page can be quickly found by utilizing the similarity between the vectors in the preset page vectorization database, so that the accuracy of the method for searching the similar web page is improved.

Description

Method and device for searching webpage
Technical Field
The invention relates to the technical field of internet, in particular to a method and a device for searching a webpage.
Background
With the development of the internet, a large number of users access various websites every moment. The method has great significance for the website, and the page similarity refers to the similarity of the contents of the web pages.
Searching for the web page B similar to the web page A, conventionally, the content of the web pages is observed manually, the page similarity of the two web pages is determined according to the relation between the observed content of the web pages, and then the web page B similar to the web page A is searched. This method of finding similar web pages is very labor intensive.
Disclosure of Invention
The technical problem to be solved by the invention is to provide a method and a device for searching a webpage, so that the accuracy of the method for searching similar webpages can be improved.
Therefore, the technical scheme for solving the technical problem is as follows:
the first aspect of the present invention provides a method for searching a web page, where the method includes:
acquiring an identifier of a first webpage;
searching a first vector corresponding to the identifier of the first webpage from a preset page vectorization database, wherein the preset page vectorization database stores the corresponding relation between the identifier of the webpage and the vector, and the similarity between the vectors is used for indicating the page similarity of the webpage;
searching at least one second vector which is in accordance with the preset similarity with the first vector from the preset page vectorization database;
and outputting the identifier of the second webpage corresponding to the at least one second vector in the preset page vectorization database.
In a first possible implementation manner of the first aspect of the present invention, the preset page vectorization database is configured by the following method:
acquiring a browsing path of a user, wherein the browsing path comprises an identifier of a webpage browsed by the user and a browsing sequence of the webpage;
taking the identification of the webpage as a word in a word vector model, and training the browsing path by using the word vector model to obtain the vector of the webpage;
and establishing a corresponding relation between the identification of the webpage and the vector of the webpage as the preset page vectorization database.
With reference to the first aspect or the first possible implementation manner of the first aspect, in a second possible implementation manner of the first aspect, the method further includes:
outputting the at least one second vector;
and obtaining the similarity between the at least one second webpage and the first webpage according to the at least one second vector and the first vector.
With reference to the second possible implementation manner of the first aspect of the present invention, in a third possible implementation manner of the first aspect, the searching, from the preset page vectorization database, at least one second vector that meets a preset similarity with the first vector includes:
and searching at least one second vector with highest similarity and/or lowest similarity with the first vector from the preset page vectorization database.
In a fourth possible implementation form of the first aspect of the invention, the method further comprises:
acquiring an identifier of a third webpage;
searching a third vector corresponding to the identifier of the third webpage from a preset page vectorization database;
and obtaining the similarity of the first webpage and the third webpage according to the similarity of the first vector and the third vector.
A second aspect of the present invention provides an apparatus for searching a web page, the apparatus comprising:
the first acquiring unit is used for acquiring the identifier of the first webpage;
the first searching unit is used for searching a first vector corresponding to the identifier of the first webpage from a preset page vectorization database, the page vectorization database stores the corresponding relation between the identifier of the webpage and the vector, and the similarity between the vectors is used for indicating the page similarity of the webpage;
the second searching unit is used for searching at least one second vector which is consistent with the first vector in a preset similarity from the preset page vectorization database;
and the first output unit is used for outputting the identification of the second webpage corresponding to the at least one second vector in the preset page vectorization database.
In a first possible implementation of the second aspect of the invention, the apparatus further comprises:
the second acquisition unit is used for acquiring a browsing path of a user, wherein the browsing path comprises an identifier of a webpage browsed by the user and a browsing sequence of the webpage;
the training unit is used for taking the identification of the webpage as a word in a word vector model, and training the browsing path by using the word vector model to obtain the vector of the webpage;
and the establishing unit is used for establishing the corresponding relation between the identification of the webpage and the vector of the webpage as the preset page vectorization database.
With reference to the second aspect or the first possible implementation manner of the second aspect, in a second possible implementation manner of the second aspect, the apparatus further includes:
a second output unit for outputting the at least one second vector;
a first obtaining unit, configured to obtain a similarity between the at least one second web page and the first web page according to the at least one second vector and the first vector.
In combination with the second possible embodiment of the second aspect of the present invention, in a third possible embodiment of the second aspect,
the second searching unit is configured to search at least one second vector closest to the similarity of the first vector and/or farthest from the preset page vectorization database.
In a fourth possible implementation of the second aspect of the invention, the apparatus further comprises:
a third obtaining unit, configured to obtain an identifier of a third web page;
a third searching unit, configured to search a third vector corresponding to the identifier of the third web page from a preset page vectorization database;
and the second obtaining unit is used for obtaining the similarity between the first webpage and the third webpage according to the similarity between the first vector and the third vector.
According to the technical scheme, the invention has the following beneficial effects:
the embodiment of the invention provides a method for searching a webpage, which comprises the steps of obtaining an identifier of a first webpage; searching a first vector corresponding to the identifier of the first webpage from a preset page vectorization database, wherein the page vectorization database stores the corresponding relation between the identifier of the webpage and the vector, and the similarity between the vectors is used for indicating the page similarity of the webpage; searching at least one second vector which is in accordance with the preset similarity with the first vector from the preset page vectorization database; and outputting the identifier of the second webpage corresponding to the at least one second vector in the preset page vectorization database. The preset page vectorization database stores the corresponding relation between the identification of the web page and the vectors, the similarity between the vectors is used for indicating the page similarity of the web page, and at least one second web page with the page similarity meeting the preset requirement with the first web page can be quickly found by utilizing the similarity between the vectors in the preset page vectorization database, so that the accuracy of the method for searching the similar web page is improved.
The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a diagram illustrating an example similarity between word vectors according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating a method for searching a web page according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram illustrating an apparatus for searching a web page according to an embodiment of the present invention.
Detailed Description
In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only partial embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
In accordance with an embodiment of the present application, there is provided a method embodiment of a method for finding a web page, it should be noted that the steps illustrated in the flowchart of the figure may be performed in a computer system such as a set of computer executable instructions, and that while a logical order is illustrated in the flowchart, in some cases the steps illustrated or described may be performed in an order different than that presented herein.
In order to provide an implementation scheme for improving the accuracy of a method for searching similar webpages, embodiments of the present invention provide a method and an apparatus for searching webpages, and a preferred embodiment of the present invention is described below with reference to drawings of the specification.
A Word Vector (Word Vector) is a serialized learning model and widely applied to the fields of natural language Processing (Nature L natural language Processing) and the like, wherein sentences take words as minimum units and are arranged in a certain sequence to form sentences, the Word Vector model is adopted to train sentences in a corpus (text files formed by a plurality of sentences), each Word in the corpus can be serialized into a Vector omega formed by a plurality of dimension real numerical values, the similarity between the vectors represents the similarity between the words, the similarity between the vectors is high, the distance between the vectors is close, the similarity between the vectors is low, and the distance between the vectors is far, wherein the Vector omega of each Word represents the position relation of the Word in a plurality of sentence components.
For example, the following steps are carried out: in different sentences, the word "Apple" always has a similar context to the word "Apple phone", for example: word a → word B → Apple → word C → word D, word a → word B → iphone → word C → word D. The vector omega of the word "Apple" after training via the word vector model1Vector omega of the harmony word "apple Mobile2Are close, i.e. the vector omega1Sum vector omega2If the similarity of the word "Apple" is high, the similarity of the word "Apple" and the word "Apple cellphone" is high.
The word vector model may map words that are literally different but have the same or similar semantics to a vector omega with a high degree of similarity. This is because the position relationship between each word in many sentences in the corpus determines the degree of similarity between the words. For another example, as shown in fig. 1, the vector of dog is very close to the vector of puppy, the vector of dog is highly similar to the vector of puppy, and it is described that the similarity between the word dog and the word puppy is high, which often occurs in similar contexts, the vector of cat is very close to the vector of kitten, the vector of cat is highly similar to the vector of kitten, and the similarity between the word cat and the word kitten is high. The distance between the vector of the dog and the vector of the cat is far, the similarity between the vector of the dog and the vector of the cat is low, and the similarity between the word dog and the word cat is low. It is understood that puppy is little dog, often appearing in a context-like sentence as dog. Whereas cat and dog are two different animals, the word cat and the word dog often appear in sentences with large context differences.
Based on the theory of the word vector model, in the embodiment of the invention, the identification of the webpage is used as a word in the word vector model, the word vector model is used for training, each webpage in the browsing path of the user can be serialized into a real numerical value vector omega, and the corresponding relation between the identification of the webpage and the vector of the webpage is established.
Fig. 2 is a flowchart illustrating a method for searching a web page according to an embodiment of the present invention, where the method includes:
201: and acquiring the identifier of the first webpage.
202: searching a first vector corresponding to the identifier of the first webpage from a preset page vectorization database, wherein the preset page vectorization database stores the corresponding relation between the identifier of the webpage and the vector, and the similarity between the vectors is used for indicating the page similarity of the webpage.
When a user needs to search for at least one second webpage which meets a preset similarity with the first webpage, an identifier of the first webpage is input, the identifier of the first webpage can be a Uniform Resource locator (Uniform Resource L atom, UR L) address of the first webpage or a self-defined name of the first webpage, and the identifier of the first webpage can uniquely identify the first webpage.
In one example, the preset page vectorization database is set by the following method:
acquiring a browsing path of a user, wherein the browsing path comprises an identifier of a webpage browsed by the user and a browsing sequence of the webpage;
taking the identification of the webpage as a word in a word vector model, and training the browsing path by using the word vector model to obtain the vector of the webpage;
and establishing a corresponding relation between the identification of the webpage and the vector of the webpage as the preset page vectorization database.
A large number of users browse various web pages at any moment, and generally, each user browses a plurality of web pages. And arranging a plurality of webpages browsed by a user in a session according to the browsing sequence by taking the session as a reference to obtain the browsing path of the user in the session. Each browsing path is an identifier of a group of webpages arranged by a user in a session according to the browsing sequence. A large number of browsing paths formed by browsing web pages by a large number of users at different moments are obtained.
For example, the obtained browsing path is as follows:
identification of web page a1 → identification of web page a2 → identification of web page A3 → identification of web page a4 → identification of web page a 5;
identification of web page a1 → identification of web page a2 → identification of web page a6 → identification of web page a4 → identification of web page a 5;
identification of web page B1 → identification of web page B2 → identification of web page B3 → identification of web page B4 → identification of web page B5 → identification of web page B6 → identification of web page B7;
identification of web page C1 → identification of web page C2 → identification of web page C3
……
The method comprises the steps of taking the identifications of the webpages in the browsing paths as words in a word vector model, training a large number of browsing paths by using the word vector model, and serializing the identifications of each webpage in the browsing paths into a real-valued vector, wherein the identification of each webpage corresponds to one vector. The similarity of the vectors of the web pages is used to characterize the similarity between the web pages. In the sentence after the word vector model training, the closer the vector distance of the words is, the higher the vector similarity of the words is, and the more similar the semantics among the words are. Similarly, in the preset page vectorization database obtained by training the browsing path by the word vector model, the web page with high vector similarity is high in similarity, and the web page with low vector similarity is low in similarity.
For example, according to a method for training words in a sentence by using a word vector model, if a large number of browsing paths exist:
identification of web page a1 → identification of web page a2 → identification of web page A3 → identification of web page a4 → identification of web page a 5;
identification of web page a1 → identification of web page a2 → identification of web page a6 → identification of web page a4 → identification of web page a 5.
The distance between the vector corresponding to the identifier of the web page A3 and the vector corresponding to the identifier of the web page a6 is short, the similarity between the vector corresponding to the identifier of the web page A3 and the vector corresponding to the identifier of the web page a6 is high, and the similarity between the representative web page A3 and the representative web page a6 is high.
After training a large number of browsing paths, the word vector model obtains a vector corresponding to the identifier of each webpage in the browsing paths. And generating a page vectorization database in advance, and storing the corresponding relation between the identification of the webpage and the vector. Such as: the correspondence between the identity of web page a1 and vector a1, the correspondence between the identity of web page a2 and vector a2, the correspondence between the identity of web page A3 and vector A3, the correspondence between the identity of web page a4 and vector a4, the correspondence between the identity of web page a5 and vector a5, the correspondence between the identity of web page A6 and vector A6, etc.
It should be noted that the preset page vectorization database may be edited according to actual needs, where the editing includes any one or more of deletion, modification, and addition.
In one example, every other preset time period, a large number of browsing paths of users in the preset time period are obtained, the identifications of the web pages in the browsing paths are used as words in the word vector model, the browsing paths of the large number of users are trained again by adopting the word vector model, vectors corresponding to the identifications of the web pages in the browsing paths are obtained, and the preset page vectorization database is updated by using the vectors corresponding to the identifications of the web pages obtained by the new training. For example, the following steps are carried out: and if the original vector corresponding to the identification of one webpage stored in the preset page vectorization database is different from the vector obtained by the new training, replacing the original vector by the vector obtained by the new training. If the vector corresponding to the identifier of one web page is not stored in the preset vectorization database, the vector corresponding to the identifier of the web page is newly added in the preset vectorization database.
The preset time period can be specifically set according to actual needs. For example: three months or half a year, etc.
In one example, if the first vector corresponding to the identifier of the first web page cannot be found from the preset page vectorization data, the browsing path where the first web page is located is obtained, the browsing path is forcibly trained by using a word vector model together with the previously obtained browsing paths of a large number of users, the first vector corresponding to the first web page is obtained, and the corresponding relation between the identifier of the first web page and the first vector is added to the preset page vectorization database.
In the two examples, one example is to automatically adopt the word vector model to train the browsing path according to a preset time period, and the other example is to forcibly adopt the word vector model to train the browsing path according to actual needs, and the page-oriented quantitative database is continuously updated by adopting the modes in the two examples.
203: and searching at least one second vector which is in accordance with the preset similarity with the first vector from the preset page vectorization database.
And searching at least one second vector which accords with the preset similarity with the first vector, namely searching at least one second webpage which accords with the preset similarity with the first webpage.
In one example, at least one second vector with the highest similarity to the first vector is searched from the preset page vectorization database.
Namely, at least one second page with the highest similarity to the first page is searched. In one example, the similarity between a first vector and all other vectors in a preset page vectorization database is calculated, and at least one second vector with the highest similarity to the first vector is obtained. And searching the second vector according to the number of the second web pages to be searched. That is, if a second web page with the highest similarity to the first web page is searched, a second vector with the highest similarity to the first vector is searched; and if a plurality of second webpages with the highest similarity with the first webpage are searched, a plurality of second vectors with the highest similarity with the first vector are searched, and the number of the searched second vectors is the same as that of the second webpages to be searched.
In another example, at least one second vector having the lowest similarity to the first vector is searched from the preset page vectorization database.
Namely, at least one second page with the lowest similarity to the first page is searched. In one example, the similarity between a first vector and all other vectors in a preset page vectorization database is calculated, and at least one second vector with the lowest similarity to the first vector is obtained. And searching the second vector according to the number of the second web pages to be searched. That is, if a second web page with the lowest similarity to the first web page is searched, a second vector with the lowest similarity to the first vector is searched; and if a plurality of second webpages with the lowest similarity to the first webpage are searched, a plurality of second vectors with the lowest similarity to the first vector are searched, and the number of the searched second vectors is the same as that of the second webpages to be searched.
In yet another example, at least one second vector, the similarity of which with the first vector meets a preset interval, is searched from the preset page vectorization database.
Namely, at least one second page with the similarity conforming to the preset interval with the first page is searched. In one example, the similarity between a first vector and all other vectors in a preset page vectorization database is calculated, and at least one second vector, the similarity of which conforms to a preset interval with the first vector, is obtained. And searching a plurality of second vectors with the similarity conforming to a preset interval with the first vector according to the number of the second webpages to be searched, wherein the number of the searched second vectors is the same as that of the second webpages to be searched. Such as: and searching N second web pages with the similarity of more than 50% with the first web page, and searching N second vectors with the similarity of more than 50% with the first vector.
There are many possible implementations for calculating the similarity between vectors, for example: and calculating the similarity between the vectors, namely calculating Euclidean distances between the vectors, and measuring the similarity between the vectors by using the Euclidean distances. The shorter the Euclidean distance is, the higher the similarity between vectors is; the larger the euclidean distance, the lower the similarity between vectors. For example: the similarity between vectors is calculated, and the cosine similarity between vectors can be calculated.
204: and outputting the identifier of the second webpage corresponding to the at least one second vector in the preset page vectorization database.
After at least one second vector which accords with the preset similarity with the first vector is found, the identification of a second webpage corresponding to each second vector is found from a preset page vectorization database, namely if only one second vector which accords with the preset similarity with the first vector is found, the identification of one second webpage corresponding to the second vector is output; and if a plurality of second vectors which accord with the first vector and the preset similarity exist, outputting the identification of the second webpage corresponding to each second vector. The output second webpage is the webpage which accords with the preset similarity with the first webpage.
In one example, the method further comprises:
outputting the at least one second vector;
and obtaining the similarity between the at least one second webpage and the first webpage according to the at least one second vector and the first vector.
And outputting the searched at least one second vector, and calculating the similarity between each second vector and the first vector to serve as the similarity between the second webpage corresponding to the second vector and the first webpage. This may quantify the similarity of the first web page to the at least one second web page. The method for calculating the similarity between the first vector and the at least one second vector refers to the above description, and is not repeated here.
In one example, the method further comprises:
acquiring an identifier of a third webpage;
searching a third vector corresponding to the identifier of the third webpage from a preset page vectorization database;
and obtaining the similarity of the first webpage and the third webpage according to the similarity of the first vector and the third vector.
In the embodiment of the invention, at least one other webpage which accords with the preset similarity with one webpage can be searched. The identifiers of the two webpages can be input, vectors corresponding to the identifiers of the two webpages respectively are searched, the similarity of the searched two vectors is calculated, and the similarity of the two webpages is quantized.
For example, the following steps are carried out:
inputting an identifier of a webpage X and an identifier of a webpage Y; searching a vector X corresponding to the identifier of the webpage X from a preset page vectorization database, and searching a vector Y corresponding to the identifier of the webpage Y; and calculating the similarity of the vector X and the vector Y as the similarity of the webpage X and the webpage Y.
Fig. 3 is a schematic structural diagram of an apparatus for searching a web page according to an embodiment of the present invention, including:
a first obtaining unit 301, configured to obtain an identifier of a first webpage.
A first searching unit 302, configured to search a first vector corresponding to the identifier of the first web page from a preset page vectorization database, where a corresponding relationship between the identifier of the web page and the vector is stored in the preset page vectorization database, and a similarity between the vectors is used to indicate a page similarity of the web page.
A second searching unit 303, configured to search at least one second vector that matches a preset similarity with the first vector from the preset page vectorization database.
The second searching unit is configured to search at least one second vector closest to the similarity of the first vector and/or farthest from the preset page vectorization database.
A first output unit 304, configured to output an identifier of a second web page corresponding to the at least one second vector in the preset page vectorization database.
In one example, the apparatus further comprises:
the second acquisition unit is used for acquiring a browsing path of a user, wherein the browsing path comprises an identifier of a webpage browsed by the user and a browsing sequence of the webpage;
the training unit is used for taking the identification of the webpage as a word in a word vector model, and training the browsing path by using the word vector model to obtain the vector of the webpage;
and the establishing unit is used for establishing the corresponding relation between the identification of the webpage and the vector of the webpage as the preset page vectorization database.
In one example, the apparatus further comprises:
a second output unit for outputting the at least one second vector;
a first obtaining unit, configured to obtain a similarity between the at least one second web page and the first web page according to the at least one second vector and the first vector.
In one example, the apparatus further comprises:
a third obtaining unit, configured to obtain an identifier of a third web page;
a third searching unit, configured to search a third vector corresponding to the identifier of the third web page from a preset page vectorization database;
and the second obtaining unit is used for obtaining the similarity between the first webpage and the third webpage according to the similarity between the first vector and the third vector.
The apparatus for searching for a web page shown in fig. 3 is a device corresponding to the method for searching for a web page shown in fig. 2, and the specific implementation manner is similar to the method for searching for a web page shown in fig. 2, and reference is made to the description in the method shown in fig. 2, which is not repeated here.
The device for searching the webpage comprises a processor and a memory, wherein the first acquisition unit, the first searching unit, the second searching unit, the first output unit and the like are stored in the memory as program units, and the processor executes the program units stored in the memory to realize corresponding functions.
The processor comprises a kernel, and the kernel calls the corresponding program unit from the memory. The kernel can be set to be one or more than one, and at least one second webpage with the page similarity meeting the preset requirement with the first webpage can be quickly searched by adjusting the kernel parameters, so that the accuracy of the method for searching the similar webpage is improved.
The memory may include volatile memory in a computer readable medium, Random Access Memory (RAM) and/or nonvolatile memory such as Read Only Memory (ROM) or flash memory (flash RAM), and the memory includes at least one memory chip.
The present application further provides a computer program product adapted to perform program code for initializing the following method steps when executed on a data processing device:
acquiring an identifier of a first webpage;
searching a first vector corresponding to the identifier of the first webpage from a preset page vectorization database, wherein the preset page vectorization database stores the corresponding relation between the identifier of the webpage and the vector, and the similarity between the vectors is used for indicating the page similarity of the webpage;
searching at least one second vector which is in accordance with the preset similarity with the first vector from the preset page vectorization database;
and outputting the identifier of the second webpage corresponding to the at least one second vector in the preset page vectorization database.
In the computer program product, the preset page vectorization database is set by program codes of the following method steps:
acquiring a browsing path of a user, wherein the browsing path comprises an identifier of a webpage browsed by the user and a browsing sequence of the webpage;
taking the identification of the webpage as a word in a word vector model, and training the browsing path by using the word vector model to obtain the vector of the webpage;
and establishing a corresponding relation between the identification of the webpage and the vector of the webpage as the preset page vectorization database.
In the computer program product described above, further comprising program code for the method steps of:
outputting the at least one second vector;
and obtaining the similarity between the at least one second webpage and the first webpage according to the at least one second vector and the first vector.
In the computer program product, the program code of the method step searching at least one second vector which meets a preset similarity with the first vector from the preset page vectorization database comprises:
and searching at least one second vector with highest similarity and/or lowest similarity with the first vector from the preset page vectorization database.
In the computer program product described above, further comprising program code for the method steps of:
acquiring an identifier of a third webpage;
searching a third vector corresponding to the identifier of the third webpage from a preset page vectorization database;
and obtaining the similarity of the first webpage and the third webpage according to the similarity of the first vector and the third vector.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). The memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims (8)

1. A method for searching a web page, the method comprising:
acquiring an identifier of a first webpage;
searching a first vector corresponding to the identifier of the first webpage from a preset page vectorization database, wherein the preset page vectorization database stores the corresponding relation between the identifier of the webpage and the vector, the vector is formed by sequencing the identifiers of the webpages in the browsing path, and the similarity between the vectors is used for indicating the page similarity of the webpages;
searching at least one second vector which is in accordance with the preset similarity with the first vector from the preset page vectorization database;
outputting the identifier of a second webpage corresponding to the at least one second vector in the preset page vectorization database;
the preset page vectorization database is set by adopting the following method:
acquiring a browsing path of a user, wherein the browsing path comprises an identifier of a webpage browsed by the user and a browsing sequence of the webpage;
taking the identification of the webpage as a word in a word vector model, and training the browsing path by using the word vector model to obtain the vector of the webpage;
and establishing a corresponding relation between the identification of the webpage and the vector of the webpage as the preset page vectorization database.
2. The method of claim 1, further comprising:
outputting the at least one second vector;
and obtaining the similarity between the at least one second webpage and the first webpage according to the at least one second vector and the first vector.
3. The method of claim 2, wherein the searching at least one second vector from the preset page vectorization database that meets a preset similarity with the first vector comprises:
and searching at least one second vector with highest similarity and/or lowest similarity with the first vector from the preset page vectorization database.
4. The method of claim 1, further comprising:
acquiring an identifier of a third webpage;
searching a third vector corresponding to the identifier of the third webpage from a preset page vectorization database;
and obtaining the similarity of the first webpage and the third webpage according to the similarity of the first vector and the third vector.
5. An apparatus for searching for a web page, the apparatus comprising:
the first acquiring unit is used for acquiring the identifier of the first webpage;
the first searching unit is used for searching a first vector corresponding to the identifier of the first webpage from a preset page vectorization database, the page vectorization database stores the corresponding relation between the identifier of the webpage and the vector, the vector is formed by the identifier serialization of the webpage in a browsing path, and the similarity between the vectors is used for indicating the page similarity of the webpage;
the second searching unit is used for searching at least one second vector which is consistent with the first vector in a preset similarity from the preset page vectorization database;
a first output unit, configured to output an identifier of a second web page corresponding to the at least one second vector in the preset page vectorization database;
the device further comprises:
the second acquisition unit is used for acquiring a browsing path of a user, wherein the browsing path comprises an identifier of a webpage browsed by the user and a browsing sequence of the webpage;
the training unit is used for taking the identification of the webpage as a word in a word vector model, and training the browsing path by using the word vector model to obtain the vector of the webpage;
and the establishing unit is used for establishing the corresponding relation between the identification of the webpage and the vector of the webpage as the preset page vectorization database.
6. The apparatus of claim 5, further comprising:
a second output unit for outputting the at least one second vector;
a first obtaining unit, configured to obtain a similarity between the at least one second web page and the first web page according to the at least one second vector and the first vector.
7. The apparatus of claim 6,
the second searching unit is configured to search at least one second vector closest to the similarity of the first vector and/or farthest from the preset page vectorization database.
8. The apparatus of claim 5, further comprising:
a third obtaining unit, configured to obtain an identifier of a third web page;
a third searching unit, configured to search a third vector corresponding to the identifier of the third web page from a preset page vectorization database;
and the second obtaining unit is used for obtaining the similarity between the first webpage and the third webpage according to the similarity between the first vector and the third vector.
CN201610474660.2A 2016-06-24 2016-06-24 Method and device for searching webpage Active CN107544980B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610474660.2A CN107544980B (en) 2016-06-24 2016-06-24 Method and device for searching webpage

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610474660.2A CN107544980B (en) 2016-06-24 2016-06-24 Method and device for searching webpage

Publications (2)

Publication Number Publication Date
CN107544980A CN107544980A (en) 2018-01-05
CN107544980B true CN107544980B (en) 2020-07-24

Family

ID=60959879

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610474660.2A Active CN107544980B (en) 2016-06-24 2016-06-24 Method and device for searching webpage

Country Status (1)

Country Link
CN (1) CN107544980B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20100023630A (en) * 2008-08-22 2010-03-04 고려대학교 산학협력단 Method and system of classifying web page using categogory tag information and recording medium using by the same
CN101727500A (en) * 2010-01-15 2010-06-09 清华大学 Text classification method of Chinese web page based on steam clustering
CN103324645A (en) * 2012-03-23 2013-09-25 腾讯科技(深圳)有限公司 Method and device for recommending webpage
CN103810264A (en) * 2014-01-27 2014-05-21 西安理工大学 Webpage text classification method based on feature selection
CN104050203A (en) * 2013-03-17 2014-09-17 祁勇 Method for acquiring personalized characteristics of webpages and users
CN104090890A (en) * 2013-12-12 2014-10-08 深圳市腾讯计算机系统有限公司 Method, device and server for obtaining similarity of key words

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20100023630A (en) * 2008-08-22 2010-03-04 고려대학교 산학협력단 Method and system of classifying web page using categogory tag information and recording medium using by the same
CN101727500A (en) * 2010-01-15 2010-06-09 清华大学 Text classification method of Chinese web page based on steam clustering
CN103324645A (en) * 2012-03-23 2013-09-25 腾讯科技(深圳)有限公司 Method and device for recommending webpage
CN104050203A (en) * 2013-03-17 2014-09-17 祁勇 Method for acquiring personalized characteristics of webpages and users
CN104090890A (en) * 2013-12-12 2014-10-08 深圳市腾讯计算机系统有限公司 Method, device and server for obtaining similarity of key words
CN103810264A (en) * 2014-01-27 2014-05-21 西安理工大学 Webpage text classification method based on feature selection

Also Published As

Publication number Publication date
CN107544980A (en) 2018-01-05

Similar Documents

Publication Publication Date Title
CN107291840B (en) User attribute prediction model construction method and device
CN111078776A (en) Data table standardization method, device, equipment and storage medium
JP2015531126A (en) Method and apparatus for realizing product characteristic navigation
CN110019669B (en) Text retrieval method and device
CN109960815B (en) Method and system for establishing neural machine translation NMT model
CN106547749B (en) Webpage data acquisition method and device
US8290925B1 (en) Locating product references in content pages
CN105512233A (en) Application shop application search method and device
CN109597983A (en) A kind of spelling error correction method and device
CN110020068B (en) Method and device for configuring page crawling rules
CN108874379B (en) Page processing method and device
CN110020236B (en) Webpage parsing method, device, storage medium, processor and equipment
CN109101512B (en) Construction method of legal database, legal data query method and device
CN108121712B (en) Keyword storage method and device
CN111126053B (en) Information processing method and related equipment
CN111125087B (en) Data storage method and device
CN110232155B (en) Information recommendation method for browser interface and electronic equipment
CN106339381B (en) Information processing method and device
CN107544980B (en) Method and device for searching webpage
CN109558580B (en) Text analysis method and device
CN110569429A (en) method, device and equipment for generating content selection model
CN116028626A (en) Text matching method and device, storage medium and electronic equipment
CN110929188A (en) Method and device for rendering server page
CN111325007B (en) Comment analysis method and terminal for PPTX file
CN110019665A (en) Text searching method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 100080 No. 401, 4th Floor, Haitai Building, 229 North Fourth Ring Road, Haidian District, Beijing

Applicant after: Beijing Guoshuang Technology Co.,Ltd.

Address before: 100086 Cuigong Hotel, 76 Zhichun Road, Shuangyushu District, Haidian District, Beijing

Applicant before: Beijing Guoshuang Technology Co.,Ltd.

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant