WO2022262632A1 - 网页搜索方法、装置及存储介质 - Google Patents

网页搜索方法、装置及存储介质 Download PDF

Info

Publication number
WO2022262632A1
WO2022262632A1 PCT/CN2022/097818 CN2022097818W WO2022262632A1 WO 2022262632 A1 WO2022262632 A1 WO 2022262632A1 CN 2022097818 W CN2022097818 W CN 2022097818W WO 2022262632 A1 WO2022262632 A1 WO 2022262632A1
Authority
WO
WIPO (PCT)
Prior art keywords
webpage
vector
semantic
webpages
information
Prior art date
Application number
PCT/CN2022/097818
Other languages
English (en)
French (fr)
Inventor
蒋昊
曹朝
张鑫宇
伍永康
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2022262632A1 publication Critical patent/WO2022262632A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Definitions

  • the present invention relates to the technical field of artificial intelligence, in particular to a web page search method, device and storage medium.
  • Search is one of the key technologies in the Internet field, which directly affects the efficiency of users in obtaining information.
  • search is also a key application in the ecological layout of Internet giants such as Google and Baidu. For example, Google's business revenue in 2019 totaled US$160.743 billion, of which Google search advertising revenue reached US$98.115 billion, accounting for 61.0%.
  • For webpage search it mainly includes several steps: analyze the webpages in the webpage library and index the webpages in the webpage library into a certain space; analyze user input online and project it into the same space as the webpage library; Complete the matching between the user input and the webpage within; and sort by the matching degree, and feed back the search results to the user.
  • the present application provides a web page search method, device and storage medium, which constructs feature information of each web page by aggregating semantic information of web pages with the same theme, thereby improving the accuracy of web page search.
  • the embodiment of the present application provides a webpage search method, including: obtaining the semantic vector of the query statement; The similarity between each webpage, the feature information of each webpage is used to represent the first semantic aggregation information of each webpage and at least one second semantic aggregation information, wherein the first semantic aggregation information is semantic aggregation of semantic information of multiple webpages Obtained, at least one second semantic aggregation information is obtained by performing semantic aggregation on the semantic information of webpages having the same theme as each webpage among multiple webpages, and the weight of each webpage in the process of semantic aggregation for each webpage is greater than The weights of other webpages participating in the semantic aggregation process; according to the similarity between the query statement and each webpage, the query result of the query statement is obtained, and the query result is at least one of the plurality of webpages.
  • the process of performing semantic aggregation on each webpage refers to the process of obtaining the first semantic aggregation information and at least one second semantic aggregation information of each webpage, that is, the Semantic aggregation of semantic information of multiple webpages to obtain the first semantic aggregation information of "each of the webpages", and semantic information of webpages with the same theme as "each of the webpages" among the plurality of webpages
  • the weight of the webpage to be calculated with the first semantic aggregation information and at least one second semantic aggregation information is greater than the weight of other webpages in the plurality of webpages .
  • the first semantic aggregation information of the first webpage is obtained During the process, semantic aggregation may be performed on the semantic information of multiple web pages according to the weight of each web page in the multiple web pages. It should be understood that, during the semantic aggregation process, the weight of the first webpage is greater than the weights of other webpages in the plurality of webpages except the first webpage.
  • the weight of the first web page is 1, the weight of web pages that have a direct link relationship with the first web page in other web pages is r (r is less than 1), and the web pages that do not have a link relationship (including direct links and indirect links) with the first web page Pages have a weight of 0.
  • the weight of the first webpage is 1, and the weights of other webpages with the same theme are less than 1. Then, based on the webpages with the same theme as the first webpage (including The weight of the first webpage), the semantic information of webpages with the same theme is semantically aggregated to obtain the second semantic aggregation information of the first webpage.
  • the feature information of each web page includes the first semantic aggregation information obtained by semantic aggregation of multiple web pages, and the second semantic aggregation information obtained by semantic aggregation of web pages with the same theme , therefore, the characteristic information of each webpage is not composed of the semantic information of each webpage in isolation, but includes the semantic information of the webpages related to the webpage, so that the characteristic information of each webpage is more abundant and accurate, and the query efficiency is improved.
  • the matching accuracy between the statement and the webpage thereby improving the search accuracy of the webpage.
  • the first webpage is any one of multiple webpages; the first semantic aggregation information of the first webpage is represented by a second vector, and the second vector is the The first vector of each webpage is semantically aggregated, and the first vector of each webpage is used to represent the semantic information of each webpage; at least one second semantic aggregation information of the first webpage is represented by at least one third vector; At least one second semantic aggregation information of the first webpage is represented by at least one third vector; each third vector in the at least one third vector corresponds to a theme included in the first webpage, and each of the at least one third vector The topics corresponding to one third vector are all different; among them, in at least one third vector, each third vector is obtained by performing semantic aggregation on the first vector of the first webpage and the first vector of the second webpage, and the second webpage is a web page including a theme corresponding to each third vector among the plurality of web pages.
  • semantic aggregation is performed on the first vectors of multiple webpages to obtain the first vectors of the first webpage, that is, the first semantic aggregation information; Semantic aggregation is performed on the first vectors of webpages of the same theme to obtain at least one second vector of the first webpage, that is, at least one second semantic aggregation information. Therefore, when obtaining the second semantic aggregation information of each webpage, only the first vectors of webpages with the same theme are aggregated, so that no noise is introduced during aggregation, and the obtained second semantic aggregation information of each webpage The accuracy is relatively high, thereby improving the accuracy of web search.
  • the at least one third vector of the first webpage is further related to a topological graph, and the topological graph indicates an association relationship between multiple webpages.
  • webpages with the same theme as each webpage can be quickly found based on the constructed topology map, and at least one second semantic aggregation information of each webpage can be quickly constructed to improve Construction efficiency of feature information for each web page.
  • the topological map includes at least one sub-topological map, that is, the sub-topical map is composed of webpages containing the subject of the first webpage extracted from the topological map, and at least one sub-topological map is obtained, and at least one third vector
  • Each of the third vectors corresponds to a sub-topology graph in at least one sub-topology graph, and the sub-topology graphs corresponding to each third vector in at least one third vector are different; the sub-topology graphs corresponding to each third vector
  • the webpages in the topology map include a first webpage and a second webpage; each third vector is obtained by performing semantic aggregation on the first webpage and the second webpage in the sub-topology map corresponding to each third vector.
  • At least one sub-topology graph containing the theme of the first webpage is obtained from the topology graph, and then semantic aggregation is performed on the webpages in each sub-topology graph to obtain at least one sub-topology graph of the first webpage.
  • the sub-topology graph can be directly extracted from the topology graph without reconstructing the sub-topology graph, so that at least one second semantic aggregation information of the first web page can be quickly obtained.
  • each sub-topology graph in at least one sub-topology graph corresponds to a webpage group in at least one webpage group, and the webpage group corresponding to each sub-topology graph in at least one sub-topology graph is are different, wherein, each webpage group in at least one webpage group is composed of webpages containing topics corresponding to each webpage group; The webpages in the webpage group corresponding to the topology map are extracted from the topology map.
  • the topics of multiple webpages are grouped first, so as to obtain the second vector of webpages under each webpage group, and finally, according to at least one webpage group to which the first webpage belongs, quickly obtain at least one vector of the first webpage.
  • a second vector does not need to repeatedly construct subtopological graphs related to multiple webpages, which improves the construction efficiency of characteristic information of webpages. For example, for webpage A and webpage B, if starting from the topics contained in each webpage, when obtaining the second vector of webpage A under the topic, it is necessary to first construct the subtopological graph of webpage A under the topic, and then, for The web pages under the sub-topology map are semantically aggregated to obtain the second vector of web page A under the topic.
  • the sub-topology map When constructing the second vector of web page B under the topic, the sub-topology map must be constructed again, and this The webpages under the sub-topology map are semantically aggregated to obtain the second vector of webpage B under the topic. Therefore, the topics of multiple webpages are grouped first, and the sub-topology map corresponding to the webpage group to which the topic belongs can be directly obtained, and at the same time, The second vectors of webpage A and webpage B under this sub-topology map improve the construction efficiency of the characteristic information of the webpage.
  • the characteristic information of each webpage further includes a first vector of each webpage, and the first vector of each webpage indicates semantic information of each webpage.
  • the characteristic information of each webpage also includes the first vector of each webpage, that is, contains the semantic information of each webpage itself, so that the characteristic information of each webpage is more abundant and accurate. Thereby, the accuracy of subsequent web page searches is improved.
  • the characteristic information of each webpage is represented by a matrix
  • the method further includes: converting the matrix corresponding to each web page into a target vector according to the weight of each vector in the matrix, and the target vector indicates the characteristic information of each web page; according to the semantic vector of the query statement and each web page in the plurality of web pages
  • the feature information of the query statement is used to determine the similarity between the query statement and each webpage, including: calculating the similarity between the semantic vector of the query statement and the target vector, and the similarity indicates the correlation between the query statement and the webpage corresponding to the target vector.
  • the multiple vectors of each webpage are represented in the form of a matrix, so that the multiple vectors of each webpage can be converted into target vectors later, in order to calculate the matching between the query statement and the webpage conditions are created.
  • the method before converting the matrix corresponding to each web page into a target vector according to the weights of each vector in the matrix, the method further includes: determining the similarity between the semantic vector of the query statement and each vector in the matrix ; Determine the weight of each vector in the matrix according to the similarity between the semantic vector of the query statement and each vector in the matrix.
  • the weight of each vector in the multiple vectors of each web page is determined through the self-attention mechanism, so that the vectors that match the semantic vector of the query statement among the multiple vectors can be retained, so that the obtained The target vector of each web page matches the query statement better, improving the accuracy of web page search.
  • the embodiment of the present application provides a web page search device, and the beneficial effect can be referred to the description of the first aspect, which will not be repeated here.
  • the webpage search device has the function of realizing the actions in the method example of the first aspect above. Functions can be realized by hardware, and can also be realized by executing corresponding software through hardware.
  • the hardware or software includes one or more modules corresponding to the above functions.
  • the webpage search device includes an acquisition unit and a processing unit; the acquisition unit is used to acquire the query statement; the processing unit is used to acquire the semantic vector of the query statement; according to the semantic vector of the query statement and each
  • the characteristic information of each webpage determines the similarity between the query statement and each webpage, and the characteristic information of each webpage is used to characterize the first semantic aggregation information and at least one second semantic aggregation information of each webpage, wherein the first Semantic aggregation information is obtained by semantic aggregation of semantic information of multiple webpages, and at least one second semantic aggregation information is obtained by semantic aggregation of webpages with the same theme as each webpage among the plurality of webpages.
  • the weight of each webpage in the semantic aggregation process of each webpage is greater than the weight of other webpages participating in the semantic aggregation process; according to the similarity between the query statement and each webpage, the query result of the query statement is obtained, and the query result is multiple at least one of the pages.
  • the first webpage is any one of multiple webpages; the first semantic aggregation information of the first webpage is represented by a second vector, and the second vector is the The first vector of each webpage is semantically aggregated, and the first vector of each webpage is used to represent the semantic information of each webpage; at least one second semantic aggregation information of the first webpage is represented by at least one third vector; Each of the third vectors in the at least one third vector corresponds to a theme included in the first webpage, and the theme corresponding to each of the third vectors in the at least one third vector is different; wherein, in the at least one third vector, Each third vector is obtained by semantically aggregating the first vector of the first webpage and the first vector of the second webpage, and the second webpage is a webpage including a theme corresponding to each third vector among the plurality of webpages.
  • the at least one third vector of the first webpage is further related to a topological graph, and the topological graph indicates an association relationship between multiple webpages.
  • the topology map includes at least one sub-topology map, each third vector in the at least one third vector corresponds to a sub-topology map in the at least one sub-topology map, and in the at least one third vector
  • the sub-topology map corresponding to each third vector is different; the webpage in the sub-topology map corresponding to each third vector includes the first webpage and the second webpage; each third vector corresponds to each third vector
  • the first webpage and the second webpage in the sub-topology graph are obtained by performing semantic aggregation.
  • each sub-topology graph in at least one sub-topology graph corresponds to a webpage group in at least one webpage group, and the webpage group corresponding to each sub-topology graph in at least one sub-topology graph is are different, wherein, each webpage group in at least one webpage group is composed of webpages containing topics corresponding to each webpage group; The webpages in the webpage group corresponding to the topology map are extracted from the topology map.
  • the characteristic information of each webpage further includes a first vector of each webpage, and the first vector of each webpage indicates semantic information of each webpage.
  • the feature information of each webpage is represented by a matrix
  • the processing unit determines the similarity between the query statement and each webpage according to the semantic vector of the query statement and the feature information of each of the multiple webpages.
  • the processing unit is also used to convert the matrix corresponding to each web page into a target vector according to the weight of each vector in the matrix, and the target vector indicates the feature information of each web page;
  • the feature information of each webpage in a webpage determines the similarity between the query statement and each webpage, and the processing unit is specifically used to: calculate the similarity between the semantic vector of the query statement and the target vector, and the similarity indicates that the query statement is similar to the target vector.
  • the relevance of the web page corresponding to the target vector is a matrix
  • the processing unit before the processing unit converts the matrix corresponding to each web page into a target vector according to the weights of each vector in the matrix, the processing unit is also used to: determine the relationship between the semantic vector of the query statement and each vector in the matrix According to the similarity between the semantic vector of the query statement and each vector in the matrix, determine the weight of each vector in the matrix.
  • the embodiment of the present application provides a web page search device, including: a memory for storing programs; a processor for executing the programs stored in the memory; when the programs stored in the memory are executed, the processor is used to realize the above-mentioned first.
  • an embodiment of the present application provides a computer-readable medium, where the computer-readable medium stores program code for execution by a device, where the program code includes a method for implementing the method in the above-mentioned first aspect.
  • the embodiment of the present application provides a computer program product containing instructions, and when the computer program product is run on a computer, the computer is enabled to implement the method in the first aspect above.
  • the embodiment of the present application provides a chip, the chip includes a processor and a data interface, and the processor reads the instructions stored in the memory through the data interface to implement the method in the first aspect above.
  • the chip may further include a memory, in which instructions are stored, and the processor is configured to execute the instructions stored in the memory, and when the instructions are executed, the processor is configured to implement the method in the first aspect above.
  • FIG. 1 is an architecture diagram of a web page search system provided by an embodiment of the present application
  • FIG. 2 is a schematic flow diagram of constructing feature information of a webpage provided by an embodiment of the present application
  • FIG. 3 is a schematic diagram of a topology diagram provided by an embodiment of the present application.
  • FIG. 4 is a schematic diagram of a sub-topology diagram provided by an embodiment of the present application.
  • Fig. 5 is a schematic diagram of obtaining the first vector of Albert Einstein's Wikipedia provided by the embodiment of the present application;
  • Fig. 6 is a schematic diagram of a topological diagram of Wikipedia comprising Albert Einstein provided by the embodiment of the present application;
  • Fig. 7 is a schematic diagram of a sub-topology diagram of Wikipedia including Albert Einstein provided by the embodiment of the present application;
  • FIG. 8 is a schematic flow diagram of a web page search method provided in an embodiment of the present application.
  • FIG. 9 is a schematic structural diagram of a webpage search device provided in an embodiment of the present application.
  • FIG. 10 is a schematic structural diagram of another web page search device provided by an embodiment of the present application.
  • a web page in this application can be understood as a web page document that can establish an association relationship.
  • a web page document can be a web page seen when searching for information, or it can be a document, paper, etc. that have a citation relationship.
  • a web page document is mainly used as an example to describe a web page seen during an information search.
  • FIG. 1 is an architecture diagram of a webpage search system provided by an embodiment of the present application.
  • the web search system 10 includes an offline module 101 and an online module 102;
  • the offline module 101 includes one or more of the following functions:
  • Data cleaning refers to that the offline module 101 performs data cleaning on webpages, and screens out a plurality of high-quality webpages. Screen out high-quality text;
  • Initial representation generation means that the offline module 101 performs vectorized representation on each of the plurality of web pages to obtain the first vector of each web page, wherein the first vector of each web page is used to represent the semantic information of each web page;
  • Full graph construction means that the offline module 101 constructs a topology map between multiple web pages based on the association relationship between multiple web pages;
  • Subgraph construction means that the offline module 101 selects web pages with the same theme from the topological graphs of multiple web pages based on the theme of each web page, and constructs these web pages with the same theme based on the association relationship between these web pages with the same theme.
  • Variable-length representation generation means that the offline module 101 calculates the semantic aggregation information of each web page under the topology map, and the semantic aggregation information of each web page under the sub-topology map corresponding to the topic contained in the web page, and combines the two semantic aggregation information The information is stacked to generate a variable-length representation of each webpage, and the characteristic information of each webpage is obtained;
  • Vector index construction means that the offline module 101 builds an index for the variable-length representation of each webpage, so as to efficiently query the variable-length representation of each webpage.
  • the online module 102 mainly includes the following functions: preprocessing, query statement generation, variable-length expression fusion, similarity calculation, and web page sorting.
  • Preprocessing refers to that the online module 102 preprocesses the query statement (Query) input by the user to obtain a high-quality query statement.
  • the preprocessing can be to remove special characters in the query statement, where the special character can be garbled characters or Non-semantic characters, such as characters "@", "#", "*", etc.;
  • the generation of query statement representation means that the online module 102 vectorizes the preprocessed query statement to obtain the semantic vector of the preprocessed query statement;
  • variable-length representation fusion means that the online module 102 fuses the variable-length representations of each webpage based on the semantic vector of the query statement to obtain the target vector of each webpage.
  • Calculation of similarity means that the online module 102 calculates the similarity between the semantic vector of the query statement and the target vector of each web page, wherein the similarity calculation methods of vectors include Euclidean distance, cosine similarity, etc.;
  • Webpage sorting means that the online module 102 sorts the similarity between multiple webpages and the query statement. For example, it can sort according to the order of similarity from large to small, so as to facilitate subsequent output of the webpage with the highest similarity.
  • FIG. 2 is a schematic flowchart of constructing feature information of a webpage provided by an embodiment of the present application. The method is applied to the above-mentioned web page search system. The method includes the following steps:
  • the plurality of webpages may be all the webpages in the webpage library, or may be some webpages in the webpage library, which is not limited in this application.
  • the association relationship between webpages may be determined through hyperlinks between webpages. For example, if webpage A contains a hyperlink to webpage B, then it is determined that there is an association relationship between webpage A and webpage B.
  • the association relationship between webpages can also be determined through the text descriptions in the webpages. For example, if there is a text description about webpage B in webpage A, it is determined that there is an association relationship between webpage A and webpage B.
  • the association relationship between webpages can also be determined according to the upper-level webpages between the webpages. For example, if the upper-level webpages of webpage A and webpage B are both webpage C, then it is determined that there is an association relationship between webpage A and webpage B.
  • this application does not limit the type of association relationship between two web pages.
  • each webpage in the plurality of webpages is used as a node, and if there is an association relationship between the two webpages, an edge can be constructed for the two nodes corresponding to the two webpages; if there is no association between the two webpages relationship, then no edge is constructed for the two nodes corresponding to the two webpages, and a topology graph corresponding to multiple webpages is obtained.
  • a plurality of webpages include webpage A, webpage B, webpage C, and webpage D, and webpage A and webpage B are associated, webpage B is associated with webpage C, and webpage D is not associated with any webpage, Then, the topology diagram shown in FIG. 3 can be constructed according to the association relationship between the web pages.
  • the edges between nodes with associated relationships in the topological graph shown in FIG. 3 may be directed or undirected, that is to say, the topological graph may be a directed graph or undirected picture.
  • the sub-topology graph mentioned later may be a directed graph or an undirected graph, which is not limited in this application. In this application, an undirected graph is taken as an example for illustration.
  • the topic identification model is invoked to acquire at least one topic included in each webpage.
  • the topic recognition model may be a Latent Dirichlet Allocation (LDA) model.
  • at least one theme included in each webpage may be "politics”, “economy”, “education”, “medical care”, and so on. It can be understood that identifying at least one theme of each webpage is the process of labeling each webpage, that is, labeling each webpage with labels such as "politics”, “economy”, “education”, and “medical care”, so each webpage At least one theme of can be indicated by at least one label stamped on each webpage.
  • LDA Latent Dirichlet Allocation
  • the characteristic information of each webpage includes first semantic aggregation information of each webpage and at least one second semantic aggregation information corresponding to each webpage.
  • the following takes the first web page as an example to illustrate the process of obtaining the first semantic aggregation information and at least one second semantic aggregation information of the first web page, and the acquisition process of the first semantic aggregation information and at least one second semantic aggregation information of other web pages is the same as A web page is similar and will not be described again.
  • the first webpage may be any webpage among multiple webpages.
  • the first semantic aggregation information of the first webpage is represented by a second vector.
  • the semantic aggregation is performed on the first vectors of multiple webpages in the topology graph to obtain the second vector of each webpage, that is, to obtain the second vector of the first webpage, wherein the first vector of each webpage is used to represent
  • the semantic information of each webpage, and the first vector of each webpage can be obtained by extracting the semantic information of each webpage through a trained semantic information extraction model, for example, the semantic information extraction model can be a Bert model.
  • the process of performing semantic aggregation on the first vectors of multiple webpages is to perform semantic aggregation on the first vectors of multiple webpages according to the weights of the multiple webpages in the topology graph.
  • the weight of the first webpage is 1, and the weights of other webpages are determined according to the connection relationship with the first webpage and the distance with the first webpage in the topological graph.
  • the webpages having a connection relationship with the first webpage include: webpages having a direct connection relationship and webpages having an indirect connection relationship.
  • the first webpage is webpage A in FIG. 3
  • webpage B is a webpage directly connected to webpage A
  • webpage C is indirectly connected to webpage A.
  • the distance between two webpages in the topological graph can be understood as the number of webpages between the two webpages.
  • the distance between webpage C and webpage A is 1, that is, there is a webpage B between webpage A and webpage B.
  • the distance is 0, that is, there is no interval between pages.
  • y is the second vector of the first webpage
  • ⁇ i is the weight of the i-th webpage among the plurality of webpages
  • e i is the first vector of the i-th webpage
  • n is the number of the plurality of webpages.
  • ⁇ i 1
  • ⁇ i 0
  • ⁇ i 0
  • ⁇ i 0
  • m the number of webpages between the i-th webpage and the first webpage.
  • At least one second semantic aggregation information of the first webpage is represented by at least one third vector
  • the at least one third vector may be determined according to the above-mentioned topological graph and the first vector of each webpage in the plurality of webpages, wherein, each third vector in the at least one third vector corresponds to a theme included in the first webpage, and the theme corresponding to each third vector is different, that is, at least one third vector is the same as the first theme included in the first webpage One to one correspondence.
  • Each third vector is obtained by semantically aggregating the first vector of the first webpage and the first vector of the second webpage, wherein the second webpage is a webpage containing a theme corresponding to each third vector among the plurality of webpages .
  • the topic of each webpage in the topological graph is traversed to determine the second webpage containing the topic E in the topological graph, wherein the topic E is any one of at least one topic contained in the first webpage .
  • the second webpage and the first webpage containing the topic E are extracted from the topological map to obtain a subtopological map corresponding to the topic E. Therefore, the webpages in each sub-topology map include a first webpage and a second webpage with the same theme as the first webpage. For at least one topic of the first webpage, an operation similar to that of topic E is performed to obtain at least one sub-topology map corresponding to the first webpage.
  • each sub-topology graph perform semantic aggregation on the first vector of the webpage in each sub-topology graph, that is, perform semantic aggregation on the first vector of the first webpage and the first vector of the second webpage in each sub-topology graph, and obtain the first webpage in
  • the third vector in each sub-topology graph can further obtain at least one third vector in the at least one sub-topology graph of the first webpage, wherein each third vector in the at least one third vector corresponds to at least one sub-topology A sub-topology graph in the figure, and the sub-topology graphs corresponding to each third vector in the at least one third vector are different, that is, at least one third vector corresponds to at least one sub-topology graph one-to-one.
  • all the themes of multiple webpages are combined and deduplicated to obtain a theme set; then, the webpages containing the first theme among the multiple webpages are grouped into the same group to obtain multiple webpages Group, wherein the first topic is any one in the topic set, that is, in a manner similar to an inverted index, each topic in the topic set is used as a feature to group multiple webpages.
  • a plurality of webpages include webpage 1 and webpage 2, wherein webpage 1 includes theme 1, theme 2 and theme 3, and webpage 2 includes theme 1 and theme 2; therefore, the theme set is obtained by merging and deduplication of themes.
  • the webpage group consisting of the webpages containing topic 1 is webpage 1 and webpage 2
  • the webpage group consisting of webpages containing topic 2 is webpage 1 and webpage 2
  • the webpage group consisting of webpages containing topic 3 is webpage 1.
  • the webpages in each sub-topology map Perform semantic aggregation on the first vector, that is, perform semantic aggregation on the vector of the first web page and the second vector of the second web page in each sub-topology graph to obtain the third vector of the first web page in each sub-topology graph, and then obtain At
  • the second vector of the first webpage is combined with at least one third vector corresponding to the first webpage to obtain feature information of the first webpage.
  • the second vector of the first webpage and at least one third vector of the first webpage may be combined in the form of a matrix, and the combined matrix may be used as feature information of each webpage.
  • webpage A, webpage B, and webpage D can be extracted from the topology map to obtain the subtopology map corresponding to the webpage group , that is, the sub-topology graph shown in Figure 4.
  • the semantic aggregation of the semantic information of the webpage can be realized through a graph neural network, which can be a graph convolutional neural network (Graph Convolutional Networks, GCN) or a graph attention network (Graph Attention Networks, GAT), etc.
  • a graph neural network which can be a graph convolutional neural network (Graph Convolutional Networks, GCN) or a graph attention network (Graph Attention Networks, GAT), etc.
  • GCN graph convolutional neural network
  • GAT graph attention network
  • the first vector) of multiple webpages in the topological graph is semantically aggregated through the graph neural network, and the second vector of each webpage can be obtained, wherein the second vector of each webpage is also obtained according to the above-mentioned
  • the weight of each webpage is obtained by aggregating the first vector of each webpage, and will not be described again.
  • the characteristic information of the first webpage further includes the first vector of the first webpage, that is, the third vector of the first webpage, at least one second vector under at least one topic included in the first webpage, And the first vector of the first webpage constitutes the characteristic information of the first webpage. Since the feature information includes the semantic information of each web page itself, the constructed feature information is more accurate, further improving the accuracy of subsequent web page searches.
  • data cleaning is performed on each webpage to obtain high-quality text in each webpage, and the high-quality text is input into the semantic information extraction model to obtain The first vector of each webpage, wherein the high-quality text in each webpage is the text in the webpage with complete semantics and perplexity lower than a threshold.
  • data cleaning may be performed on the webpage, so as to screen out multiple high-quality webpages from the webpage database, that is, multiple webpages of the present application.
  • Step 1 Download the current latest Wikipedia web page data to obtain multiple web pages.
  • Step 2 As shown in Figure 5, through data processing, the text information of each webpage in the plurality of webpages is obtained; then, the text information of each webpage is input into the Bert model to obtain the first vector of each webpage.
  • Step 3 As shown in FIG. 6 , construct a topology map based on hyperlinks of multiple web pages.
  • the underlined word in Figure 6 is a hyperlink in Albert Einstein's Wikipedia. Therefore, Albert Einstein's Wikipedia is associated with other webpages in multiple webpages through hyperlinks, and the hyperlinked webpage nodes of Albert Einstein's Wikipedia are connected to obtain a topology map.
  • Each node in the topological graph is the first vector of the webpage corresponding to each node, as shown in Figure 6, the black node represents the first vector of Albert Einstein's Wikipedia.
  • the hyperlinks between Albert Einstein's Wikipedia and other web pages are represented by edge connections between nodes in the topological graph.
  • Step 4 Identify the theme of each web page in the topology map through the LDA theme recognition model.
  • Step 5 As shown in Figure 7, the topics contained in Albert Einstein's Wikipedia are extracted from the topological graph to form a sub-topological graph, and the number of sub-topological graphs is the same as the number of topics contained in Albert Einstein's Wikipedia.
  • the sub-topology graphs 1, ..., and sub-topology graphs n corresponding to topics 1, ..., and topic n are respectively extracted from the topological graph;
  • the vectors are semantically aggregated to obtain the third vector corresponding to Albert Einstein's Wikipedia, and the graph neural network is used to perform semantic aggregation on the webpages in each sub-topology graph to obtain the second vector of Albert Einstein's Wikipedia under each sub-topology graph.
  • the second vector of Albert Einstein's Wikipedia under the topological map and the third vector under each sub-topological map are combined to obtain the feature information of Albert Einstein's Wikipedia.
  • variable length representation (feature information) of web pages is mainly reflected in the fact that the number of vectors contained in the feature information of web pages is related to the number of topics of web pages.
  • FIG. 8 is a schematic flow chart of a webpage search method provided by an embodiment of the present application. The method is applied to the above-mentioned web page search system. The method includes the following steps:
  • the query statement (Query) input by the user is obtained, and the query statement is vectorized to obtain the semantic vector of the query statement, and the semantic vector of the query statement is used to represent the semantic information of the query statement, wherein, the query statement is vectorized
  • the representation can be realized through the semantic information extraction model, for example, the vector representation of the query statement through the above-mentioned Bert model.
  • the characteristic information of each webpage among the plurality of webpages can be obtained through the characteristic information construction method shown in FIG. 2 , which will not be described again.
  • the feature information of each web page is converted into a target vector, for example, according to the weight of each vector in the feature information of each web page
  • Vectors are weighted to obtain the target vector of each web page; then, calculate the similarity between the semantic vector of the query statement and the target vector of each web page, and obtain the similarity between the query statement and each web page, for example, you can calculate the query
  • the cosine similarity between the semantic vector of the sentence and the target vector of each web page is used as the similarity between the query sentence and each web page.
  • the process of determining the similarity between the query sentence and the first webpage is described.
  • the similarity between the semantic vector of the query statement and the second vector, and the similarity with each third vector determines the similarity between the semantic vector of the query statement and the second vector, and the similarity with each third vector; then, the similarity between the semantic vector of the query statement and the second vector , and normalize the similarity with each third vector to obtain the weight corresponding to the second vector and the weight corresponding to each third vector; according to the weight corresponding to the second vector and each The weights corresponding to the three vectors are weighted to the second vector and at least one third vector to obtain the target vector corresponding to the first web page; finally, determine the similarity between the target vector of the first web page and the semantic vector of the query statement, The similarity between the query statement and the first web page is obtained.
  • the query result is at least one of the plurality of web pages.
  • the multiple webpages are sorted, and the first K webpages are used as the query results of the query statement, and the query can be displayed on the visual interface
  • the value of K is an integer greater than or equal to 1.
  • the semantic information of the webpages associated with each webpage is also integrated, instead of simply using the semantic information of each webpage itself Construct feature information, and only fuse the semantic information of web pages with the same theme as each web page, so that noise will not be introduced in the process of information fusion (for example, the semantic information of irrelevant web pages is fused), so that the constructed
  • the accuracy of feature information is relatively high. Since the constructed feature information is relatively high, when the query statement is matched with the webpage, the matching accuracy between the query statement and the webpage can be improved, and the webpage search accuracy and the user's search experience are improved.
  • FIG. 9 is a schematic structural diagram of a webpage search device provided by an embodiment of the present application.
  • a web page search apparatus 900 includes an acquisition unit 901 and a processing unit 902;
  • An acquisition unit 901 configured to acquire a semantic vector of a query statement
  • the processing unit 902 is configured to determine the similarity between the query statement and each webpage according to the semantic vector of the query statement and the feature information of each webpage in the plurality of webpages, and the feature information of each webpage is used to characterize each webpage First semantic aggregation information and at least one second semantic aggregation information, wherein the first semantic aggregation information is obtained by semantic aggregation of semantic information of multiple webpages, and at least one second semantic aggregation information is obtained from multiple webpages related to each The semantic information of webpages with the same theme is obtained by semantic aggregation.
  • the weight of each webpage is greater than the weight of other webpages participating in the process of semantic aggregation; according to the query statement and each webpage The similarity between the query statements is obtained, and the query result is at least one of the plurality of web pages.
  • FIG. 10 is a schematic structural diagram of another webpage search device provided by an embodiment of the present application.
  • the webpage search device 1000 may be the above-mentioned webpage search device; or, it may be a chip or a chip system in the above-mentioned webpage search device.
  • the web search device 1000 shown in FIG. 10 includes a memory 1001 , a processor 1002 , a communication interface 1003 and a bus 1004 .
  • the memory 1001 , the processor 1002 , and the communication interface 1003 are connected to each other through a bus 1004 .
  • the memory 1001 may be a read-only memory (Read Only Memory, ROM), a static storage device, a dynamic storage device or a random access memory (Random Access Memory, RAM).
  • the memory 1001 may store a program. When the program stored in the memory 1001 is executed by the processor 1002, the processor 1002 and the communication interface 1003 are used to execute various steps in the data stream transmission method of the embodiment of the present application.
  • the processor 1002 may be a general-purpose central processing unit (Central Processing Unit, CPU), a microprocessor, an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), a graphics processing unit (graphics processing unit, GPU) or one or more
  • the integrated circuit is used to execute related programs to realize the functions required by the units in the audio feature compensation device or the audio recognition device of the embodiment of the present application, or to execute the data stream transmission method of the method embodiment of the present application.
  • the processor 1002 may also be an integrated circuit chip with signal processing capability. During implementation, each step in the data stream transmission method of the present application may be completed by an integrated logic circuit of hardware in the processor 1002 or instructions in the form of software.
  • the above-mentioned processor 1002 can also be a general-purpose processor, a digital signal processor (Digital Signal Processing, DSP), an application-specific integrated circuit (ASIC), a ready-made programmable gate array (Field Programmable Gate Array, FPGA) or other programmable logic devices , discrete gate or transistor logic devices, discrete hardware components.
  • DSP Digital Signal Processing
  • ASIC application-specific integrated circuit
  • FPGA Field Programmable Gate Array
  • Various methods, steps, and logic block diagrams disclosed in the embodiments of the present application may be implemented or executed.
  • a general-purpose processor may be a microprocessor, or the processor may be any conventional processor, or the like.
  • the steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor.
  • the software module can be located in a mature storage medium in the field such as random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, register.
  • the storage medium is located in the memory 1001, and the processor 1002 reads the information in the memory 1001, and combines its hardware to complete the functions required by the units included in the user equipment or the head-mounted device of the embodiment of the present application, or execute the method embodiment of the present application The various steps in the data streaming method.
  • the communication interface 1003 can be a transceiving device such as a transceiver to realize communication between the webpage search device 1000 and other devices or communication networks; the communication interface 1003 can also be an input-output interface to realize the communication between the webpage search device 1000 and the input-output interface. Data transmission between output devices, where input-output devices include but not limited to keyboards, mice, display screens, U disks, and hard disks.
  • the bus 1004 may include a path for transmitting information between various components of the device web search device 1000 (eg, memory 1001 , processor 1002 , communication interface 1003 ).
  • processing unit 902 is equivalent to the processor 1002 in the webpage search apparatus 1000 .
  • the web page search device 1000 shown in Figure 10 only shows memory, processor, and communication interface, in the specific implementation process, those skilled in the art should understand that the web page search device 1000 also includes other devices. Meanwhile, according to specific needs, those skilled in the art should understand that the web page search apparatus 1000 may also include hardware devices for implementing other additional functions. In addition, those skilled in the art should understand that the web page search apparatus 1000 may only include components necessary to realize the embodiment of the present application, and does not necessarily include all the components shown in FIG. 10 .
  • the disclosed systems, devices and methods may be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the units is only a logical function division. In actual implementation, there may be other division methods.
  • multiple units or components can be combined or May be integrated into another system, or some features may be ignored, or not implemented.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.
  • “at least one” means one or more, and “multiple” means two or more.
  • “And/or” describes the association relationship of associated objects, indicating that there may be three types of relationships, for example, A and/or B, which can mean: A exists alone, A and B exist simultaneously, and B exists alone, where A, B can be singular or plural.
  • the character “/” generally indicates that the contextual objects are an “or” relationship; in the formulas of this application, the character “/” indicates that the contextual objects are a "division” Relationship.
  • the functions described above are realized in the form of software function units and sold or used as independent products, they can be stored in a computer-readable storage medium.
  • the technical solution of the present application is essentially or the part that contributes to the prior art or the part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including Several instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disc and other media that can store program codes. .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请公开了一种网页搜索方法、装置及存储介质。该方法包括:获取查询语句的语义向量;根据查询语句的语义向量以及多个网页中每个网页的特征信息,确定查询语句与每个网页之间的相似度,每个网页的特征信息用于表征每个网页的第一语义聚合信息以及至少一个第二语义聚合信息,其中,第一语义聚合信息是对多个网页的语义信息进行语义聚合得到的,至少一个第二语义聚合信息是对多个网页中与每个网页具有相同主题的网页的语义信息进行语义聚合得到的,在对每个网页进行语义聚合过程中每个网页的权重大于参与语义聚合过程中的其他网页的权重;根据查询语句与每个网页之间的相似度,得到查询语句的查询结果。本申请实施例有利于提高网页搜索精度。

Description

网页搜索方法、装置及存储介质
本申请要求于2021年06月18日提交中国专利局、申请号为202110683570.5、申请名称为“网页搜索方法、装置及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本发明涉及人工智能技术领域,具体涉及一种网页搜索方法、装置及存储介质。
背景技术
搜索是互联网领域的关键技术之一,直接影响到用户获取信息的效率。另一方面,搜索也是谷歌、百度等互联网大厂生态布局中的关键应用。例如,谷歌2019年业务收入共1607.43亿美元,其中,谷歌搜索的广告收入就达981.15亿美元,占比高达61.0%。
对于网页搜索来说,主要包括几个步骤:分析网页库中的网页并将网页库中的网页索引到某个空间内;在线分析用户输入并投射到和网页库相同的空间内;在该空间内完成用户输入和网页之间的匹配;并按匹配度排序,将搜索结果反馈给用户。
在传统的网页搜索技术中,基于候选网页和用户输入的文本字符进行关键词分析和相似度计算,这种搜索效率较慢,且搜索精度低。为了持续提升网页搜索体验和产品竞争力,网页搜索技术一直处于持续的更迭和改进中,逐渐从基于文本匹配的符号化搜索向基于语义匹配的深度语义搜索演进。在深度语义搜索的过程中,通过深度神经网络的深度表示模型(例如,BERT模型)来表示候选网页和用户输入。通过深度表示模型将两者的文本字符等显性信息表示为隐性的语义向量,在语义空间中计算语义向量之间的匹配度来完成搜索排序过程。
虽然,深度语义搜索能够解决部分复杂语义场景下的搜索问题,但是都是孤立确定每个网页的语义向量,影响匹配度的计算,导致搜索质量较低。
发明内容
本申请提供了一种网页搜索方法、装置及存储介质,通过聚合具有相同主题的网页的语义信息构造每个网页的特征信息,提高了网页搜索的精度。
第一方面,本申请实施例提供一种网页搜索方法,包括:获取查询语句的语义向量;根据查询语句的语义向量以及多个网页中每个网页的特征信息,确定查询语句与每个网页之间的相似度,每个网页的特征信息用于表征每个网页的第一语义聚合信息以及至少一个第二语义聚合信息,其中,第一语义聚合信息是对多个网页的语义信息进行语义聚合得到的,至少一个第二语义聚合信息是对多个网页中与每个网页具有相同主题的网页的语义信息进行语义聚合得到的,在对每个网页进行语义聚合过程中每个网页的权重大于参与语义聚合过程中的其他网页的权重;根据查询语句与每个网页之间的相似度,得到查询语句的查询结果,查询结果为多个网页中的至少一个。
需说明的是,“对所述每个网页进行语义聚合过程中”,这个指的是,得到每个网页的第一语义聚合信息以及至少一个第二语义聚合信息的过程,也就是对所述多个网页的语义信息 进行语义聚合得到“所述每个网页”的第一语义聚合信息的过程,以及对所述多个网页中与“所述每个网页”具有相同主题的网页的语义信息进行语义聚合得到至少一个第二语义聚合信息的过程,这两个子过程中,要被计算第一语义聚合信息以及至少一个第二语义聚合信息的网页的权重大于该多个网页中其他网页的权重。
示例性的,针对第一网页来说,其中,第一网页为多个网页中的任意一个网页,在对第多个网页的语义信息进行语义聚合,得到第一网页的第一语义聚合信息的过程中,可根据多个网页中每个网页的权重对多个网页的语义信息进行语义聚合。应理解,在进行语义聚合的过程中,第一网页的权重大于多个网页中除第一网页之外的其他网页的权重。比如,第一网页的权重为1,其他网页中与第一网页存在直接链接关系的网页的权重为r(r小于1),与第一网页不存在链接关系(包括直接链接和间接链接)的网页的权重为0。同样,在得到第一网页的第二语义聚合信息的过程中,第一网页的权重为1,其他具有相同主题的网页的权重小于1,然后,基于与第一网页具有相同主题的网页(包括第一网页)的权重,对具有相同主题的网页的语义信息进行语义聚合,得到第一网页的第二语义聚合信息。
可以看出,在本申请实施例中,每个网页的特征信息包括对多个网页进行语义聚合得到的第一语义聚合信息,以及对具有相同主题的网页进行语义聚合得到的第二语义聚合信息,因此,每个网页的特征信息不是由每个网页的语义信息孤立组成的,包含了跟该网页有关的网页的语义信息,从而使每个网页的特征信息更加的丰富和准确,提高了查询语句和网页之间的匹配精度,进而提高了网页的搜索精度。
在一些可能的实施方式中,对于第一网页,其中,第一网页为多个网页中的任意一个;第一网页的第一语义聚合信息通过第二向量表示,第二向量是对多个网页中每个网页的第一向量进行语义聚合得到的,每个网页的第一向量用于表示每个网页的语义信息;第一网页的至少一个第二语义聚合信息通过至少一个第三向量表示;第一网页的至少一个第二语义聚合信息通过至少一个第三向量表示;至少一个第三向量中的每一个第三向量都对应第一网页包括的一个主题,且至少一个第三向量中的每一个第三向量对应的主题都不同;其中,至少一个第三向量中,每个第三向量是对第一网页的第一向量以及第二网页的第一向量进行语义聚合得到的,第二网页是多个网页中,包含与每个第三向量对应的主题的网页。
可以看出,在实施方式中,对多个网页的第一向量进行语义聚合,得到第一网页的第一向量,即第一语义聚合信息;然后,对多个网页中包含与第一网页具有相同的主题的网页的第一向量进行语义聚合,得到第一网页的至少一个第二向量,即至少一个第二语义聚合信息。因此,在获取每个网页的第二语义聚合信息时,只聚合具有相同主题的网页的第一向量,从而实现在聚合时不会引入噪声,使获取到的每个网页的第二语义聚合信息的精度比较高,进而提高网页搜索精度。
在一些可能的实施方式中,第一网页的至少一个第三向量还与拓扑图有关,拓扑图指示多个网页之间的关联关系。
可以看出,在本实施方式中,通过构建拓扑图,可以基于构建的拓扑图快速找到与每个网页具有相同主题的网页,快速的构建出每个网页的至少一个第二语义聚合信息,提高每个网页的特征信息的构建效率。
在一些可能的实施方式中,拓扑图包括至少一个子拓扑图,即从拓扑图中抽取包含有第一网页的主题的网页组成子拓扑图,得到至少一个子拓扑图,至少一个第三向量中的每一个第三向量都对应至少一个子拓扑图中的一个子拓扑图,且至少一个第三向量中的每一个第三向量对应的子拓扑图都不相同;每个第三向量对应的子拓扑图中的网页包括第一网页以及第 二网页;每个第三向量是对每个第三向量对应的子拓扑图中的第一网页和第二网页进行语义聚合得到。
可以看出,在本实施方式中,从拓扑图中获取包含有第一网页的主题的至少一个子拓扑图,然后,分别对每个子拓扑图中的网页进行语义聚合,得到第一网页的至少一个第一语义聚合信息。从拓扑图可以直接抽取子拓扑图,无需重新构造子拓扑图,从而可以快速得到第一网页的至少一个第二语义聚合信息。
在一些可能的实施方式中,至少一个子拓扑图中的每一个子拓扑图都对应至少一个网页组中的一个网页组,且至少一个子拓扑图中的每一个子拓扑图对应的网页组都不相同,其中,至少一个网页组中的每个网页组是由多个网页中包含有与每个网页组对应的主题的网页组成;至少一个子拓扑图中的每个子拓扑图是将每个子拓扑图对应的网页组中的网页从拓扑图中抽取出来的。
可以看出,先对多个网页的主题进行分组,从而得到可以得到每个网页组下的网页的第二向量,最后,依据第一网页所属的至少一个网页组,快速得到第一网页的至少一个第二向量,不用重复构建与多个网页的子拓扑图,提高了网页的特征信息的构建效率。举例来说,网页A和网页B,如果从每个网页包含的主题出发,获取网页A在该主题下的第二向量时,需要先构建网页A在该主题下的子拓扑图,然后,对该子拓扑图下的网页进行语义聚合,得到网页A在该主题下的第二向量,在构建网页B在该主题下的第二向量时,又要再一次构建该子拓扑图,再次对该子拓扑图下的网页进行语义聚合,得到网页B在该主题下的第二向量,因此先对多个网页的主题进行分组,可以直接得到该主题所属的网页组对应的子拓扑图,同时得到网页A和网页B在这个子拓扑图下的第二向量,提高了网页的特征信息的构建效率。
在一些可能的实施方式中,每个网页的特征信息还包括每个网页的第一向量,每个网页的第一向量指示每个网页的语义信息。
可以看出,在本实施方式中,每个网页的特征信息还包括每个网页的第一向量,即包含了每个网页自身的语义信息,从而使每个网页的特征信息更加丰富和精确,进而提高后续网页搜索的精度。
在一些可能的实施方式中,每个网页的特征信息以一个矩阵表示,根据查询语句的语义向量以及多个网页中每个网页的特征信息,确定查询语句与每个网页之间的相似度之前,该方法还包括:根据矩阵中各个向量的权重,将每个网页对应的矩阵转化为目标向量,目标向量指示每个网页的特征信息;根据查询语句的语义向量以及多个网页中每个网页的特征信息,确定查询语句与每个网页之间的相似度,包括:计算查询语句的语义向量与目标向量的相似度,相似度指示查询语句与目标向量对应的网页的相关性。
可以看出,在本实施方式中,将每个网页的多个向量以矩阵的形式表示,从而便于后面将每个网页的多个向量转化为目标向量,为计算查询语句和网页之间的匹配度创造了条件。
在一些可能的实施方式中,根据矩阵中各个向量的权重,将每个网页对应的矩阵转化为目标向量之前,该方法还包括:确定查询语句的语义向量与矩阵中各个向量之间的相似度;根据查询语句的语义向量与矩阵中各个向量之间的相似度,确定矩阵中各个向量的权重。
可以看出,在本实施方式中,通过自注意力机制确定每个网页的多个向量中各个向量的权重,可以使多个向量中与查询语句的语义向量匹配的向量保留下来,使求出的每个网页的目标向量与查询语句更加匹配,提高网页搜索的精度。
第二方面,本申请实施例提供一种网页搜索装置,有益效果可以参见第一方面的描述此处不再赘述。网页搜索装置具有实现上述第一方面的方法实例中行为的功能。功能可以通过 硬件实现,也可以通过硬件执行相应的软件实现。该硬件或软件包括一个或多个与上述功能相对应的模块。在一个可能的设计中,网页搜索装置包括获取单元和处理单元;获取单元,用于获取查询语句;处理单元,用于获取查询语句的语义向量;根据查询语句的语义向量以及多个网页中每个网页的特征信息,确定查询语句与每个网页之间的相似度,每个网页的特征信息用于表征每个网页的第一语义聚合信息以及至少一个第二语义聚合信息,其中,第一语义聚合信息是对多个网页的语义信息进行语义聚合得到的,至少一个第二语义聚合信息是对多个网页中与每个网页具有相同主题的网页的语义信息进行语义聚合得到的,在对每个网页进行语义聚合过程中每个网页的权重大于参与语义聚合过程中的其他网页的权重;根据查询语句与每个网页之间的相似度,得到查询语句的查询结果,查询结果为多个网页中的至少一个。
在一些可能的实施方式中,对于第一网页,其中,第一网页为多个网页中的任意一个;第一网页的第一语义聚合信息通过第二向量表示,第二向量是对多个网页中每个网页的第一向量进行语义聚合得到的,每个网页的第一向量用于表示每个网页的语义信息;第一网页的至少一个第二语义聚合信息通过至少一个第三向量表示;至少一个第三向量中的每一个第三向量都对应第一网页包括的一个主题,且至少一个第三向量中的每一个第三向量对应的主题都不同;其中,至少一个第三向量中,每个第三向量是对第一网页的第一向量以及第二网页的第一向量进行语义聚合得到的,第二网页是多个网页中,包含与每个第三向量对应的主题的网页。
在一些可能的实施方式中,第一网页的至少一个第三向量还与拓扑图有关,拓扑图指示多个网页之间的关联关系。
在一些可能的实施方式中,拓扑图包括至少一个子拓扑图,至少一个第三向量中的每一个第三向量都对应至少一个子拓扑图中的一个子拓扑图,且至少一个第三向量中的每一个第三向量对应的子拓扑图都不相同;每个第三向量对应的子拓扑图中的网页包括第一网页以及第二网页;每个第三向量是对每个第三向量对应的子拓扑图中的第一网页和第二网页进行语义聚合得到。
在一些可能的实施方式中,至少一个子拓扑图中的每一个子拓扑图都对应至少一个网页组中的一个网页组,且至少一个子拓扑图中的每一个子拓扑图对应的网页组都不相同,其中,至少一个网页组中的每个网页组是由多个网页中包含有与每个网页组对应的主题的网页组成;至少一个子拓扑图中的每个子拓扑图是将每个子拓扑图对应的网页组中的网页从拓扑图中抽取出来的。
在一些可能的实施方式中,每个网页的特征信息还包括每个网页的第一向量,每个网页的第一向量指示每个网页的语义信息。
在一些可能的实施方式中,每个网页的特征信息以一个矩阵表示,根据查询语句的语义向量以及多个网页中每个网页的特征信息,处理单元确定查询语句与每个网页之间的相似度之前,处理单元,还用于根据矩阵中各个向量的权重,将每个网页对应的矩阵转化为目标向量,目标向量指示每个网页的特征信息;在处理单元根据查询语句的语义向量以及多个网页中每个网页的特征信息,确定查询语句与每个网页之间的相似度方面,处理单元,具体用于:计算查询语句的语义向量与目标向量的相似度,相似度指示查询语句与目标向量对应的网页的相关性。
在一些可能的实施方式中,处理单元根据矩阵中各个向量的权重,将每个网页对应的矩阵转化为目标向量之前,处理单元,还用于:确定查询语句的语义向量与矩阵中各个向量之 间的相似度;根据查询语句的语义向量与矩阵中各个向量之间的相似度,确定矩阵中各个向量的权重。
第三方面,本申请实施例提供了一种网页搜索装置,包括:存储器,用于存储程序;处理器,用于执行存储器存储的程序;当存储器存储的程序被执行时,处理器用于实现上述第一方面。
第四方面,本申请实施例提供了提供一种计算机可读介质,该计算机可读介质存储用于设备执行的程序代码,该程序代码包括用于实现上述第一方面中的方法。
第五方面,本申请实施例提供了提供一种包含指令的计算机程序产品,当该计算机程序产品在计算机上运行时,使得计算机实现上述第一方面中的方法。
第六方面,本申请实施例提供了提供一种芯片,该芯片包括处理器与数据接口,处理器通过数据接口读取存储器上存储的指令,实现上述第一方面中的方法。
可选地,作为一种实现方式,芯片还可以包括存储器,存储器中存储有指令,处理器用于执行存储器上存储的指令,当指令被执行时,处理器用于实现上述第一方面中的方法。
附图说明
图1为本申请实施例提供的一种网页搜索系统的架构图;
图2为本申请实施例提供的一种构建网页的特征信息的流程示意图;
图3为本申请实施例提供的一种拓扑图的示意图;
图4为本申请实施例提供的一种子拓扑图的示意图;
图5为本申请实施例提供的一种获取Albert Einstein的维基百科的第一向量的示意图;
图6为本申请实施例提供的一种构造包含Albert Einstein的维基百科的拓扑图的示意图;
图7为本申请实施例提供的一种包含Albert Einstein的维基百科的子拓扑图的示意图;
图8为本申请实施例提供的一种网页搜索方法的流程示意图;
图9为本申请实施例提供的一种网页搜索装置的结构示意图;
图10为本申请实施例提供的另一种网页搜索装置的结构示意图。
具体实施方式
首先说明,本申请中的网页可以理解为能够建立关联关系的网页文档,比如,网页文档可以为信息搜索时所见的网页,也可以为具有引用关系的文献、论文,等等。本申请中主要以网页文档为信息搜索时所见的网页为例进行说明。
参阅图1,图1为本申请实施例提供的一种网页搜索系统的架构图。网页搜索系统10包括离线模块101和在线模块102;
离线模块101包括以下功能中的一种或多种:
数据清洗、初始表示生成、全图构建、子图构建、变长表示生成以及向量索引构建;
数据清洗是指离线模块101对网页进行数据清洗,筛选出多个高质量的网页,该多个网页可以为网页库中的部分或全部网页,以及对网页中的文本进行数据清洗,从网页中筛选出高质量的文本;
初始表示生成是指离线模块101对多个网页中的每个网页进行向量化表示,得到每个网页的第一向量,其中,每个网页的第一向量用于表示每个网页的语义信息;
全图构建是指离线模块101基于多个网页之间的关联关系构造多个网页之间的拓扑图;
子图构建是指离线模块101基于每个网页的主题,从多个网页的拓扑图中选取具有相同的主题的网页,并基于这些具有相同的主题的网页之间的关联关系构造这些具有相同的主题的网页之间的关联关系;
变长表示生成是指离线模块101计算每个网页在拓扑图下的语义聚合信息,以及每个网页在该网页所包含的主题对应的子拓扑图下的语义聚合信息,并将两种语义聚合信息进行堆叠生成每个网页的变长表示,得到每个网页的特征信息;
向量索引构建是指离线模块101为每个网页的变长表示建立索引,便于高效的查询每个网页的变长表示。
在线模块102主要包括以下功能:预处理、查询语句表示生成、变长表示融合、相似度计算以及网页排序。
预处理是指在线模块102对用户输入的查询语句(Query)进行预处理,得到高质量的查询语句,比如,预处理可以为去除查询语句中的特殊字符,其中,特殊字符可以为乱码字符或者无语义的字符,比如,字符“@”、“#”、“*”、等等;
查询语句表示生成是指在线模块102将经预处理后的查询语句进行向量化表示,得到预处理后的查询语句的语义向量;
变长表示融合是指在线模块102基于查询语句的语义向量,对每个网页的变长表示进行融合,得到每个网页的目标向量。
相似度计算是指在线模块102计算查询语句的语义向量与每个网页的目标向量之间的相似度,其中,向量的相似度计算方法有欧氏距离、余弦相似度,等等;
网页排序是指在线模块102对多个网页与查询语句之间的相似度进行排序,比如,可以按照相似度从大到小的顺序进行排序,便于后续输出相似度最大的网页。
下面结合附图详细说明对网页进行离线处理,得到网页的特征信息的过程。
参阅图2,图2为本申请实施例提供的一种构建网页的特征信息的流程示意图。该方法应用于上述的网页搜索系统。该方法包括以下步骤:
201:根据多个网页中网页之间的关联关系,构建与多个网页对应的拓扑图。
其中,该多个网页可以为网页库中的全部网页,也可以为网页库中的部分网页,本申请对此不做限定。
可选的,网页间的关联关系可以通过网页之间的超链接确定,比如,网页A中包含有网页B的超链接,则确定网页A和网页B之间存在关联关系。
可选的,网页之间的关联关系还可以通过网页中的文字描述确定,比如,网页A中有关于网页B的文字描述,则确定网页A和网页B之间存在关联关系。
可选的,网页之间的关联关系还可以根据网页之间的上级网页确定,比如,网页A和网页B的上级网页都是网页C,则确定网页A和网页B之间存在关联关系。
因此,本申请不对两个网页之间存在关联关系的类型进行限定。
可选的,将该多个网页中的各个网页作为节点,且若两个网页之间存在关联关系,则可以为两个网页对应的两个节点构建边,若两个网页之间不存在关联关系,则不为两个网页对应的两个节点构建边,得到与多个网页对应的拓扑图。
举例来说,多个网页包括网页A、网页B、网页C和网页D,且网页A和网页B存在关联关系,网页B和网页C存在关联关系,而网页D不与任何网页存在关联关系,则可以按照网页间的关联关系构建出如图3所示的拓扑图。
示例性的,图3所示的拓扑图中具有关联关系的节点之间的边可以是有向的,也可以是无向,也就是说该拓扑图可以是有向图,也可以是无向图。且后面涉及到的子拓扑图可以是有向图,也可以是无向图,本申请对此均不做限定。本申请中以无向图为例进行说明。
202:根据拓扑图以及每个网页包括的至少一个主题,确定每个网页的特征信息。
示例性的,调用主题识别模型获取每个网页包括的至少一个主题。主题识别模型可以为隐狄利克雷分布(Latent Dirichlet Allocation,LDA)模型。示例性的,每个网页包括的至少一个主题可以为“政治”、“经济”、“教育”、“医疗”,等等。可以理解的,识别每个网页的至少一个主题就是为每个网页打上标签的过程,即为每个网页打上“政治”、“经济”、“教育”、“医疗”等标签,因此每个网页的至少一个主题可以通过为每个网页打上的至少一个标签指示。
示例性的,每个网页的特征信息包括每个网页的第一语义聚合信息以及每个网页对应的至少一个第二语义聚合信息。
下面以第一网页为例说明获取第一网页的第一语义聚合信息和至少一个第二语义聚合信息的过程,其他网页的第一语义聚合信息和至少一个第二语义聚合信息的获取过程与第一网页类似,不再叙述。其中,第一网页可以为多个网页中的任意一个网页。
可选的,第一网页的第一语义聚合信息通过第二向量表示。
具体的,对拓扑图中的多个网页的第一向量进行语义聚合,得到每个网页的第二向量,即得到第一网页的第二向量,其中,每个网页的第一向量用于表示每个网页的语义信息,且每个网页的第一向量可以通过训练好的语义信息提取模型对每个网页进行语义信息提取得到,比如,语义信息提取模型可以为Bert模型。
示例性的,对多个网页的第一向量进行语义聚合过程就是根据拓扑图中多个网页的权重对多个网页的第一向量进行语义聚合。对于获取第一网页的第二向量来说,第一网页的权重为1,其他网页的权重根据与第一网页的连接关系以及与第一网页在拓扑图中的距离确定。具体的,当某个网页与第一网页不具有连接关系时,确定该网页的权重为0;当某个网页与第一网页具有连接关系时,根据该网页与第一网页在拓扑图中的距离确定该网页的权重,其中,与第一网页具有连接关系的网页包括:具有直接连接关系的网页和间接连接关系的网页。例如,第一网页为图3中的网页A,则网页B为与网页A具有直接连接关系的网页,网页C为与网页A具有间接连接关系的网页。其中,两个网页在拓扑图中的距离,可以理解为两个网页之间间隔的网页的数量,比如,网页C和网页A的距离为1,即间隔了一个网页B,网页A和网页B的距离为0,即没有间隔网页。
因此,第一网页的第二向量可以通过公式(1)表示:
Figure PCTCN2022097818-appb-000001
其中,y为第一网页的第二向量,α i为多个网页中第i个网页的权重,e i为第i个网页的第一向量,n为多个网页的数量。
示例性的,当第i个网页为第一网页时,则α i为1;当第i个网页为与第一网页不具有关联关系的网页时,α i为0;当第i个网页为与第一网页具有关联关系的网页时,α i=γ m+1,其中,γ为预设参数,小于1,m为第i个网页与第一网页之间间隔的网页的数量。
需要说明,后续对子拓扑图中的网页的第一向量进行语义聚合的过程,与上述对拓扑图中的多个网页的第一向量进行语义聚合的过程类似,不再叙述。
可选的,第一网页的至少一个第二语义聚合信息通过至少一个第三向量表示,该至少一个第三向量可以根据上述的拓扑图以及多个网页中的每个网页的第一向量确定,其中,至少 一个第三向量中的每个第三向量都对应第一网页包括的一个主题,且每个第三向量对应的主题不同,即至少一个第三向量与第一网页包括的第一主题一一对应。每个第三向量是对第一网页的第一向量以及第二网页的第一向量进行语义聚合得到的,其中,第二网页是多个网页中包含与每个第三向量对应的主题的网页。
在本申请的一个实施方式中,遍历拓扑图中每个网页的主题,确定拓扑图中包含有主题E的第二网页,其中,主题E为第一网页包含的至少一个主题中的任意一个主题。将包含有主题E的第二网页以及第一网页从拓扑图中抽取出来,得到与主题E对应的子拓扑图。因此,每个子拓扑图中的网页包括第一网页以及与第一网页具有相同主题的第二网页。针对第一网页的至少一个主题均执行与主题E类似的操作,得到与第一网页对应的至少一个子拓扑图。最后,对每个子拓扑图中的网页的第一向量进行语义聚合,即对每个子拓扑图中的第一网页的第一向量和第二网页的第一向量进行语义聚合,得到第一网页在每个子拓扑图中的第三向量,进而可以得到第一网页在至少一个子拓扑图中的至少一个第三向量,其中,至少一个第三向量中的每个第三向量都对应至少一个子拓扑图中的一个子拓扑图,且至少一个第三向量中的每个第三向量对应的子拓扑图都不相同,即至少一个第三向量与至少一个子拓扑图一一对应。
在本申请的另一个实施方式中,将多个网页的所有主题进行合并与去重,得到主题集;然后,将多个网页中包含有第一主题的网页分到同一组,得到多个网页组,其中,第一主题为主题集中的任意一个,即类似倒排索引的方式,将主题集中的每个主题作为一个特征,对多个网页进行分组。比如,多个网页包括网页1和网页2,其中,网页1包括主题1、主题2和主题3,网页2包括主题1和主题2;因此,对主题进行合并与去重,得到的主题集为主题1、主题2和主题3。将包含有主题1的网页组成的网页组为网页1和网页2,包含有主题2的网页组成的网页组为网页1和网页2,包含有主题3的网页组成的网页组为网页1。
然后,根据第一网页的至少一个主题,确定多个网页组中与第一网页对应的至少一个网页组;然后,从拓扑图中将至少一个网页组中的每个网页组所包含的网页抽取出来,得到与每个网页组对应的子拓扑图,进而得到与至少一个网页组对应的至少一个子拓扑图,其中,至少一个子拓扑图中的每个子拓扑图都对应至少一个网页组中的一个网页组,且至少一个子拓扑图中的每个子拓扑图对应的网页组都不相同,即至少一个网页组与至少一个子拓扑图一一对应;最后,对每个子拓扑图中的网页的第一向量进行语义聚合,即对每个子拓扑图中的第一网页的向量和第二网页的第二向量进行语义聚合,得到第一网页在每个子拓扑图中的第三向量,进而可以得到第一网页在至少一个子拓扑图中的至少一个第三向量。
最后,将第一网页的第二向量与第一网页对应的至少一个第三向量进行组合,得到第一网页的特征信息。示例性的,可以以矩阵的形式将第一网页的第二向量以及第一网页的至少一个第三向量进行组合,并将组合后的矩阵作为每个网页的特征信息。
应说明,在从拓扑图中抽取网页组成子拓扑图时,不改变网页在拓扑图中的关联关系。
举例来说,某个网页组中包含的网页有网页A、网页B和网页D,则可以将网页A、网页B和网页D从拓扑图中抽取出来,得到与该网页组对应的子拓扑图,即如图4所示的子拓扑图。
在本申请的一个实施方式中,对网页的语义信息进行语义聚合可以通过图神经网络实现,该图神经网络可以为图卷积神经网络(Graph Convolutional Networks,GCN)或者图注意力网络(Graph Attention Networks,GAT),等等。例如,对拓扑图中的多个网页的第一向量进行语义聚合时,将拓扑图(即多个网页之间的关联关系)以及拓扑图中每个网页的第一向量 作为图神经网络的输入数据,通过图神经网络将拓扑图中的多个网页的语义信息(即第一向量)进行语义聚合,可得到每个网页的第二向量,其中,得到每个网页的第二向量也是按照上述每个网页的权重对每个网页的第一向量进行聚合得到,不再叙述。
应理解,在通过图神经网络进行语义聚合过程中,对于每个节点来说,只会将与这个节点有直接关联关系(直接连接)或者有间接关联关系(间接连接)的节点的语义信息进行聚合。如图3所示,则对拓扑图中的网页A来说,会将网页B以及网页C的语义信息与该网页A的语义信息聚合到一起,得到网页A的第二向量,不会聚合网页D的语义信息。对于拓扑图中完全孤立的网页来说,比如,网页D的第二向量即为网页D所对应的第一向量。
在本申请的一个实施方式中,第一网页的特征信息还包括第一网页的第一向量,即将第一网页的第三向量,第一网页包括的至少一个主题下的至少一个第二向量,以及第一网页的第一向量组成第一网页的特征信息。由于特征信息中包含有每个网页本身的语义信息,从而使构建出的特征信息更加精确,进一步提高后续网页搜索的准确度。
在本申请的一个实施方式中,获取每个网页的第一向量之前,先对每个网页进行数据清洗,得到每个网页中的高质量文本,将高质量文本输入到语义信息提取模型中得到每个网页的第一向量,其中,每个网页中的高质量文本为网页中语义完整,且困惑度低于阈值的文本。
在本申请的一个实施方式中,在构建每个网页的特征信息之前,可以先对网页进行数据清洗,以从网页库中筛选出多个高质量的网页,即本申请的多个网页。
下面以第一网页为阿尔伯特·爱因斯坦(Albert Einstein)的维基百科为例说明构建网页的特征信息的过程。
步骤1:下载当前最新的维基百科网页数据,得到多个网页。
步骤2:如图5所示,通过数据处理,得到多个网页中每个网页的文本信息;然后,将每个网页的文本信息,输入到Bert模型中,得到每个网页的第一向量。
步骤3:如图6所示,基于多个网页的超链接构建拓扑图。图6下划线标记的词语为Albert Einstein的维基百科中的一个超链接。因此,Albert Einstein的维基百科通过超链接与多个网页中的其他网页进行关联,将与Albert Einstein的维基百科存在超链接的网页节点的进行连接,得到拓扑图。该拓扑图中的每个节点为与每个节点对应的网页的第一向量,如图6中黑色的节点代表Albert Einstein的维基百科的第一向量。其中,Albert Einstein的维基百科与其他网页的超链接在拓扑图中以节点之间的边连接体现。
步骤4:通过LDA主题识别模型识别拓扑图中的每个网页的主题。
步骤5:如图7所示,从拓扑图中抽取包含有Albert Einstein的维基百科包含的主题构成子拓扑图,子拓扑图的数量与Albert Einstein的维基百科包含的主题的数量相同。如图7所示,从拓扑图中分别抽取出了分别与主题1、…、主题n对应的子拓扑图1、…、子拓扑图n;利用图神经网络对拓扑图中的网页的第一向量进行语义聚合,得到Albert Einstein的维基百科对应的第三向量,利用图神经网络对各个子拓扑图中的网页进行语义聚合,得到Albert Einstein的维基百科在各个子拓扑图下的第二向量。然后,将Albert Einstein的维基百科在拓扑图下的第二向量,以及在各个子拓扑图下的第三向量进行组合,得到Albert Einstein的维基百科的特征信息。
从构造Albert Einstein的维基百科的特征信息可以看出,网页的变长表示(特征信息)主要体现在网页的特征信息中所包含的向量的数量与网页的主题数量相关。
参阅图8,图8为本申请实施例提供的一种网页搜索方法的流程示意图。该方法应用于上述的网页搜索系统。该方法包括以下步骤内容:
801:获取查询语句的语义向量。
示例性的,获取用户输入的查询语句(Query),并对查询语句进行向量表示,得到查询语句的语义向量,查询语句的语义向量用于表示查询语句的语义信息,其中,对查询语句进行向量表示可以通过语义信息提取模型实现,比如,通过上述的Bert模型对查询语句进行向量表示。
802:根据查询语句的语义向量以及多个网页中每个网页的特征信息,确定查询语句与每个网页之间的相似度。
其中,多个网页中每个网页的特征信息可以通过图2示出的特征信息构建方法得到,不再叙述。
示例性的,根据每个网页的特征信息中的各个向量的权重,将每个网页的特征信息转化为目标向量,比如,根据每个网页的特征信息中各个向量的权重对特征信息中的各个向量进行加权处理,得到每个网页的目标向量;然后,计算查询语句的语义向量与每个网页的目标向量的相似度,得到查询语句与每个网页之间的相似度,比如,可以计算查询语句的语义向量与每个网页的目标向量之间的余弦相似度,将余弦相似度作为查询语句与每个网页之间的相似度。
具体的,确定查询语句的语义向量与每个网页的特征信息中的各个语义向量之间的相似度,对查询语句的语义向量与各个语义向量之间的相似度进行归一化处理,并将归一化后的结果作为各个语义向量的权重。
以第一网页的特征信息包括第一网页对应的第二向量以及至少一个第三向量为例说明,确定查询语句与第一网页之间的相似度的过程。
示例性的,确定查询语句的语义向量与第二向量之间的相似度,以及与每个第三向量之间的相似度;然后,将查询语句的语义向量与第二向量之间的相似度,以及与每个第三向量之间的相似度进行归一化处理,得到与第二向量对应的权重,以及与每个第三向量对应的权重;根据第二向量对应的权重以及每个第三向量对应的权重,对第二向量以及至少一个第三向量进行加权处理,得到第一网页对应的目标向量;最后,确定第一网页的目标向量和查询语句的语义向量之间的相似度,得到查询语句与第一网页之间的相似度。
803:根据查询语句与每个网页之间的相似度,得到查询语句的查询结果。
其中,查询结果为多个网页中的至少一个。示例性的,按照查询语句与每个网页之间的相似度从大到小的顺序,对多个网页进行排序,将前K个网页作为查询语句的查询结果,并可以在可视化界面展示该查询结果,其中,K的取值为大于或等于1的整数。
可以看出,在本申请实施例中,在构建每个网页的特征信息时,将与每个网页存在关联的网页的语义信息也融合进来,而不是单纯的只利用每个网页本身的语义信息构建特征信息,而且,仅融合与每个网页具有相同主题的网页的语义信息,从而在信息融合的过程中不会引入噪声(比如,融合了无关的网页的语义信息),从而使构造出的特征信息的精度比较高。由于构建的特征信息比较高,则查询语句与网页进行匹配时,可以提高查询语句与网页之间的匹配精度,提高了网页搜索精度与用户的搜索体验。
参阅图9,图9为本申请实施例提供的一种网页搜索装置的结构示意图。如图9所示,网页搜索装置900包括获取单元901和处理单元902;
获取单元901,用于获取查询语句的语义向量;
处理单元902,用于根据查询语句的语义向量以及多个网页中每个网页的特征信息,确定查询语句与每个网页之间的相似度,每个网页的特征信息用于表征每个网页的第一语义聚合信息以及至少一个第二语义聚合信息,其中,第一语义聚合信息是对多个网页的语义信息进行语义聚合得到的,至少一个第二语义聚合信息是对多个网页中与每个网页具有相同主题的网页的语义信息进行语义聚合得到的,在对每个网页进行语义聚合过程中每个网页的权重大于参与语义聚合过程中的其他网页的权重;根据查询语句与每个网页之间的相似度,得到查询语句的查询结果,查询结果为多个网页中的至少一个。
关于上述获取单元901和处理单元902更详细的描述,可参考上述方法实施例中的相关描述,在此不再说明。
参阅图10,图10为本申请实施例提供的另一种网页搜索装置的结构示意图。网页搜索装置1000可以为上述的网页搜索装置;或者,为上述网页搜索装置中的芯片或芯片系统。
图10所示的网页搜索装置1000包括存储器1001、处理器1002、通信接口1003以及总线1004。其中,存储器1001、处理器1002、通信接口1003通过总线1004实现彼此之间的通信连接。
存储器1001可以是只读存储器(Read Only Memory,ROM),静态存储设备,动态存储设备或者随机存取存储器(Random Access Memory,RAM)。存储器1001可以存储程序,当存储器1001中存储的程序被处理器1002执行时,处理器1002和通信接口1003用于执行本申请实施例的数据流传输方法中的各个步骤。
处理器1002可以采用通用的中央处理器(Central Processing Unit,CPU),微处理器,应用专用集成电路(Application Specific Integrated Circuit,ASIC),图形处理器(graphics processing unit,GPU)或者一个或多个集成电路,用于执行相关程序,以实现本申请实施例的音频特征补偿装置或音频识别装置中的单元所需执行的功能,或者执行本申请方法实施例的数据流传输方法。
处理器1002还可以是一种集成电路芯片,具有信号的处理能力。在实现过程中,本申请的数据流传输方法中的各个步骤可以通过处理器1002中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处理器1002还可以是通用处理器、数字信号处理器(Digital Signal Processing,DSP)、专用集成电路(ASIC)、现成可编程门阵列(Field Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器1001,处理器1002读取存储器1001中的信息,结合其硬件完成本申请实施例的用户设备或头戴设备中包括的单元所需执行的功能,或者执行本申请方法实施例的数据流传输方法中的各个步骤。
通信接口1003可以为收发器一类的收发装置,来实现网页搜索装置1000与其他设备或通信网络之间的通信;通信接口1003也可以为输入-输出接口,来实现网页搜索装置1000与输入-输出设备之间的数据传输,其中,输入-输出设备包括但不限于键盘、鼠标、显示屏、U盘以及硬盘。
总线1004可包括在装置网页搜索装置1000各个部件(例如,存储器1001、处理器1002、通信接口1003)之间传送信息的通路。
应理解,上述的处理单元902相当于网页搜索装置1000中的处理器1002。
应注意,尽管图10所示网页搜索装置1000仅仅示出了存储器、处理器、通信接口,但是在具体实现过程中,本领域的技术人员应当理解,网页搜索装置1000还包括实现正常运行所必须的其他器件。同时,根据具体需要,本领域的技术人员应当理解,网页搜索装置1000还可包括实现其他附加功能的硬件器件。此外,本领域的技术人员应当理解,网页搜索装置1000也可仅仅包括实现本申请实施例所必须的器件,而不必包括图10中所示的全部器件。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。
本申请中,“至少一个”是指一个或者多个,“多个”是指两个或两个以上。“和/或”,描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B的情况,其中A,B可以是单数或者复数。在本申请的文字描述中,字符“/”,一般表示前后关联对象是一种“或”的关系;在本申请的公式中,字符“/”,表示前后关联对象是一种“相除”的关系。
可以理解的是,在本申请的实施例中涉及的各种数字编号仅为描述方便进行的区分,并不用来限制本申请的实施例的范围。上述各过程的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定。
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。

Claims (20)

  1. 一种网页搜索方法,其特征在于,包括:
    获取查询语句的语义向量;
    根据所述查询语句的语义向量以及多个网页中每个网页的特征信息,确定所述查询语句与所述每个网页之间的相似度,所述每个网页的特征信息用于表征所述每个网页的第一语义聚合信息以及至少一个第二语义聚合信息,其中,所述第一语义聚合信息是对所述多个网页的语义信息进行语义聚合得到的,所述至少一个第二语义聚合信息是对所述多个网页中与所述每个网页具有相同主题的网页的语义信息进行语义聚合得到的,在对所述每个网页进行语义聚合过程中所述每个网页的权重大于参与所述语义聚合过程中的其他网页的权重;
    根据所述查询语句与所述每个网页之间的相似度,得到所述查询语句的查询结果,所述查询结果为所述多个网页中的至少一个。
  2. 根据权利要求1所述的方法,其特征在于,对于第一网页,其中,所述第一网页为所述多个网页中的任意一个;
    所述第一网页的第一语义聚合信息通过第二向量表示,所述第二向量是对所述多个网页中每个网页的第一向量进行语义聚合得到的,所述每个网页的第一向量用于表示所述每个网页的语义信息;
    所述第一网页的至少一个第二语义聚合信息通过至少一个第三向量表示;所述至少一个第三向量中的每一个第三向量都对应所述第一网页包括的一个主题,且所述至少一个第三向量中的每一个第三向量对应的主题都不同;其中,所述至少一个第三向量中,每个第三向量是对所述第一网页的第一向量以及第二网页的第一向量进行语义聚合得到的,所述第二网页是所述多个网页中,包含与所述每个第三向量对应的主题的网页。
  3. 根据权利要求2所述的方法,其特征在于,
    所述第一网页的至少一个第三向量还与拓扑图有关,所述拓扑图指示所述多个网页之间的关联关系。
  4. 根据权利要求3所述的方法,其特征在于,
    所述拓扑图包括至少一个子拓扑图,所述至少一个第三向量中的每一个第三向量都对应所述至少一个子拓扑图中的一个子拓扑图,且所述至少一个第三向量中的每一个第三向量对应的子拓扑图都不相同;所述每个第三向量对应的子拓扑图中的网页包括所述第一网页以及所述第二网页;
    所述每个第三向量是对所述每个第三向量对应的子拓扑图中的所述第一网页和所述第二网页进行语义聚合得到。
  5. 根据权利要求4所述的方法,其特征在于,
    所述至少一个子拓扑图中的每一个子拓扑图都对应至少一个网页组中的一个网页组,且所述至少一个子拓扑图中的每一个子拓扑图对应的网页组都不相同,其中,所述至少一个网页组中的每个网页组是由所述多个网页中包含有与所述每个网页组对应的主题的网页组成;
    所述至少一个子拓扑图中的每个子拓扑图是将所述每个子拓扑图对应的网页组中的网页从所述拓扑图中抽取出来的。
  6. 根据权利要求1-5中任一项所述的方法,其特征在于,
    所述每个网页的特征信息还包括所述每个网页的第一向量,所述每个网页的第一向量指示所述每个网页的语义信息。
  7. 根据权利要求1-6中任一项所述的方法,其特征在于,所述每个网页的特征信息以一个矩阵表示,根据所述查询语句的语义向量以及多个网页中每个网页的特征信息,确定所述查询语句与所述每个网页之间的相似度之前,所述方法还包括:
    根据所述矩阵中各个向量的权重,将所述每个网页对应的矩阵转化为目标向量,所述目标向量指示所述每个网页的特征信息;
    所述根据所述查询语句的语义向量以及多个网页中每个网页的特征信息,确定所述查询语句与所述每个网页之间的相似度,包括:
    计算所述查询语句的语义向量与所述目标向量的相似度,所述相似度指示所述查询语句与所述目标向量对应的网页的相关性。
  8. 根据权利要求7述的方法,其特征在于,根据所述矩阵中各个向量的权重,将所述每个网页对应的矩阵转化为目标向量之前,所述方法还包括:
    确定所述查询语句的语义向量与所述矩阵中各个向量之间的相似度;
    根据所述查询语句的语义向量与所述矩阵中各个向量之间的相似度,确定所述矩阵中各个向量的权重。
  9. 一种网页搜索装置,其特征在于,包括获取单元和处理单元;
    所述获取单元,用于获取查询语句;
    所述处理单元,用于获取所述查询语句的语义向量;根据所述查询语句的语义向量以及多个网页中每个网页的特征信息,确定所述查询语句与所述每个网页之间的相似度,所述每个网页的特征信息用于表征所述每个网页的第一语义聚合信息以及至少一个第二语义聚合信息,其中,所述第一语义聚合信息是对所述多个网页的语义信息进行语义聚合得到的,所述至少一个第二语义聚合信息是对所述多个网页中与所述每个网页具有相同主题的网页的语义信息进行语义聚合得到的,在对所述每个网页进行语义聚合过程中所述每个网页的权重大于参与所述语义聚合过程中的其他网页的权重;根据所述查询语句与所述每个网页之间的相似度,得到所述查询语句的查询结果,所述查询结果为所述多个网页中的至少一个。
  10. 根据权利要求9所述的装置,其特征在于,
    对于第一网页,其中,所述第一网页为所述多个网页中的任意一个;
    所述第一网页的第一语义聚合信息通过第二向量表示,所述第二向量是对所述多个网页中每个网页的第一向量进行语义聚合得到的,所述每个网页的第一向量用于表示所述每个网页的语义信息;
    所述第一网页的至少一个第二语义聚合信息通过至少一个第三向量表示;所述至少一个第三向量中的每一个第三向量都对应所述第一网页包括的一个主题,且所述至少一个第三向量中的每一个第三向量对应的主题都不同;其中,所述至少一个第三向量中,每个第三向量是对所述第一网页的第一向量以及第二网页的第一向量进行语义聚合得到的,所述第二网页是所述多个网页中,包含与所述每个第三向量对应的主题的网页。
  11. 根据权利要求10所述的装置,其特征在于,
    所述第一网页的至少一个第三向量还与拓扑图有关,所述拓扑图指示所述多个网页之间的关联关系。
  12. 根据权利要求9或10所述的装置,其特征在于,
    所述拓扑图包括至少一个子拓扑图,所述至少一个第三向量中的每一个第三向量都对应所述至少一个子拓扑图中的一个子拓扑图,且所述至少一个第三向量中的每一个第三向量对应的子拓扑图都不相同;所述每个第三向量对应的子拓扑图中的网页包括所述第一网页以及 所述第二网页;
    所述每个第三向量是对所述每个第三向量对应的子拓扑图中的所述第一网页和所述第二网页进行语义聚合得到。
  13. 根据权利要求9或10所述的装置,其特征在于,
    所述至少一个子拓扑图中的每一个子拓扑图都对应至少一个网页组中的一个网页组,且所述至少一个子拓扑图中的每一个子拓扑图对应的网页组都不相同,其中,所述至少一个网页组中的每个网页组是由所述多个网页中包含有与所述每个网页组对应的主题的网页组成;
    所述至少一个子拓扑图中的每个子拓扑图是将所述每个子拓扑图对应的网页组中的网页从所述拓扑图中抽取出来的。
  14. 根据权利要求9-13中任一项所述的装置,其特征在于,
    所述每个网页的特征信息还包括所述每个网页的第一向量,所述每个网页的第一向量指示所述每个网页的语义信息。
  15. 根据权利要求9-14中任一项所述的装置,其特征在于,
    所述每个网页的特征信息以一个矩阵表示,根据所述查询语句的语义向量以及多个网页中每个网页的特征信息,在所述处理单元确定所述查询语句与所述每个网页之间的相似度之前,所述处理单元,还用于根据所述矩阵中各个向量的权重,将所述每个网页对应的矩阵转化为目标向量,所述目标向量指示所述每个网页的特征信息;
    在所述处理单元根据所述查询语句的语义向量以及多个网页中每个网页的特征信息,确定所述查询语句与所述每个网页之间的相似度方面,所述处理单元,具体用于:
    计算所述查询语句的语义向量与所述目标向量的相似度,所述相似度指示所述查询语句与所述目标向量对应的网页的相关性。
  16. 根据权利要求15所述的装置,其特征在于,所述处理单元根据所述矩阵中各个向量的权重,将所述每个网页对应的矩阵转化为目标向量之前,所述处理单元,还用于:
    确定所述查询语句的语义向量与所述矩阵中各个向量之间的相似度;
    根据所述查询语句的语义向量与所述矩阵中各个向量之间的相似度,确定所述矩阵中各个向量的权重。
  17. 一种网页搜索装置,其特征在于,包括:存储器,用于存储程序;处理器,用于执行存储器存储的程序;当存储器存储的程序被执行时,处理器用于实现权利要求1-8中任一项所述的方法。
  18. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质存储用于设备执行的程序代码,所述程序代码包括用于实现权利要求1-8中任一项所述的方法。
  19. 一种计算机程序产品,其特征在于,当所述计算机程序产品在计算机上运行时,使得实施计算机执行权利要求1-8中任一项所述的方法。
  20. 一种芯片,其特征在于,所述芯片包括处理器与数据接口,处理器通过数据接口读取存储器上存储的指令,实现权利要求1-8中任一项所述的方法。
PCT/CN2022/097818 2021-06-18 2022-06-09 网页搜索方法、装置及存储介质 WO2022262632A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110683570.5A CN115495636A (zh) 2021-06-18 2021-06-18 网页搜索方法、装置及存储介质
CN202110683570.5 2021-06-18

Publications (1)

Publication Number Publication Date
WO2022262632A1 true WO2022262632A1 (zh) 2022-12-22

Family

ID=84463984

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/097818 WO2022262632A1 (zh) 2021-06-18 2022-06-09 网页搜索方法、装置及存储介质

Country Status (2)

Country Link
CN (1) CN115495636A (zh)
WO (1) WO2022262632A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116975410A (zh) * 2023-09-22 2023-10-31 北京中关村科金技术有限公司 网页数据采集方法、装置、电子设备及可读存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120259882A1 (en) * 2011-04-06 2012-10-11 Google Inc. Mining for Product Classification Structures for Intenet-Based Product Searching
CN104504138A (zh) * 2014-12-31 2015-04-08 广州索答信息科技有限公司 一种基于人的信息聚合方法和装置
CN106021346A (zh) * 2016-05-09 2016-10-12 北京百度网讯科技有限公司 检索处理方法及装置
CN107220307A (zh) * 2017-05-10 2017-09-29 清华大学 网页搜索方法和装置

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120259882A1 (en) * 2011-04-06 2012-10-11 Google Inc. Mining for Product Classification Structures for Intenet-Based Product Searching
CN104504138A (zh) * 2014-12-31 2015-04-08 广州索答信息科技有限公司 一种基于人的信息聚合方法和装置
CN106021346A (zh) * 2016-05-09 2016-10-12 北京百度网讯科技有限公司 检索处理方法及装置
CN107220307A (zh) * 2017-05-10 2017-09-29 清华大学 网页搜索方法和装置

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116975410A (zh) * 2023-09-22 2023-10-31 北京中关村科金技术有限公司 网页数据采集方法、装置、电子设备及可读存储介质
CN116975410B (zh) * 2023-09-22 2023-12-19 北京中关村科金技术有限公司 网页数据采集方法、装置、电子设备及可读存储介质

Also Published As

Publication number Publication date
CN115495636A (zh) 2022-12-20

Similar Documents

Publication Publication Date Title
CN109815308B (zh) 意图识别模型的确定及检索意图识别方法、装置
CN106951422B (zh) 网页训练的方法和装置、搜索意图识别的方法和装置
CN101430695B (zh) 用于计算单词之间的差相关度的系统和方法
CN109271514B (zh) 短文本分类模型的生成方法、分类方法、装置及存储介质
CN108647322B (zh) 基于词网识别大量Web文本信息相似度的方法
CN110413787B (zh) 文本聚类方法、装置、终端和存储介质
CN112989208B (zh) 一种信息推荐方法、装置、电子设备及存储介质
CN112559747B (zh) 事件分类处理方法、装置、电子设备和存储介质
CN112580357A (zh) 自然语言查询的语义解析
KR102046692B1 (ko) 다언어 특질 투영된 개체 공간 기반 개체 요약본 생성 방법 및 시스템
CN112100396A (zh) 一种数据处理方法和装置
WO2022262632A1 (zh) 网页搜索方法、装置及存储介质
CN115438274A (zh) 基于异质图卷积网络的虚假新闻识别方法
CN114416926A (zh) 关键词匹配方法、装置、计算设备及计算机可读存储介质
CN114238746A (zh) 跨模态检索方法、装置、设备及存储介质
CN115248890A (zh) 用户兴趣画像的生成方法、装置、电子设备以及存储介质
CN113743079A (zh) 一种基于共现实体交互图的文本相似度计算方法及装置
CN108038109A (zh) 从非结构化文本中提取特征词的方法及系统、计算机程序
CN109871429B (zh) 融合Wikipedia分类及显式语义特征的短文本检索方法
He et al. Identification of communities with multi-semantics via Bayesian generative model
CN105608183A (zh) 一种提供聚合类型回答的方法和装置
El-Hajj et al. An optimal approach for text feature selection
CN114255067A (zh) 数据定价方法和装置、电子设备、存储介质
CN112632229A (zh) 文本聚类方法及装置
CN117725555B (zh) 多源知识树的关联融合方法、装置、电子设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22824112

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE