CN117909560A - Search method, training device, training equipment, training medium and training program product - Google Patents

Search method, training device, training equipment, training medium and training program product Download PDF

Info

Publication number
CN117909560A
CN117909560A CN202410101634.XA CN202410101634A CN117909560A CN 117909560 A CN117909560 A CN 117909560A CN 202410101634 A CN202410101634 A CN 202410101634A CN 117909560 A CN117909560 A CN 117909560A
Authority
CN
China
Prior art keywords
block
webpage
sample
matched
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410101634.XA
Other languages
Chinese (zh)
Inventor
崔自鑫
叶超
朱坤鸿
郭宗仁
张人愉
张斌杰
国智
李双龙
贺登武
刘林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Baidu com Times Technology Beijing Co Ltd
Original Assignee
Baidu com Times Technology Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Baidu com Times Technology Beijing Co Ltd filed Critical Baidu com Times Technology Beijing Co Ltd
Priority to CN202410101634.XA priority Critical patent/CN117909560A/en
Publication of CN117909560A publication Critical patent/CN117909560A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Transfer Between Computers (AREA)

Abstract

The disclosure provides a search method, a training method of a deep learning model, a device, electronic equipment, a storage medium and a program product, and relates to the technical field of artificial intelligence, in particular to the technical fields of man-machine interaction, a large model, a large language model, a transducer, a dialogue model, a generation model and the like. The specific implementation scheme is as follows: in response to receiving the search content, determining respective webpage information of a plurality of webpages to be matched, wherein the webpage information comprises block attributes and block contents of layout blocks of the webpages to be matched, the block attributes of the layout blocks are used for representing structural information of the webpages to be matched, and the block contents of the layout blocks comprise page contents of the webpages to be matched; and determining a target webpage matched with the search content from the plurality of webpages to be matched based on the search content and the webpage information of each of the plurality of webpages to be matched.

Description

Search method, training device, training equipment, training medium and training program product
Technical Field
The present disclosure relates to the field of artificial intelligence, and in particular, to the technical fields of human-computer interaction, large models, large language models, transformers, conversational models, generative models, and the like. And more particularly to a search method, a training method of a deep learning model, an apparatus, an electronic device, a storage medium, and a program product.
Background
Human-machine interaction is a way for a human to interact with a machine. With the continuous development of artificial intelligence technology, machines have been realized to be able to understand information entered by humans, understand the intrinsic meaning of the entered information, and make corresponding feedback. In these operations, accurate understanding of semantics, rapidness of feedback, and giving corresponding comments or suggestions all become factors affecting smooth man-machine interaction.
Disclosure of Invention
The disclosure provides a search method, a training method of a deep learning model, a device, an electronic device, a storage medium and a program product.
According to an aspect of the present disclosure, there is provided a search method including: in response to receiving the search content, determining respective webpage information of a plurality of webpages to be matched, wherein the webpage information comprises block attributes and block contents of layout blocks of the webpages to be matched, the block attributes of the layout blocks are used for representing structural information of the webpages to be matched, and the block contents of the layout blocks comprise page contents of the webpages to be matched; and determining a target webpage matched with the search content from the plurality of webpages to be matched based on the search content and the webpage information of each of the plurality of webpages to be matched.
According to another aspect of the present disclosure, there is provided a training method of a deep learning model, including: determining sample webpage information of a sample webpage, wherein the sample webpage information comprises sample block attributes and sample block contents of sample layout blocks of the sample webpage, the sample block attributes of the sample layout blocks are used for representing a page layout structure of the sample webpage, and the sample block contents of the sample layout blocks comprise page contents of the sample webpage; training the deep learning model based on the sample search content, sample webpage information of the sample webpage and a relevance label to obtain a trained deep learning model, wherein the relevance label is used for representing the relevance between the sample search content and the sample webpage.
According to another aspect of the present disclosure, there is provided a search apparatus including: the webpage determining module is used for determining respective webpage information of a plurality of webpages to be matched in response to receiving search content, wherein the webpage information comprises block attributes and block contents of layout blocks of the webpages to be matched, the block attributes of the layout blocks are used for representing structural information of the webpages to be matched, and the block contents of the layout blocks comprise page contents of the webpages to be matched; and the matching module is used for determining a target webpage matched with the search content from the plurality of webpages to be matched based on the search content and the webpage information of each of the plurality of webpages to be matched.
According to another aspect of the present disclosure, there is provided a training apparatus of a deep learning model, including: the sample webpage determining module is used for determining sample webpage information of the sample webpage, wherein the sample webpage information comprises sample block attributes and sample block contents of sample layout blocks of the sample webpage, the sample block attributes of the sample layout blocks are used for representing a page layout structure of the sample webpage, and the sample block contents of the sample layout blocks comprise page contents of the sample webpage; and the training module is used for training the deep learning model based on the sample search content, the sample webpage information of the sample webpage and the relevance label to obtain a trained deep learning model, wherein the relevance label is used for representing the relevance between the sample search content and the sample webpage.
According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the methods as disclosed herein.
According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform a method as disclosed herein.
According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements a method as disclosed herein.
It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.
Drawings
The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:
FIG. 1 schematically illustrates an exemplary system architecture to which search methods and apparatus may be applied, according to embodiments of the present disclosure;
FIG. 2 schematically illustrates a flow chart of a search method according to an embodiment of the disclosure;
FIG. 3 schematically illustrates a schematic diagram of a web page to be matched according to an embodiment of the present disclosure;
FIG. 4 schematically illustrates a flow diagram of a hierarchical relationship according to an embodiment of the disclosure;
FIG. 5 schematically illustrates a schematic diagram of a web page to be matched according to another embodiment of the present disclosure;
FIG. 6 schematically illustrates a schematic diagram of determining correlation according to an embodiment of the present disclosure;
FIG. 7 schematically illustrates a flow chart of a training method of a deep learning model according to an embodiment of the disclosure;
FIG. 8 schematically illustrates a flow chart of a training method of a deep learning model according to an embodiment of the present disclosure;
fig. 9 schematically shows a block diagram of a search apparatus according to an embodiment of the present disclosure;
FIG. 10 schematically illustrates a block diagram of a training apparatus of a deep learning model in accordance with an embodiment of the present disclosure; and
Fig. 11 schematically illustrates a block diagram of an electronic device adapted to implement a search method according to an embodiment of the disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
In a search scenario, in implementing the concepts of the present disclosure, it is found that: the presentation of web page content can affect the accuracy of the determination of search results. For example, in the case where the page type of the web page to be matched is the consultation type, the web page to be matched is mainly a dialogue. And under the condition that the page type of the webpage to be matched is a content page, the webpage to be matched can introduce commodities or services in an important way. And under the condition that the page type of the webpage to be matched is the download type, the information quantity is less, and the download service is mainly provided.
The web page content of the web page to be matched can be used as reference data only, whether the web page to be matched is matched with search content input by a user or not is determined, and the matched web page is fed back to the client.
The use of web page content or keywords in web page content as reference data may to some extent lose the structured information of the web page to be matched, resulting in difficulty in using different structure of different page types as reference data.
In order to solve the technical problem, the present disclosure provides a search method, including: and determining the webpage information of each of the plurality of webpages to be matched in response to receiving the search content. The webpage information comprises block attributes and block contents of layout blocks of the webpage to be matched, the block attributes of the layout blocks are used for representing structural information of the webpage to be matched, and the block contents of the layout blocks comprise page contents of the webpage to be matched. And determining a target webpage matched with the search content from the plurality of webpages to be matched based on the search content and the webpage information of each webpage to be matched.
By utilizing the searching method provided by the disclosure, the block attribute in the webpage information can be utilized to represent the structural information of the webpage to be matched, and the block content in the webpage information is utilized to represent the page content of the webpage to be matched, so that the structural information of the webpage to be matched is fused into the page content, thereby enriching the information diversity of the webpage information, and improving the accuracy of determining the target webpage.
In the technical scheme of the disclosure, the related processes of collecting, storing, using, processing, transmitting, providing, disclosing, applying and the like of the personal information of the user all conform to the regulations of related laws and regulations, necessary security measures are adopted, and the public order harmony is not violated.
In the technical scheme of the disclosure, the authorization or consent of the user is obtained before the personal information of the user is obtained or acquired.
Fig. 1 schematically illustrates an exemplary system architecture to which search methods and apparatuses may be applied, according to embodiments of the present disclosure.
It should be noted that fig. 1 is only an example of a system architecture to which embodiments of the present disclosure may be applied to assist those skilled in the art in understanding the technical content of the present disclosure, but does not mean that embodiments of the present disclosure may not be used in other devices, systems, environments, or scenarios. For example, in another embodiment, an exemplary system architecture to which the search method and apparatus may be applied may include a terminal device, but the terminal device may implement the search method and apparatus provided by the embodiments of the present disclosure without interacting with a server.
As shown in fig. 1, a system architecture 100 according to this embodiment may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired and/or wireless communication links, and the like.
The user may interact with the server 105 via the network 104 using the terminal devices 101, 102, 103 to receive or send messages or the like. Various communication client applications may be installed on the terminal devices 101, 102, 103, such as a knowledge reading class application, a web browser application, a search class application, an instant messaging tool, a mailbox client and/or social platform software, etc. (as examples only).
The terminal devices 101, 102, 103 may be a variety of electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablets, laptop and desktop computers, and the like.
The server 105 may be a server providing various services, such as a background management server (by way of example only) providing support for content browsed by the user using the terminal devices 101, 102, 103. The background management server may analyze and process the received data such as the user request, and feed back the processing result (e.g., the web page, information, or data obtained or generated according to the user request) to the terminal device.
It should be noted that, the search method provided by the embodiments of the present disclosure may be generally performed by the terminal device 101, 102, or 103. Accordingly, the search apparatus provided by the embodiments of the present disclosure may also be provided in the terminal device 101, 102, or 103.
Or the search method provided by the embodiments of the present disclosure may be generally performed by the server 105. Accordingly, the search apparatus provided by the embodiments of the present disclosure may be generally provided in the server 105. The search method provided by the embodiments of the present disclosure may also be performed by a server or a server cluster that is different from the server 105 and that is capable of communicating with the terminal devices 101, 102, 103 and/or the server 105. Accordingly, the search apparatus provided by the embodiments of the present disclosure may also be provided in a server or a server cluster that is different from the server 105 and is capable of communicating with the terminal devices 101, 102, 103 and/or the server 105.
For example, when a user searches for a web page, the terminal device 101, 102, 103 may acquire search content input by the user, and then transmit the search content to the server 105, and the server 105 determines web page information of each of a plurality of web pages to be matched in response to receiving the search content. And determining a target webpage matched with the search content from the plurality of webpages to be matched based on the search content and the webpage information of each of the plurality of webpages to be matched. Or by a server or cluster of servers capable of communicating with the terminal devices 101, 102, 103 and/or the server 105, and finally determine the target web page.
It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
It should be noted that the sequence numbers of the respective operations in the following methods are merely representative of the operations for the purpose of description, and should not be construed as representing the order of execution of the respective operations. The method need not be performed in the exact order shown unless explicitly stated.
Fig. 2 schematically illustrates a flow chart of a search method according to an embodiment of the present disclosure.
As shown in fig. 2, the method includes operations S210 to S220.
In operation S210, in response to receiving the search content, web page information of each of the plurality of web pages to be matched is determined.
In operation S220, a target web page matching the search content is determined from the plurality of web pages to be matched based on the search content and the web page information of each of the plurality of web pages to be matched.
In an example, the web page information may include block attributes and block contents of layout blocks of the web page to be matched, the block attributes of the layout blocks being used to characterize structural information of the web page to be matched, the block contents of the layout blocks including page contents of the web page to be matched.
In one example, a web page to be matched may include a plurality of layout blocks arranged in accordance with structured information.
In an example, the structural information of the web page to be matched may also be referred to as structural layout information, which may refer to information that a plurality of web page components, such as layout blocks, in the web page to be matched have a hierarchical layout or arrangement between each other.
In an example, the structured information of the web pages to be matched may be embodied by the block attributes of the layout blocks. And the webpage content of the webpage to be matched is embodied through the block content of the layout block.
For example, the title, the related list, the image, the subject introduction, and the like of the web page to be matched may be respectively used as the layout blocks. Based on the web pages to be matched, block attributes and block contents of each layout block are determined. The block attributes may include a layout block type, but are not limited thereto, and may also include a layout block location. As long as the block attributes of the layout blocks can be characterized. The block content may include text content.
According to the embodiment of the disclosure, the webpage information of the webpage to be matched is used as the judging basis for judging whether the webpage to be matched is related to the search content, the block attribute in the webpage information can be utilized to embody the page structuring information, and the block content in the webpage information can be utilized to fully represent the page semantic information, so that the analysis granularity of the webpage to be matched is thinned, the analysis comprehensiveness of the webpage to be matched is enlarged, and the correlation judging precision between the search content and the webpage to be matched is improved.
According to a related example, web page content of a web page to be matched can be obtained, and a degree of correlation between the web page to be matched and the search content is determined based on the search content and the web page content. So as to determine whether to present the web pages to be matched to the user as search results.
Compared with a search method for determining the relevance by only using the webpage content of the webpage to be matched, the search method provided by the embodiment of the invention can comprehensively analyze the webpage to be matched by using the block attribute for reflecting the structural information in the webpage information, thereby improving the relevance discrimination precision between the search content and the webpage to be matched and improving the user experience.
According to an embodiment of the present disclosure, before performing operation S210 as shown in fig. 2, determining web page information of each of a plurality of web pages to be matched in response to receiving search content, the search method may further include: a set of web page information is generated.
In an example, the web page information may include: page attributes, block content, and content attributes.
In one example, the page attributes may include page types, also referred to as web page types, such as a list page type, a detail page type, or a consultation page type, but are not limited thereto, and may also include web page titles, such as a "picture detail" web title or a "merchandise introduction" web title. As long as the attribute information can characterize the web pages to be matched.
In one example, the block attributes may include a layout block type, such as a layout subtitle, a content list, a text, etc., but are not limited thereto, and may include a layout block location, such as page coordinates. As long as the block attributes of the layout blocks can be characterized.
In an example, the block content may include text content. Such as text content in a title.
In an example, the content attributes may include content types, such as merchandise, ratings, novels, games, etc., information characterizing the content types.
In one example, web page information for each web page to be matched may be determined, and a set of web page information may be generated based on the web page information for each of the plurality of web pages to be matched.
According to the embodiment of the disclosure, the webpage information set is constructed in advance, and under the condition that a user sends a search request through the terminal equipment, the webpage information of the webpage to be matched can be rapidly obtained from the webpage information set, so that the processing efficiency and the response efficiency are improved.
Fig. 3 schematically shows a schematic diagram of a web page to be matched according to an embodiment of the disclosure.
As shown in fig. 3, the web page 300 to be matched includes a title 310, a bar navigation 320 set below the title 310, a dialog box 330 set on the right side of the bar navigation 320, an article 340 set on the upper right side of the dialog box 330, and the like. The title 310 and icons may be divided into layout block 1, the information bar navigation 320 into layout block 2, the dialog box 330 into layout block 3, and the article 340 into layout block 4. The block attributes and block content of each layout block may be determined based on the web page 300 to be matched. For layout block 1, the block attribute "title", block content ". Times. Hospital" of layout block 1 may be determined. For the layout block 2, a block attribute "list" of the layout block 2, a block content "information column navigation |introduction 1|catalog |introduction 2" may be determined. For layout block 3, the layout block attribute "dialog box", block content "do you good, please ask what helps you? I you good. For the layout block 4, a layout block attribute "body" and a layout block content "may be determined.
According to an embodiment of the present disclosure, generating the set of web page information may include: and determining layout blocks from the webpages to be matched. And obtaining webpage information based on the block attribute and the block content of the layout block. A set of web page information is generated based on the plurality of web page information.
For example, the web page information of the web page to be matched may be obtained based on the block attribute of the layout block 1, the block content of the layout block 1, the block attribute of the layout block 2, the block content of the layout block 2, the block attribute of the layout block 3, the block content of the layout block 3, the block attribute of the layout block 4, and the block content of the layout block 4 as shown in fig. 3. But is not limited thereto. The web page information can also be obtained based on the page attribute of the web page to be matched, the block attribute of the layout block and the block content.
According to the embodiment of the disclosure, each piece of webpage information in the webpage information set generated in advance comprises the respective block attributes and the respective block contents of the plurality of layout blocks, so that the webpage information is standardized through the combination form of the block attributes and the block contents of the layout blocks while the page structural information and the page contents of the webpage to be matched are reflected.
According to an embodiment of the present disclosure, determining a layout block from a web page to be matched may include: and determining page elements of the web page to be matched. Based on the page elements, a layout block is determined from the web pages to be matched.
In an example, for a web page to be matched of HTML (Hyper Text Markup Language ) type or CSS (CASCADING STYLE SHEETS, cascading style sheet) type, page elements (elements) of the web page to be matched may be obtained by a developer tool. But is not limited thereto. The code language used for describing the web page to be matched can be used as the page element of the web page to be matched as long as the code language is used for reference or modification of the developer.
For example, page elements may include layout block tags < div >, inline tags < span >, image tags < img >, hyperlink tags < a >, attribute names such as title, alignment, coordinates, and the like.
In one example, layout blocks may be determined from the web pages to be matched based on the page elements. For example, keyword matching is performed from the page elements, and a preset page sub-element for identifying the layout block is obtained. And determining a layout block from the webpage to be matched based on the preset page subelement.
In another example of the present disclosure, the layout blocks may also be determined from the web pages to be matched according to a page map that matches the web pages to be matched. For example, image recognition is performed on the page map to obtain each layout block.
According to the embodiment of the disclosure, the structural layout design of the webpage to be matched and the structural layout features of a plurality of different components which are partially dispersed in different organization structures are utilized, and the webpage to be matched is subjected to the refinement analysis of the granularity of the layout blocks, so that the common characteristics of the webpage to be matched can be reasonably utilized, the existing resources such as page elements of the webpage to be matched can be reasonably utilized, the processing difficulty is reduced while the refinement analysis is performed, and the processing cost is reduced.
According to an embodiment of the present disclosure, determining, based on page elements, a layout block for characterizing a page layout structure of a web page to be matched from the web page to be matched may include: and determining the structural information of the webpage to be matched based on the page type and the layout mapping relation of the webpage to be matched. And determining the layout blocks from the webpages to be matched based on the structural information and the page elements of the webpages to be matched.
In an example, the structured information may include relevant information for characterizing the structural layout of the web page to be matched. For example, the number of layout blocks, the arrangement manner of a plurality of layout blocks with each other, the identification of each layout block or the block attribute of the layout block, and the like.
In one example, different page types of web pages to be matched have different structured information. For example, for page type 1 to be matched, web page 1, the structured information includes 3 layout blocks. For a web page to be matched of page type 2, the structured information includes 4 layout blocks.
In an example, a layout mapping relationship for characterizing an association relationship between a page type and structured information may be constructed in advance. Based on the page type, the structural information of the web pages to be matched is determined. And determining the layout blocks from the webpages to be matched based on the structural information and the page elements of the webpages to be matched.
According to an exemplary embodiment of the present disclosure, in a case where the layout block includes a plurality of the web pages to be matched, determining the layout block from the web pages to be matched based on the structural information of the web pages to be matched and the page elements may include: based on the page elements and the structural information of the web pages to be matched, page subelements for identifying layout blocks are determined. A layout block is determined from the web pages to be matched based on the page sub-elements.
In an example, the structured information may include layout block identification, such as a page sub-element, but is not limited thereto, and may include semantic information for characterizing the layout block, which may be matched to the page sub-element based on the semantic information. It is sufficient if the page sub-element for identifying the layout block can be determined from the page elements by the structured information.
According to the embodiment of the disclosure, the structural information is determined for the webpages to be matched of each page type, so that the structural characteristics of the webpages to be matched can be extracted, and the common characteristics of the webpages to be matched can be summarized. The method has the advantages that the commonality characteristic is utilized to establish the layout mapping relation between the page type and the structured information, the speed of determining the layout blocks can be improved, the pertinence of determining the layout block mode is improved, and the determination accuracy is further improved.
According to an embodiment of the present disclosure, determining a layout block from a web page to be matched based on a page subelement may include: based on the page sub-elements, an initial layout block is determined from the web pages to be matched. In the case where the block type of the initial layout block is determined to be a predetermined block type, the layout block is determined based on the initial layout block.
In an example, a layout block obtained from a web page to be matched based on the page subelement can be directly taken as a final layout block. But is not limited thereto. The layout block obtained from the web page to be matched based on the page subelement can also be used as an initial layout block. And performing block type identification on the initial layout block, and determining the layout block based on the initial layout block under the condition that the block type of the initial layout block is determined to be a preset block type.
In an example, the predetermined block type may be predetermined based on the structured information.
According to the embodiment of the disclosure, the layout blocks obtained from the webpages to be matched based on the page sub-elements are used as the initial layout blocks, and the secondary identification is carried out on the initial layout blocks, so that the analysis precision of the layout blocks can be improved, and the reduction of the precision of determining the target webpages based on the webpage information due to the fact that the layout blocks are determined to be wrong is avoided.
According to an embodiment of the present disclosure, before performing the generating of the web page information set, the search method may further include: and determining the structural information of each webpage to be matched.
According to an embodiment of the present disclosure, determining the structured information of each web page to be matched may include: and determining at least one piece of structural information to be evaluated of the evaluation webpage. Structured information is evaluated for each test. And determining an evaluation layout block of the evaluation webpage from the evaluation webpage based on the structural information to be evaluated and the evaluation page element of the evaluation webpage. And determining the layout mapping relation between the page type of the evaluation webpage and the structural information to be evaluated based on the evaluation search content, the evaluation block attribute of the evaluation layout block and the evaluation block content. The correlation between the evaluation search content and the evaluation web page is known. And determining the structural information of the webpage to be matched based on the page type and the layout mapping relation of the webpage to be matched.
In an example, the evaluation web page may include a web page to be matched. But is not limited thereto. But also the same type of web page as the web page to be matched.
In an example, a plurality of structural information to be evaluated may be determined in advance for each evaluation web page. And determining an evaluation page subelement for identifying an evaluation layout block from the evaluation webpage based on the structural information to be evaluated and the evaluation page element of the evaluation webpage. And determining an evaluation layout block from the evaluation webpage based on the evaluation page subelement. And determining the evaluation webpage information of the evaluation webpage based on the evaluation block attribute and the evaluation block content of the evaluation layout block. And determining the evaluation correlation between the evaluation webpage and the search content based on the evaluation search content, the evaluation search attribute of the evaluation search content and the evaluation webpage information. And determining target evaluation correlation matched with the real correlation from the plurality of evaluation correlations based on the plurality of evaluation correlations corresponding to the plurality of structured information to be evaluated one by one and the known real correlation between the evaluation search content and the evaluation webpage. And taking the target to-be-evaluated structured information corresponding to the target evaluation correlation degree as the structured information of the best matching of the evaluation webpage. And determining the association relation between the page type of the evaluation webpage and the target to-be-evaluated structured information. And obtaining a layout mapping relation by determining target to-be-evaluated structural information of different page types.
According to the embodiment of the disclosure, at least one piece of preset structured information is used as the structured information to be evaluated, the structured information to be evaluated is evaluated through the evaluation webpage, whether the structured information to be evaluated is matched with the page type of the evaluation webpage or not is determined according to the comparison relation between the evaluation correlation and the real correlation, and therefore the accuracy of construction of the layout mapping relation is improved through the evaluation step, and the accuracy of determining the layout block is further improved.
According to an embodiment of the present disclosure, for operation S220 as shown in fig. 2, determining a target web page matching the search content from among a plurality of web pages to be matched based on the search content and web page information of each of the plurality of web pages to be matched may include: and generating an information sequence based on the search content, the search attribute of the search content and the webpage information aiming at the webpage information of each webpage to be matched. And determining the correlation degree between the search content and the webpage to be matched based on the information sequence to obtain a plurality of correlation degrees. And determining target webpages matched with the search content from the webpages to be matched based on the relevance degrees.
In one example, the search content and the web page information are ordered in a predetermined order to obtain an information sequence. But is not limited thereto. The information sequence may also be generated by searching for content, search attributes of the searched content, and web page information. The search attribute of the search content may refer to the search field or may be related information for assisting in explaining the search. For example, search web sites.
In an example, the information sequence may be input into a relevance discrimination model to obtain a relevance between the search content and the web page to be matched. And sequencing the plurality of webpages to be matched according to the sequence of the correlation degree from high to low to obtain a sequencing result. And taking the preset number of webpages to be matched which are ranked in front as target webpages according to the ranking result.
In an example, the relevance discrimination model may include: one or more of convolutional neural network, recurrent neural network, long-short term memory network, or the like, but not limited thereto, may also include a large language model (Large Languege Model, LLM), such as one or more of GPT (GENERATIVE PRE-Trained Transformer, generative pre-training codec model), chatGPT (CHAT GENERATIVE PRE-Trained Transformer, chat-generative pre-training codec model), GLM (General Language Model, generic language model), or the like.
In a related example, features may be extracted from search content and search attributes of the search content, resulting in first features. And extracting the features from the webpage information to obtain second features. And carrying out vector similarity calculation on the first feature and the second feature to obtain the correlation. Any manner is possible as long as the degree of correlation between the search content and the web page to be matched can be determined based on the search content, the search attribute of the search content, and the web page information.
According to an embodiment of the present disclosure, an information sequence is generated based on search content, search attributes of the search content, and web page information. And determining the relevance between the search content and the webpage to be matched based on the information sequence. The search information can be associated with the content to be searched to obtain an information sequence, so that the models can learn the context information among the models, and the judging precision of the correlation degree is improved.
According to an embodiment of the present disclosure, the search method may further include: search attributes of the search content are determined.
According to an embodiment of the present disclosure, determining search attributes of search content may include: in response to receiving the search content, at least one initial search attribute of the search content is determined. User attributes and user historical attention information are determined. Search attributes for searching for content are determined from the at least one initial search attribute based on the user attributes and the user historical interest information.
In an example, the user attributes may include identity information of the user's age, occupation, gender, etc. The user historical attention information may include click information, collection information, attention information, browsing duration, forwarding, and the like.
For example, apples may refer to fruit, film names, and possibly cell phone brands. It may be determined that the user has recently searched for movie viewing every weekend based on the user attributes and the user's historical attention information. It can thus be determined that the search attribute for the search content "apple" may include "movie".
In an example, the user attribute and the user historical attention information can be utilized to determine the search attribute of the search content from at least one initial search attribute, so that the search difficulty is reduced, and the matching difficulty of the web pages to be matched is prevented from being increased due to the determination problem of the search attribute.
According to an exemplary embodiment of the present disclosure, generating an information sequence based on search content, search attributes of the search content, and web page information may include: and ordering the search content, the search attribute of the search content and the webpage information according to a preset hierarchical relationship to obtain an information sequence.
In an example, the predetermined hierarchical relationship, may be used to characterize structured information of the web pages to be matched. For example, the order of searching for content, search attributes, and web page information with respect to each other may be included. But is not limited thereto. The block attributes of the layout blocks within the web page information and the order of the block contents with respect to each other may also be included.
According to the embodiment of the disclosure, the information sequence is standardized and serialized, so that the large language model can conveniently learn the structural information of the webpage to be matched in the information sequence, and further the training speed and the training precision of the large language model are improved.
According to an embodiment of the present disclosure, the layout block may include a plurality of. The predetermined hierarchical relationship may include a layout order of a plurality of page sub-elements in one-to-one correspondence with the plurality of layout blocks.
According to embodiments of the present disclosure, the layout order of the plurality of page sub-elements may characterize the structured layout of the web page to be matched. The information sequence generated based on the preset hierarchical relationship can embody the structural information of the webpage to be matched by utilizing the preset hierarchical relationship to comprise the layout sequence, so that the large language model can learn the structural information of the webpage to be matched in the information sequence conveniently, and the matching precision of the large language model is improved.
According to an exemplary embodiment of the present disclosure, the web page information may include: page attributes, block content, and content attributes. According to a predetermined hierarchical relationship, the search content, the search attribute of the search content and the webpage information are ordered to obtain an information sequence, which may include: and ordering the search content, the search attribute of the search content, the page attribute, the block content and the content attribute according to a preset hierarchical relationship to obtain an information sequence.
Fig. 4 schematically illustrates a flow diagram of a hierarchical relationship according to an embodiment of the disclosure.
As shown in FIG. 4, an analysis may be performed from the page level 410 to determine the page type of the web page to be matched. Based on the page type and layout mapping relationship, analysis is performed from the layout block level. Structural information corresponding to page type 1 of the web page to be matched is determined, and the content of the layout block level 420 is determined from the web page to be matched based on the structural information and the page elements. For example, layout block 1, layout block 2, and layout block m. For each layout block, an analysis of the content level 430 is performed. Block attributes and block contents of the layout blocks are determined. For example, for layout block 1, block content 1 and block content 2 of the layout block, and block attribute 1 and block attribute 2 of block content 1 and block content 2 are determined.
In one example, search content, such as query, may be identified according to a predetermined hierarchical relationship of search content, search attributes of search content, page attributes, block attributes, content attributes, block content: "xxx", search attributes of search content such as query industry: "xxx", page attributes such as floor page header: "xxx" and floor page types: "xxx", block attributes such as block type: "xxx", block content such as content: "xxx", content attributes such as content tags: "xxx" to obtain an information sequence, such as query: "xxx", query industry: "xxx" [ SEP ] landing page header: "xxx", floor page type: "xxx", block 1: "block type: xxx, content tags: xxx, content: xxx ", block 2: "block type: xxx, content tags: xxx, content: xxx ", …, floor page industry: "xxx" [ SEP ]. Where [ SEP ] belongs to the identifier.
According to the embodiment of the disclosure, the structural information and page content of the webpage to be matched are embodied from different levels and different aspects by utilizing a plurality of parameters in the information sequence in a natural language description mode, so that the target webpage determined by utilizing the information sequence is accurate and effective.
Fig. 5 schematically shows a schematic diagram of a web page to be matched according to another embodiment of the present disclosure.
As shown in FIG. 5, the web page 500 to be matched includes a first layout block 510 of an image type, a second layout block 520 of a text type, and a third layout block 530 of a control type. The block content of the first layout block 510 includes an image.
In an example, where it is determined that the block content includes an image, the image is subjected to semantic analysis to obtain semantic text describing the image. Based on the semantic text, block content is obtained.
In one example, the image may be input into an image processing model using an image recognition means, such as an image processing model, resulting in semantic text. Semantic text may be used as the chunk content. But is not limited thereto. Characters in the image can also be recognized by utilizing OCR (Optical Character Recognition ) technology to obtain an image text, and the image text is taken as a semantic text. Any semantic text that can characterize the semantics of an image.
According to the embodiment of the disclosure, the content type of the block content is judged in advance, and the format conversion is performed on the non-text block content, so that the information format of the webpage information obtained based on the block content is unified and standardized, and the webpage information is described by converting the webpage information into the natural language, thereby being beneficial to learning and application of a large language model and improving the processing efficiency of the large language model.
According to the embodiment of the disclosure, the webpage information and the search content can be processed by using a large language model, so that the correlation degree between the webpage to be matched and the search content is obtained. A specific implementation may employ a correlation determination method as shown in fig. 6.
Fig. 6 schematically illustrates a schematic diagram of determining a degree of correlation according to an embodiment of the present disclosure.
As shown in fig. 6, the search content, the search attribute of the search content, the page attribute, the block content, the content attribute are ordered according to a predetermined hierarchical relationship to obtain an information sequence 610. The information sequence 610 is input to the correlation discrimination model M610, and a correlation degree 620 is obtained.
According to the embodiment of the disclosure, the structural information of the webpage to be matched is introduced into the correlation discrimination model by utilizing the information sequence, so that the correlation discrimination model can learn knowledge contained in the structural information, the advantage of the correlation discrimination model on natural language understanding is fully utilized, the webpages to be matched of different structural information are learned, and the universality of the correlation discrimination model is improved under the condition that the discrimination precision is ensured.
According to an embodiment of the present disclosure, the relevance discrimination model may be trained as a deep learning model as shown in fig. 7 below, using a training method of the deep learning model as shown in fig. 7, so as to apply the trained deep learning model to a search method as shown in fig. 2, improving the determination accuracy of a target web page of the search method.
Fig. 7 schematically illustrates a flowchart of a training method of a deep learning model according to an embodiment of the present disclosure.
As shown in fig. 7, the method includes operations S710 to S720.
In operation S710, sample web page information of a sample web page is determined.
In operation S710, a deep learning model is trained based on sample search content, sample web page information of sample web pages, and relevance labels, resulting in a trained deep learning model.
In an example, the sample web page information includes sample block attributes and sample block contents of sample layout blocks of the sample web page, the sample block attributes of the sample layout blocks being used to characterize a page layout structure of the sample web page, the sample block contents of the sample layout blocks including page contents of the sample web page.
In an example, a relevance tag is used to characterize relevance between sample search content and a sample web page.
In an example, sample search content and sample web page information may be input into a deep learning model to obtain a relevance result. And obtaining a loss value based on the correlation result and the correlation label. And adjusting parameters of the deep learning model based on the loss value until the deep learning model converges. The deep learning model that reaches convergence is taken as a trained deep learning model.
In an example, the sample search content, sample web page, and sample web page information shown in fig. 7 are the same as or similar to the definitions and processing of the search content, web page to be matched, and web page information shown in fig. 2, but are merely different in the embodiments. Similarly, the terms involved in the training method of the deep learning model and the terms involved in the searching method are identical or similar in interpretation and processing manner if they differ only in "sample". And will not be described in detail herein.
According to the embodiment of the disclosure, page structured information is converted into sample webpage information in a natural language form, fine tuning training is performed on a deep learning model such as a generated model, the sample form can be kept consistent with pre-training as much as possible, the processing capability of the deep learning model such as the generated model on natural language data is fully exerted, the problem that external features are difficult to introduce based on a large-scale pre-training model is solved, and training speed and training precision are improved.
Fig. 8 schematically illustrates a flowchart of a training method of a deep learning model according to an embodiment of the present disclosure.
The deep learning model may be trained by training the deep learning model based on sample search content, sample web page information of the sample web page, and relevance labels, as shown in fig. 8, resulting in a trained deep learning model.
As shown in fig. 8, sample search content and sample web page information of a sample web page are used as a sample information sequence 810, and the sample information sequence 810 is input into the deep learning model M810 to obtain a relevance result 820 and a reason result 830. Based on the correlation result 820 and the correlation label 840, a first penalty value 850 is obtained. Based on the reason result 830 and the reason tag 860, a second loss value 870 is obtained. The reason tag is used for representing the reason for whether the sample search content is relevant to the sample webpage or not. Based on the first loss value 850 and the second loss value 860, the deep learning model is trained, resulting in a trained deep learning model.
According to the embodiment of the disclosure, the generated correlation result representing whether the correlation is generated can be taken as a training task, the generated reason result can be taken as a training task, and the multiple tasks can be trained simultaneously. To improve the accuracy of the determination of the correlation determination task through multitasking training.
According to an embodiment of the present disclosure, generating the sample web page information set may include: sample layout blocks are determined from the sample web page. And obtaining sample webpage information based on the sample block attribute and the sample block content of the sample layout block. A sample web page information set is generated based on the plurality of sample web page information.
According to an embodiment of the present disclosure, determining, based on the sample page element, a sample layout block for characterizing a page layout structure of the sample web page from the sample web page may include: and determining sample structural information of the sample webpage based on the sample page type and the sample layout mapping relation of the sample webpage. Sample layout blocks are determined from the sample web page based on sample structural information of the sample web page and sample page elements.
According to an embodiment of the present disclosure, determining a sample layout block from a sample web page based on sample page sub-elements may include: based on the sample page sub-elements, a sample initial layout block is determined from the sample web page. In the case where the sample block type of the sample initial layout block is determined to be the sample predetermined block type, the sample layout block is determined based on the sample initial layout block.
According to an embodiment of the present disclosure, determining sample structured information for each sample web page may include: and determining at least one sample to-be-evaluated structural information of the sample evaluation webpage. Structured information is evaluated for each sample. And determining a sample evaluation layout block of the sample evaluation webpage from the sample evaluation webpage based on the sample to-be-evaluated structural information and the sample evaluation page element of the sample evaluation webpage. And determining a sample layout mapping relation between the sample page type of the sample evaluation webpage and sample to-be-evaluated structural information based on the sample evaluation search content, the sample evaluation block attribute of the sample evaluation layout block and the sample evaluation block content. Sample relevance between sample evaluation search content and sample evaluation web page is known. And determining the structural information of the sample webpage based on the sample page type and sample layout mapping relation of the sample webpage.
According to an embodiment of the present disclosure, determining a sample target web page matching sample search content from a plurality of sample web pages based on sample search content and sample web page information of each of the plurality of sample web pages may include: for web page information of each sample web page, a sample information sequence is generated based on the sample search content, the search attribute of the sample search content, and the sample web page information. And inputting the sample information sequence into the deep learning model to obtain a correlation result and a reason result.
According to an embodiment of the present disclosure, determining sample search attributes of sample search content may include: at least one sample initial search attribute of sample search content is determined. Sample user attributes and sample user historical attention information are determined. Sample search attributes of sample search content are determined from the at least one sample initial search attribute based on the sample user attributes and the sample user historical interest information.
According to an exemplary embodiment of the present disclosure, generating a sample information sequence based on sample search content, sample search attributes of the sample search content, and sample web page information may include: and according to the sample preset hierarchical relationship, sequencing the sample searching content, the sample searching attribute of the sample searching content and the sample webpage information to obtain a sample information sequence.
According to embodiments of the present disclosure, the sample layout block may include a plurality of. The sample predetermined hierarchical relationship may include a sample layout order of a plurality of sample page sub-elements in one-to-one correspondence with the plurality of sample layout blocks.
According to an exemplary embodiment of the present disclosure, the sample web page information may include: sample page attributes, sample block content, and sample content attributes. According to a predetermined hierarchical relationship of the sample, sorting the sample search content, the sample search attribute of the sample search content, and the sample web page information to obtain a sample information sequence may include: and according to the sample preset hierarchical relationship, sorting the sample search content, the sample search attribute of the sample search content, the sample page attribute, the sample block content and the sample content attribute to obtain a sample information sequence.
In an example, the training method further comprises: in the case that the sample block content comprises an image, carrying out semantic analysis on the sample image to obtain sample semantic text for describing the sample image. And obtaining sample block content based on the sample semantic text.
Fig. 9 schematically shows a block diagram of a search apparatus according to an embodiment of the present disclosure.
As shown in fig. 9, the search apparatus 900 includes: a web page determination module 910 and a matching module 920.
The web page determining module 910 is configured to determine, in response to receiving the search content, web page information of each of the plurality of web pages to be matched. The webpage information comprises block attributes and block contents of layout blocks of the webpage to be matched, wherein the block attributes of the layout blocks are used for representing structural information of the webpage to be matched, and the block contents of the layout blocks comprise page contents of the webpage to be matched.
The matching module 920 is configured to determine, from the plurality of webpages to be matched, a target webpage that matches the search content based on the search content and the webpage information of each of the plurality of webpages to be matched.
According to an embodiment of the present disclosure, the search apparatus further includes: an element determination module, a block determination module, and an information determination module.
And the element determining module is used for determining page elements of the webpage to be matched.
And the block determining module is used for determining the layout blocks from the webpages to be matched based on the page elements.
And the information determining module is used for obtaining the webpage information based on the block attribute and the block content of the layout block.
According to an embodiment of the present disclosure, the block determination module includes: the structure determination submodule and the block determination submodule.
The structure determination submodule is used for determining structural information of the webpage to be matched based on the page type and the layout mapping relation of the webpage to be matched. The layout mapping relationship is used for representing the association relationship between the page type and the structural information.
And the block determination submodule is used for determining the layout block from the webpage to be matched based on the structural information and the page elements of the webpage to be matched.
According to an embodiment of the present disclosure, the search apparatus further includes: and the evaluation block determining module and the relation determining module.
And the evaluation block determining module is used for determining an evaluation layout block of the evaluation webpage from the evaluation webpage based on the structural information to be evaluated and the evaluation page elements of the evaluation webpage.
And the relation determining module is used for determining the mapping relation between the page type of the evaluation webpage and the structural information to be evaluated based on the evaluation search content, the evaluation block attribute of the evaluation layout block and the evaluation block content. Correlation between the search content and the evaluation web page is known.
According to an embodiment of the present disclosure, a block determination submodule includes: an element determination unit and a block determination unit.
And the element determining unit is used for determining page sub-elements for identifying the layout blocks based on the page elements and the structural information of the webpages to be matched.
And the block determining unit is used for determining the layout block from the webpage to be matched based on the page subelement.
According to an embodiment of the present disclosure, a block determination unit includes: a first block determination subunit and a second block determination subunit.
And the first block determining subunit is used for determining an initial layout block from the webpage to be matched based on the page subelement.
And a second block determination subunit configured to determine a layout block based on the initial layout block in a case where the block type of the initial layout block is determined to be a predetermined block type.
According to an embodiment of the present disclosure, the search apparatus further includes: the semantic conversion module and the content determination module.
And the semantic conversion module is used for carrying out semantic analysis on the image under the condition that the block content is determined to comprise the image, so as to obtain a semantic text for describing the image.
And the content determining module is used for obtaining the block content based on the semantic text.
According to an embodiment of the present disclosure, a matching module includes: the generation sub-module, the correlation determination sub-module and the matching sub-module.
The generation sub-module is used for generating an information sequence based on the search content, the search attribute of the search content and the webpage information aiming at the webpage information of each webpage to be matched.
And the correlation determination sub-module is used for determining the correlation degree between the search content and the webpage to be matched based on the information sequence to obtain a plurality of correlation degrees.
And the matching sub-module is used for determining a target webpage matched with the search content from the webpages to be matched based on the correlations.
According to an embodiment of the present disclosure, generating the sub-module includes: and a generating unit.
And the generating unit is used for sequencing the search content, the search attribute of the search content and the webpage information according to the preset hierarchical relationship to obtain an information sequence.
According to an embodiment of the present disclosure, the layout blocks include a plurality of, and the predetermined hierarchical relationship includes a layout order of a plurality of page sub-elements corresponding to the plurality of layout blocks one to one.
According to an embodiment of the present disclosure, web page information includes: page attributes, block content, and content attributes.
According to an embodiment of the present disclosure, the generating unit includes: generating a subunit.
And the generation subunit is used for sequencing the search content, the search attribute of the search content, the page attribute, the block content and the content attribute according to the preset hierarchical relationship to obtain an information sequence.
According to an embodiment of the present disclosure, the search apparatus further includes: the device comprises a response module, a focus module and an attribute determination module.
And a response module for determining at least one initial search attribute of the search content in response to receiving the search content.
And the attention module is used for determining the user attribute and the user historical attention information.
And the attribute determining module is used for determining the search attribute of the search content from at least one initial search attribute based on the user attribute and the user historical attention information.
Fig. 10 schematically illustrates a block diagram of a training apparatus of a deep learning model according to an embodiment of the present disclosure.
As shown in fig. 10, the training apparatus 1000 for the deep learning model includes: sample web page determination module 1010 and training module 1020.
The sample web page determining module 1010 is configured to determine sample web page information of a sample web page. The sample webpage information comprises sample block attributes and sample block contents of sample layout blocks of the sample webpage, wherein the sample block attributes of the sample layout blocks are used for representing the page layout structure of the sample webpage, and the sample block contents of the sample layout blocks comprise the page contents of the sample webpage.
The training module 1020 is configured to train the deep learning model based on the sample search content, sample web page information of the sample web page, and the relevance label, and obtain a trained deep learning model. The relevance tag is used for representing relevance between sample search content and a sample webpage.
According to an embodiment of the present disclosure, a training module includes: an input sub-module, a first loss sub-module, a second loss sub-module, and a training sub-module.
And the input sub-module is used for inputting sample search content and sample webpage information of the sample webpage into the deep learning model to obtain a correlation result and a reason result.
And the first loss submodule is used for obtaining a first loss value based on the correlation result and the correlation label.
And the second loss submodule is used for obtaining a second loss value based on the reason result and the reason label, wherein the reason label is used for representing the reason of whether the sample search content is related to the sample webpage or not.
And the training sub-module is used for training the deep learning model based on the first loss value and the second loss value to obtain a trained deep learning model.
According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.
According to an embodiment of the present disclosure, an electronic device includes: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method as in an embodiment of the present disclosure.
According to an embodiment of the present disclosure, a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform a method as in an embodiment of the present disclosure.
According to an embodiment of the present disclosure, a computer program product comprising a computer program which, when executed by a processor, implements a method as an embodiment of the present disclosure.
Fig. 11 illustrates a schematic block diagram of an example electronic device 1100 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 11, the apparatus 1100 includes a computing unit 1101 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 1102 or a computer program loaded from a storage unit 1108 into a Random Access Memory (RAM) 1103. In the RAM 1103, various programs and data required for the operation of the device 1100 can also be stored. The computing unit 1101, ROM 1102, and RAM 1103 are connected to each other by a bus 1104. An input/output (I/O) interface 1105 is also connected to bus 1104.
Various components in device 1100 are connected to an input/output (I/O) interface 1105, including: an input unit 1106 such as a keyboard, a mouse, etc.; an output unit 1107 such as various types of displays, speakers, and the like; a storage unit 1108, such as a magnetic disk, optical disk, etc.; and a communication unit 1109 such as a network card, modem, wireless communication transceiver, or the like. The communication unit 1109 allows the device 1100 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.
The computing unit 1101 may be a variety of general purpose and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 1101 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 1101 performs the respective methods and processes described above, such as a search method. For example, in some embodiments, the search method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 1108. In some embodiments, some or all of the computer programs may be loaded and/or installed onto device 1100 via ROM 1102 and/or communication unit 1109. When the computer program is loaded into the RAM 1103 and executed by the computing unit 1101, one or more steps of the search method described above may be performed. Alternatively, in other embodiments, the computing unit 1101 may be configured to perform the search method by any other suitable means (e.g. by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.
The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.
The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims (31)

1. A search method, comprising:
determining respective webpage information of a plurality of webpages to be matched in response to receiving search content, wherein the webpage information comprises block attributes and block contents of layout blocks of the webpages to be matched, the block attributes of the layout blocks are used for representing structural information of the webpages to be matched, and the block contents of the layout blocks comprise page contents of the webpages to be matched; and
And determining a target webpage matched with the search content from the plurality of webpages to be matched based on the search content and the webpage information of each webpage to be matched.
2. The method of claim 1, further comprising:
Determining page elements of the web page to be matched;
determining the layout block from the webpage to be matched based on the page element; and
And obtaining the webpage information based on the block attribute and the block content of the layout block.
3. The method of claim 2, wherein the determining the layout block from the web page to be matched based on the page element comprises:
Determining structural information of the webpage to be matched based on the page type and the layout mapping relation of the webpage to be matched, wherein the layout mapping relation is used for representing the association relation between the page type and the structural information; and
And determining the layout block from the webpage to be matched based on the structural information of the webpage to be matched and the page element.
4. A method according to claim 3, further comprising:
determining an evaluation layout block of the evaluation webpage from the evaluation webpage based on the structural information to be evaluated and the evaluation page element of the evaluation webpage; and
And determining the mapping relation between the page type of the evaluation webpage and the structural information to be evaluated based on the evaluation search content, the evaluation block attribute of the evaluation layout block and the evaluation block content, wherein the correlation between the evaluation search content and the evaluation webpage is known.
5. The method according to claim 3 or 4, wherein the determining the layout block from the web page to be matched based on the structural information of the web page to be matched and the page element comprises:
determining page subelements for identifying layout blocks based on the page elements and the structural information of the web pages to be matched; and
And determining the layout block from the webpage to be matched based on the page subelement.
6. The method of claim 5, wherein the determining the layout block from the web page to be matched based on the page subelement comprises:
Determining an initial layout block from the webpage to be matched based on the page subelement; and
In a case where the block type of the initial layout block is determined to be a predetermined block type, the layout block is determined based on the initial layout block.
7. The method of any one of claims 1 to 6, further comprising:
Under the condition that the block content comprises an image, carrying out semantic analysis on the image to obtain a semantic text for describing the image; and
And obtaining the block content based on the semantic text.
8. The method of any of claims 1-7, wherein the determining, from the plurality of web pages to be matched, a target web page that matches the search content based on the search content and web page information for each of the plurality of web pages to be matched, comprises:
generating an information sequence based on the search content, the search attribute of the search content and the webpage information aiming at the webpage information of each webpage to be matched;
based on the information sequence, determining the correlation degree between the search content and the webpage to be matched, and obtaining a plurality of correlation degrees; and
And determining target webpages matched with the search content from the webpages to be matched based on the relevancy.
9. The method of claim 8, wherein the generating an information sequence based on the search content, the search attributes of the search content, and the web page information comprises:
and ordering the search content, the search attribute of the search content and the webpage information according to a preset hierarchical relationship to obtain the information sequence.
10. The method of claim 9, wherein the layout block includes a plurality, and the predetermined hierarchical relationship includes a layout order of a plurality of page sub-elements in one-to-one correspondence with the plurality of layout blocks.
11. The method of claim 9 or 10, wherein the web page information comprises: page attributes, block content, and content attributes;
the step of sorting the search content, the search attribute of the search content and the web page information according to a predetermined hierarchical relationship to obtain the information sequence includes:
And sorting the search content, the search attribute of the search content, the page attribute, the block content and the content attribute according to the preset hierarchical relationship to obtain the information sequence.
12. The method of any of claims 8 to 11, further comprising:
Determining at least one initial search attribute of the search content in response to receiving the search content;
determining user attributes and user historical attention information; and
A search attribute of the search content is determined from the at least one initial search attribute based on the user attribute and the user historical focus information.
13. A training method of a deep learning model, comprising:
Determining sample webpage information of a sample webpage, wherein the sample webpage information comprises sample block attributes and sample block contents of sample layout blocks of the sample webpage, the sample block attributes of the sample layout blocks are used for representing a page layout structure of the sample webpage, and the sample block contents of the sample layout blocks comprise page contents of the sample webpage; and
Training the deep learning model based on the sample search content, sample webpage information of the sample webpage and a relevance label to obtain a trained deep learning model, wherein the relevance label is used for representing the relevance between the sample search content and the sample webpage.
14. The method of claim 13, wherein the training the deep learning model based on the sample search content, sample web page information of the sample web page, and relevance labels, results in a trained deep learning model, comprising:
inputting the sample search content and sample webpage information of the sample webpage into the deep learning model to obtain a relevance result and a reason result;
Obtaining a first loss value based on the correlation result and the correlation label;
Obtaining a second loss value based on the reason result and a reason tag, wherein the reason tag is used for representing the reason of whether the sample search content is related to the sample webpage or not; and
Training the deep learning model based on the first loss value and the second loss value to obtain a trained deep learning model.
15. A search apparatus comprising:
The webpage determining module is used for determining respective webpage information of a plurality of webpages to be matched in response to receiving search content, wherein the webpage information comprises block attributes and block contents of layout blocks of the webpages to be matched, the block attributes of the layout blocks are used for representing structural information of the webpages to be matched, and the block contents of the layout blocks comprise the webpage contents of the webpages to be matched; and
And the matching module is used for determining a target webpage matched with the search content from the plurality of webpages to be matched based on the search content and the webpage information of each of the plurality of webpages to be matched.
16. The apparatus of claim 15, further comprising:
the element determining module is used for determining page elements of the web page to be matched;
the block determining module is used for determining the layout block from the webpage to be matched based on the page element; and
And the information determining module is used for obtaining the webpage information based on the block attribute and the block content of the layout block.
17. The apparatus of claim 16, wherein the block determination module comprises:
The structure determining submodule is used for determining structural information of the webpage to be matched based on the page type and the layout mapping relation of the webpage to be matched, wherein the layout mapping relation is used for representing the association relation between the page type and the structural information; and
And the block determining submodule is used for determining the layout block from the webpage to be matched based on the structural information of the webpage to be matched and the page element.
18. The apparatus of claim 17, further comprising:
the evaluation block determining module is used for determining an evaluation layout block of the evaluation webpage from the evaluation webpage based on the structural information to be evaluated and the evaluation page elements of the evaluation webpage; and
And the relation determining module is used for determining the mapping relation between the page type of the evaluation webpage and the structural information to be evaluated based on the evaluation search content, the evaluation block attribute of the evaluation layout block and the evaluation block content, wherein the degree of correlation between the evaluation search content and the evaluation webpage is known.
19. The apparatus of claim 17 or 18, wherein the block determination submodule comprises:
An element determining unit, configured to determine a page subelement for identifying a layout block based on the page element and the structural information of the web page to be matched; and
And the block determining unit is used for determining the layout block from the webpage to be matched based on the page subelement.
20. The apparatus of claim 19, wherein the block determination unit comprises:
A first block determining subunit, configured to determine an initial layout block from the web page to be matched based on the page subelement; and
A second block determination subunit configured to determine the layout block based on the initial layout block, in a case where the block type of the initial layout block is determined to be a predetermined block type.
21. The apparatus of any of claims 15 to 20, further comprising:
The semantic conversion module is used for carrying out semantic analysis on the image under the condition that the block content is determined to comprise the image, so as to obtain a semantic text for describing the image; and
And the content determining module is used for obtaining the block content based on the semantic text.
22. The apparatus of any of claims 15 to 21, wherein the matching module comprises:
the generation sub-module is used for generating an information sequence according to the search content, the search attribute of the search content and the webpage information aiming at the webpage information of each webpage to be matched;
The correlation determination submodule is used for determining the correlation degree between the search content and the webpage to be matched based on the information sequence to obtain a plurality of correlation degrees; and
And the matching sub-module is used for determining a target webpage matched with the search content from the webpages to be matched based on the correlations.
23. The apparatus of claim 22, wherein the generating submodule comprises:
And the generation unit is used for sequencing the search content, the search attribute of the search content and the webpage information according to a preset hierarchical relationship to obtain the information sequence.
24. The apparatus of claim 23, wherein the layout block comprises a plurality, and the predetermined hierarchical relationship comprises a layout order of a plurality of page sub-elements in one-to-one correspondence with the plurality of layout blocks.
25. The apparatus of claim 23 or 24, wherein the web page information comprises: page attributes, block content, and content attributes;
The generation unit includes:
And the generation subunit is used for sorting the search content, the search attribute of the search content, the page attribute, the block content and the content attribute according to the preset hierarchical relationship to obtain the information sequence.
26. The apparatus of any of claims 22 to 25, further comprising:
A response module for determining at least one initial search attribute of the search content in response to receiving the search content;
The attention module is used for determining user attributes and user historical attention information; and
And the attribute determining module is used for determining the search attribute of the search content from the at least one initial search attribute based on the user attribute and the user historical attention information.
27. A training device for a deep learning model, comprising:
A sample webpage determining module, configured to determine sample webpage information of a sample webpage, where the sample webpage information includes sample block attributes and sample block contents of a sample layout block of the sample webpage, the sample block attributes of the sample layout block are used to characterize a page layout structure of the sample webpage, and the sample block contents of the sample layout block include page contents of the sample webpage; and
The training module is used for training the deep learning model based on the sample search content, sample webpage information of the sample webpage and a relevance label to obtain a trained deep learning model, wherein the relevance label is used for representing the relevance between the sample search content and the sample webpage.
28. The apparatus of claim 27, wherein the training module comprises:
the input sub-module is used for inputting the sample search content and sample webpage information of the sample webpage into the deep learning model to obtain a relevance result and a reason result;
the first loss submodule is used for obtaining a first loss value based on the correlation result and the correlation label;
The second loss submodule is used for obtaining a second loss value based on the reason result and the reason label, wherein the reason label is used for representing the reason of whether the sample search content is related to the sample webpage or not; and
And the training sub-module is used for training the deep learning model based on the first loss value and the second loss value to obtain a trained deep learning model.
29. An electronic device, comprising:
At least one processor; and
A memory communicatively coupled to the at least one processor; wherein,
The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1 to 14.
30. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1 to 14.
31. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1 to 14.
CN202410101634.XA 2024-01-24 2024-01-24 Search method, training device, training equipment, training medium and training program product Pending CN117909560A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410101634.XA CN117909560A (en) 2024-01-24 2024-01-24 Search method, training device, training equipment, training medium and training program product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410101634.XA CN117909560A (en) 2024-01-24 2024-01-24 Search method, training device, training equipment, training medium and training program product

Publications (1)

Publication Number Publication Date
CN117909560A true CN117909560A (en) 2024-04-19

Family

ID=90690551

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410101634.XA Pending CN117909560A (en) 2024-01-24 2024-01-24 Search method, training device, training equipment, training medium and training program product

Country Status (1)

Country Link
CN (1) CN117909560A (en)

Similar Documents

Publication Publication Date Title
CN109190049B (en) Keyword recommendation method, system, electronic device and computer readable medium
US11172040B2 (en) Method and apparatus for pushing information
KR20200007969A (en) Information processing methods, terminals, and computer storage media
CN103870973A (en) Information push and search method and apparatus based on electronic information keyword extraction
US11741094B2 (en) Method and system for identifying core product terms
CN115982376B (en) Method and device for training model based on text, multimode data and knowledge
CN114595686B (en) Knowledge extraction method, and training method and device of knowledge extraction model
CN113806588B (en) Method and device for searching video
US20220121668A1 (en) Method for recommending document, electronic device and storage medium
CN114549874A (en) Training method of multi-target image-text matching model, image-text retrieval method and device
WO2022245469A1 (en) Rule-based machine learning classifier creation and tracking platform for feedback text analysis
CN112926308B (en) Method, device, equipment, storage medium and program product for matching text
US9152698B1 (en) Substitute term identification based on over-represented terms identification
CN111737607B (en) Data processing method, device, electronic equipment and storage medium
CN111523019B (en) Method, apparatus, device and storage medium for outputting information
CN116501960B (en) Content retrieval method, device, equipment and medium
CN117112595A (en) Information query method and device, electronic equipment and storage medium
US20230085684A1 (en) Method of recommending data, electronic device, and medium
CN116662495A (en) Question-answering processing method, and method and device for training question-answering processing model
CN116049370A (en) Information query method and training method and device of information generation model
CN117909560A (en) Search method, training device, training equipment, training medium and training program product
CN114048315A (en) Method and device for determining document tag, electronic equipment and storage medium
CN113326438A (en) Information query method and device, electronic equipment and storage medium
CN111046151B (en) Message processing method and device
CN113743973A (en) Method and device for analyzing market hotspot trend

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination