CN117009621A

CN117009621A - Information searching method, device, electronic equipment, storage medium and program product

Info

Publication number: CN117009621A
Application number: CN202211530257.9A
Authority: CN
Inventors: 康战辉
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-11-30
Filing date: 2022-11-30
Publication date: 2023-11-07

Abstract

The embodiment of the application provides an information searching method, an information searching device, electronic equipment, a storage medium and a program product, relates to the technical field of artificial intelligence, and can be applied to an information searching scene. The method specifically comprises the following steps: acquiring input query information; inputting the query information into an information rewriting model, and rewriting the query information through the information rewriting model to obtain rewritten query information corresponding to the query information; the information rewriting model is trained according to the information rewriting sample pair; the information rewriting sample pair comprises a pair of inquiry information samples corresponding to the check operation of the same inquiry result; searching in a search engine based on the rewritten query information to obtain at least one search result; wherein the query result is at least one result viewed in the search result. The application rewrites the query information through the information rewriting model, enriches the expression diversity of the rewritten query information on the basis of conforming to semantic correlation, and improves the recall capability of the search engine.

Description

Information searching method, device, electronic equipment, storage medium and program product

Technical Field

The present application relates to the field of artificial intelligence technology and information searching technology, and in particular, to an information searching method, apparatus, electronic device, storage medium and program product.

Background

Search engines have become increasingly popular as tools for operating objects to obtain information of interest from the internet and various types of UGC (User Generated Content) content. Since query words input by the operation object may have a case of miswords or a case of a certain semantic deviation from a desired target result, it is difficult for the search engine to effectively return the desired target result of the operation object. In the prior art, the technical problem is generally solved by adopting an automatic query term rewriting technology, and in order to ensure that the automatic query term rewriting technology does not change the originally intended content of the operation object, a plurality of more accurate synonym replacement technologies are often adopted to rewrite the query term input by the operation object. If the input query word is "alice's financial newspaper," it may be rewritten into "alice's financial newspaper" according to the collected synonym dictionary.

The existing query word rewriting technology based on the synonym dictionary replacement technology has the following problems: on one hand, the dependent synonym dictionary mainly depends on manual collection and has limited range, and the problem of great increase of labor cost is faced when the coverage of synonyms is further expanded; on the other hand, the semantic consistency before and after the rewriting is ensured by the rewriting of the query words based on the synonym replacement, but the problem that the search engine converges and lacks content diversity based on the recall result of the query words before and after the rewriting is caused by the single rewritten query word.

Disclosure of Invention

The embodiment of the application provides an information searching method, an information searching device, electronic equipment, a storage medium and a program product for solving at least one technical problem. The technical scheme is as follows:

in a first aspect, an embodiment of the present application provides an information searching method, including:

acquiring input query information;

inputting the query information into an information rewriting model, and rewriting the query information through the information rewriting model to obtain rewritten query information corresponding to the query information; the information rewriting model is trained according to the information rewriting sample pair; the information rewriting sample pair comprises a pair of inquiry information samples corresponding to the check operation of the same inquiry result;

searching in a search engine based on the rewritten query information to obtain at least one search result;

wherein the query result is at least one result viewed in the search result.

In a possible embodiment, the information-rewriting sample pair includes a positive sample pair and a negative sample pair;

the information rewriting sample pair is constructed by:

determining a training sample based on the historical search data;

Different historical query information corresponding to the same historical query result in the training sample is formed into a positive sample pair; a set of positive sample pairs includes two different historical query information;

randomly forming a negative sample pair from historical query information corresponding to different historical query results in the training sample; a set of negative sample pairs includes two different historical query information.

In a possible embodiment, the determining training samples based on historical search data includes:

acquiring historical search data;

and screening historical query results with the checking times reaching a preset threshold value and historical query information corresponding to the historical query results from the historical search data to obtain training samples.

In a possible embodiment, each positive sample pair has a corresponding sample similarity level;

the construction of the information-rewriting sample pairs further includes, for each positive sample pair, performing the following operations:

determining the semantic similarity of the positive sample pair based on the minimum checking times of the historical query result in the positive sample pair and the maximum searching times of the historical query information;

and determining a sample similarity level corresponding to the positive sample pair based on the semantic similarity through a mapping relation between a preset similarity value range and a preset similarity level.

In a possible embodiment, the information rewriting model is trained based on a pre-training model, and the pre-training model is a double-tower language model with a random inactivation layer;

training the information rewriting model according to the information rewriting sample pair, including:

inputting the same information rewriting sample pair into the information rewriting model at least twice to obtain at least two different vector representations;

network parameters of the information rewriting model are adjusted based on the at least two different vector representations.

In a possible embodiment, inputting the information rewriting sample pair into the information rewriting model to obtain a vector representation includes:

performing word segmentation on each historical query information in the information rewriting sample pair to obtain two word segmentation results;

extracting characteristic information of each history inquiry information based on word segmentation results, and determining sub-vector representations corresponding to each history inquiry information respectively based on the characteristic information;

calculating a sum and a difference between the two sub-vector representations;

and determining a vector representation of the information-rewritten sample pair based on the difference value, the sum value, and the two sub-vector representations.

In a possible embodiment, said adjusting network parameters of said information rewriting model based on at least two different vector representations comprises:

Determining the corresponding cross entropy loss values and relative entropy loss values of the at least two different vector representations;

determining a predicted loss value based on the cross entropy loss value and the relative entropy loss value;

and adjusting network parameters of the information rewriting model based on the predicted loss value.

In a possible embodiment, the determining the at least two different vectors represents corresponding cross entropy loss values, and the relative entropy loss value comprises:

determining a corresponding prediction distribution based on the vector representation, and determining a probability value for the corresponding sample similarity level for the information rewritten sample based on the prediction distribution; or, the vector is expressed as the predicted similarity, each information rewritten sample pair is ordered, and a probability value corresponding to the corresponding sample similarity level of the information rewritten sample pair is determined based on the ordering result;

determining corresponding cross entropy loss values for the at least two different vector representations based on the probability values and the similarity levels, and relative entropy loss values.

In a second aspect, an embodiment of the present application provides an information search apparatus, including:

the acquisition module is used for acquiring input query information;

The rewriting module is used for inputting the query information into an information rewriting model, and rewriting the query information through the information rewriting model to obtain rewritten query information corresponding to the query information; the information rewriting model is trained according to the information rewriting sample pair; the information rewriting sample pair comprises a pair of inquiry information samples corresponding to the check operation of the same inquiry result;

the searching module is used for searching in a search engine based on the rewritten query information to obtain at least one search result;

wherein the query result is at least one result viewed in the search result.

In a third aspect, an embodiment of the present application provides an electronic device, where the electronic device includes a memory, a processor, and a computer program stored on the memory, where the processor executes the computer program to implement the steps of the information searching method provided in the first aspect.

In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the information search method provided in the first aspect described above.

In a fifth aspect, an embodiment of the present application provides a computer program product comprising a computer program which, when executed by a processor, implements the steps of the information search method provided in the first aspect.

The technical scheme provided by the embodiment of the application has the beneficial effects that:

the embodiment of the application provides an information searching method, which specifically comprises the steps that after input query information is acquired, the query information can be input into an information rewriting model, and the query information is rewritten through the information rewriting model to obtain rewritten query information corresponding to the query information; then, searching can be carried out in the search engine based on the rewritten query information to obtain at least one search result fed back by the search engine; the information rewriting model is trained according to an information rewriting sample pair, and the information rewriting sample pair comprises a pair of query information samples corresponding to the same query result in a checking operation; wherein the query result is at least one result viewed in the search result. That is, the rewritten query information obtained based on the query information rewriting is determined after the viewing operation in the learning and searching process of the information rewriting model, and the trained information rewriting model is used for rewriting the query information, so that the expression diversity of the rewritten query information can be improved on the basis of ensuring higher consistency of semantics before and after the query information is rewritten. On the basis, when searching is performed in the search engine based on the rewritten query information, the diversity of the recall search result content of the search engine can be improved to a certain extent.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings that are required to be used in the description of the embodiments of the present application will be briefly described below.

Fig. 1 is a flowchart of an information searching method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of an operating environment of an information searching method according to an embodiment of the present application;

FIG. 3 is an exemplary diagram of semantic dependencies provided by an embodiment of the present application;

FIG. 4 is an exemplary diagram of expression diversity provided by an embodiment of the present application;

FIG. 5a is an exemplary diagram of a comparative study provided by an embodiment of the present application;

FIG. 5b is an exemplary diagram of another comparative study provided by an embodiment of the present application;

FIG. 6 is a schematic diagram of a pre-training model versus learning provided in an embodiment of the present application;

FIG. 7 is a schematic diagram of a two-tower language sub-model according to an embodiment of the present application;

FIG. 8a is a schematic diagram of an interactive interface according to an embodiment of the present application;

FIG. 8b is a schematic diagram of another interactive interface provided by an embodiment of the present application;

fig. 9 is a schematic structural diagram of an information searching apparatus according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

Embodiments of the present application are described below with reference to the drawings in the present application. It should be understood that the embodiments described below with reference to the drawings are exemplary descriptions for explaining the technical solutions of the embodiments of the present application, and the technical solutions of the embodiments of the present application are not limited.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless expressly stated otherwise, as understood by those skilled in the art. It will be further understood that the terms "comprises" and "comprising," when used in this specification, specify the presence of stated features, information, data, steps, operations, elements, and/or components, but do not preclude the presence or addition of other features, information, data, steps, operations, elements, components, and/or groups thereof, all of which may be included in the present specification. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. The term "and/or" as used herein indicates that at least one of the items defined by the term, e.g., "a and/or B" may be implemented as "a", or as "B", or as "a and B".

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail with reference to the accompanying drawings.

Embodiments of the present application relate to artificial intelligence (Artificial Intelligence, AI) which is the intelligence of a person using a digital computer or a machine controlled by a digital computer to simulate, extend and expand the environment, sense the environment, acquire knowledge and use knowledge to obtain optimal results, methods, techniques and applications. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision. The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning, automatic driving, intelligent traffic and other directions.

The information searching method provided by the embodiment of the application particularly relates to Machine Learning (ML), which is a multi-domain intersection subject and relates to multiple subjects such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.

Specifically, the application can adopt contrast learning, which is a discriminant expression learning framework (or method) based on contrast ideas, and compares the sample with the example (positive sample) similar to the semantic of the sample and the example (negative sample) dissimilar to the semantic of the sample, so that the representation corresponding to the example similar to the semantic is closer in the representation space, and the representation corresponding to the example dissimilar to the semantic is farther. As shown in fig. 5a and 5b, wherein black spheres represent positive examples and white spheres represent negative examples.

The information searching method provided by the embodiment of the application can be applied to the scene of information searching through a search engine.

The technical solutions of the embodiments of the present application and technical effects produced by the technical solutions of the present application are described below by describing several exemplary embodiments. It should be noted that the following embodiments may be referred to, or combined with each other, and the description will not be repeated for the same terms, similar features, similar implementation steps, and the like in different embodiments.

Fig. 2 is a schematic diagram of an operation environment of the information searching method according to the embodiment of the present application, where the environment may include a terminal 20 and a server 10.

Wherein the terminal 20 may run a client or a service platform. Terminals (which may also be referred to as devices) may be, but are not limited to, smartphones, tablet computers, notebook computers, desktop computers, intelligent voice interaction devices (e.g., smart speakers), wearable electronic devices (e.g., smart watches), vehicle terminals, smart appliances (e.g., smart televisions), AR/VR devices, and the like. Alternatively, the terminal 20 may perform the information searching method provided by the embodiment of the present application.

The server 10 may be an independent physical server, a server cluster or a distributed system (such as a distributed cloud storage system) formed by a plurality of physical servers, or a cloud server that provides cloud computing services. Alternatively, the server 10 may train the information rewrite model, or may perform a search based on the rewritten query information. In one example, the operation object may initiate a search operation through the terminal 20 and transmit rewritten query information corresponding to the search operation to the server 10 through the network 30 to acquire a search result fed back by the server 10.

In a possible embodiment, the terminal 20 and the server 10 may be directly or indirectly connected through wired or wireless communication, which is not limited herein.

In a possible embodiment, the operating environment may also include a database that may be used to store historical search data stored by the server 10 during the information search process.

The following describes an information searching method provided by the embodiment of the present application.

Specifically, as shown in fig. 1, the information searching method provided by the embodiment of the present application includes the following steps S101 to S103:

step S101: and acquiring the input query information.

As shown in fig. 8a, in the interface for searching information, an operation object may input query information "how cold is done" in a search box. Optionally, when at least one character input by the operation object in the search box is detected, the current information in the search box can be acquired in response to the input operation. Optionally, semantic recognition can be performed synchronously, and if it is determined that the information input in the search box belongs to a complete word, a complete sentence, or has a complete semantic expression, the input operation can be responded immediately, so as to obtain the corresponding input query information.

Step S102: inputting the query information into an information rewriting model, and rewriting the query information through the information rewriting model to obtain rewritten query information corresponding to the query information; the information rewriting model is trained according to the information rewriting sample pair; the information rewriting sample pair comprises a pair of query information samples corresponding to the same query result in a viewing operation.

The rewritten query information comprises information determined based on the query information after the information rewritten model learns viewing relations between different historical query information and historical query results. It will be appreciated that the historical query information relates to historical query results and the operational behavior of the operator in a historical search, such as upon input of the query information by the operator, the search engine may feed back at least one corresponding search result from which the operator may determine a desired query result. Wherein, the viewing relationship may represent whether different historical query information may correspond to the same historical query result. The training steps of the information rewriting model will be described in the following embodiments.

Step S103: searching in a search engine based on the rewritten query information to obtain at least one search result.

As shown in fig. 8b, when the operation object initiates a search operation for the "cold treatment method" from among a plurality of pieces of rewritten query information, the terminal may send the rewritten query information to the server, screen out a search result corresponding to the rewritten query information by the search engine, and feed back to the terminal for display. As shown in fig. 8b, several search results may be obtained, and in the information search interface, the operator may view the query results that it needs from among the displayed search results.

The search results may be content of documents doc, expressions, videos, and the like, among others. The query result may be an item of content selected by the operation object from the search results. Alternatively, in the subsequent embodiment, the query information or the rewritten query information may be expressed as query, Q; the query result may be represented as doc, D.

As can be seen by comparing fig. 8a and 8b, the search results recalled by the search engine are different for different expressions of query information. The information rewriting model provided by the embodiment of the application rewrites the query information input by the operation object through a large number of user searching operation behaviors in the learning history to obtain the rewritten query information, so that the semantic expression of the rewritten query information for searching is more accurate and clear. In the embodiment of the application, the semantic relativity of the query information before and after the rewriting is ensured, and the expression diversity of the query information is enriched. Referring to fig. 3 and fig. 4, fig. 3 shows an example that query information before and after rewriting does not have semantic relevance, where original query information may be query information query input by an operation object, and candidate query information may be rewritten query information; aiming at the original query information 'junior middle school language information', the indicated time periods in the rewritten candidate query information 'primary school language information' are different; for the original query information of Shenzhen house city, the indicated areas are different in the rewritten candidate query information of Guangzhou house city. FIG. 4 shows an example of whether the query information before and after the rewriting meets the expression diversity, wherein the original query information may be the query information input by the operation object, and the candidate query information may be the rewritten query information; for the original query information 'how common cold' is, the rewritten candidate query information 'cold treatment method' accords with the expression diversity, and the rewritten candidate query information 'how common cold' does not accord with the expression diversity.

The following describes a construction process of a training sample in an embodiment of the present application.

Alternatively, training samples may be used to construct information-rewritten pairs of samples based on historical search data. The method is suitable for a model training mode of contrast learning, and the information rewriting sample pair can comprise a positive sample pair and a negative sample pair, wherein the positive sample pair comprises different historical query information, and each historical query information corresponds to the same historical query result; such as historical query information corresponding to the same historical query result, form positive sample pairs. The negative sample pair can be obtained by randomly selecting historical query information in the historical search data. Any one of the historical query information corresponds to a plurality of historical search results, and the historical query results are at least one query result selected by an operation object from the historical search results.

In a possible embodiment, the information-rewriting sample pair is obtained by the following operation configuration of step A1 to step A3:

step A1: training samples are determined based on historical search data.

The training samples can acquire historical search data to construct; the history search data may include history query information (such as query information input by an operation object, rewritten query information), history search results (a result recalled by a search engine based on the history query information), and history query results (a result having a correspondence with the history query information determined from the history search results).

Alternatively, training samples may be selected from the full amount of historical search data, or training samples may be selected from the historical search data for a specified period of time, which may be adapted to adjust the range of the historical search data according to the requirements, which is not limited in this application. The history search data may be data that is stored locally by the terminal and that performs a search operation using all operation objects of the terminal, or may be data that is stored by the server and that performs a search operation using all operation objects of the search engine.

Optionally, determining the training sample in step A1 based on the historical search data includes steps a 11-a 12:

step A11: historical search data is obtained.

Step A12: and screening historical query results with the checking times reaching a preset threshold value and historical query information corresponding to the historical query results from the historical search data to obtain training samples.

The training samples can be obtained from historical search data, and in order to improve learning efficiency and model performance, the training samples can be constructed by filtering high-confidence search logs with the check times greater than a certain threshold value from the historical search data (such as historical search logs), and the format is as follows:

< query, docid > \t search times, view times

Wherein, for a particular < query, the calculation of the viewing rate of the docid > pair can be expressed as shown in the following formula (1):

view rate = number of views/number of searches

.. the formula (1)

The number of times of viewing refers to the number of times that the historical query result is selected, and the number of times of searching refers to the number of times that the historical query information corresponding to the historical query result is used for initiating searching.

According to the embodiment of the application, the corresponding historical query information can be reversely acquired based on the historical query result, and the historical query result with higher checking times is screened out, so that the corresponding historical query information is acquired. After the training samples screened from the historical search data in the step a12 are obtained, the comparison learning can be performed on the < query1, query2> by rewriting the samples based on the screening result construction information.

Step A2: different historical query information corresponding to the same historical query result in the training sample is formed into a positive sample pair; a set of positive sample pairs includes two different historical query information.

In this embodiment of the present application, if two different query information queries commonly correspond to the same query result doc (e.g., in different search operations, the same query result is selected for different query information, and the operation object) then the two query information queries may be candidate positive sample pairs. For example, if there are two records in the search log:

<Q1,D1>\t qv1，click1

<Q2,D1>\t qv1，click2

Then < Q1, Q2> is a set of candidate rewritten positive sample pairs, i.e., since Q1, Q2 both have a higher number of views (number of click records) on query result D1, it is stated that the target doc to be searched is the same, and although Q1, Q2 are written differently, they should have a higher semantic similarity.

Step A3: randomly forming a negative sample pair from historical query information corresponding to different historical query results in the training sample; a set of negative sample pairs includes two different historical query information.

In contrast to the positive sample pair, where different historical query information in the negative sample pair corresponds to different historical query results, respectively, it is understood that the semantic similarity between the two historical query information in the negative sample pair is very low or 0 (i.e., has no semantic similarity).

In a possible embodiment, each positive sample pair may also have a corresponding sample similarity level. The construction of the information rewriting sample pair further includes performing the following operation of A4-step A4 for each positive sample pair:

step A4: and determining the semantic similarity of the positive sample pair based on the minimum checking times of the historical query results and the maximum searching times of the historical query information in the positive sample pair.

Specifically, the semantic similarity of the positive sample pair < Q1, Q2> may be represented by a co-view rate CTR, the calculation of which may be shown in the following formula (2):

CTR＝min(click1,click2)/max(Q1,Q2)

..

Wherein, min (click 1, click 2) is the minimum number of views of the same query result D1 in Q1, Q2; max (Q1, Q2) is the maximum number of searches in both Q1, Q2.

Step A5: and determining a sample similarity level corresponding to the positive sample pair based on the semantic similarity through a mapping relation between a preset similarity value range and a preset similarity level.

In order to facilitate better training of the model, the embodiment of the application performs hierarchical mapping on the co-view rate, for example, the co-view rate can be converted into a problem of 5 similarity levels, wherein the similarity level 5 means that the co-view rate of Q1 and Q2 to D1 is high, meaning that the semantic similarity of the two is probably highest, the similarity level 1 means that the co-view rate of Q1 and Q2 to D1 is lower, and at the moment, the semantic similarity of the two may only have a certain similarity on a certain smaller side, but the semantic similarity of the main body may not be high.

Optionally, the rule for setting the similarity level is as follows in table 1:

TABLE 1

It will be appreciated that the degree of similarity may be more or less, and may be adjusted according to the needs of the model training, as the application is not limited in this respect.

The sample similarity level determined in step A5 may be the sample similarity level actually corresponding to the positive sample pair when the positive sample pair is used as a training sample for information rewriting the model. Alternatively, sample similarity levels may be distinguished by the label.

The following describes a training method of the model in the embodiment of the present application.

In a possible embodiment, the information rewriting model is a two-tower language model with a random deactivation layer. The model training method is suitable for pre-training superposition contrast learning, and the information rewriting model can be a pre-training model before training.

The pre-training model may be a BERT (Bidirectional Encoder Representation from Transformers) model, a pre-trained language characterization model that employs a Masked Language Model (MLM) network architecture to enable deep bi-directional language characterization.

The existence of a random inactivation (dropout) layer can temporarily remove a neural network training unit from a network according to a certain probability in the deep learning training process, and each mini-batch trains different networks due to random discarding in the random gradient descent. It will be appreciated that with dropout at each training, each neuron has a certain probability of being removed, which may allow training of one neuron independent of another, and also may allow synergy between features to be reduced.

Training the information rewriting model according to the information rewriting sample pair, wherein the training comprises the steps of B1-B2:

step B1: and (3) inputting the same information rewritten sample pair into the pre-training model at least twice to obtain at least two different vector representations.

As shown in fig. 6, the < query1, query2> sample pair (information rewriting sample pair, including positive sample pair and negative sample pair) constructed as described above may be input into a bert double-tower model with dropout to obtain a vector representation. It will be appreciated that since the model has a random deactivation layer, when the model is input N times for the same pair of information rewritten samples, it can also be considered to obtain N different vector representations, i.e. after processing by dropout, it can be considered that the pair of information rewritten samples has passed N slightly different models; wherein N is greater than or equal to 2.

Step B2: and adjusting network parameters of the pre-training model based on the at least two different vector representations to obtain a trained information rewriting model.

The training samples can construct a plurality of information rewriting sample pairs, and each information rewriting sample pair can output at least two different vector representations corresponding to each information rewriting sample pair in the process of respectively inputting the pre-training model for training at least two times, so that when the network parameters of the information rewriting model are adjusted, the network parameters of the model can be adjusted based on the total loss values after the loss values determined by the plurality of different vector representations are calculated for each information rewriting sample pair.

The embodiment of the application adopts a pre-training superposition contrast learning mode, wherein the pre-training mode is expressed on model parameters, such as a relatively suitable task (the task may be a task which is common to a plurality of tasks) acquired before, the parameters of the model are trained in advance (pre-training), the training is not needed to be started from 0, the task is rewritten aiming at the current query information, some parameters obtained by pre-training may not be suitable, the contrast learning can be combined on the basis of the current parameters, the network parameters of the model are adjusted, and the model which is more suitable for the current task is further obtained, so that better model performance is obtained. The implementation of the mode can greatly reduce the learning time and improve the training efficiency of the model. Specifically, on the basis of pre-training, the information rewriting model is subjected to model training in a contrast learning mode based on training samples, network parameters can be finely adjusted based on learning results, and the trained information rewriting model is finally obtained.

In a possible embodiment, the step B1 of inputting the information rewriting sample pair into the information rewriting model to obtain a vector representation includes the steps B11-B14:

step B11: and performing word segmentation on each historical query information in the information rewriting sample pair to obtain two word segmentation results.

Step B12: and extracting characteristic information of each piece of historical query information based on the word segmentation result, and determining sub-vector representations corresponding to each piece of historical query information based on the characteristic information.

Step B13: a sum and a difference between the two sub-vector representations are calculated.

Step B14: and determining a vector representation of the information-rewritten sample pair based on the difference value, the sum value, and the two sub-vector representations.

In steps B11 to B14, a procedure of rewriting the model for one information rewriting sample to obtain one vector representation will be described.

As shown in fig. 7, the double tower bert sub-model may include a word segmenter, a feature extraction layer, a pooling layer, and an interaction layer in the framework of the information rewriting model. After the information rewritten sample pair < query1, query2> is respectively input to word segmentation devices (a universal word segmentation device can be adopted, and a corresponding word segmentation device can be selected according to the language of query information), features can be respectively extracted through a universal bert base model, and then the features are input to a pooling layer for pooling processing, so that corresponding q, t sub-vector representations are generated; wherein q is the sub-vector representation of Query1, t is the sub-vector representation of Query2, and finally the difference value (q-t) between the sub-vectors q and t, the sum value (q+t) and the vector representation of q, t itself are directly spliced together through an interaction layer. Alternatively, the vector representation output by the interaction layer may serve as the predicted similarity for Query1 and Query 2.

In a possible embodiment, in step B2, the network parameters of the information rewriting model are adjusted based on at least two different vector representations, including the following steps B21-B23:

step B21: determining that the at least two different vectors represent corresponding cross entropy loss values, and relative entropy loss values.

Step B22: a predicted loss value is determined based on the cross entropy loss value and the relative entropy loss value.

Step B23: and adjusting network parameters of the information rewriting model based on the predicted loss value.

Wherein, as shown in FIG. 6, for the vector representation of the bert double-tower model output, a prediction distribution can be obtained, and if the input model is rewritten twice for the same information, two different prediction distributions, such as P, can be obtained _θ (y|) vs P _θ ^′ (y|) _， And the Loss function Loss at this time can be expressed as shown in the following formula (3):

wherein, formula (3) is composed of two parts:

two cross entropy losses (cross entropy loss), as shown in equation (4) below:

the KL divergence (Kullback-Leibler divergence) of the two predicted distributions is shown in equation (5) below:

in the above formula (3), x in the formula (4) and the formula (5) is query1 in one information rewriting sample pair, and y is query2 in the same information rewriting sample pair. query1 may be the original query information and query2 may be the rewritten candidate query information. The alpha in the total loss is a super parameter for adjusting the corresponding weights of the two cross entropy losses and the relative entropy loss, and can be flexibly adjusted according to the sample training condition, which is not limited by the embodiment of the application.

Wherein the prediction distribution may indicate a probability that the semantic similarity of each information-rewritten sample pair corresponds to a respective similarity level.

Wherein Cross Entropy (Cross Entropy) can be used to measure the difference information between two probability distributions. The performance of a language model may be measured using cross entropy. The meaning of cross entropy is the difficulty of text recognition with the model, or from a compression perspective, each word is encoded with on average a few bits. The relative entropy (relative entropy), also known as the Kullback-Leibler divergence (Kullback-Leibler divergence) or information divergence (information divergence), is a measure of asymmetry of the difference between two probability distributions (probability distribution).

In a possible embodiment, in step B21, determining the at least two different vectors to represent corresponding cross entropy loss values, and the relative entropy loss value, comprises the following steps C1-C2:

step C1: determining a corresponding prediction distribution based on the vector representation, and determining a probability value for the corresponding sample similarity level for the information rewritten sample based on the prediction distribution; or, the vector is expressed as a predicted similarity and the pairs of information-rewritten samples are ordered, and a probability value corresponding to the similarity level of the information-rewritten samples to the corresponding samples is determined based on the ordering result.

Step C2: determining corresponding cross entropy loss values for the at least two different vector representations based on the probability values and the similarity levels, and relative entropy loss values.

For one information rewriting sample pair, the sample similarity level (which may be identified by a label) corresponding to the actual semantic similarity of the information rewriting sample pair may be determined in step a24, the vector representation obtained after the sample query information is input into the Bert double-tower model with dropout may be determined in step B11-step B14, and the corresponding prediction distribution may be obtained from the vector representation, and the probability value corresponding to the sample similarity level may be obtained from the prediction distribution, so that the loss value corresponding to the set loss function may be calculated by the probability value.

The following describes a calculation procedure for the cross entropy loss value:

taking 5 similar grades shown in the table 1 as an example, calculating a logic function to perform softmax processing to obtain the classification condition of the information rewritten sample pair, on the basis, the classification task can be optimized by using the cross soil moisture ce loss, and the calculation of the loss value can be shown in the following formula (6):

in formula (6), yL is the set of information-rewritten sample pairs constructed for all trained samples, Y _lf Rewriting the actual label (sample similarity level) of the sample pair on the f-th classification label for the l-th information, Z _lf Rewriting a sample pair vector for the first information to predict a probability value belonging to the f classification label; ln is the cross-entropy logarithm. In the example shown in Table 1, f is a value of 1 to 5.

Alternatively, the prediction distribution may be indeterminate, and instead the prediction of probability may be regarded as a sort problem of pairing method pariwise to learn so as to optimize the relative rank relation between different information rewritten sample pairs. After the steps B11 to B14 are performed, the information rewritten sample pair may be expressed as a predicted similarity (i.e., a similarity between words predicted by a model), all the information rewritten sample pairs in the training samples may be ranked based on the predicted similarity, and a probability value corresponding to each similarity level may be determined from the ranking result, so that a loss value may be calculated.

The training of the information rewriting model adopts a pre-training superposition contrast learning mode, namely the information rewriting model before the training is the pre-training model; wherein the training samples for training the information rewriting model include information rewriting sample pairs determined based on historical query information; the information rewriting sample pair constructed by the application can comprise a positive sample pair and a negative sample pair, the positive sample pair can comprise different historical query information, and each historical query information corresponds to the same historical query result; it can be appreciated that any one of the historical query information corresponds to a number of historical search results, and the historical query results are at least one query result that the operation object looks at in the historical search results; that is, the application can train the information rewriting model by adopting the query information corresponding to the query results commonly selected by the operation objects, so that the rewritten result can enrich the diversity of expression on the basis of ensuring semantic relevance when the information rewriting model obtained by training rewrites the original query information, and is also beneficial to improving the recall capability of the search engine based on the rewritten result, thereby improving the probability of looking up the query result from the search result and further improving the intelligent experience of the search.

In the embodiment of the application, through model improvement, compared with the traditional method based on manually mining synonyms, the search recall rate of replacement and rewrite is obviously improved, and in the example shown in the following table 2, the method can be effectively improved, and part of query information which is not rewritten originally can be rewritten well and has higher semantic similarity; the implementation of the embodiment of the application ensures the search viewing rate of the rewritten query information, which means that the rewritten query information can keep basically consistent with the semantics of the original query information, and the probability of determining the query result in the search results returned by the search engine by the operation object is improved.

TABLE 2

Original query information	Prior Art	The scheme of the application	Similarity rating
				Rabbit how to draw	Rabbit how to draw	Rabbit with teaching picture	5
Henan public accumulation policy	Not rewritten	Public accumulation fund regulations in Henan province	5
				Taboo for sitting on moon	Sitting moon is taboo	Big taboo for sitting in moon	4
Seven-day festival send what gift	Seven-day festival send what gift	Seven-day festival gift	4

In the example of table 2, the similarity ranking is based on the ranking classification shown in table 1. Taking the original query information of how to draw rabbits as an example, when the query information is rewritten by the prior art, the candidate query information can be obtained as how to draw rabbits, namely how to rewrite the original query information into rabbits, and the search results recalled by a search engine are not changed greatly basically; when the model trained by the embodiment of the application is used for rewriting query information, candidate query information (rewritten query information) can be obtained as a teaching picture rabbit, wherein, how is rewritten into teaching is realized, on the one hand, on the basis of ensuring semantic similarity, the diversity of query information expression is expanded, and the search result obtained by the teaching picture rabbit can be more relevant to the intention of an operation object to search information in terms of the search result recalled by a search engine; and the content expressivity of the search results correspondingly obtained by the embodiment of the application is richer when the search results recalled by the search engines in the prior art and the embodiment of the application are seen.

In order to better illustrate the information searching method provided by the embodiment of the present application, a possible application example is given below.

If the operation object searches information through the client loaded by the terminal, when the operation object inputs query information how to catch cold of the content to be searched through a search box, the operation object can respond to the input operation of the query information, obtain a plurality of rewritten query information based on the information rewrite model trained by the embodiment of the application and display the rewritten query information on an information search interface (such as display through a drop-down box). Optionally, before initiating the search operation, if the operation object modifies the input query information, the rewritten query information obtained based on the query information also changes accordingly.

Optionally, the process of determining the rewritten query information based on the query information may also be implemented by a server, for example, when the terminal obtains the query information, the process is sent to the server, that is, the trained model runs in the server, and the server determines a plurality of corresponding rewritten query information through the model and feeds back to the terminal for display.

When the operation object initiates a search operation from one of the plurality of rewritten query information, the terminal may send a search request to the server, where the terminal may send the query information and rewritten query information to the server together for information search, or may send only rewritten query information to the server, where the server serves as a search engine, and may feed back a corresponding search result based on the search request sent by the terminal, where the search result may be displayed as shown in fig. 8b, and the operation object may select content (query result) to be viewed from the search result. It can be understood that, when determining the query result, the operation object may return to the page shown in fig. 8b multiple times to select multiple query results, that is, there may be a case that one query information corresponds to multiple query results.

Alternatively, if it is determined that there are a plurality of pieces of rewritten query information based on the query information, the search operation may be triggered when the operation object selects one of the pieces of rewritten query information; if it is determined that one of the query information is rewritten based on the query information, the operation object may initiate a search operation through a "search" function control as shown in fig. 8 b.

Alternatively, when determining the query result from the search results, the operation object may be implemented by clicking, as in the content shown in fig. 8b, to view the search result corresponding to "how to treat cold", and then determine that the operation object selects the search result "how to treat cold" as the query result, so the number of views described in the above embodiment may be the number of pointing strokes, the viewing rate may be the pointing rate, and the co-viewing rate may be the co-clicking rate.

It should be noted that, in the alternative embodiment of the present application, the related data (such as the related data of the query information, the search result, the query result, etc.) needs to be licensed or agreed by the user when the above embodiment of the present application is applied to a specific product or technology, and the collection, use and processing of the related data need to comply with the related laws and regulations and standards of the related country and region. That is, in the embodiment of the present application, if data related to the subject is involved, the data needs to be obtained through the subject authorization consent and in accordance with the relevant laws and regulations and standards of the country and region.

An embodiment of the present application provides an information search apparatus, as shown in fig. 9, the information search apparatus 100 may include: an acquisition module 101, a rewrite module 102 and a search module 103.

The acquiring module 101 is configured to acquire input query information; a rewriting module 102, configured to input the query information into an information rewriting model, and rewrite the query information through the information rewriting model to obtain rewritten query information corresponding to the query information; the information rewriting model is trained according to the information rewriting sample pair; the information rewriting sample pair comprises a pair of inquiry information samples corresponding to the check operation of the same inquiry result; a search module 103, configured to search in a search engine based on the rewritten query information to obtain at least one search result; wherein the query result is at least one result viewed in the search result.

the apparatus 100 further comprises a training module, specifically configured to perform the following operations to construct the information-rewritten sample pair:

determining a training sample based on the historical search data;

In a possible embodiment, the training module, when used to perform determining training samples based on historical search data, is specifically configured to:

acquiring historical search data;

In a possible embodiment, each positive sample pair has a corresponding sample similarity level; the training module, when used for construction of information-rewritten sample pairs, is also used for performing the following operations for each positive sample pair:

In a possible embodiment, the information rewriting model is trained based on a pre-training model, and the pre-training model is a double-tower language model with a random inactivation layer; the training module is also used for training the information rewriting model according to the information rewriting sample pair, and is specifically used for:

In a possible embodiment, the training module, when used for executing the input of the information rewriting sample pair into the information rewriting model to obtain the vector representation, is specifically used for:

calculating a sum and a difference between the two sub-vector representations;

In a possible embodiment, the training module, when configured to perform the adjustment of the network parameters of the information rewriting model based on at least two different vector representations, is specifically configured to:

In a possible embodiment, the training module, when configured to perform determining the cross entropy loss value and the relative entropy loss value corresponding to the at least two different vector representations, is specifically configured to:

The device of the embodiment of the present application may perform the method provided by the embodiment of the present application, and its implementation principle is similar, and actions performed by each module in the device of the embodiment of the present application correspond to steps in the method of the embodiment of the present application, and detailed functional descriptions of each module of the device may be referred to the descriptions in the corresponding methods shown in the foregoing, which are not repeated herein.

The modules involved in the embodiments of the present application may be implemented in software. The name of the module is not limited to the module itself in some cases, and for example, the first display module may be also described as "a module for displaying query information" or the like.

The query information, the search result, the query result and the like according to the embodiment of the application can be stored by a block chain technology. The blockchain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, encryption algorithm and the like. The Blockchain (Blockchain), which is essentially a de-centralized database, is a string of data blocks that are generated in association using cryptographic methods, each of which contains a certain amount of processed data that is used to verify the validity of its information (anti-counterfeit) and to generate the next block. The blockchain may include a blockchain underlying platform, a platform product services layer, an application services layer, and the like.

The embodiment of the application provides electronic equipment, which comprises a memory, a processor and a computer program stored on the memory, wherein the processor executes the computer program to realize the steps of an information searching method, and compared with the related technology, the steps of the information searching method can be realized: after the input query information is obtained, the query information can be input into an information rewriting model, and the query information is rewritten through the information rewriting model to obtain rewritten query information corresponding to the query information; then, searching can be carried out in the search engine based on the rewritten query information to obtain at least one search result fed back by the search engine; the information rewriting model is trained according to an information rewriting sample pair, and the information rewriting sample pair comprises a pair of query information samples corresponding to the same query result in a checking operation; wherein the query result is at least one result viewed in the search result. That is, the rewritten query information obtained based on the query information rewriting is determined after the viewing operation in the learning and searching process of the information rewriting model, and the trained information rewriting model is used for rewriting the query information, so that the expression diversity of the rewritten query information can be improved on the basis of ensuring higher consistency of semantics before and after the query information is rewritten. On the basis, when searching is performed in the search engine based on the rewritten query information, the diversity of the content of the recall search result of the search engine can be improved to a certain extent, so that the probability of selecting the query result from the search result by the operation object is improved, and the intelligent experience of searching is improved.

In an alternative embodiment, there is provided an electronic device, as shown in fig. 10, the electronic device 4000 shown in fig. 10 includes: a processor 4001 and a memory 4003. Wherein the processor 4001 is coupled to the memory 4003, such as via a bus 4002. Optionally, the electronic device 4000 may further comprise a transceiver 4004, the transceiver 4004 may be used for data interaction between the electronic device and other electronic devices, such as transmission of data and/or reception of data, etc. It should be noted that, in practical applications, the transceiver 4004 is not limited to one, and the structure of the electronic device 4000 is not limited to the embodiment of the present application.

The processor 4001 may be a CPU (Central Processing Unit ), general purpose processor, DSP (Digital Signal Processor, data signal processor), ASIC (Application Specific Integrated Circuit ), FPGA (Field Programmable Gate Array, field programmable gate array) or other programmable logic device, transistor logic device, hardware components, or any combination thereof. Which may implement or perform the various exemplary logic blocks, modules and circuits described in connection with this disclosure. The processor 4001 may also be a combination that implements computing functionality, e.g., comprising one or more microprocessor combinations, a combination of a DSP and a microprocessor, etc.

Bus 4002 may include a path to transfer information between the aforementioned components. Bus 4002 may be a PCI (Peripheral Component Interconnect, peripheral component interconnect standard) bus or an EISA (Extended Industry Standard Architecture ) bus, or the like. The bus 4002 can be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in fig. 10, but not only one bus or one type of bus.

Memory 4003 may be, but is not limited to, ROM (Read Only Memory) or other type of static storage device that can store static information and instructions, RAM (Random Access Memory ) or other type of dynamic storage device that can store information and instructions, EEPROM (Electrically Erasable Programmable Read Only Memory ), CD-ROM (Compact Disc Read Only Memory, compact disc Read Only Memory) or other optical disk storage, optical disk storage (including compact discs, laser discs, optical discs, digital versatile discs, blu-ray discs, etc.), magnetic disk storage media, other magnetic storage devices, or any other medium that can be used to carry or store a computer program and that can be Read by a computer.

The memory 4003 is used for storing a computer program for executing an embodiment of the present application, and is controlled to be executed by the processor 4001. The processor 4001 is configured to execute a computer program stored in the memory 4003 to realize the steps shown in the foregoing method embodiment.

Among them, electronic devices include, but are not limited to: server and terminal.

Embodiments of the present application provide a computer readable storage medium having a computer program stored thereon, which when executed by a processor, implements the steps of the foregoing method embodiments and corresponding content.

The embodiment of the application also provides a computer program product, which comprises a computer program, wherein the computer program can realize the steps and corresponding contents of the embodiment of the method when being executed by a processor.

The terms "first," "second," "third," "fourth," "1," "2," and the like in the description and in the claims and in the above figures, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate, such that the embodiments of the application described herein may be implemented in other sequences than those illustrated or otherwise described.

It should be understood that, although various operation steps are indicated by arrows in the flowcharts of the embodiments of the present application, the order in which these steps are implemented is not limited to the order indicated by the arrows. In some implementations of embodiments of the application, the implementation steps in the flowcharts may be performed in other orders as desired, unless explicitly stated herein. Furthermore, some or all of the steps in the flowcharts may include multiple sub-steps or multiple stages based on the actual implementation scenario. Some or all of these sub-steps or phases may be performed at the same time, or each of these sub-steps or phases may be performed at different times, respectively. In the case of different execution time, the execution sequence of the sub-steps or stages can be flexibly configured according to the requirement, which is not limited by the embodiment of the present application.

The foregoing is merely an optional implementation manner of some of the implementation scenarios of the present application, and it should be noted that, for those skilled in the art, other similar implementation manners based on the technical ideas of the present application are adopted without departing from the technical ideas of the scheme of the present application, and the implementation manner is also within the protection scope of the embodiments of the present application.

Claims

1. An information search method, comprising:

acquiring input query information;

wherein the query result is at least one result viewed in the search result.

2. The method of claim 1, wherein the information-rewriting sample pairs include a positive sample pair and a negative sample pair;

the information rewriting sample pair is constructed by:

determining a training sample based on the historical search data;

3. The method of claim 2, wherein the determining training samples based on historical search data comprises:

acquiring historical search data;

4. The method of claim 2, wherein each positive sample pair has a corresponding sample similarity level;

5. The method of claim 1, wherein the information rewriting model is trained based on a pre-training model, the pre-training model being a two-tower language model with a random inactivation layer;

The same information rewritten sample pair is input into the pre-training model at least twice, so that at least two different vector representations are obtained;

and adjusting network parameters of the pre-training model based on the at least two different vector representations to obtain a trained information rewriting model.

6. The method of claim 5, wherein inputting pairs of information-rewriting samples into the information-rewriting model results in vector representations, comprising:

calculating a sum and a difference between the two sub-vector representations;

7. The method of claim 5, wherein adjusting network parameters of the information rewriting model based on at least two different vector representations comprises:

8. The method of claim 7, wherein the determining the at least two different vectors represents corresponding cross entropy loss values, and a relative entropy loss value, comprises:

9. An information search apparatus, comprising:

the acquisition module is used for acquiring input query information;

wherein the query result is at least one result viewed in the search result.

10. An electronic device comprising a memory, a processor and a computer program stored on the memory, characterized in that the processor executes the computer program to carry out the steps of the method according to any one of claims 1-8.

11. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method according to any of claims 1-8.

12. A computer program product comprising a computer program, characterized in that the computer program, when executed by a processor, implements the steps of the method according to any of claims 1-8.