CN116069903A

CN116069903A - Class search method, system, electronic equipment and storage medium

Info

Publication number: CN116069903A
Application number: CN202310187729.3A
Authority: CN
Inventors: 邹游
Original assignee: Terminus Technology Group Co Ltd
Current assignee: Terminus Technology Group Co Ltd
Priority date: 2023-03-02
Filing date: 2023-03-02
Publication date: 2023-05-05

Abstract

The embodiment of the disclosure provides a class retrieval method, a system, electronic equipment and a storage medium, and belongs to the field of class retrieval. The method comprises the following steps: receiving a retrieval case request of a user; inputting the search case sentences into a class search encoder trained in advance to obtain corresponding sentence vector representations; and obtaining similar cases similar to the search case sentences according to the sentence vector representation and the case vector representation set. According to the case retrieval method, system, electronic equipment and storage medium, the training problem of the case database data of the case is solved under the scene without labels, and the labor cost and time for manually adding labels are avoided. The quality of the retrieval sentence vector representation is improved, so that the retrieval accuracy is improved.

Description

Class search method, system, electronic equipment and storage medium

Technical Field

The embodiment of the disclosure belongs to the field of data retrieval, and particularly relates to a category retrieval method, a category retrieval system, electronic equipment and a storage medium.

Background

The search of the class often has the situation that no data is marked, and in this case, a supervised model cannot be trained. The current practice is to search the class cases in a supervised mode of manually adding labels, so that the labor cost is increased; or the search of the class case is simply carried out through the keywords, and the accuracy is lower.

And an unsupervised mode is used, the sentence vector is often expressed with low quality, so that the retrieval accuracy of the class case is difficult to ensure. The current practice simply adopts the word vector to generate sentence vector to represent sentence vector, resulting in very poor sentence vector representation quality.

In addition, in the case where the data amount of the case library is large, the time-consuming problem of retrieving similar cases is also serious.

Disclosure of Invention

Embodiments of the present disclosure aim to solve at least one of the technical problems existing in the prior art, and provide a category search method, a system, an electronic device, and a storage medium.

One aspect of the present disclosure provides a category retrieval method, including:

receiving a retrieval case request of a user; wherein the search case request includes a search case sentence;

inputting the search case sentences into a class search encoder trained in advance to obtain corresponding sentence vector representations; wherein, the case search encoder is trained in advance by adopting an unsupervised mode based on contrast learning;

obtaining similar cases similar to the search case sentences according to the sentence vector representation and the case vector representation set; wherein the case vector representation set is obtained by processing the case set through the case search encoder.

Optionally, the class search encoder is trained by the following steps:

setting a class search encoder and a momentum encoder;

updating the network weights of the encoder and the momentum encoder according to a loss functionL _i Obtaining the trained case search encoder.

Optionally, the loss functionL _i The following relation is satisfied:

(1)

wherein ,Mrepresents the queue size of the momentum encoder,

a coded representation of a negative sample in a queue representing a momentum encoder,Nindicates the size of each mini-batch,/-for each mini-batch>

Encoded representation of negative samples representing each mini-batch, +.>

and />

Each sample and its enhanced positive sample are represented separately.

Optionally, the updating the network weights of the encoder and the momentum encoder includes:

updating the network weight of the encoder by a back propagation method;

updating the network weight of the momentum encoder by the following formula (1):

(2)

wherein ,

for the network weight of the momentum encoder>

For the network weights of the encodings,

；

and in the process of updating the momentum encoder, each latest mini-batch data enters a queue, the oldest data is discharged from the queue, and when each batch of mini-batch data is trained, the data codes of the queues are used as negative samples for comparison learning.

Optionally, the obtaining similar cases similar to the search case sentence according to the sentence vector representation and the case vector representation set includes:

and respectively calculating the similarity between the sentence vector representation and each case vector representation in the case vector representation set, and selecting a plurality of cases with highest similarity as the similarity cases.

Optionally, the calculating the similarity between the sentence vector representation and each case vector representation in the case vector representation set, and selecting a plurality of cases with the highest similarity as the similarity case includes:

distributing each case vector representation to a corresponding plurality of nodes;

for each node, calculating the similarity between the sentence vector representation and the case vector representation of the node through cosine similarity to obtain case selection results of a plurality of nodes;

and merging the case selection results of the nodes, sorting the case selection results in a descending order according to the similarity, and selecting the first K cases as the similar cases.

Another aspect of the present disclosure provides a category retrieval system, characterized in that the system comprises:

and a receiving module: the method comprises the steps of receiving a retrieval case request of a user; wherein the search case request includes a search case sentence;

and a coding module: inputting the search case sentences into a class search encoder trained in advance to obtain corresponding sentence vector representations; wherein, the case search encoder is trained in advance by adopting an unsupervised mode based on contrast learning;

the calculation module: the method is used for obtaining similar cases similar to the search case sentences according to the sentence vector representation and the case vector representation set; wherein the case vector representation set is obtained by processing the case set through the case search encoder.

Optionally, the system further comprises a training module, wherein the training module is used for:

setting a class search encoder and a momentum encoder;

Another aspect of the present disclosure provides an electronic device, including:

at least one processor; the method comprises the steps of,

and a memory communicatively coupled to the at least one processor for storing one or more programs that, when executed by the at least one processor, cause the at least one processor to implement the class retrieval method as described above.

A final aspect of the present disclosure provides a computer readable storage medium storing a computer program which when executed by a processor implements a class retrieval method as described above.

According to the case retrieval method, system, electronic equipment and storage medium, the training problem of the case database data of the case is solved under the scene without labels, and the labor cost and time for manually adding labels are avoided. The quality of the retrieval sentence vector representation is improved, so that the retrieval accuracy is improved.

Drawings

FIG. 1 is a flow chart of a class search method according to an embodiment of the disclosure;

FIG. 2 is a schematic diagram of a type of retrieval system according to another embodiment of the present disclosure;

fig. 3 is a schematic structural diagram of an electronic device according to another embodiment of the disclosure.

Detailed Description

In order that those skilled in the art will better understand the technical solutions of the present disclosure, the present disclosure will be described in further detail with reference to the accompanying drawings and detailed description.

As shown in fig. 1, an embodiment of the present disclosure provides a category retrieval method including the steps of:

step S11, training a case search encoder.

Taking a large number of cases as a training set, and performing unsupervised training based on contrast learning. In the embodiment of the disclosure, each sample x of the training set can only be set to 256 for a graphics card with 12G video memory because of the limitation of video memory. To increase the negative sample for comparison, the present embodiment adopts a contrast learning method of Momentum Contrast.

Specifically, we use the chinese version deberta model as a class search encoder, additionally set as a momentum encoder, and the initial weights use the weights of the encoder. momentum encoder does not back propagate the gradient and his network weight update rule is as follows (2):

(2)

wherein ,

for the network weight of the momentum encoder>

For the network weights of the encodings,

. At the same time->

Updates are made by back propagation.

The momentum encoder maintains a queue of size Z times the size of batch_size, M. And in the training process, each latest mini-batch data enters a queue, the oldest data is discharged from the queue, and when each batch of mini-batch data is trained, the data codes of the queue are used as negative samples for comparison learning.

Here our loss function is as follows (1):

(1)

wherein ,Mrepresents the queue size of the momentum encoder,

Encoded representation of negative samples representing each mini-batch, +.>

and />

Each sample and its enhanced positive sample are represented separately.

And performing self-supervised contrast learning training through the loss function, wherein the weight of the encoder obtained after the final training is the final weight of our bert.

Step S12, receiving a search case request of a user.

Specifically, when the user needs to search the class, the keyword sentence to be searched is input on the computer and other devices to form a search case sentence, so as to complete the receiving of the user search case request.

Step S13, inputting the search case sentences into a class search encoder trained in advance to obtain corresponding sentence vector representations.

Specifically, when searching for similar cases, all cases in the case library are first compared and learned by the case search encoder after self-supervision training in the initialization stage to obtain vector representations of all cases in the case library

. Every time the user searches, the sentence searched by the user is obtained by the class search encoder to obtain the vector representation of the sentence searched by the user +.>

。

And step S14, obtaining similar cases similar to the search case sentences according to the sentence vector representation and the case vector representation set.

Specifically, sentence vector representation of user retrieval cases

Representing all the case vectors in the case library respectively

The similarity is calculated, the cos function is used for calculating the similarity of the two vector representations, and the first K cases (K epsilon N) with the highest similarity are the similar cases needing to be returned for retrieval.

The embodiment of the disclosure provides an unsupervised case retrieval method based on contrast learning, solves the training problem under the scene that case base data of the case are not marked, and avoids the manpower cost and time for manually adding marks. The quality of the retrieval sentence vector representation is improved, so that the retrieval accuracy is improved.

Illustratively, in order to optimize the speed when performing step S14, to make the speed of calculating the similarity faster, embodiments of the present disclosure provide a distributed calculation method, including:

Specifically, each case vector is represented

Respectively put innThe nodes master, node1, node2, node3 and …. Assume that the number of similar cases to be retrieved isSAll case vectors are represented +.>

Evenly distributed on each node, each node is provided withS/nThe individual case vectors represent. At this time, sentence vector representation +_is synchronously calculated on each node by cosine similarity>

Representation +.>

And the top K results are taken separately. Then master node merges allnAnd sequencing the first K results of each node from high to low in similarity of the combined results, and then taking the first K results and returning. Ideally the number of nodes to be set up in the calculationnInversely proportional.

The embodiment of the disclosure provides a distributed similarity calculation method, solves the problem of serious time consumption when searching a large number of cases, and improves the searching efficiency.

Another embodiment of the present disclosure provides a category retrieval system, as shown in fig. 2, comprising:

the receiving module 201: the method comprises the steps of receiving a retrieval case request of a user; wherein the search case request includes a search case sentence;

encoding module 202: inputting the search case sentences into a class search encoder trained in advance to obtain corresponding sentence vector representations; wherein, the case search encoder is trained in advance by adopting an unsupervised mode based on contrast learning;

the calculation module 203: the method is used for obtaining similar cases similar to the search case sentences according to the sentence vector representation and the case vector representation set; wherein the case vector representation set is obtained by processing the case set through the case search encoder.

Specifically, when the user needs to search for a category, a keyword sentence to be searched for is input into a device such as a computer to form a search case sentence. The receiving module 201 is responsible for receiving the retrieval case sentences. In the initialization phase, the coding modeBlock 202 causes all cases in the case library to be compared with the case search encoder after self-supervision training to obtain vector representations of all cases in the case library

Every time a user searches, the user searched sentences are obtained to be vector representation of the user searched sentences through a class search encoder>

. Calculation module 203 calculates sentence vector representation +.>

Representation of all case vectors in the case base +.>

The similarity of the two vector representations is calculated by using a cos function, and the first K cases (K epsilon N) with the highest similarity are the similar cases needing to be returned for retrieval.

Illustratively, the system further includes a training module 204 for:

setting a class search encoder and a momentum encoder;

Specifically, the training module 204 performs the training method of the class search encoder described in step S11, so as to obtain a trained class search encoder for use by the encoding module 202.

According to the case retrieval system in the embodiment of the disclosure, the problem of training under the scene that case base data of the case is not marked is solved by an unsupervised case retrieval method based on contrast learning, and the labor cost and time for manually adding marks are avoided. The quality of the retrieval sentence vector representation is improved, so that the retrieval accuracy is improved.

As shown in fig. 3, another embodiment of the present disclosure provides an electronic device, including:

at least one processor 301, and a memory 302 communicatively coupled to the at least one processor 301 for storing one or more programs that, when executed by the at least one processor 301, enable the at least one processor 301 to implement a class retrieval method as described above.

Where the memory and the processor are connected by a bus, the bus may comprise any number of interconnected buses and bridges, the buses connecting the various circuits of the one or more processors and the memory together. The bus may also connect various other circuits such as peripherals, voltage regulators, and power management circuits, which are well known in the art, and therefore, will not be described any further herein. The bus interface provides an interface between the bus and the transceiver. The transceiver may be one element or may be a plurality of elements, such as a plurality of receivers and transmitters, providing a means for communicating with various other apparatus over a transmission medium. The data processed by the processor is transmitted over the wireless medium via the antenna, which further receives the data and transmits the data to the processor.

The processor is responsible for managing the bus and general processing and may also provide various functions including timing, peripheral interfaces, voltage regulation, power management, and other control functions. And memory may be used to store data used by the processor in performing operations.

According to the electronic equipment in the embodiment of the disclosure, the training problem of the case-like case library data under the scene without labeling is solved by realizing the case-like retrieval method, and the labor cost and time for manually adding labeling are avoided. The quality of the retrieval sentence vector representation is improved, so that the retrieval accuracy is improved.

Another embodiment of the present disclosure provides a computer-readable storage medium storing a computer program which, when executed by a processor, implements a class search method as described above.

The computer readable storage medium may be included in the system and the electronic device of the present disclosure, or may exist alone.

A computer readable storage medium may be any tangible medium that can contain, or store a program that can be electronic, magnetic, optical, electromagnetic, infrared, semiconductor systems, apparatus, device, more specific examples including, but not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, an optical fiber, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination thereof.

The computer readable storage medium may also include a data signal propagated in baseband or as part of a carrier wave, with the computer readable program code embodied therein, specific examples of which include, but are not limited to, electromagnetic signals, optical signals, or any suitable combination thereof.

It is to be understood that the above embodiments are merely exemplary embodiments employed to illustrate the principles of the present disclosure, however, the present disclosure is not limited thereto. Various modifications and improvements may be made by those skilled in the art without departing from the spirit and substance of the disclosure, and are also considered to be within the scope of the disclosure.

Claims

1. A category retrieval method, the method comprising:

2. The case retrieval method according to claim 1, wherein the case retrieval encoder is trained by:

setting a class search encoder and a momentum encoder;

3. The case retrieval method according to claim 2, wherein the loss functionL _i The following relation is satisfied:

(1)

wherein ,Mrepresents the queue size of the momentum encoder,

Encoded representation of negative samples representing each mini-batch, +.>

and />

Each sample and its enhanced positive sample are represented separately.

4. The case retrieval method according to claim 2, wherein the updating the network weights of the encoder and the momentum encoder includes:

updating the network weight of the encoder by a back propagation method;

updating the network weights of the momentum encodings by the following formula (2):

(2)

wherein ,

for the network weight of the momentum encoder>

For the network weights of the encodings,

；

5. The case retrieval method according to claim 1, wherein the obtaining similar cases similar to the retrieved case sentences from the sentence vector representation and case vector representation sets includes:

and respectively calculating the similarity between the sentence vector representation and each case vector representation in the case vector representation set, and selecting a plurality of cases with highest similarity as the similar cases.

6. The case retrieval method according to claim 5, wherein the calculating the similarity between the sentence vector representation and each case vector representation in the case vector representation set, respectively, and selecting a plurality of cases with the highest similarity as the similar cases includes:

7. A category retrieval system, the system comprising:

8. The case retrieval system of claim 7, further comprising a training module to:

setting a class search encoder and a momentum encoder;

9. An electronic device, comprising:

at least one processor; the method comprises the steps of,

a memory communicatively coupled to the at least one processor for storing one or more programs that, when executed by the at least one processor, cause the at least one processor to implement the class retrieval method of any of claims 1-6.

10. A computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements the class search method of any one of claims 1 to 6.