CN115794999A

CN115794999A - Patent document query method based on diffusion model and computer equipment

Info

Publication number: CN115794999A
Application number: CN202310048755.8A
Authority: CN
Inventors: 尤元岳; 徐青伟; 严长春; 裴非; 范娥媚
Original assignee: Zhiguagua Tianjin Big Data Technology Co ltd; Beijing Zhiguquan Technology Service Co ltd
Current assignee: Beijing Xinghe Zhiyuan Technology Co ltd; Zhiguagua Tianjin Big Data Technology Co ltd
Priority date: 2023-02-01
Filing date: 2023-02-01
Publication date: 2023-03-14
Anticipated expiration: 2043-02-01
Also published as: CN115794999B

Abstract

The application discloses a patent document query method based on a diffusion model and computer equipment, and aims to solve the problem that the completeness and accuracy of the existing patent retrieval are not ideal. The method comprises the steps of obtaining a plurality of keywords through word segmentation for a short text input by a user, and respectively sending the keywords into three diffusion models for diffusion generation, wherein clusters of the keywords in word segmentation results are jointly used as control signals of the diffusion models to limit the diffusion generation direction; the training corpora of the three diffusion models are respectively derived from the abstract, the claim and the specification and are used for correspondingly generating sentences similar to the sentence expression forms of the abstract, the claim and the specification; after retrieval, performing weighted integration on the three groups of patent documents, and selecting and outputting a plurality of patent documents with the highest similarity after weighting as the intention retrieval result of the user; therefore, the retrieval result is comprehensive and more accords with the real retrieval intention of the user, and the completeness and the accuracy of patent retrieval are improved.

Description

Patent document query method based on diffusion model and computer equipment

Technical Field

The application belongs to the technical field of document retrieval, and particularly relates to a patent document query method and computer equipment.

Background

Patent retrieval is used for patent duplication checking and infringement detection, which is a key core link in the patent application and right maintenance process, and how to realize accurate and efficient retrieval becomes an important content in patent system construction.

The conventional patent retrieval method is generally realized based on matching ranking between a retrieval keyword phrase input by a user and a patent text, and particularly for scenes such as simple retrieval, semantic retrieval and the like, multiple-topic association may exist in the retrieval keyword input by the user, so that a real retrieval intention of the user cannot be completely expressed by a short text input by the user, the limited information amount of the short text is not matched with rich semantic content of a patent document, and the completeness and accuracy of final patent retrieval are not ideal.

Meanwhile, the traditional query expansion is realized by using similar word lists, word vectors and the like in the general field, however, the similar words in the general field cannot effectively capture the semantic similarity between the professional terms in the patent field. The methods cannot adapt to diversity efficient retrieval in a patent retrieval dynamic unknown (Zero Shot) scene, and the retrieval text automatically generated by expanding the patent retrieval query has poor effect of improving the overall accuracy of retrieval.

Disclosure of Invention

The application provides a patent document query method based on a diffusion model and computer equipment, and aims to solve the problem that the completeness and accuracy of the current patent retrieval are not ideal.

Therefore, the following technical scheme is provided in the application:

a patent document query method based on a diffusion model comprises the following steps:

receiving text content input by a user;

if the text content input by the user through retrieval exceeds a preset length threshold, segmenting the text content, and then respectively sending the segmentation results into three diffusion models for diffusion generation, wherein clusters of all keywords in the segmentation results are jointly used as control signals of the diffusion models to limit the generation direction of the diffusion models; the three diffusion models are respectively marked as a first diffusion model, a second diffusion model and a third diffusion model, and training corpuses of the three diffusion models are respectively derived from an abstract, a claim and a specification and are used for correspondingly generating sentences which are similar to the abstract, the claim and the specification sentence expression form;

the sentences generated by the three diffusion models are sent into a retrieval system, and the abstract, the claim and the specification are respectively and correspondingly taken as retrieval ranges to retrieve patent documents, so that three groups of patent documents are obtained;

and performing weighted integration on the three groups of patent documents, and selecting a plurality of weighted patent documents with the highest similarity as an intention retrieval result of the user and outputting the intention retrieval result.

Optionally, the patent document query method further includes:

if the text content input by the user search does not exceed the preset length threshold, the text content input by the user search is directly sent to a search system, and patent documents are searched respectively by taking the abstract, the claim and the specification as search ranges to obtain three groups of patent documents.

Alternatively, the three sets of patent documents are of the same size. Of course, it may be different.

Preferably, the training method of the three diffusion models comprises the following steps:

gradually adding noise into the training corpus, continuously destroying corpus information, and storing corpus information in each step of destruction process until the original corpus information is destroyed to be completely random Gaussian noise, wherein the process is marked as a noise adding process; and then, denoising the completely random Gaussian noise, and continuously denoising by using a generative model by using the damaged corpus information stored in the denoising process as tag data to finally obtain original corpus information, so that the generative model learns the capacity of generating corresponding corpus through the denoising process.

Optionally, the generating model adopts a Transformer model or a GPT model.

Preferably, the method for generating the corpus comprises:

extracting sentences from the abstract, the claim and the specification of the published patent document respectively, and recording the sentences as a first sentence, a second sentence and a third sentence;

and respectively segmenting the first sentence, the second sentence and the third sentence by adopting a text word segmentation device, wherein the corresponding word segmentation result is the training corpus used for the first diffusion model, the second diffusion model and the third diffusion model.

Preferably, in the three diffusion models, each diffusion model performs a diffusion generation process, which specifically includes:

performing word segmentation and word stop removal on text contents input by a user to obtain a plurality of keywords;

respectively searching a domain word list containing each keyword; the field word list is generated in advance based on a clustering algorithm;

and taking other words in the domain word list to which each keyword belongs as target words with similar semantics to the keywords, training a classifier corresponding to the class of the domain word list to obtain the probability of the diffusion model for each target word in the domain word list, further performing gradient updating on the hidden variables of the diffusion model, repeating multi-step diffusion, and mapping the hidden variables generated finally to texts through a softmax function to obtain sentences in the control direction of the keywords.

Preferably, the retrieval system performs similarity calculation on the sentence generated by the first diffusion model and the abstract text vector of the patent document, performs similarity calculation on the sentence generated by the second diffusion model and the claim text vector of the patent document, performs similarity calculation on the sentence generated by the third diffusion model and the specification text vector of the patent document, and returns N patent documents with the highest similarity respectively by adopting a bm25 model or bert model word vector representation mode.

Computer equipment comprising a memory and a processor, wherein the memory stores a computer program, and is characterized in that the processor implements the steps of the patent document inquiry method based on the diffusion model when executing the computer program.

A computer-readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the above-mentioned diffusion model-based patent document query method.

The application has at least the following beneficial effects:

the method comprises the steps of obtaining a plurality of keywords through word segmentation for a short text input by a user, and respectively sending the keywords into three diffusion models for diffusion generation, wherein clusters of the keywords in word segmentation results are jointly used as control signals of the diffusion models to limit the generation direction of the diffusion models; the training corpora of the three diffusion models are respectively derived from the abstract, the claim and the specification and are used for correspondingly generating sentences similar to the sentence expression forms of the abstract, the claim and the specification; sending the documents into a retrieval system for retrieval to obtain three groups of patent documents, performing weighted integration on the three groups of patent documents, and selecting and outputting a plurality of patent documents with the highest similarity after weighting as the intention retrieval result of the user; therefore, the retrieval result is more comprehensive and more accords with the real retrieval intention of the user, and the completeness and the accuracy of patent retrieval are improved.

Drawings

FIG. 1 is a schematic diagram illustrating a basic principle of a patent document query method based on a diffusion model according to an embodiment of the present application;

FIG. 2 is a diagram illustrating a training process of a diffusion model in an embodiment of the present application (taking a summary diffusion model as an example);

FIG. 3 is a schematic diagram of a training method for three diffusion models according to an embodiment of the present application;

FIG. 4 is a diagram illustrating a sentence generation process of a diffusion model in an embodiment of the present application (taking a summary diffusion model as an example);

FIG. 5 is a diagram illustrating a sentence generation method for three diffusion models according to an embodiment of the present application;

FIG. 6 is a diagram illustrating an extended search and integration process according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

In one embodiment, an application scenario, such as a user, enters a search statement on a website, APP, that provides patent search services, in a "simple search" manner (typically a search box presented on a home page) or in a manner of selecting "semantic search" (supporting longer text content).

As shown in fig. 1, there is provided a patent document query method based on a diffusion model, comprising the following steps:

receiving text content input by a user;

if the text content input by the user for searching does not exceed the preset length threshold, directly sending the text content input by the user for searching into a searching system (or carrying out appropriate preprocessing), and respectively searching for patent documents by taking the abstract, the claim and the specification as searching ranges to obtain three groups of patent documents;

if the text content input by the user in the search exceeds a preset length threshold, segmenting the text content, and then respectively sending the segmentation result into three diffusion models (an abstract diffusion model, a claim diffusion model and a specification diffusion model) for diffusion generation, wherein the clusters of all keywords in the segmentation result are jointly used as control signals of the diffusion models to limit the generation direction of the diffusion models; the three diffusion models can be abbreviated as a first diffusion model, a second diffusion model and a third diffusion model, and training corpuses of the three diffusion models are respectively derived from an abstract, a claim and a specification and are used for correspondingly generating sentences similar to the abstract, the claim and the specification sentence expression form; the sentences generated by the three diffusion models are sent into a retrieval system, and the abstract, the claim and the specification are respectively and correspondingly taken as retrieval ranges to retrieve patent documents;

note that the "claims" and "claims" referred to herein are different concepts, the former emphasizes the claims (each claim expresses an independent meaning and can generate a text vector; the same semantics, the statement expression form of the claims may be different from the statement expression form in the abstract and the specification), and the latter is one of the basic compositions of the patent document (the target range for similarity calculation);

in addition, the expression form similarity of the sentence is a concept different from the similarity calculation in patent document retrieval, the former focuses on the expression form of the sentence and aims to make the sentences generated by the three models respectively more like abstract sentences, claim sentences and specification sentences, and the latter focuses on semantic approximation;

performing weighted integration on the three groups of patent documents, and selecting K patent documents with the highest similarity after weighting as intention retrieval results of the user and outputting the K patent documents; k is the preset space number and is less than 3N.

Specifically, the above-mentioned retrieval of patent documents with the abstract, the claims and the specification as retrieval ranges respectively includes performing similarity calculation between the sentence generated by the first diffusion model and the abstract text vector of the patent document in the patent library, performing similarity calculation between the sentence generated by the second diffusion model and the claim text vector of the patent document in the patent library, performing similarity calculation between the sentence generated by the third diffusion model and the specification text vector of the patent document in the patent library, and returning N patent documents with the highest similarity respectively to obtain three groups of patent documents (3N patent documents). Of course, the number of three sets of patent documents may also vary, for example: the first group of patent documents is set to be 100, the second group of patent documents is set to be 80, the third group of patent documents is set to be 50, and finally, the total number of the three groups of patent documents is 230; and (4) performing weighted integration on the three groups of patent documents, and selecting 150 patents with the highest similarity after weighting.

In particular, a technology (approximate search) for searching for a patent document in which a specific search range (abstract, claim, or specification) is used as a search range is a prior art in the field.

The diffusion model is a deep generation model, query expansion automatic generation with input as a condition has the advantages of strong robustness, high sampling efficiency, semantic approximation, sample diversity and the like, and meanwhile, the generated result has certain interpretability. The incremental generation of the query contents through the diffusion model can enable the retrieval result to be more completely covered and improve the recall rate. The cost of missing related patents is high, and the recall index is very important. In addition, the new word and term in the patent is high in updating speed, and the diffusion model can generate interpretable expansion query with wide diversity coverage, so that the intuitive understanding of a user is facilitated. In the whole process, the diffusion model plays an important step, and the embodiment mainly trains out a diffusion model which can be controlled by the keywords, and applies the controllable diffusion model to the extended retrieval system in the patent field.

The purpose of the first training part is to make the diffusion model obtain the capability of generating sentences in random text directions through training data and training methods, for example, through training of training data of various fields, the diffusion model can generate sentences in various field directions, including fields of artificial intelligence, computers, traffic and the like, but the diffusion model of the current step has no way to control the field direction of specifically generated sentences, the model can generate sentences in artificial intelligence fields at this step, and can also generate sentences in computer fields, which are completely random, and the first step only makes the model have the capability of generating sentences in all fields. In this step, in order to make the result of the patent extension search more accurate, the corpus used to train the diffusion model may be used to train three models respectively using the corpuses of the three parts of the abstract, the claim and the specification, so that the sentences generated by the three models may be more like the sentences in the abstract, the claim and the specification.

The method for training the diffusion model in the embodiment mainly comprises the following steps:

extracting sentences from the abstract, the claim and the specification of the published patent document respectively, wherein the sentences can be marked as a first sentence, a second sentence and a third sentence; the first sentence, the second sentence and the third sentence are respectively subjected to word segmentation by adopting a text word segmentation device, and the corresponding word segmentation result is training corpora used for the first diffusion model, the second diffusion model and the third diffusion model;

gradually adding noise into corresponding training corpora according to each diffusion model, continuously destroying the corpus information, and storing the corpus information of each step of the destroying process until the original corpus information is destroyed to become completely random Gaussian noise, wherein the process is marked as a noise adding process; and then, denoising the completely random Gaussian noise, and continuously denoising by using a generative model by using the destroyed corpus information stored in the denoising process as tag data to finally obtain original corpus information, so that the model learns the capacity of generating corresponding corpus through the denoising process.

The second step generation part limits the generation direction of the diffusion model through the key words input by the user, so that the diffusion model can generate sentences with fixed text directions. For example, if the user wants to expand and search a sentence in the field of artificial intelligence, and thus inputs a keyword of "artificial intelligence", then in the second generation process, the model gradually migrates the self-generated sentence direction to the field of artificial intelligence according to the keyword input by the user, and finally generates a sentence in the field of artificial intelligence. The sentences in the artificial intelligence field can be used as search conditions with richer contents for the expanded query of the keywords of the artificial intelligence input by the user. Based on the models trained by the abstract, the claim and the specification in the first step, the keyword artificial intelligence input by the user is expanded and generated, and the three diffusion models can generate the abstract, the claim and the specification sentences in the three fields of artificial intelligence.

The specific process of the diffusion generation by the diffusion model in this embodiment mainly includes:

performing word segmentation and word stop removal on text contents input by a user to obtain a plurality of keywords; respectively searching a domain word list containing each keyword (each domain word list is generated in advance based on a clustering algorithm); and taking other words in the domain word list to which each keyword belongs as target words with similar semantics to the keywords, training a classifier corresponding to the class of the domain word list to obtain the probability of the diffusion model for each target word in the domain word list, further performing gradient updating on the hidden variables of the diffusion model, repeating multi-step diffusion, and mapping the hidden variables generated finally to texts through a softmax function to obtain sentences in the control direction of the keywords.

And a third step of searching the abstract, the claim and the instruction sentence of the three artificial intelligence fields obtained by the generation part respectively for the abstract, the claim and the instruction sentence of the patent document in the searching system. That is, the abstract sentences generated by the diffusion model in the artificial intelligence field are compared with the abstract information of the patent documents in the retrieval system for retrieval, and the retrieval system returns the former N patents based on the similarity between the abstract in the patent documents and the sentences generated by the diffusion model. Similarly, the claim sentences and the specification sentences respectively search the claim and the specification of the patent documents in the search engine, and also respectively return the first N patents with the similarity. And then carrying out weighted statistics on the 3N patents, finding out the first K patents which are most relevant to the artificial intelligence input by the user, taking the first K patents as keywords of the artificial intelligence input by the user, and returning the results of the K patents by the retrieval system.

Thus, the whole process can be summarized in the following three steps:

1. and a diffusion model training process, wherein the diffusion model obtains the capability of generating sentences in random directions, and the diffusion model is trained by the linguistic data of different patent parts respectively, so that the diffusion model can generate sentences of corresponding patent parts. The process does not enter the whole process of the extended search and is a precondition of the function of the extended search.

2. The diffusion model generates sentences, the process is a part of the expansion retrieval process, the process is to enable the diffusion model to gradually generate sentences in the keyword direction input by a user, and the sentences of the abstract, the claims and the specification in the keyword direction are respectively generated according to the trained diffusion model aiming at different parts of the patent.

3. And in the expanding retrieval and integration process, the generated sentences in the same field of the three parts are transmitted to a retrieval system to retrieve the three parts of the patent respectively, and the retrieval result is weighted and counted to find out the most similar previous K patents, so that the process of expanding retrieval is realized.

These three steps are described in further detail below.

1. A diffusion model training process, aiming to make the diffusion model obtain the capability of generating sentences in random domain directions. The diffusion model training step comprises the following steps: corpus construction, abstract diffusion model training, claim diffusion model training and specification diffusion model training.

Step 1, corpus construction: because of the extended search aiming at the patent field and the final aim that the model can generate sentences related to the abstract, the claim and the specification, the abstract, the claim and the specification text content of the patent are extracted from the collected patent documents, and the abstract, the claim and the specification content are divided by periods, semicolons and the like, and the divided sentences are used as the preliminary training corpora of the three diffusion models. Three different diffusion models are prepared and trained by the three parts of training corpora.

In order to avoid repeated descriptions, the following processes take abstract sentence corpora as an example, and abstract related diffusion models are trained, and the diffusion models of the claims and the specification are also preconditions which need to be trained as the next generation process.

Step 2, abstract diffusion model training: as shown in fig. 2, the overall idea of the training process is to gradually add noise to the collected abstract corpus, continuously destroy the information of the whole corpus, and store the information of the destruction process at each step until the corpus information is destroyed to be completely random gaussian noise. This process is called forward propagation, i.e. the noising process. After the denoising process, random gaussian noise is obtained, noise of the gaussian noise needs to be reduced continuously, the stored damaged corpus information in the denoising process is used as tag data, a generative model such as a Transformer model or a GPT model is used for reducing noise continuously, and finally the noise is reduced to be initial corpus information, and the generative model can learn the capacity of generating corresponding corpus through the denoising process. The specific training process is as shown in fig. 2:

(1) The abstract sentences acquired by the corpus constructing part are used as input data of the diffusion model, and in the example, the abstract sentences 'an artificial intelligent automobile, which comprise an automatic route searching method and a danger predicting module' are used as input texts. And then segmenting the input text, wherein the text segmenter can perform segmentation by pre-training or directly using a trained segmenter, such as a hand segmenter or a jieba segmenter. The obtained word segmentation result is w, wherein w is a word list after the sentence of the input data is segmented. Assuming that the sentence of the input data has n words in total after word segmentation, the input data is divided into a plurality of words

The word segmentation result in this example may be:

。

(2) The word segmentation result w is transmitted to a word vector embedding layer EMB, so that the word segmentation result w is separatedScattered words are mapped into a continuous space, and the obtained word embedding result is as follows:

n words are mapped into n d-dimensional vectors.

(3) The embedded word vectors are then transformed through a Markov chain into hidden variables in a diffusion model and through a probabilistic model

Generating corresponding hidden variables

Wherein

Is given by

Then, generating through word vector coding and Markov chain

The probability of (c).

Is represented by

Is taken as the mean value of the average value,

is a normal distribution of the variance and,

the value of (c) is the value sampled in the normal distribution. In the reverse process, a trainable approximation model step is added

And mapping back the original text word segmentation content again, wherein the mapping relation is as follows:

wherein

Is a softmax distribution, and

is given the meaning of

Obtained by softmax distribution under the premise of

The probability of (c). In the following, for convenience of understanding, the handle

As a probabilistic representation of feed forward propagation

As a probabilistic representation of the inverse denoising process.

(4) During the feedforward propagation process, intermediate hidden variables are constructed

This feed forward propagation is stepwise towards

Gaussian noise is added until T steps are added, and at T steps,

close to Gaussian noise, with each step transferred by

To

Are all by

Is sampled. Wherein

To add the amount of gaussian noise at step t,

the feed forward process q is a hyperparameter and may contain no trainable parameters and may define a training objective, including generating noisy data according to a predefined feed forward process q and training a model to reverse the process and reconstruct the data.

(5) In reverse propagation, the diffusion model is formed by reverse propagation

Gradually denoise the Gaussian noise, thereby gradually reconstructing

. The whole process is that in the process of reconstruction, the model is obtained

The Gaussian noise begins to be gradually denoised, so that a series of hidden variables are generated

Thereby approaching the sampling of the target distribution

. At an initial state of

And a denoising process is performed at each step

To

Are all composed of

Is obtained in which

And

can pass through

Or

And calculating and learning. Wherein the data of the training process is that the input is

The output being obtained by a forward process of diffusion

，

The model is used to learn the mean and variance of the distribution in the feed forward propagation

For example, in the de-noising process,

is inputted as

The output is the denoising process

Predicted

And this is

To approach in feed-forward propagation

Propagating in a feed-forward manner

And de-noising process

As a difference of

By back propagation of the loss function

Such that

The mean and variance in the current distribution are learned.

(6) Diffusion model by maximizing data

Is trained, and the normalization target is

Is lower bound, the loss function of the diffusion model becomes, therefore:

however, this training objective is not stable and requires a great deal of optimization skill, so a simple alternative objective was devised

Is expanded and re-weighted to obtain the mean square error loss, so the loss function of the diffusion model becomes

Wherein

Is a posterior probability

Is close to gaussian noise.

Predicted by neural networks

Is measured.

(7) The word vector in the step (3) is addedMapping to hidden variables

Process for producing

And to be reconstructed

Process for remapping words back

And (5) carrying the loss function in the step (6) to finally obtain an end-to-end training loss function:

(ii) a It can also be optimized as:

；

both training loss functions are initially equivalent;

and training the diffusion model through a loss function, and performing back propagation to complete the training of the single diffusion model.

Step 3, in order to improve the retrieval accuracy, the claim diffusion model and the specification diffusion model are retrained according to the processes (1) to (7) in the step 2, and the corpus information corresponds to sentences of the claim and the specification; three diffusion models were thus obtained, as shown in fig. 3.

2. And the diffusion model generates a sentence process, wherein the process aims to perform content diffusion generation in the keyword direction on the trained three diffusion models respectively aiming at the keywords input by the user, so as to generate sentences which correspond to the three parts of the patent and belong to the keyword field direction. The diffusion model generation sentence process can be divided into a diffusion model preparation, a diffusion model generation abstract sentence, a diffusion model generation claim and a diffusion model generation specification sentence.

Step 1, diffusion model preparation: firstly, a domain vocabulary is pre-trained or a pre-trained domain vocabulary is used, for example, keywords such as artificial intelligence and neural network may be included in the artificial intelligence domain. This individual domain vocabulary is then viewed as a bag of words for nbow. The word list in the field can encode each word into a vector after segmenting, de-duplicating and removing stop words of the text contents of all Chinese patents, and the process of encoding the vector can encode the words into word vectors by using an existing word vector library or by using a bert model and the like. Clustering the words contained in all Chinese patents by using the encoded word vectors in a KNN or Kmeans clustering manner to obtain clusters, namely clustering words in the graph 4, wherein the individual clustering word lists obtained by clustering are regarded as field word lists.

And 2, generating an abstract sentence by the diffusion model. The main purpose of this step is to control the abstract diffusion model to generate sentences related to the direction of the phrases input by the user through the phrases input by the user. The first input is a keyword phrase input by a user, the keyword phrase is used as control information to control a diffusion model to generate a sentence text in a keyword direction, the second input is random Gaussian noise, and the diffusion model is based on the Gaussian noise and is used for continuously denoising the noise so as to generate a fluent sentence.

The overall flow of the generation phase is as shown in fig. 4: firstly, a phrase text input by a user, such as an artificial intelligence image input by the user, is input, the user possibly wants to retrieve the contents in some artificial intelligence directions and relates to some artificial intelligence image identification or image processing methods, the phrase input by the user is subjected to word segmentation and word deactivation to obtain two keywords of the artificial intelligence and the image, the two keywords are respectively searched in clusters, and the clusters containing the two keywords are searched; as shown in fig. 4, the cluster 1 includes artificial intelligence, the cluster 2 includes images, the cluster 1 and the cluster 2 are used as control signals of a diffusion model to control the diffusion model to generate words in the two clusters, and the control process is to train a classifier, the class of the classifier is the class of the clusters, the denoising result of each step of the diffusion model is predicted by using one classifier, the prediction result has a certain deviation from the keyword clustering result input by a user, then a loss function is formed by the deviation, and the hidden variable of the current step of the abstract diffusion model is modified in a gradient updating manner in a back propagation manner, and the hidden variable modified by back propagation is more biased to the directions of the cluster 1 and the cluster 2. However, sentences which are more fluent and highly related to the direction of the cluster 1 and the cluster 2 cannot be generated at once through one-step diffusion and back propagation, so that the process needs to be repeated for multiple steps, hidden variables are gradually migrated towards the direction of the cluster 1 and the cluster 2, and fluent sentences are generated. This step is a hyper-parameter and the invention locates 200 steps. And mapping the generated hidden variables to texts by using a softmax function through diffusion generation and direction migration in 200 steps, thereby obtaining the generated sentences. The generated sentences are the output of the step, namely, the abstract diffusion model successfully generates sentences which are related to artificial intelligence and images and are similar to the abstract. And the specific implementation and formula logic are as follows.

The diffusion model generation stage is from Gaussian noise

Gradually denoising and then generating fluent sentence hidden variable

Then, the approximate model in step (3) of the diffusion model training process is used to train the diffusion model

And the text sentence is re-mapped, and the process is the generation process of the general diffusion model. However, it can also be seen that the whole process starts with gaussian noise and does not control a randomly generated sentence, so that the general sentence generated by the diffusion model has no way to control the direction of the sentence. Therefore, the embodiment controls the diffusion model to generate the sentence with the keyword direction, namely, to control the generation of the hidden variable in the diffusion modelIn the direction, i.e. control

The value of the hidden variable. The embodiment is characterized in that the control is realized by a domain vocabulary

The domain vocabulary is the clustering vocabulary generated in step 1, so the control process can be expressed by the probability formula,

wherein c represents a control condition, i.e. a keyword, and the probability formula represents the generation of a hidden variable given the keyword

The probability value of (2). While the diffusion process for each step is hidden

According to the previous step

Generated by combining with control conditions (keywords) and obtained by Bayesian formula

By the assumption of independent conditions, the method can be simplified into

. To be provided with

Generating

For example, first, the

Is transmitted into a model (usually a Transformer) trained by a diffusion model and is predicted by the Transformer

Is then to be generated

Inputting into a classifier, predicting by the classifier

Is then to be classified

Is updated by a process of back propagation

At this time

It is shifted one step towards the target direction. Then the offset

Inputting into Transformer again to predict

Repeating the above steps by analogy until the step T is repeated to obtain

Will be

And performing text prediction through softmax to obtain a corresponding text result. The text result obtained at this time is the text result obtained after the control by the target direction.

Thus, for the t-th step of the diffusion process, it can be updated by the following formula

The value of (c):

wherein

Is obtained through a diffusion model, the main function of the method is to generate fluency texts,

is obtained by a classifier of a neural network, the main function of the method is to generate texts for controlling the direction of conditions (keywords), and in addition, in order to generate more fluent texts, a text is added

The hyper-parameter balances the fluency of the text and the direction of the text, so this gradient update may become

。

As described hereinbefore

Can be obtained by a well-trained diffusion model in the process of diffusion model training

A classifier is required to obtain the corresponding probability value. For the

Given a hidden variable, it is

Judging the hidden variable as the control condition

The probability of (c). Usually, a classifier is required to be trained to obtain the probability value, but since there are too many keywords that may appear, it is difficult to output hidden variables by using all the keywords as tags

The probability value may correspond to a keyword, so this embodiment adopts an nbow model to calculate the probability value, and the domain vocabulary obtained in step 1 is taken as nbow. Firstly, searching which field or fields the keyword belongs to, then taking the words in the fields as target words similar to the semantic of the keyword, so that the probability value is the probability value of the diffusion language model to each word in the field word list added and logarithmized:

wherein

For words in the domain vocabulary, p is generated during reconstruction

Probability value of a word. Therefore, the current hidden variable can be obtained from the step

And inputting the probability value obtained by the classifier. Gradient updating is carried out on the hidden variables of the diffusion model through the probability values of the domain word list and the diffusion model, and then the next step is carried out

Is a hidden vector that is closer to the control condition. Generated by diffusion of step T

Is the final hidden variable, will be obtained

And inputting the text into an approximate model in the diffusion model to obtain corresponding sentence text so as to generate a corresponding sentence. T is a hyper-parameter that may be set to 200.

And 3, generating the claim and the instruction sentence by the diffusion model, and repeatedly generating the claim diffusion model and the instruction diffusion model according to the content in the step 2 to finally obtain the abstract, the claim and the instruction sentence related to the artificial intelligence and the image, as shown in fig. 5.

3. Extended search and integration, as shown in FIG. 6:

expanding and retrieving: and (3) respectively sending the abstract, the claim and the instruction sentence generated by the diffusion model into a retrieval system for retrieval, and simultaneously comparing the abstract, the claim and the instruction sentence with the patent abstract, the claim and the instruction sentence in a patent library in the retrieval system. Taking the abstract sentence as an example, the abstract sentence is searched and compared with the abstract part of the patent in the search system, and the previous N patents similar to the input abstract sentence are returned. The retrieval system can calculate the similarity of the sentences generated in the diffusion model and the abstract of the patent, the claims and the text vector of the specification part in a bm25 model or bert model word vector representation mode and return topN patents with higher similarity.

Integration: and selecting the previous K patents with the highest weight as the extended search by weighting the similarity of the three acquired 3N patents. The weighting coefficient may be set as required, for example, the weighting method may be that the returned patent similarity of each part is assigned with the same weight, and then the first K patent documents are the first 3N patents which are ranked from high to low in patent similarity.

The embodiment performs diffusion generation on the query key words input by the user in a short way, so as to generate longer and more diversified sentences, the sentences can also correspond to the parts of the abstract, the claim and the specification of the patent, so that the retrieval system can perform more accurate retrieval, similarity comparison is performed on the three parts of the patent, therefore, the patent retrieval system can obtain more accurate information of the user, the retrieval system can retrieve the content which the user wants to obtain, the retrieval text which is similar to the abstract of the patent is generated by combining with iterative supplementation of a user interaction mechanism, patent retrieval is realized by combining with a multi-stage text similarity matching sorting algorithm, the defect of lack of fine retrieval in the prior art is overcome, the accuracy of patent label retrieval is improved, and the purposes of releasing manpower, reducing cost and improving efficiency are achieved.

The embodiment can be realized by software, and the product form can be a computer device loaded with the corresponding software and can also be a computer readable storage medium. For example:

computer equipment, comprising a memory and a processor, wherein the memory stores a computer program, and is characterized in that the processor realizes the steps of the patent document inquiry method based on the diffusion model when executing the computer program.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A patent document query method based on a diffusion model is characterized by comprising the following steps:

receiving text content input by a user;

if the text content input by the user through retrieval exceeds a preset length threshold, performing word segmentation on the text content, and then respectively sending word segmentation results into three diffusion models for diffusion generation, wherein clusters of all keywords in the word segmentation results are jointly used as control signals of the diffusion models to limit the generation direction of the diffusion models; the three diffusion models are respectively marked as a first diffusion model, a second diffusion model and a third diffusion model, and training corpuses of the three diffusion models are respectively derived from an abstract, a claim and a specification and are used for correspondingly generating sentences which are similar to the abstract, the claim and the specification sentence expression form;

and performing weighted integration on the three groups of patent documents, and selecting a plurality of weighted patent documents with the highest similarity as the intention retrieval result of the user and outputting the intention retrieval result.

2. The diffusion model-based patent document query method according to claim 1, further comprising:

3. The diffusion model-based patent document query method of claim 2, wherein the three groups of patent documents have the same number of sections.

4. The diffusion model-based patent document query method according to claim 1, wherein the training methods of the three diffusion models each comprise the steps of:

gradually adding noise into the training corpus, continuously destroying corpus information, and storing the corpus information in each step of destroying process until the original corpus information is destroyed to become completely random Gaussian noise, wherein the process is recorded as a noise adding process; and then, denoising the completely random Gaussian noise, and continuously denoising by using a generative model by using the damaged corpus information stored in the denoising process as tag data to finally obtain original corpus information, so that the generative model learns the capacity of generating corresponding corpus through the denoising process.

5. The diffusion model-based patent document query method according to claim 4, wherein the generative model is a Transformer model or a GPT model.

6. The method according to claim 4, wherein the generating method of the corpus comprises:

and performing word segmentation on the first sentence, the second sentence and the third sentence by adopting a text word segmentation device respectively, wherein the corresponding word segmentation result is a training corpus used for the first diffusion model, the second diffusion model and the third diffusion model.

7. The diffusion model-based patent document query method according to claim 4, wherein in the three diffusion models, each diffusion model performs a diffusion generation process, and specifically comprises:

performing word segmentation and word deactivation on text contents input by a user to obtain a plurality of keywords;

8. The diffusion model-based patent document query method according to claim 4,

the retrieval system adopts a bm25 model or a bert model word vector representation mode to calculate the similarity between sentences generated by the first diffusion model and abstract text vectors of patent documents, calculate the similarity between sentences generated by the second diffusion model and claim text vectors of the patent documents, calculate the similarity between sentences generated by the third diffusion model and specification text vectors of the patent documents, and return N patent documents with the highest similarity.

9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor when executing the computer program implements the steps of the diffusion model based patent document query method according to any one of claims 1 to 8.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the steps of the diffusion model-based patent document query method according to any one of claims 1 to 8.