Disclosure of Invention
The patent generation type retrieval method and system provided by the application have the advantages that semantic identification coding is carried out on text contents through a self-encoder, self-supervision model training is realized without marking data, unified fusion coding is carried out by means of an IPC multi-level classification system, the defect of strict matching of word grammar of a traditional TF-IDF/BM25 and the problem of calculation efficiency of dense vector retrieval based on a language model are overcome, the generalization capability is high, the patent retrieval accuracy of new disclosure and application is improved, and the recall ratio and precision ratio of patent retrieval are improved.
In a first aspect, a generative search method for patents, the method comprising S1 model training, S2 patent encoding, S3 query encoding, S4 search matching;
s1, training a model, namely constructing a training data set based on patent text and IPC multi-level classification data in a patent library, carrying out training on the training data set based on a pre-training language model and combining a codebook data structure, and carrying out training on a comprehensive loss function combining reconstruction loss, IPC loss and commitment loss until convergence to obtain a patent text semantic identification coding model;
s2, patent coding, namely, carrying out semantic identification coding on all patents in a patent library by utilizing a patent text semantic identification coding model trained to be converged, and storing generated identification sequence coding data into an index database;
s3, query coding, namely coding a query text input by a user by utilizing a patent text semantic identification coding model trained to be converged, and generating a semantic identification sequence;
s4, searching and matching, namely searching patents matched with the query in a patent code index library by applying a longest matching algorithm based on a tree, and returning a Top-K result with highest similarity after sorting according to the similarity.
Optionally, the training of the S1 model specifically includes:
s11, constructing a training data set, extracting patent document text and IPC classification data from a patent library, firstly cleaning the data, segmenting patent titles, abstracts, claims, description parts and drawings, filtering paragraph labels and drawing description data, sequentially combining other text fields, and associating the combined text with the IPC classification corresponding to the patent text to form training data;
s12, designing a model, wherein the model specifically comprises a patent text semantic identification coding model and a text reconstruction model; the text reconstruction model is used for assisting in training a patent text semantic identification coding model, and semantic information of the patent text can be characterized by assisting the patent text semantic identification coding model in a semantic identification sequence generated by the patent text; the patent text semantic identification coding model comprises a coding layer, a decoding layer and a codebook; the coding layer and the decoding layer are in a model framework taking a pre-training language model based on a transducer as a basic model, and the codebook is a codebook data structure designed aiming at the characteristics of the patent text; the coding layer of the pre-training language model T5 based on the transducer is selected as the coding layer, the decoding layer of the pre-training language model T5 based on the transducer is selected as the decoding layer, and the codebook is a codebook initialized by using patent data; the text reconstruction model is based on a pre-training language model T5 of a transducer as a basic model, and the model framework comprises an encoding layer and a decoding layer; the patent document is specifically denoted by GThe semantic identification coding model uses E 1 Representing coding layers in a patent text semantic identification coding model by D 1 Representing a decoding layer in a patent text semantic identification coding model by E t Representing a codebook, and representing a text reconstruction model by R;
s13, training is performed, namely training a model framework by a pointer on a data set prepared by constructing a training data set comprises initializing a codebook, designing a semantic identification sequence, training a model and optimizing a comprehensive loss function; finally training until the model converges, and storing and outputting the model parameters after training.
Optionally, in the training performed in S13, the method specifically includes:
the initialization codebook is used for carrying out text clustering by utilizing a K-Means algorithm aiming at all patent texts in the data set, clustering K categories, combining L categories of parts, major categories and minor categories in the patent IPC classification, and finally constructing a codebook index structure as E t ;
The semantic identification sequence design is used for designing the structure of the semantic identification sequence by utilizing parts, major classes and minor classes in the patent IPC classification, wherein the first three bits of the semantic identification sequence are specifically designed to respectively represent the parts, the major classes and the minor classes, and the semantic identifications of the rest positions represent the semantic information of the patent text.
Optionally, in the training performed in S13, the method specifically includes:
model training, training using training data sets to construct prepared training data consisting of pairs of data (d, d ipc ) Composition, where d represents the patent text, d ipc The IPC three-level classification of the patent is shown, and in the model training process, the integral input is (d, d) ipc ) The patent text d is input into the patent text semantic identification coding model G, and in the data flow of each time step t, the coding layer E in the patent text semantic identification coding model G is firstly passed through 1 And decoding layer D 1 The output is generated as d t :
Wherein E is 1 Representing coding layers in a patent text semantic identification coding model, D 1 Representing a decoding layer in a patent text semantic identification coding model, d t Identifying the flow through the coding layer E in the coding model G for the patent text semantics of the current time step t 1 And decoding layer D 1 Post output, z <t Output of the semantic identification coding model G representing the patent text before the current time step t, d generated by each time step t Codebook E input to patent text semantic identification coding model G t Output generation z t :
The single data training totally goes through T time steps, wherein, the patent text semantic identification coding model G outputs a semantic identification sequence Z for the received input patent text d:
generating a semantic identification sequence Z through T time-step patent text semantic identification coding models G as a whole, and processing the semantic identification sequence Z to generate Z 4→T ,Z 4→T Representing the fourth bit to the T bit in the semantic identification sequence; input Z 4→T To the text reconstruction model R, output generation:
Using a reconstruction model R for semantic identification sequences Z 4→T Reconstructing to generate a patent text d, wherein the result of the R prediction of the reconstruction model is thatIt is expected that the reconstruction decoding will generate a result closer to the original input patent text d, so that the original input patent text d is patentedLearning the original text d gradually approaches;
optimizing the comprehensive loss function, carrying out integrated calculation aiming at the loss function and integrally updating the model; the loss function designs three losses, namely a combined loss function of reconstruction loss, IPC loss and commitment loss to optimize a model, the overall loss is used as an optimization target for training, the random gradient descent method is used for optimizing the loss function, and model parameters are optimized in a counter-propagation mode until the model converges, so that an expected training result is obtained.
Optionally, the S2 patent code specifically includes:
s21, patent library document semantic identification coding, namely inputting a patent text set in a DOC by using a patent text semantic identification coding model G, traversing each patent text in the DOC, sending the patent text to the patent text semantic identification coding model G to generate a semantic identification sequence, and finally outputting all patent text semantic identification sequences; specifically traversing extracted patent text D from DOC i Inputting a patent text semantic identification coding model G, outputting and generating a corresponding semantic identification sequence Z i ,Z i =G(D i ) Until all patent texts are traversed and semantic identification sequences of all patent texts are generated;
S22, constructing a semantic identification index library, constructing a corresponding index structure for semantic identification sequences of all patents generated by semantic identification codes of patent library documents, and storing the index structure into an index database.
Optionally, the S3 query code specifically includes:
generating semantic identification codes of the query text; and specifically, applying a patent text semantic identification coding model G to carry out semantic identification coding on a query text Q input by a user to generate a query semantic identification sequence.
Optionally, the step of S4 retrieving the matching specifically includes:
searching the patent matched with the query semantic identification sequence in the patent coding index library by using the longest matching algorithm based on the tree to input the query semantic identification sequence and the patent coding index libraryThe distance between candidate representing sequences is measured, and Top-K candidate patent results closest to the input representing sequences are returned; specifically, a query semantic identification sequence Z generated in a query encoding stage Q Performing tree-based longest matching algorithm calculation with semantic identification sequences of all patents in a patent coding index library, and returning a Top-K result with highest similarity after sorting according to the similarity; wherein Top-K is a set value.
In a second aspect, a generative search system for patents, the system comprising a model training module, a patent encoding module, a query encoding module, and a search matching module, wherein:
the model training module is used for constructing a training data set based on the patent text and IPC multi-level classification data in the patent library, carrying out training on a training data set based on a pre-training language model and combining a codebook data structure by using a comprehensive loss function combining reconstruction loss, IPC loss and commitment loss, and carrying out training until convergence to obtain a patent text semantic identification coding model;
the patent coding module is used for carrying out semantic identification coding on all patents in the patent library by utilizing the patent text semantic identification coding model trained to be converged, and storing the generated identification sequence coding data into the index database;
the query coding module is used for coding the query text input by the user by utilizing the patent text semantic identification coding model trained to be converged to generate a semantic identification sequence;
and the search matching module is used for searching the patents matched with the query in the patent code index library by applying the longest matching algorithm based on the tree, and returning the Top-K result with the highest similarity after sorting according to the similarity.
In a third aspect, a computer device is provided, comprising a memory storing a computer program and a processor implementing the generative search method for patents of any of the first aspects described above when the computer program is executed by the processor.
In a fourth aspect, a computer readable storage medium is provided, on which a computer program is stored, which when executed by a processor implements a generative search method for patents as described in any of the above first aspects.
The invention provides a generating type retrieval method and a generating type retrieval system for patents, which are used for designing a loss function aiming at the characteristics of the patents based on a pre-training language model, carrying out semantic identification coding on text contents through a self-encoder without marking data to realize self-supervision model training, carrying out unified fusion coding by means of an IPC multi-level classification system, overcoming the defects of strict matching of word grammar of the traditional TF-IDF/BM25 and the problem of calculation efficiency of dense vector retrieval based on the language model, and have the advantages of strong generalization capability of the generating type model and realizing accurate retrieval of new published/applied patents. According to the invention, through a pre-coding mechanism and combining with a large-scale index database, the calculation of search service can be effectively reduced, the service delay is reduced, the efficiency of the traditional search method and the semantic understanding advantage of the depth model are effectively combined, and the recall ratio and the precision ratio of large-scale patent search are improved.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.
In the description of the present application: the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements but may include other steps or elements not expressly listed but inherent to such process, method, article, or apparatus or steps or elements added to further optimization schemes based on the inventive concepts.
The invention provides a generation type retrieval method and a system for patents, which are characterized in that a generation type patent coding model is trained aiming at patent characteristics based on a pre-training language model, unified fusion coding is carried out by combining the characteristics of a patent text and an IPC multi-level classification system of the patent, the training model is designed to be converged, the patent coding is carried out on all documents in a patent library through the trained generation type patent coding model, query coding is carried out on a text to be queried by utilizing the trained generation type patent coding model in a patent retrieval stage, search matching is carried out in the patent coding library based on the query coding and the patent coding, and a TopK result is returned after sequencing. The whole flow is divided into model training, patent coding, query coding and search matching.
Model training: including training dataset construction, model design, performing training. And constructing a training data set based on the patent text and IPC multi-level classification data in the patent library, wherein the model design is based on a pre-training language model and combines a codebook data structure, the model is trained by a comprehensive loss function combining the reconstruction loss, the IPC loss and the commitment loss, the training is performed until convergence, and the final model is a patent text semantic identification coding model.
Patent code: the method comprises the steps of patent library document semantic identification coding and semantic identification index table construction. And carrying out semantic identification coding on all patents in the patent library by utilizing the patent text semantic identification coding model trained to be converged, and storing the generated identification sequence coding data into an index database.
Query encoding: and coding the query text input by the user by utilizing the patent text semantic identification coding model trained to be converged to generate a semantic identification sequence.
Searching and matching: and searching patents matched with the query in a patent code index library by using a longest matching algorithm based on the tree, and returning a Top-K result with highest similarity after sorting according to the similarity.
As shown in fig. 1, a method for generating search of patents is provided, and the method can be applied to a server, and comprises the steps of training an S1 model, encoding an S2 patent, encoding an S3 query, and matching an S4 search;
the method specifically comprises the following steps:
the S1 model training comprises the steps of S11 training data set construction, S12 model design and S13 training;
s11, training data set construction is to extract patent document text and IPC classification data from a patent library, firstly, perform data cleaning, divide patent titles, abstracts, claims, description parts and drawings, filter paragraph labels, drawing descriptions and other data, then sequentially combine other text fields, associate the combined text with the corresponding IPC classification of the patent text, and form training data, namely, comprise cleaned patent text content and corresponding IPC classification, and construct training data set for model training by the classified data;
the S12 model design is that the model consists of a patent text semantic identification coding model and a text reconstruction model. The text reconstruction model is used for assisting in training the patent text semantic identification coding model, so that semantic information of the patent text is extracted more abundantly, and finally, the semantic identification coding model can be used for assisting in representing a semantic identification sequence generated by the patent textSemantic information of patent text. The patent text semantic identification coding model comprises a coding layer, a decoding layer and a codebook. Wherein the encoding layer (Encoder) and decoding layer (Decoder) are the encoding layer and decoding layer in the model architecture based on a transducer-based pre-trained language model as a base model, and the codebook is a codebook data structure designed for the characteristics of the patent text. A code layer (Encoder) selects a code layer of a pre-training language model T5 based on a converter, and a code layer (Encoder) selects a code layer of the pre-training language model T5 based on the converter, wherein a codebook (codebook) is a codebook (codebook) initialized by using patent data; the text reconstruction model is based on a pre-trained language model T5 of a transducer as a basic model, and the model architecture comprises an encoding layer and a decoding layer. The patent text semantic identification coding model is specifically represented by G, and E is used 1 Representing coding layers in a patent text semantic identification coding model by D 1 Representing a decoding layer in a patent text semantic identification coding model by E t Representing a codebook, and representing a text reconstruction model by R;
s13, training is performed by means of training the model architecture by means of a data set prepared by constructing a training data set, wherein the training comprises the steps of initializing a codebook, designing a semantic identification sequence, training the model and optimizing a comprehensive loss function. Finally training until the model converges, and storing and outputting the model parameters after training.
The initialization codebook is aimed at all patent texts in the data set, the K-Means algorithm is utilized to perform text clustering, K categories are clustered, L categories of parts, major categories and minor categories in the patent IPC classification are combined, and finally, the codebook index structure is constructed as E t ,E t ∈R M*D Wherein m=k+l, in this patent, K takes the value of 2000, L takes the value of 786, and d is the dimension 768 of the hidden layer of the model.
The semantic identification sequence design combines the characteristics of patent text, namely patent IPC classification, and utilizes parts, major classes and minor classes in the patent IPC classification to design the structure of the semantic identification sequence, and the semantic identification at the other positions represents the semantic information of the patent text.
As in fig. 2, a diagram of the model training architecture in the present application is given. Model training is performed by constructing prepared training data using a training data set, the training data being composed of data pairs (d, d ipc ) Composition, where d represents the patent text, d ipc The IPC three-level classification of the patent is shown, and in the model training process, the integral input is (d, d) ipc ) The patent text d is input into the patent text semantic identification coding model G, and in the data flow of each time step t, the coding layer E in the patent text semantic identification coding model G is firstly passed through 1 And decoding layer D 1 The output is generated as d t :
Wherein E is 1 Representing coding layers in a patent text semantic identification coding model, D 1 Representing a decoding layer in a patent text semantic identification coding model, d t Identifying the flow through the coding layer E in the coding model G for the patent text semantics of the current time step t 1 And decoding layer D 1 Post output, z <t Output of the semantic identification coding model G representing the patent text before the current time step t, d generated by each time step t Codebook E input to patent text semantic identification coding model G t Output generation z t :
The training of single data totally experiences T time steps, in the invention, T is taken 64, namely the length of a semantic representation sequence is also 64, wherein, for a received input patent text d, a patent text semantic identification coding model G is output as a semantic identification sequence Z:
through T timesGenerating a semantic identification sequence Z by the whole semantic identification coding model G of the intermittent patent text, and generating Z by processing the semantic identification sequence Z 4→T ,Z 4→T The fourth bit to the T bit in the semantic identification sequence is represented, namely the semantic identification sequence which does not contain the first three bits and represents the IPC class is not represented.
Input Z 4→T To the text reconstruction model R, output generation:
Using a reconstruction model R for semantic identification sequences Z 4→T Reconstructing to generate a patent text d, wherein the result of the R prediction of the reconstruction model is thatThe reconstruction and decoding result is expected to be closer to the original input patent text d, so that the result is gradually closer to the original patent text d in learning, and finally, the semantic identification sequence generated by the patent text semantic identification coding model G contains richer semantic information and is finer and more accurate.
The comprehensive loss function optimization is to integrate and calculate the loss function designed for the whole task and update the model integrally. Wherein the Loss function in the invention designs three losses, and the combined Loss function of the Loss function (Reconstruction Loss), the IPC Loss and the Commitment Loss (Commitment Loss) is reconstructed to optimize the model, and the Loss is loss=L Rec +αL Com +βL IPC . Wherein the alpha value of the invention is 0.95, and the beta value is 0.8. And training by taking the integral Loss as an optimization target, optimizing a Loss function by using a random gradient descent method, and back-propagating optimized model parameters until the model converges, so that an expected training result is obtained, namely, the finally trained patent text semantic identification coding model G can generate a semantic identification sequence Z capable of fully representing the semantic information of the patent text.
Wherein the loss function L is reconstructed Rec Expressed as:
the CE here is the Cross Entropy (Cross Entropy). The design purpose is that the semantic identification sequence generated by the patent text semantic identification coding model G contains richer semantic information.
IPC loss L IPC Expressed as:
Z 1,2,3 the first three bits in the semantic identification sequence represent parts, major classes, minor classes in IPC. The CE here is the Cross Entropy (Cross Entropy). The design purpose is to enable the classification learning of the IPC class parts, major classes and minor classes in the semantic identification sequence generated by the patent text semantic identification coding model G to be more accurate.
Commitment loss L Com Expressed as:
the design aim is to enable the patent text semantic identification coding model G to avoid forgetting and improve the generation quality when generating a semantic representation sequence, so that the model is updated and optimized to be converged better.
The S2 patent code comprises S21 patent library document semantic identification code and S22 semantic identification index library construction.
S21, the semantic identification coding of the patent library document is to apply a patent text semantic identification coding model G to generate a semantic identification sequence for the patent text in the patent library. The patent text semantic identification coding model G is used, a patent text set in the DOC is input, each patent text in the DOC is traversed and sent to the patent text semantic identification coding model G to generate a semantic identification sequence, and finally all the patent text semantic identification sequences are output. Specifically, the extracted patent text D will be traversed from DOC i Inputting a patent text semantic identification coding model G, outputting and generating corresponding semanticsIdentification sequence Z i ,Z i =G(D i ) Until all patent texts are traversed and semantic identification sequences of all patent texts are generatedThe method comprises the steps of carrying out a first treatment on the surface of the Wherein DOC is the text set of all patent documents in the patent library, DOC= { D i ,...,D n N represents the number in DOC patent library, D i Is the element in the patent document set DOC used for representing the patent text of a single piece, < ->A set of semantic identification sequences representing all patent text that is ultimately generated.
S22, constructing a semantic identification index library, namely, a semantic identification sequence of all patents generated by encoding semantic identifications of patent library documentsConstructing a corresponding index structure and storing the index structure in an index database, namely identifying the sequence Z for each semantic meaning i And constructing an index relation with the patent text data, and finally generating a patent coding index library.
A patent coding flow chart provided by an embodiment of the present application is shown in fig. 3.
S3, query coding is generated by semantic identification coding of the query text. Specifically, a patent text semantic identification coding model G is applied to carry out semantic identification coding on a query text Q input by a user, and a query semantic identification sequence is generated. Namely, a query text Q input by a user is input into a patent text semantic identification coding model G, and the patent text semantic identification coding model G outputs a query semantic identification sequence Z Q ,Z Q =g (Q). A query encoding flow chart provided by an embodiment of the present application is shown in fig. 4.
S4, searching and matching is to search the patent matched with the query semantic identification sequence in the patent coding index library by applying the longest matching algorithm based on the tree, taking the distance between the input query semantic identification sequence and the candidate representation sequence in the index library as a measure, and returning the Top-K candidate patent knot closest to the input representation sequenceAnd (5) fruits. Specifically, the query semantic identification sequence Z generated in the query encoding stage Q And (3) carrying out tree-based longest matching algorithm calculation with semantic identification sequences of all patents in the patent coding index library, and returning a Top-K result with highest similarity after sorting according to the similarity. Wherein Top-K is a set value, and in the present application, top-K takes a value of 50. As shown in fig. 5, a search matching flowchart provided in an embodiment of the present application is provided.
In summary, the invention provides a generating type retrieval method and system for patents, which do not need to mark data to realize self-supervision model training, perform unified fusion coding by means of an IPC multi-level classification system, construct semantic identification coding indexes for full-library patents through a generating model of a self-encoder, effectively fuse the efficiency and accuracy of traditional word vector retrieval and depth model semantic retrieval, and realize large-scale semantic retrieval based on query coding sequences and similarity of patent coding sequences.
In one embodiment, as shown in fig. 6, the present application provides a generative search system for patents, the system comprising a model training module, a patent coding module, a query coding module, and a search matching module, wherein:
the model training module is used for constructing a training data set based on the patent text and IPC multi-level classification data in the patent library, carrying out training on a training data set based on a pre-training language model and combining a codebook data structure by using a comprehensive loss function combining reconstruction loss, IPC loss and commitment loss, and carrying out training until convergence to obtain a patent text semantic identification coding model;
the patent coding module is used for carrying out semantic identification coding on all patents in the patent library by utilizing the patent text semantic identification coding model trained to be converged, and storing the generated identification sequence coding data into the index database;
the query coding module is used for coding the query text input by the user by utilizing the patent text semantic identification coding model trained to be converged to generate a semantic identification sequence;
and the search matching module is used for searching the patents matched with the query in the patent code index library by applying the longest matching algorithm based on the tree, and returning the Top-K result with the highest similarity after sorting according to the similarity.
The specific implementation content of each module can be referred to above for limitation of the method of the generating search system for patent, and will not be described herein.
In one embodiment, a computer device is provided, which may be a server, the internal structure of which may be as shown in fig. 7. The computer device includes a processor, a memory, and a network interface connected by a system bus. The processor of the computer device is used for providing computing and control capability, the network interface is used for communicating with an external terminal through network connection, and the computer device runs the computer program by loading to realize the multi-domain knowledge extraction method of the patent.
It will be appreciated by those skilled in the art that the structure shown in fig. 7 is merely a block diagram of some of the structures associated with the present application and is not limiting of the computer device to which the present application may be applied, and that a particular computer device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.
In an embodiment, a computer readable storage medium is also provided, on which a computer program is stored, involving all or part of the flow of the method of the above embodiment.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.