CN117421393B - Generating type retrieval method and system for patent - Google Patents

Generating type retrieval method and system for patent Download PDF

Info

Publication number
CN117421393B
CN117421393B CN202311732921.2A CN202311732921A CN117421393B CN 117421393 B CN117421393 B CN 117421393B CN 202311732921 A CN202311732921 A CN 202311732921A CN 117421393 B CN117421393 B CN 117421393B
Authority
CN
China
Prior art keywords
model
coding
text
semantic identification
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311732921.2A
Other languages
Chinese (zh)
Other versions
CN117421393A (en
Inventor
谢鑫
徐青伟
范娥媚
裴非
严长春
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Xinghe Zhiyuan Technology Co.,Ltd.
Original Assignee
Zhiguagua Tianjin Big Data Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhiguagua Tianjin Big Data Technology Co ltd filed Critical Zhiguagua Tianjin Big Data Technology Co ltd
Priority to CN202311732921.2A priority Critical patent/CN117421393B/en
Publication of CN117421393A publication Critical patent/CN117421393A/en
Application granted granted Critical
Publication of CN117421393B publication Critical patent/CN117421393B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/0895Weakly supervised learning, e.g. semi-supervised or self-supervised learning
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The method combines the characteristics of patent texts and an IPC multi-level classification system of the patents to perform unified fusion coding, designs a comprehensive loss function optimization training model to converge, performs patent coding on all documents in a patent library through a trained generated patent coding model, performs query coding on a text to be queried by using the trained generated patent coding model in a patent retrieval stage, performs retrieval matching in the patent coding library based on the query coding and the patent coding, and returns a result after sequencing. According to the invention, through a pre-coding mechanism and combining with a large-scale index database, the calculation of search service can be effectively reduced, the service delay is reduced, the efficiency of the traditional search method and the semantic understanding advantage of the depth model are effectively combined, and the recall ratio and the precision ratio of large-scale patent search are improved.

Description

Generating type retrieval method and system for patent
Technical Field
The application relates to the technical field of patent retrieval, in particular to a generation type retrieval method and system for patents.
Background
Along with the continuous improvement of intellectual property protection consciousness of society, the demand for accurate and efficient patent retrieval is also increasing. The patent searching for patent check and infringement detection is a key core link in the patent application and right maintenance processes, and how to realize accurate and efficient searching has become an important content in patent system construction. Current patent retrieval is typically implemented based on similarity matching of user-entered text queries and patent text, and has failed to accommodate the ever-increasing patent retrieval needs of patent auditing and applicant. Therefore, how to accurately and efficiently search related patents from text input by a user has become an important research content in the field of patent search.
The traditional patent retrieval is realized in a word vector matching mode based on TF-IDF and BM25, has the advantage of high-efficiency retrieval of mass data, and the strict word matching mode also causes the defects of semantic deletion, weak generalization capability and the like. The conventional patent retrieval method is based on vectorization coding of a pre-training language model, is realized through serial modules in a pipeline mode, is difficult to realize end-to-end multi-module combined training, and is inconsistent with a retrieval task target based on a comparison learning training method, so that knowledge acquired in a model pre-training stage cannot be fully utilized. The existing method does not fully utilize the associated semantic information among all internal constituent units in the patent document, so that the similarity measurement between the input and the target in the patent retrieval process is inaccurate, and the integrity and the accuracy of the final patent retrieval are not high as a whole.
Disclosure of Invention
The patent generation type retrieval method and system provided by the application have the advantages that semantic identification coding is carried out on text contents through a self-encoder, self-supervision model training is realized without marking data, unified fusion coding is carried out by means of an IPC multi-level classification system, the defect of strict matching of word grammar of a traditional TF-IDF/BM25 and the problem of calculation efficiency of dense vector retrieval based on a language model are overcome, the generalization capability is high, the patent retrieval accuracy of new disclosure and application is improved, and the recall ratio and precision ratio of patent retrieval are improved.
In a first aspect, a generative search method for patents, the method comprising S1 model training, S2 patent encoding, S3 query encoding, S4 search matching;
s1, training a model, namely constructing a training data set based on patent text and IPC multi-level classification data in a patent library, carrying out training on the training data set based on a pre-training language model and combining a codebook data structure, and carrying out training on a comprehensive loss function combining reconstruction loss, IPC loss and commitment loss until convergence to obtain a patent text semantic identification coding model;
s2, patent coding, namely, carrying out semantic identification coding on all patents in a patent library by utilizing a patent text semantic identification coding model trained to be converged, and storing generated identification sequence coding data into an index database;
s3, query coding, namely coding a query text input by a user by utilizing a patent text semantic identification coding model trained to be converged, and generating a semantic identification sequence;
s4, searching and matching, namely searching patents matched with the query in a patent code index library by applying a longest matching algorithm based on a tree, and returning a Top-K result with highest similarity after sorting according to the similarity.
Optionally, the training of the S1 model specifically includes:
s11, constructing a training data set, extracting patent document text and IPC classification data from a patent library, firstly cleaning the data, segmenting patent titles, abstracts, claims, description parts and drawings, filtering paragraph labels and drawing description data, sequentially combining other text fields, and associating the combined text with the IPC classification corresponding to the patent text to form training data;
s12, designing a model, wherein the model specifically comprises a patent text semantic identification coding model and a text reconstruction model; the text reconstruction model is used for assisting in training a patent text semantic identification coding model, and semantic information of the patent text can be characterized by assisting the patent text semantic identification coding model in a semantic identification sequence generated by the patent text; the patent text semantic identification coding model comprises a coding layer, a decoding layer and a codebook; the coding layer and the decoding layer are in a model framework taking a pre-training language model based on a transducer as a basic model, and the codebook is a codebook data structure designed aiming at the characteristics of the patent text; the coding layer of the pre-training language model T5 based on the transducer is selected as the coding layer, the decoding layer of the pre-training language model T5 based on the transducer is selected as the decoding layer, and the codebook is a codebook initialized by using patent data; the text reconstruction model is based on a pre-training language model T5 of a transducer as a basic model, and the model framework comprises an encoding layer and a decoding layer; the patent document is specifically denoted by GThe semantic identification coding model uses E 1 Representing coding layers in a patent text semantic identification coding model by D 1 Representing a decoding layer in a patent text semantic identification coding model by E t Representing a codebook, and representing a text reconstruction model by R;
s13, training is performed, namely training a model framework by a pointer on a data set prepared by constructing a training data set comprises initializing a codebook, designing a semantic identification sequence, training a model and optimizing a comprehensive loss function; finally training until the model converges, and storing and outputting the model parameters after training.
Optionally, in the training performed in S13, the method specifically includes:
the initialization codebook is used for carrying out text clustering by utilizing a K-Means algorithm aiming at all patent texts in the data set, clustering K categories, combining L categories of parts, major categories and minor categories in the patent IPC classification, and finally constructing a codebook index structure as E t
The semantic identification sequence design is used for designing the structure of the semantic identification sequence by utilizing parts, major classes and minor classes in the patent IPC classification, wherein the first three bits of the semantic identification sequence are specifically designed to respectively represent the parts, the major classes and the minor classes, and the semantic identifications of the rest positions represent the semantic information of the patent text.
Optionally, in the training performed in S13, the method specifically includes:
model training, training using training data sets to construct prepared training data consisting of pairs of data (d, d ipc ) Composition, where d represents the patent text, d ipc The IPC three-level classification of the patent is shown, and in the model training process, the integral input is (d, d) ipc ) The patent text d is input into the patent text semantic identification coding model G, and in the data flow of each time step t, the coding layer E in the patent text semantic identification coding model G is firstly passed through 1 And decoding layer D 1 The output is generated as d t
Wherein E is 1 Representing coding layers in a patent text semantic identification coding model, D 1 Representing a decoding layer in a patent text semantic identification coding model, d t Identifying the flow through the coding layer E in the coding model G for the patent text semantics of the current time step t 1 And decoding layer D 1 Post output, z <t Output of the semantic identification coding model G representing the patent text before the current time step t, d generated by each time step t Codebook E input to patent text semantic identification coding model G t Output generation z t
The single data training totally goes through T time steps, wherein, the patent text semantic identification coding model G outputs a semantic identification sequence Z for the received input patent text d:
generating a semantic identification sequence Z through T time-step patent text semantic identification coding models G as a whole, and processing the semantic identification sequence Z to generate Z 4→T ,Z 4→T Representing the fourth bit to the T bit in the semantic identification sequence; input Z 4→T To the text reconstruction model R, output generation:
Using a reconstruction model R for semantic identification sequences Z 4→T Reconstructing to generate a patent text d, wherein the result of the R prediction of the reconstruction model is thatIt is expected that the reconstruction decoding will generate a result closer to the original input patent text d, so that the original input patent text d is patentedLearning the original text d gradually approaches;
optimizing the comprehensive loss function, carrying out integrated calculation aiming at the loss function and integrally updating the model; the loss function designs three losses, namely a combined loss function of reconstruction loss, IPC loss and commitment loss to optimize a model, the overall loss is used as an optimization target for training, the random gradient descent method is used for optimizing the loss function, and model parameters are optimized in a counter-propagation mode until the model converges, so that an expected training result is obtained.
Optionally, the S2 patent code specifically includes:
s21, patent library document semantic identification coding, namely inputting a patent text set in a DOC by using a patent text semantic identification coding model G, traversing each patent text in the DOC, sending the patent text to the patent text semantic identification coding model G to generate a semantic identification sequence, and finally outputting all patent text semantic identification sequences; specifically traversing extracted patent text D from DOC i Inputting a patent text semantic identification coding model G, outputting and generating a corresponding semantic identification sequence Z i ,Z i =G(D i ) Until all patent texts are traversed and semantic identification sequences of all patent texts are generated
S22, constructing a semantic identification index library, constructing a corresponding index structure for semantic identification sequences of all patents generated by semantic identification codes of patent library documents, and storing the index structure into an index database.
Optionally, the S3 query code specifically includes:
generating semantic identification codes of the query text; and specifically, applying a patent text semantic identification coding model G to carry out semantic identification coding on a query text Q input by a user to generate a query semantic identification sequence.
Optionally, the step of S4 retrieving the matching specifically includes:
searching the patent matched with the query semantic identification sequence in the patent coding index library by using the longest matching algorithm based on the tree to input the query semantic identification sequence and the patent coding index libraryThe distance between candidate representing sequences is measured, and Top-K candidate patent results closest to the input representing sequences are returned; specifically, a query semantic identification sequence Z generated in a query encoding stage Q Performing tree-based longest matching algorithm calculation with semantic identification sequences of all patents in a patent coding index library, and returning a Top-K result with highest similarity after sorting according to the similarity; wherein Top-K is a set value.
In a second aspect, a generative search system for patents, the system comprising a model training module, a patent encoding module, a query encoding module, and a search matching module, wherein:
the model training module is used for constructing a training data set based on the patent text and IPC multi-level classification data in the patent library, carrying out training on a training data set based on a pre-training language model and combining a codebook data structure by using a comprehensive loss function combining reconstruction loss, IPC loss and commitment loss, and carrying out training until convergence to obtain a patent text semantic identification coding model;
the patent coding module is used for carrying out semantic identification coding on all patents in the patent library by utilizing the patent text semantic identification coding model trained to be converged, and storing the generated identification sequence coding data into the index database;
the query coding module is used for coding the query text input by the user by utilizing the patent text semantic identification coding model trained to be converged to generate a semantic identification sequence;
and the search matching module is used for searching the patents matched with the query in the patent code index library by applying the longest matching algorithm based on the tree, and returning the Top-K result with the highest similarity after sorting according to the similarity.
In a third aspect, a computer device is provided, comprising a memory storing a computer program and a processor implementing the generative search method for patents of any of the first aspects described above when the computer program is executed by the processor.
In a fourth aspect, a computer readable storage medium is provided, on which a computer program is stored, which when executed by a processor implements a generative search method for patents as described in any of the above first aspects.
The invention provides a generating type retrieval method and a generating type retrieval system for patents, which are used for designing a loss function aiming at the characteristics of the patents based on a pre-training language model, carrying out semantic identification coding on text contents through a self-encoder without marking data to realize self-supervision model training, carrying out unified fusion coding by means of an IPC multi-level classification system, overcoming the defects of strict matching of word grammar of the traditional TF-IDF/BM25 and the problem of calculation efficiency of dense vector retrieval based on the language model, and have the advantages of strong generalization capability of the generating type model and realizing accurate retrieval of new published/applied patents. According to the invention, through a pre-coding mechanism and combining with a large-scale index database, the calculation of search service can be effectively reduced, the service delay is reduced, the efficiency of the traditional search method and the semantic understanding advantage of the depth model are effectively combined, and the recall ratio and the precision ratio of large-scale patent search are improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It will be apparent to those skilled in the art from this disclosure that the drawings described below are merely exemplary and that other embodiments may be derived from the drawings provided without undue effort.
FIG. 1 is a main logic flow diagram provided in an embodiment of the present application;
FIG. 2 is a diagram of a model training architecture provided in an embodiment of the present application;
FIG. 3 is a flow chart of patent code provided in an embodiment of the present application;
FIG. 4 is a flowchart of query encoding provided in an embodiment of the present application;
FIG. 5 is a flowchart of search matching provided in an embodiment of the present application;
FIG. 6 is a block diagram of a module of the generative search system of the present application;
fig. 7 is an internal structural diagram of a computer device in one embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.
In the description of the present application: the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements but may include other steps or elements not expressly listed but inherent to such process, method, article, or apparatus or steps or elements added to further optimization schemes based on the inventive concepts.
The invention provides a generation type retrieval method and a system for patents, which are characterized in that a generation type patent coding model is trained aiming at patent characteristics based on a pre-training language model, unified fusion coding is carried out by combining the characteristics of a patent text and an IPC multi-level classification system of the patent, the training model is designed to be converged, the patent coding is carried out on all documents in a patent library through the trained generation type patent coding model, query coding is carried out on a text to be queried by utilizing the trained generation type patent coding model in a patent retrieval stage, search matching is carried out in the patent coding library based on the query coding and the patent coding, and a TopK result is returned after sequencing. The whole flow is divided into model training, patent coding, query coding and search matching.
Model training: including training dataset construction, model design, performing training. And constructing a training data set based on the patent text and IPC multi-level classification data in the patent library, wherein the model design is based on a pre-training language model and combines a codebook data structure, the model is trained by a comprehensive loss function combining the reconstruction loss, the IPC loss and the commitment loss, the training is performed until convergence, and the final model is a patent text semantic identification coding model.
Patent code: the method comprises the steps of patent library document semantic identification coding and semantic identification index table construction. And carrying out semantic identification coding on all patents in the patent library by utilizing the patent text semantic identification coding model trained to be converged, and storing the generated identification sequence coding data into an index database.
Query encoding: and coding the query text input by the user by utilizing the patent text semantic identification coding model trained to be converged to generate a semantic identification sequence.
Searching and matching: and searching patents matched with the query in a patent code index library by using a longest matching algorithm based on the tree, and returning a Top-K result with highest similarity after sorting according to the similarity.
As shown in fig. 1, a method for generating search of patents is provided, and the method can be applied to a server, and comprises the steps of training an S1 model, encoding an S2 patent, encoding an S3 query, and matching an S4 search;
the method specifically comprises the following steps:
the S1 model training comprises the steps of S11 training data set construction, S12 model design and S13 training;
s11, training data set construction is to extract patent document text and IPC classification data from a patent library, firstly, perform data cleaning, divide patent titles, abstracts, claims, description parts and drawings, filter paragraph labels, drawing descriptions and other data, then sequentially combine other text fields, associate the combined text with the corresponding IPC classification of the patent text, and form training data, namely, comprise cleaned patent text content and corresponding IPC classification, and construct training data set for model training by the classified data;
the S12 model design is that the model consists of a patent text semantic identification coding model and a text reconstruction model. The text reconstruction model is used for assisting in training the patent text semantic identification coding model, so that semantic information of the patent text is extracted more abundantly, and finally, the semantic identification coding model can be used for assisting in representing a semantic identification sequence generated by the patent textSemantic information of patent text. The patent text semantic identification coding model comprises a coding layer, a decoding layer and a codebook. Wherein the encoding layer (Encoder) and decoding layer (Decoder) are the encoding layer and decoding layer in the model architecture based on a transducer-based pre-trained language model as a base model, and the codebook is a codebook data structure designed for the characteristics of the patent text. A code layer (Encoder) selects a code layer of a pre-training language model T5 based on a converter, and a code layer (Encoder) selects a code layer of the pre-training language model T5 based on the converter, wherein a codebook (codebook) is a codebook (codebook) initialized by using patent data; the text reconstruction model is based on a pre-trained language model T5 of a transducer as a basic model, and the model architecture comprises an encoding layer and a decoding layer. The patent text semantic identification coding model is specifically represented by G, and E is used 1 Representing coding layers in a patent text semantic identification coding model by D 1 Representing a decoding layer in a patent text semantic identification coding model by E t Representing a codebook, and representing a text reconstruction model by R;
s13, training is performed by means of training the model architecture by means of a data set prepared by constructing a training data set, wherein the training comprises the steps of initializing a codebook, designing a semantic identification sequence, training the model and optimizing a comprehensive loss function. Finally training until the model converges, and storing and outputting the model parameters after training.
The initialization codebook is aimed at all patent texts in the data set, the K-Means algorithm is utilized to perform text clustering, K categories are clustered, L categories of parts, major categories and minor categories in the patent IPC classification are combined, and finally, the codebook index structure is constructed as E t ,E t ∈R M*D Wherein m=k+l, in this patent, K takes the value of 2000, L takes the value of 786, and d is the dimension 768 of the hidden layer of the model.
The semantic identification sequence design combines the characteristics of patent text, namely patent IPC classification, and utilizes parts, major classes and minor classes in the patent IPC classification to design the structure of the semantic identification sequence, and the semantic identification at the other positions represents the semantic information of the patent text.
As in fig. 2, a diagram of the model training architecture in the present application is given. Model training is performed by constructing prepared training data using a training data set, the training data being composed of data pairs (d, d ipc ) Composition, where d represents the patent text, d ipc The IPC three-level classification of the patent is shown, and in the model training process, the integral input is (d, d) ipc ) The patent text d is input into the patent text semantic identification coding model G, and in the data flow of each time step t, the coding layer E in the patent text semantic identification coding model G is firstly passed through 1 And decoding layer D 1 The output is generated as d t
Wherein E is 1 Representing coding layers in a patent text semantic identification coding model, D 1 Representing a decoding layer in a patent text semantic identification coding model, d t Identifying the flow through the coding layer E in the coding model G for the patent text semantics of the current time step t 1 And decoding layer D 1 Post output, z <t Output of the semantic identification coding model G representing the patent text before the current time step t, d generated by each time step t Codebook E input to patent text semantic identification coding model G t Output generation z t
The training of single data totally experiences T time steps, in the invention, T is taken 64, namely the length of a semantic representation sequence is also 64, wherein, for a received input patent text d, a patent text semantic identification coding model G is output as a semantic identification sequence Z:
through T timesGenerating a semantic identification sequence Z by the whole semantic identification coding model G of the intermittent patent text, and generating Z by processing the semantic identification sequence Z 4→T ,Z 4→T The fourth bit to the T bit in the semantic identification sequence is represented, namely the semantic identification sequence which does not contain the first three bits and represents the IPC class is not represented.
Input Z 4→T To the text reconstruction model R, output generation
Using a reconstruction model R for semantic identification sequences Z 4→T Reconstructing to generate a patent text d, wherein the result of the R prediction of the reconstruction model is thatThe reconstruction and decoding result is expected to be closer to the original input patent text d, so that the result is gradually closer to the original patent text d in learning, and finally, the semantic identification sequence generated by the patent text semantic identification coding model G contains richer semantic information and is finer and more accurate.
The comprehensive loss function optimization is to integrate and calculate the loss function designed for the whole task and update the model integrally. Wherein the Loss function in the invention designs three losses, and the combined Loss function of the Loss function (Reconstruction Loss), the IPC Loss and the Commitment Loss (Commitment Loss) is reconstructed to optimize the model, and the Loss is loss=L Rec +αL Com +βL IPC . Wherein the alpha value of the invention is 0.95, and the beta value is 0.8. And training by taking the integral Loss as an optimization target, optimizing a Loss function by using a random gradient descent method, and back-propagating optimized model parameters until the model converges, so that an expected training result is obtained, namely, the finally trained patent text semantic identification coding model G can generate a semantic identification sequence Z capable of fully representing the semantic information of the patent text.
Wherein the loss function L is reconstructed Rec Expressed as:
the CE here is the Cross Entropy (Cross Entropy). The design purpose is that the semantic identification sequence generated by the patent text semantic identification coding model G contains richer semantic information.
IPC loss L IPC Expressed as:
Z 1,2,3 the first three bits in the semantic identification sequence represent parts, major classes, minor classes in IPC. The CE here is the Cross Entropy (Cross Entropy). The design purpose is to enable the classification learning of the IPC class parts, major classes and minor classes in the semantic identification sequence generated by the patent text semantic identification coding model G to be more accurate.
Commitment loss L Com Expressed as:
the design aim is to enable the patent text semantic identification coding model G to avoid forgetting and improve the generation quality when generating a semantic representation sequence, so that the model is updated and optimized to be converged better.
The S2 patent code comprises S21 patent library document semantic identification code and S22 semantic identification index library construction.
S21, the semantic identification coding of the patent library document is to apply a patent text semantic identification coding model G to generate a semantic identification sequence for the patent text in the patent library. The patent text semantic identification coding model G is used, a patent text set in the DOC is input, each patent text in the DOC is traversed and sent to the patent text semantic identification coding model G to generate a semantic identification sequence, and finally all the patent text semantic identification sequences are output. Specifically, the extracted patent text D will be traversed from DOC i Inputting a patent text semantic identification coding model G, outputting and generating corresponding semanticsIdentification sequence Z i ,Z i =G(D i ) Until all patent texts are traversed and semantic identification sequences of all patent texts are generatedThe method comprises the steps of carrying out a first treatment on the surface of the Wherein DOC is the text set of all patent documents in the patent library, DOC= { D i ,...,D n N represents the number in DOC patent library, D i Is the element in the patent document set DOC used for representing the patent text of a single piece, < ->A set of semantic identification sequences representing all patent text that is ultimately generated.
S22, constructing a semantic identification index library, namely, a semantic identification sequence of all patents generated by encoding semantic identifications of patent library documentsConstructing a corresponding index structure and storing the index structure in an index database, namely identifying the sequence Z for each semantic meaning i And constructing an index relation with the patent text data, and finally generating a patent coding index library.
A patent coding flow chart provided by an embodiment of the present application is shown in fig. 3.
S3, query coding is generated by semantic identification coding of the query text. Specifically, a patent text semantic identification coding model G is applied to carry out semantic identification coding on a query text Q input by a user, and a query semantic identification sequence is generated. Namely, a query text Q input by a user is input into a patent text semantic identification coding model G, and the patent text semantic identification coding model G outputs a query semantic identification sequence Z Q ,Z Q =g (Q). A query encoding flow chart provided by an embodiment of the present application is shown in fig. 4.
S4, searching and matching is to search the patent matched with the query semantic identification sequence in the patent coding index library by applying the longest matching algorithm based on the tree, taking the distance between the input query semantic identification sequence and the candidate representation sequence in the index library as a measure, and returning the Top-K candidate patent knot closest to the input representation sequenceAnd (5) fruits. Specifically, the query semantic identification sequence Z generated in the query encoding stage Q And (3) carrying out tree-based longest matching algorithm calculation with semantic identification sequences of all patents in the patent coding index library, and returning a Top-K result with highest similarity after sorting according to the similarity. Wherein Top-K is a set value, and in the present application, top-K takes a value of 50. As shown in fig. 5, a search matching flowchart provided in an embodiment of the present application is provided.
In summary, the invention provides a generating type retrieval method and system for patents, which do not need to mark data to realize self-supervision model training, perform unified fusion coding by means of an IPC multi-level classification system, construct semantic identification coding indexes for full-library patents through a generating model of a self-encoder, effectively fuse the efficiency and accuracy of traditional word vector retrieval and depth model semantic retrieval, and realize large-scale semantic retrieval based on query coding sequences and similarity of patent coding sequences.
In one embodiment, as shown in fig. 6, the present application provides a generative search system for patents, the system comprising a model training module, a patent coding module, a query coding module, and a search matching module, wherein:
the model training module is used for constructing a training data set based on the patent text and IPC multi-level classification data in the patent library, carrying out training on a training data set based on a pre-training language model and combining a codebook data structure by using a comprehensive loss function combining reconstruction loss, IPC loss and commitment loss, and carrying out training until convergence to obtain a patent text semantic identification coding model;
the patent coding module is used for carrying out semantic identification coding on all patents in the patent library by utilizing the patent text semantic identification coding model trained to be converged, and storing the generated identification sequence coding data into the index database;
the query coding module is used for coding the query text input by the user by utilizing the patent text semantic identification coding model trained to be converged to generate a semantic identification sequence;
and the search matching module is used for searching the patents matched with the query in the patent code index library by applying the longest matching algorithm based on the tree, and returning the Top-K result with the highest similarity after sorting according to the similarity.
The specific implementation content of each module can be referred to above for limitation of the method of the generating search system for patent, and will not be described herein.
In one embodiment, a computer device is provided, which may be a server, the internal structure of which may be as shown in fig. 7. The computer device includes a processor, a memory, and a network interface connected by a system bus. The processor of the computer device is used for providing computing and control capability, the network interface is used for communicating with an external terminal through network connection, and the computer device runs the computer program by loading to realize the multi-domain knowledge extraction method of the patent.
It will be appreciated by those skilled in the art that the structure shown in fig. 7 is merely a block diagram of some of the structures associated with the present application and is not limiting of the computer device to which the present application may be applied, and that a particular computer device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.
In an embodiment, a computer readable storage medium is also provided, on which a computer program is stored, involving all or part of the flow of the method of the above embodiment.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

Claims (9)

1. The generation type search method for the patent is characterized by comprising S1 model training, S2 patent coding, S3 query coding and S4 search matching;
s1, training a model, namely constructing a training data set based on patent text and IPC multi-level classification data in a patent library, carrying out training on the training data set based on a pre-training language model and combining a codebook data structure, and carrying out training on a comprehensive loss function combining reconstruction loss, IPC loss and commitment loss until convergence to obtain a patent text semantic identification coding model;
s2, patent coding, namely, carrying out semantic identification coding on all patents in a patent library by utilizing a patent text semantic identification coding model trained to be converged, and storing generated identification sequence coding data into an index database;
s3, query coding, namely coding a query text input by a user by utilizing a patent text semantic identification coding model trained to be converged, and generating a semantic identification sequence;
s4, searching and matching, namely searching patents matched with the query in a patent code index library by applying a longest matching algorithm based on a tree, and returning a Top-K result with highest similarity after sorting according to the similarity;
the S1 model training specifically comprises the following steps:
s11, constructing a training data set, extracting patent document text and IPC classification data from a patent library, firstly cleaning the data, segmenting patent titles, abstracts, claims, description parts and drawings, filtering paragraph labels and drawing description data, sequentially combining other text fields, and associating the combined text with the IPC classification corresponding to the patent text to form training data;
s12, designing a model, wherein the model specifically comprises a patent text semantic identification coding model and a text reconstruction model; the text reconstruction model is used for assisting in training a patent text semantic identification coding model, and semantic information of the patent text can be characterized by assisting the patent text semantic identification coding model in a semantic identification sequence generated by the patent text; the patent text semantic identification coding model comprises a coding layer, a decoding layer and a codebook; the coding layer and the decoding layer are in a model framework taking a pre-training language model based on a transducer as a basic model, and the codebook is a codebook data structure designed aiming at the characteristics of the patent text; wherein the coding layer is selected based on a transducerThe decoding layer selects the decoding layer of the pre-training language model T5 based on a Transformer, wherein the codebook is a codebook initialized by using patent data; the text reconstruction model is based on a pre-training language model T5 of a transducer as a basic model, and the model framework comprises an encoding layer and a decoding layer; the patent text semantic identification coding model is specifically represented by G, and E is used 1 Representing coding layers in a patent text semantic identification coding model by D 1 Representing a decoding layer in a patent text semantic identification coding model by E t Representing a codebook, and representing a text reconstruction model by R;
s13, training is performed, namely training a model framework by a pointer on a data set prepared by constructing a training data set comprises initializing a codebook, designing a semantic identification sequence, training a model and optimizing a comprehensive loss function; finally training until the model converges, and storing and outputting the model parameters after training.
2. The method according to claim 1, wherein the step S13 of performing training specifically comprises:
the initialization codebook is used for carrying out text clustering by utilizing a K-Means algorithm aiming at all patent texts in the data set, clustering K categories, combining L categories of parts, major categories and minor categories in the patent IPC classification, and finally constructing a codebook index structure as E t
The semantic identification sequence design is used for designing the structure of the semantic identification sequence by utilizing parts, major classes and minor classes in the patent IPC classification, wherein the first three bits of the semantic identification sequence are specifically designed to respectively represent the parts, the major classes and the minor classes, and the semantic identifications of the rest positions represent the semantic information of the patent text.
3. The method according to claim 1, wherein the step S13 of performing training specifically comprises:
model training, training using training data sets to construct prepared training data consisting of pairs of data (d, d ipc ) Composition, wherein d represents the patentText, d ipc The IPC three-level classification of the patent is shown, and in the model training process, the integral input is (d, d) ipc ) The patent text d is input into the patent text semantic identification coding model G, and in the data flow of each time step t, the coding layer E in the patent text semantic identification coding model G is firstly passed through 1 And decoding layer D 1 The output is generated as d t
Wherein E is 1 Representing coding layers in a patent text semantic identification coding model, D 1 Representing a decoding layer in a patent text semantic identification coding model, d t Identifying the flow through the coding layer E in the coding model G for the patent text semantics of the current time step t 1 And decoding layer D 1 Post output, z <t Output of the semantic identification coding model G representing the patent text before the current time step t, d generated by each time step t Codebook E input to patent text semantic identification coding model G t Output generation z t
M is the dimension in the codebook index structure, and a single piece of data training experiences T time steps altogether, wherein, the patent text semantic identification coding model G outputs a semantic identification sequence Z for the received input patent text d:
d is the dimension of a hidden layer of the model, a semantic identification sequence Z is integrally generated by T time-step patent text semantic identification coding models G, and the semantic identification sequence Z is processed to generate Z 4→T ,Z 4→T Representing the fourth bit to the T bit in the semantic identification sequence; input Z 4→T To the text reconstruction model R, output generation:
Using a reconstruction model R for semantic identification sequences Z 4→T Reconstructing to generate a patent text d, wherein the result of the R prediction of the reconstruction model is thatThe reconstruction and decoding result is expected to be closer to the original input patent text d, so that the result is gradually closer to the original patent text d;
optimizing the comprehensive loss function, carrying out integrated calculation aiming at the loss function and integrally updating the model; the loss function designs three losses, namely a combined loss function of reconstruction loss, IPC loss and commitment loss to optimize a model, the overall loss is used as an optimization target for training, the random gradient descent method is used for optimizing the loss function, and model parameters are optimized in a counter-propagation mode until the model converges, so that an expected training result is obtained.
4. The method of claim 1, wherein the S2 patent code specifically comprises:
s21, patent library document semantic identification coding, namely inputting all patent document text sets in a patent library by using a patent text semantic identification coding model G, traversing each patent text in the patent library, sending the patent text to the patent text semantic identification coding model G to generate a semantic identification sequence, and finally outputting all patent text semantic identification sequences; traversing extracted patent text D from all patent document text sets in patent library i Inputting a patent text semantic identification coding model G, outputting and generating a corresponding semantic identification sequence Z i ,Z i =G(D i ) Until all patent texts are traversed and semantic identification sequences of all patent texts are generated
S22, constructing a semantic identification index library, constructing a corresponding index structure for semantic identification sequences of all patents generated by semantic identification codes of patent library documents, and storing the index structure into an index database.
5. The method of claim 1, wherein the S3 query encoding specifically comprises:
generating semantic identification codes of the query text; and specifically, applying a patent text semantic identification coding model G to carry out semantic identification coding on a query text Q input by a user to generate a query semantic identification sequence.
6. The method according to claim 1, wherein said S4 retrieving matches specifically comprises:
searching patents matched with the query semantic identification sequence in a patent coding index library by using a longest matching algorithm based on a tree, taking the distance between the input query semantic identification sequence and a candidate representation sequence in the index library as a measure, and returning a Top-K candidate patent result closest to the input representation sequence; specifically, a query semantic identification sequence Z generated in a query encoding stage Q Performing tree-based longest matching algorithm calculation with semantic identification sequences of all patents in a patent coding index library, and returning a Top-K result with highest similarity after sorting according to the similarity; wherein Top-K is a set value.
7. A generative search system for patents, the system comprising a model training module, a patent encoding module, a query encoding module, and a search matching module, wherein:
the model training module is used for constructing a training data set based on the patent text and IPC multi-level classification data in the patent library, carrying out training on a training data set based on a pre-training language model and combining a codebook data structure by using a comprehensive loss function combining reconstruction loss, IPC loss and commitment loss, and carrying out training until convergence to obtain a patent text semantic identification coding model;
the patent coding module is used for carrying out semantic identification coding on all patents in the patent library by utilizing the patent text semantic identification coding model trained to be converged, and storing the generated identification sequence coding data into the index database;
the query coding module is used for coding the query text input by the user by utilizing the patent text semantic identification coding model trained to be converged to generate a semantic identification sequence;
the searching and matching module is used for searching the patent matched with the query in the patent coding index library by applying the longest matching algorithm based on the tree, and returning a Top-K result with highest similarity after sorting according to the similarity;
the model training module specifically comprises:
constructing a training data set, extracting patent document text and IPC classification data from a patent library, firstly cleaning the data, dividing the title, abstract, claims, description part and drawings of the patent, filtering paragraph numbers and drawing description data, sequentially combining the rest text fields, and associating the combined text with the IPC classification corresponding to the patent text to form training data;
the model design comprises a patent text semantic identification coding model and a text reconstruction model; the text reconstruction model is used for assisting in training a patent text semantic identification coding model, and semantic information of the patent text can be characterized by assisting the patent text semantic identification coding model in a semantic identification sequence generated by the patent text; the patent text semantic identification coding model comprises a coding layer, a decoding layer and a codebook; the coding layer and the decoding layer are in a model framework taking a pre-training language model based on a transducer as a basic model, and the codebook is a codebook data structure designed aiming at the characteristics of the patent text; the coding layer of the pre-training language model T5 based on the transducer is selected as the coding layer, the decoding layer of the pre-training language model T5 based on the transducer is selected as the decoding layer, and the codebook is a codebook initialized by using patent dataThe method comprises the steps of carrying out a first treatment on the surface of the The text reconstruction model is based on a pre-training language model T5 of a transducer as a basic model, and the model framework comprises an encoding layer and a decoding layer; the patent text semantic identification coding model is specifically represented by G, and E is used 1 Representing coding layers in a patent text semantic identification coding model by D 1 Representing a decoding layer in a patent text semantic identification coding model by E t Representing a codebook, and representing a text reconstruction model by R;
training is performed, namely training a model framework by a pointer on a data set prepared by constructing a training data set, wherein the training comprises initializing a codebook, designing a semantic identification sequence, training a model and optimizing a comprehensive loss function; finally training until the model converges, and storing and outputting the model parameters after training.
8. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1 to 6 when the computer program is executed.
9. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method according to any one of claims 1 to 6.
CN202311732921.2A 2023-12-18 2023-12-18 Generating type retrieval method and system for patent Active CN117421393B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311732921.2A CN117421393B (en) 2023-12-18 2023-12-18 Generating type retrieval method and system for patent

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311732921.2A CN117421393B (en) 2023-12-18 2023-12-18 Generating type retrieval method and system for patent

Publications (2)

Publication Number Publication Date
CN117421393A CN117421393A (en) 2024-01-19
CN117421393B true CN117421393B (en) 2024-04-09

Family

ID=89530520

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311732921.2A Active CN117421393B (en) 2023-12-18 2023-12-18 Generating type retrieval method and system for patent

Country Status (1)

Country Link
CN (1) CN117421393B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2009065146A2 (en) * 2007-11-15 2009-05-22 Gibbs Andrew H System and method for conducting a patent search
WO2019108793A1 (en) * 2017-11-29 2019-06-06 John Maclaren Walsh Recommender methods and systems for patent processing
CN110083686A (en) * 2019-04-27 2019-08-02 长沙曙通信息科技有限公司 A kind of algorithm design of patent auto recommending method
CN114496069A (en) * 2022-02-17 2022-05-13 华东师范大学 Method for predicting off-target of CIRPCAs 9 system based on Transformer architecture
CN115269882A (en) * 2022-09-28 2022-11-01 山东鼹鼠人才知果数据科技有限公司 Intellectual property retrieval system and method based on semantic understanding
KR20230000420A (en) * 2021-06-24 2023-01-02 주식회사 워트인텔리전스 Apparatus and method for building training data using patent document and building training data system using the same
CN115794999A (en) * 2023-02-01 2023-03-14 北京知呱呱科技服务有限公司 Patent document query method based on diffusion model and computer equipment
CN116912047A (en) * 2023-09-13 2023-10-20 湘潭大学 Patent structure perception similarity detection method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220156271A1 (en) * 2014-11-26 2022-05-19 Vettd, Inc. Systems and methods for determining the probability of an invention being granted a patent

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2009065146A2 (en) * 2007-11-15 2009-05-22 Gibbs Andrew H System and method for conducting a patent search
WO2019108793A1 (en) * 2017-11-29 2019-06-06 John Maclaren Walsh Recommender methods and systems for patent processing
CN110083686A (en) * 2019-04-27 2019-08-02 长沙曙通信息科技有限公司 A kind of algorithm design of patent auto recommending method
KR20230000420A (en) * 2021-06-24 2023-01-02 주식회사 워트인텔리전스 Apparatus and method for building training data using patent document and building training data system using the same
CN114496069A (en) * 2022-02-17 2022-05-13 华东师范大学 Method for predicting off-target of CIRPCAs 9 system based on Transformer architecture
CN115269882A (en) * 2022-09-28 2022-11-01 山东鼹鼠人才知果数据科技有限公司 Intellectual property retrieval system and method based on semantic understanding
CN115794999A (en) * 2023-02-01 2023-03-14 北京知呱呱科技服务有限公司 Patent document query method based on diffusion model and computer equipment
CN116912047A (en) * 2023-09-13 2023-10-20 湘潭大学 Patent structure perception similarity detection method

Also Published As

Publication number Publication date
CN117421393A (en) 2024-01-19

Similar Documents

Publication Publication Date Title
JP5537649B2 (en) Method and apparatus for data retrieval and indexing
CN111914054A (en) System and method for large scale semantic indexing
CN111291188B (en) Intelligent information extraction method and system
CN110750640A (en) Text data classification method and device based on neural network model and storage medium
CN108875065B (en) Indonesia news webpage recommendation method based on content
CN113743119B (en) Chinese named entity recognition module, method and device and electronic equipment
WO2015051481A1 (en) Determining collection membership in a data graph
CN111400584A (en) Association word recommendation method and device, computer equipment and storage medium
CN115329766B (en) Named entity identification method based on dynamic word information fusion
Huang et al. A low-cost named entity recognition research based on active learning
Lian et al. Product quantized collaborative filtering
CN111241310A (en) Deep cross-modal Hash retrieval method, equipment and medium
CN113836896A (en) Patent text abstract generation method and device based on deep learning
Qin et al. A survey on text-to-sql parsing: Concepts, methods, and future directions
CN110851584B (en) Legal provision accurate recommendation system and method
CN112948601A (en) Cross-modal Hash retrieval method based on controlled semantic embedding
CN114281982B (en) Book propaganda abstract generation method and system adopting multi-mode fusion technology
Chai et al. Cross-domain deep code search with few-shot meta learning
CN117421393B (en) Generating type retrieval method and system for patent
Chen et al. Distant supervision for relation extraction with sentence selection and interaction representation
CN116662565A (en) Heterogeneous information network keyword generation method based on contrast learning pre-training
CN114064820B (en) Mixed architecture-based table semantic query coarse arrangement method
CN116975271A (en) Text relevance determining method, device, computer equipment and storage medium
CN114820134A (en) Commodity information recall method, device, equipment and computer storage medium
Rani et al. Telugu text summarization using LSTM deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20240412

Address after: No. 401-1, 4th floor, podium, building 3 and 4, No. 11, Changchun Bridge Road, Haidian District, Beijing 100089

Patentee after: Beijing Zhiguagua Technology Co.,Ltd.

Country or region after: China

Address before: 806A, Building 1, Sixin Building, South Side of Heiniucheng Road, Hexi District, Tianjin, 300221

Patentee before: Zhiguagua (Tianjin) Big Data Technology Co.,Ltd.

Country or region before: China

CP03 Change of name, title or address

Address after: No. 401-1, 4th floor, podium, building 3 and 4, No. 11, Changchun Bridge Road, Haidian District, Beijing 100089

Patentee after: Beijing Xinghe Zhiyuan Technology Co.,Ltd.

Country or region after: China

Address before: No. 401-1, 4th floor, podium, building 3 and 4, No. 11, Changchun Bridge Road, Haidian District, Beijing 100089

Patentee before: Beijing Zhiguagua Technology Co.,Ltd.

Country or region before: China