WO2023195769A1 - Procédé d'extraction de documents de brevets similaires à l'aide d'un modèle de réseau neuronal, et appareil pour sa fourniture - Google Patents
Procédé d'extraction de documents de brevets similaires à l'aide d'un modèle de réseau neuronal, et appareil pour sa fourniture Download PDFInfo
- Publication number
- WO2023195769A1 WO2023195769A1 PCT/KR2023/004594 KR2023004594W WO2023195769A1 WO 2023195769 A1 WO2023195769 A1 WO 2023195769A1 KR 2023004594 W KR2023004594 W KR 2023004594W WO 2023195769 A1 WO2023195769 A1 WO 2023195769A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- similarity
- embedding
- patent document
- similar
- model
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 46
- 238000003062 neural network model Methods 0.000 title abstract description 13
- 239000013598 vector Substances 0.000 claims abstract description 163
- 238000004364 calculation method Methods 0.000 claims description 8
- 230000007547 defect Effects 0.000 claims description 6
- 238000002372 labelling Methods 0.000 claims description 3
- 238000004458 analytical method Methods 0.000 abstract description 9
- 238000005516 engineering process Methods 0.000 description 17
- 239000000284 extract Substances 0.000 description 11
- 230000008569 process Effects 0.000 description 10
- 238000010586 diagram Methods 0.000 description 6
- 230000006870 function Effects 0.000 description 5
- 238000000605 extraction Methods 0.000 description 4
- 238000003058 natural language processing Methods 0.000 description 4
- 238000007781 pre-processing Methods 0.000 description 4
- 238000013528 artificial neural network Methods 0.000 description 3
- 238000013461 design Methods 0.000 description 3
- 230000009977 dual effect Effects 0.000 description 3
- 238000013507 mapping Methods 0.000 description 3
- 238000012545 processing Methods 0.000 description 3
- 239000011159 matrix material Substances 0.000 description 2
- 239000000203 mixture Substances 0.000 description 2
- 238000012935 Averaging Methods 0.000 description 1
- 230000003044 adaptive effect Effects 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000003190 augmentative effect Effects 0.000 description 1
- 230000004888 barrier function Effects 0.000 description 1
- 230000002457 bidirectional effect Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000002860 competitive effect Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 230000014509 gene expression Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000001902 propagating effect Effects 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012549 training Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3347—Query execution using vector based model
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/338—Presentation of query results
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/04—Inference or reasoning models
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2216/00—Indexing scheme relating to additional aspects of information retrieval not explicitly covered by G06F16/00 and subgroups
- G06F2216/11—Patent retrieval
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y10—TECHNICAL SUBJECTS COVERED BY FORMER USPC
- Y10S—TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y10S707/00—Data processing: database and file management or data structures
- Y10S707/912—Applications of a database
- Y10S707/923—Intellectual property
- Y10S707/93—Intellectual property intellectual property analysis
- Y10S707/931—Patent comparison
Definitions
- the present invention relates to a method and device for extracting similar patent documents using a neural network model.
- NLP Natural Language Processing
- natural language processing methods are available to analyze embedded-based structured and unstructured documents. It is being designed (Korea Registered Patent Publication 10-2342055 (2021.12.17)).
- the purpose of the present invention is to provide a patent analysis method through a neural network model learned from patent data and patent judgment data from the Korean Intellectual Property Office or court.
- the purpose of the present invention is to provide a method to simplify the cost of patent analysis and support faster decision-making by allowing a neural network model to learn the characteristics of the work performed by existing patent experts.
- the purpose of the present invention is to provide a method for determining the invalidity of prior art or patents or determining the possibility of patent infringement of a technology by providing analysis results of patent documents similar to the technology provided by querying the patent documents. do.
- a method for calculating similarity between patent documents based on embedding vectors according to the present invention to solve the above technical problem includes receiving first and second embedding vectors of each of the patent documents; and calculating similarity between the patent documents based on the first and second embedding vectors, wherein the first and second embedding vectors are respectively embedded using first and second embedding models, and It is preferable that the first embedding model generates the first embedding vector based on a greater number of tokens than the maximum tokenized number of the second embedding model.
- TL1 > N * TL2, where N is a natural number of 2 or more, TL1 is the maximum tokenization number of the first embedding model, and TL2 is the maximum tokenization number of the second embedding model.
- the patent documents include a first patent document that serves as a standard for determining similarity
- the second embedding vectors include a 2-1 embedding vector that embeds claims of the first patent document.
- the claims of the first patent document include a plurality of components, and the 2-1 embedding vector is an embedding vector for each component that is embedded based on at least one of the components.
- the patent documents include a second patent document that is subject to similarity judgment with the first patent document, and the second embedding vectors include a plurality of 2-2 embedding vectors each embedding sentences of the second patent document.
- the step of calculating the similarity it is preferable to check the embedding vector that is most similar to the embedding vector for each component among the plurality of 2-2 embedding vectors.
- the step of calculating the similarity involves calculating the similarity using a similarity judgment model learned using a first label value of a similar document similar to the patent document and a second label value of a similar document that is partially or not similar to the patent document. However, it is preferable that the second label value is smaller than the first label value.
- the embedding vector-based similar patent document search method to solve the above technical problem is based on the query patent document and the first embedding vectors of each of the patent documents in the first candidate patent document list that satisfies the search conditions. calculating a first similarity; extracting a second candidate patent document list from the first candidate patent document list through the calculated first similarity; calculating a second similarity based on second embedding vectors of the query patent document and each of the patent documents in the second candidate patent document list; and providing similar patent documents in the second candidate patent document list through the calculated second similarity, wherein the first and second embedding vectors are respectively embedded using first and second embedding models.
- the first embedding model generates the first embedding vector based on a greater number of tokens than the maximum tokenized number of the second embedding model.
- the claims of the query patent document include a plurality of components, and the 2-1 embedding vector of the query patent document is an embedding vector embedded based on at least one of the components.
- the step of calculating the second similarity involves calculating the similarity by comparing the 2-1 embedding vector with a plurality of 2-2 embedding vectors that each embed sentences of patent documents in the second candidate patent document list. desirable.
- a method of learning a similarity judgment model for calculating the similarity of patent documents according to the present invention to solve the above technical problem includes providing a first sample and a second sample to the similarity judgment model; And a step of learning the similarity judgment model by comparing labeling values of the first and second samples with similarity values output based on the first and second samples, wherein the first sample is a target patent document. It includes target patent document information, first document information about similar documents similar to the target patent, and a first label value, and the second sample includes the target patent document information and patent documents that are partially or unsimilar to the target patent. It includes second document information and a second label value, and the second label value is preferably smaller than the first label value.
- the second sample includes the 2-1 sample, and the 2-1 sample includes the target patent document information, third document information about any patent document belonging to the same technical field as the target patent, and a third label. value, and the third label value is preferably smaller than the second label value.
- the second sample includes a 2-2 sample, wherein the 2-2 sample includes the target patent document information, fourth document information about an arbitrary patent document belonging to a technical field different from the target patent, and a fourth label. value, and the fourth label value is preferably smaller than the third label value.
- non-similar patent document is included in the administrative history of the target patent, but is not cited as a basis for a specific reason for rejection or invalidity for the target patent.
- the similar similar patent document is at least one of the cited documents cited by the Korean Intellectual Property Office, tribunal, or court as a basis for rejection or invalidation of the target patent.
- the above reasons for rejection or invalidity are preferably at least one of defects in novelty, violation of the originality principle and expanded originality principle, and violation of Article 102 of the U.S. Patent Act.
- the partially similar patent documents are documents cited when the target patent is rejected or invalidated due to defects in inventive step.
- the target patent document information is preferably generated based only on one or more independent claims of the target patent.
- the cost of patent analysis can be simplified and faster decision-making can be supported by allowing a neural network model to learn patent documents and judgment data about patent documents.
- the learned model can be used to investigate the background technology of a newly developed technology or provide results of examining patent literature that may invalidate the target patent.
- FIG. 1 is a conceptual diagram showing a similar patent document extraction service according to an embodiment of the present invention.
- Figure 2 is a flowchart showing a method for extracting similar patent documents according to an embodiment of the present invention.
- Figure 3 is a block diagram showing the structure of a patent document database according to an embodiment of the present invention.
- 4 to 8 are block diagrams showing the structure of a patent embedding model according to an embodiment of the present invention.
- Figure 9 is a flowchart showing a method for extracting similar patent documents according to an embodiment of the present invention.
- 10 to 14 are block diagrams showing the structure of a patent similarity determination model according to an embodiment of the present invention.
- Figure 15 is a flowchart showing a method of learning a similarity judgment model according to an embodiment of the present invention.
- Figure 16 is a block diagram showing the hardware configuration of a similar patent document extraction server according to an embodiment of the present invention.
- FIG. 1 is a diagram illustrating a system for calculating similarity between patent documents according to an embodiment of the present invention.
- the system consists of a user 10 entering query information and a server 300 extracting and providing similar patent document information corresponding to the query information entered by the user 10. It can be.
- the server 300 may operate including a database (DB) that manages patent documents and a neural network model 200 learned through the DB.
- DB database
- the neural network model 200 may have a dual structure of an embedding model that extracts meanings inherent in patent documents and a similarity judgment model that calculates similarity through embedding, which is the output of the embedding model.
- the neural network model 200 obtains an embedding vector corresponding to a patent document from a database (DB) that stores and manages embedding vectors pre-calculated in the embedding model for the patent documents, and based on the embedding vector,
- DB database
- a similarity model can calculate similarity.
- the embedding vector for each patent document can be calculated in advance and stored in the database, and the similarity judgment model can be extracted and used from the DB without the need to create an embedding vector during each learning and inference, saving server resources and delay. can be reduced.
- the embedding vectors of the query patent document and prior patent documents extracted from the query information may be calculated from the embedding model in real time (on-the-fly) and input into the similarity model.
- An embedding vector can be extracted by performing a preprocessing process in real time to divide the query patent documents or prior patent documents differently (e.g., splitting them into sentences) as needed, and inputting the preprocessed tokens into the embedding model. Additionally, the embedding vectors of the upper layer can be extracted by integrating the embedding vectors back into the document or paragraph unit.
- the embedding model flexibly processes and embeds patent documents in real time according to the characteristics of the input data, the user's intention, or the required input of the similarity model, and the similarity model calculates the similarity based on this. make it possible Therefore, the neural network model can demonstrate adaptive performance.
- this embodiment illustrates a method of extracting similar prior patent documents based on the query patent document
- prior patent documents are interpreted as prior documents in a broad sense and are non-patent documents, such as papers or technologies published in academic societies or archives (arxiv). It can include various text documents posted on web communities such as data and GitHub. Therefore, the collected various technical data can be used to extract documents similar to the query patent document by using them as one prior art document based on the date or time of publication and managing them in a database described later.
- the server 300 may extract query patent document information from query information input by the user (S100).
- Query patent document information is an identification value of a patent document that serves as a standard for extracting similar patent documents and may include information to identify the patent, such as the patent application, publication, and registration number or the title of the invention. You can.
- the query information may include date information such as the priority date (application date) of the query patent or information related to the applicant, inventor, or right holder as a condition for searching for a patent. Therefore, the server 300 allows the server 300 to search for patents before that date. You can set search conditions.
- a patent classification system representing the technical field of the invention, such as IPC (International Patent Classification) or CPC (Cooperative Patent Classification), is input as search conditions and similar patent documents for the conditions are searched. Or, you can compare them first.
- IPC International Patent Classification
- CPC Cooperative Patent Classification
- the server 300 extracts prior patent documents to calculate the degree of similarity with the query patent document using the identification value and search conditions of the patent document extracted from the input query information.
- the server 300 receives document embedding vectors and sentence embedding vectors of each of the extracted patent documents (S200).
- the document embedding vector is a value that embeds the meaning of the entire patent document and may have a unique value for each document.
- it may be an embedding vector obtained by inputting the entire patent document, including the abstract, detailed description, and claims, into an embedding model.
- a sentence embedding vector is a vector that embeds the meaning of each sub-document unit that makes up the patent document, and can be generated with multiple values depending on the size or composition of the patent document.
- a sentence embedding vector may be an embedding vector obtained by inputting not only a sentence or paragraph, but also text in units smaller than a sentence or units larger than a paragraph into the embedding model.
- the server 300 calculates global similarity through the document embedding vector of the query patent document and the document embedding vector of previous patent documents, and simultaneously calculates the similarity between each element of the patent document to the sentence level similarity based on the sentence embedding vector. Allow it to be calculated.
- the server 300 includes a patent document DB 312 for extracting prior patent documents in text form, a first embedding vector DB 314 that manages document-level first embedding vectors for each patent, and a sentence-level second embedding vector DB 314 for managing document-level first embedding vectors for each patent. It may be configured to include a second embedding vector DB 316 that manages embedding vectors for each patent.
- the database 310 may be composed of a patent document DB 312, a first embedding vector DB 314, and a second embedding vector DB 316, and the database 310 is
- the patent's identification information can be managed as a unique index for the values in each DB, and thus each value corresponding to the patent's identification information in the input query information can be extracted and used to determine similarity.
- identification information or text of specifications of prior patent documents to be compared with the query patent document can also be extracted from the patent document DB 312, and the patent application numbers of the prior patents are used as identification information of the prior patent documents.
- the first embedding vector of the prior patent documents can be extracted from the first embedding vector DB 314, or the second embedding vector can be extracted from the second embedding vector DB 316.
- the above embedding vectors can be generated in advance through an embedding model and managed in the DB as described above, or it is also possible to calculate the embedding vectors by dividing the text for each patent document from the patent document DB according to the required input format in real time.
- the first embedding model 322 which outputs the first embedding vector of the patent documents as a single unit, converts the features inherent in the entire input patent documents into values in the feature space that defines the features of the patent documents. It can be vectorized by embedding. More specifically, the first embedding model 322 may determine the values of the embedding vectors so that the more similar the patent documents are, the closer the embedding vectors of the corresponding patent documents are located in the feature space.
- the second embedding model 324 which outputs the second embedding vector of the patent documents in sub-document units (e.g., sentence units), inputs sentence texts that divide the patent documents into individual sentences. You can vectorize patent documents sentence by sentence. That is, the second embedding model 324 can extract a plurality of vector values for one patent document and enable a more detailed similarity judgment based on the plurality of vector values between the query patent document and the prior patent document.
- sub-document units e.g., sentence units
- a preprocessing process can be performed to divide the patent document into each unit text into sub-document units that can be processed by the second embedding model 324, and the preprocessed unit
- the features of the patent sentences can be vectorized into values in a feature space where they are defined.
- the preprocessing process can be performed by splitting the text according to predetermined rules (e.g., the position of periods (.) or semicolons (;)) or semantically.
- predetermined rules e.g., the position of periods (.) or semicolons (;)
- the embedding process according to this embodiment is performed according to the hierarchical structure of the document composed of sentences or paragraphs or paragraphs composed of sentences of patent documents with common technical ideas, so the model for extracting the embedding vector is also hierarchical. It can be implemented to have a structure.
- the first embedding model 322 may be configured to extract a high-dimensional first embedding vector by integrating the output of the second embedding model 324 for sentences in the patent document.
- patent document texts can be divided into a plurality of first strings in sub-document (e.g., sentence) units through a segmentation pre-processing process, and embedding of the plurality of first strings can be performed.
- Each may be input to the second embedding model 324 that performs.
- the second embedding model 324 which performs sentence-level embedding, may output each second embedding vector for the input first string.
- the output second embedding vectors are integrated through an encoder (e.g., Transformer or another type of neural network), and the encoder can again output a first embedding vector in a form that implies a plurality of second embedding vectors. there is.
- an encoder e.g., Transformer or another type of neural network
- the second embedding vectors input to the encoder may additionally include location information within the document so as not to lose the inherent meaning in the context of the entire document by considering the relationship between sentences, and the encoder may input the value of the second embedding vector. And the inherent meaning can be output as a first embedding vector according to the mutual relationship.
- the second embedding model 324 vectorizes and outputs the meaning of the sentence unit, thereby extracting information from a plurality of second strings in smaller units of sub-documents (e.g., words or word-segmented tokens). It may be configured to output a second embedding vector using a third embedding model 326 that extracts the meaning of the first string.
- the first string extracted from the patent document can be further divided into a plurality of second strings, and the second strings are input to the third embedding model 326, which outputs an embedding vector in word units, for example. and can be output as each third embedding vector.
- the third embedding vector contains the position information of each second string within the first string, and therefore, through the position information and the unique value of the vector, the encoder can output a second embedding vector that abbreviates the meaning of the sentence.
- the embedding models according to this embodiment are the first embedding model that outputs the first embedding vector for the patent document of the highest document unit corresponding to the hierarchical structure of the document, and the first embedding model outputs the first embedding vector for the patent document in smaller units (e.g., It may include a second embedding model that outputs a second embedding vector for text in a smaller unit (e.g., a paragraph).
- the second embedding model also outputs a second embedding vector for text in smaller units (e.g., a sentence). May include N embedding models.
- the neural network model of FIG. 8 can embed text at the word, sentence, and paragraph level through a hierarchical structure.
- the first and second embedding models are similar in terms of structure, but there is a difference in the size of text that can be input.
- the first embedding model 322 uses a second embedding model (322) to embed text of more than a paragraph.
- the first embedding vector can be generated based on a number of tokens greater than the maximum number of tokenizations (324).
- the second embedding model is a base BERT (Bidirectional Encoder Representations from Transformers) model and the maximum tokenized number is 512
- M sentences containing 512 tokens are used as the second embedding model.
- N second embedding vectors can be obtained, and these can be input again into the first embedding model.
- the maximum tokenized numbers of the first and second embedding models can be N * 512 and 512, respectively (N is an integer of 2 or more).
- N is a natural number of 2 or more
- TL1 is the maximum tokenization number of the first embedding model
- TL2 is the maximum tokenization number of the second embedding model.
- Each embedding model can be trained so that the embedded vector values are similar as the contents of the input texts are similar, and the meaning of the upper layer is more accurately identified by propagating the learning results of the lower layer to the upper layer according to the hierarchical structure. It can be learned to do so.
- the first embedding model may be independently learned and provided without including the second embedding model.
- the first embedding model may be a BERT-based model with a large maximum tokenization number (e.g., 4096), such as longformer or BigBird.
- the server 300 calculates the degree of similarity between patent documents based on the first and second embedding vectors (S300).
- a first candidate patent document list is generated from the query patent documents extracted from the input query information and prior patent documents that satisfy the search conditions, and the patent documents in the first candidate patent document list are generated.
- the first similarity is calculated based on each first embedding vector (S1000).
- the similarity between each document can be calculated based on the first embedding vector of the document unit described above.
- the first similarity determination model 332 extracts the first embedding vectors of the query patent document and a plurality of prior patent documents in the first candidate patent list from the first embedding DB 314, and extracts the first embedding vector
- the similarity between the two documents is determined based on the first set of embedding vectors composed of .
- the first similarity judgment model 332 calculates the Euclidean distance, Manhattan distance, Mahalanobis distance, and correlation coefficient between the first embedding vectors of the query patent document and the prior document. (Correlation)
- the similarity between two documents can be determined based on at least one of the distances.
- the similarity determination model can implement a model network that sequentially determines similarity using embedding vectors hierarchically extracted for the document-sentence structure and calculate the similarity, as described above.
- a prior patent having a similarity greater than a threshold value is determined through the first similarity between the query patent document and the prior patent document in the first candidate patent document list extracted from the patent document DB 312 according to the search conditions in the query information.
- a second candidate patent list can be extracted from the set of documents (S2000).
- the second embedding vector for the corresponding patent can be extracted from the second embedding vector DB 316 in order to calculate the sentence-level similarity between the query document and the prior patent documents in the second candidate patent list.
- a second embedding vector set consisting of second embedding vectors between the query patent document and prior patents in the second candidate patent list is provided to the second similarity judgment model 334, and the second similarity judgment model 334 provides the query patent
- a second similarity is calculated based on the embedding vector extracted from the sentence in the document and each second embedding vector in sentence units of prior patents (S3000).
- the second embedding vector of the query patent document used in this embodiment may be an embedding vector for each component that divides the claims of the query patent document into each component.
- the claims of the query patent document are extracted from the patent document DB 312 ) and input into the second embedding model 322, which extracts sentence-level embedding vectors for the extracted claims.
- the second embedding model 322 can extract the 2-1 embedding vector by dividing the claims of the query patent document into sentences.
- the second similarity judgment model 332 is configured with a 2-1 embedding vector for the claims of the query patent document and a 2-2 embedding vector for each sentence of the prior patent document in the second candidate patent list.
- the second similarity can be calculated based on the embedding vector set.
- the second similarity determination model 332 identifies the embedding vector that is most similar to the 2-1 embedding vector for the claim among the plurality of 2-2 embedding vectors, and calculates the similarity with the most similar embedding vector as the final similarity. can do.
- the final similarity as the average of the similarities of the embedding vectors that have a similarity greater than the threshold input by the user.
- the second similarity judgment model 332 returns the location of the sentence corresponding to the embedding vector used in calculating the final similarity within the prior patent document as mapping information, allowing the user to directly check the content of the text that is the basis for the similarity judgment. It is also possible to do so.
- the third similarity is based on the keyword defined as a specific word or sentence in the query patent document and its weight, and the number of appearances of the corresponding keywords in each prior patent document in the second candidate patent list. It is also possible to additionally use a third similarity judgment model 336 that determines .
- keywords can be set in units of not only words but also phrases or clauses composed of two or more words.
- the number of appearances can be counted based on the set keyword by setting it to a sentence or a paragraph consisting of sentences.
- the third similarity judgment model 336 inputs the N keywords of the query patent and the weights for the keywords, and the keyword set of the n candidate patents in the second candidate patent list, and determines the literature of candidate patent keywords corresponding to the keywords of the query patent.
- a third degree of similarity is calculated based on the number of appearances and the weight.
- the second similarity calculated based on the second embedding vector for each prior document and the third similarity calculated according to the number of appearances of keywords in each prior patent document are reflected in the similarity judgment to finally generate a third candidate patent list.
- similar patent documents in the third candidate patent list can be provided to the user.
- the second similarity is corrected by reflecting the weight for each keyword set by the user for the sentence of the query patent document in the second similarity value, and the user's level of similarity is adjusted based on the corrected second similarity. It is also possible to provide intent-based patent documents as a third candidate patent list.
- similar patent document information may include mapping information that serves as the basis for determining similarity in addition to the identification value of the similar patent document.
- Mapping information is information about similar sentences that are similar to sentences in the query patent document, for example, the content of the actual sentences together with the location in the patent document of one or N sentences that are most similar to the first configuration of the claim of the query patent document. and may include judgment information about similarity.
- the similarity judgment model that extracts similar patent documents through a sequential extraction process can also be implemented as an integrated similarity model.
- one neural network can receive a set of sentence embedding vectors and a set of document embedding vectors of two patent documents and determine the similarity between the two documents based on these.
- the similarity calculation process between the first and second embedding vector sets between the above-described query patent document and the prior patent document is performed in parallel, and the final similarity is calculated by averaging or weighting each similarity judgment result. It is also possible to calculate and provide similar patent documents to users.
- the similarity judgment model implemented as a single similarity model can be implemented so that the similarity can be calculated by simultaneously inputting the actual text information of both patent documents in cross-encoding form.
- the similarity judgment model not only uses the embedding vector values directly to determine the similarity based on vector operations defined in the feature space as the dot product or Euclidean distance between vectors, but also inputs the embedding vectors into a learned neural network. It is also possible to calculate more accurate similarity.
- the similarity judgment model can be learned using data labeled with the similarity between embedding vectors for learning, and in this embodiment, competitive learning is performed on pairs of learning data with labeled relative similarity values to learn the similarity judgment model. do.
- a positive sample determined to be similar to a reference patent document and a negative sample determined to be dissimilar can be used as training data in pairs. That is, a positive sample with a first label value for a similar document similar to the patent document and a second label value for a similar document partially or not similar to the patent document are used as a negative sample, but the first label value is greater than the second label value.
- the label value and sample are provided to the similarity judgment model (S20), and the similarity judgment model can learn based on the positive and negative samples (S30).
- the similarity judgment model determines the similarity based on the embedding vector of the patent document and the similar document, and learns so that the similarity value calculated for the similar document has a small error with the first label value, but for the non-similar document, the similarity value is learned to be small. Internal layers can be trained so that the error with the label value is reduced.
- samples can be divided into dolls in stages, and the internal layers of the similarity judgment model can be learned by sequentially configuring label values for each stage.
- negative samples in addition to positive samples can be divided into four labels such as hard-negative, negative, and extreme-negative, depending on the similarity between two documents/sentences. there is.
- a positive sample is at least one of the cited documents cited by the Korean Intellectual Property Office, a trial judge, or a court as grounds for rejection or invalidation of a patent, and may be a prior patent document cited in a judgment of identity.
- the reasons for rejection or invalidation in the administrative history are defects in novelty, violation of the originality doctrine and expanded originality doctrine, and violation of Article 102 of the U.S. Patent Act, and in this case, the cited prior literature can be created as a positive sample.
- negative samples are some similar patent documents that correspond to hard negative samples and may be documents cited when the target patent is rejected or invalidated due to inventive step defects.
- negative samples may not be directly cited in the actual reason for rejection/invalidation, but may be documents that the applicant directly referred to in the invention through an Information Disclosure Statement (IDS), etc., or, more broadly, based on the technical field, for example, an IPC code. You can refer to and use any patent literature in the same field as a sample.
- IDS Information Disclosure Statement
- the server 300 may be implemented in the form of a computing device.
- Each module constituting the server 300 is implemented on a general-purpose computing processor, and thus includes a processor 202, an input/output I/O 204, a memory 306, and an interface. It may include 308, storage 312, and bus 314.
- the processor 202, input/output I/O 304, memory device 306, and/or interface 308 may be coupled to each other through a bus 314.
- the bus 314 corresponds to a path along which data moves.
- the processor 302 includes a Central Processing Unit (CPU), a Micro Processor Unit (MPU), a Micro Controller Unit (MCU), a Graphic Processing Unit (GPU), a microprocessor, a digital signal processor, a microcontroller, and an application processor (AP). , application processor) and logic elements capable of performing similar functions.
- CPU Central Processing Unit
- MPU Micro Processor Unit
- MCU Micro Controller Unit
- GPU Graphic Processing Unit
- microprocessor a digital signal processor
- AP application processor
- AP application processor
- the input/output I/O device 304 may include at least one of a keypad, keyboard, touch screen, and display device.
- the memory device 306 may store data and/or programs.
- the interface 308 may perform the function of transmitting data to or receiving data from a communication network.
- Interface 308 may be wired or wireless.
- the interface 308 may include an antenna or a wired or wireless transceiver.
- the memory device 306 is an operating memory for improving the operation of the processor 302 and may further include high-speed DRAM and/or SRAM.
- Internal storage 312 stores programming and data configurations that provide the functionality of some or all modules described herein. For example, it may include logic to perform selected aspects of the similarity determination method described above.
- the memory device 306 loads a program or application with a set of instructions including each step of performing the above-described similarity determination method stored in the storage 312 and allows the processor to perform each step.
- the cost of patent analysis can be simplified and more rapid decision-making can be supported by allowing the neural network model to learn patent documents and judgment data about the patent documents.
- the learned model can be used to investigate the background technology of a newly developed technology or provide results of a search for patent literature that may invalidate the target patent.
- various embodiments described herein may be implemented in a recording medium readable by a computer or similar device, for example, using software, hardware, or a combination thereof.
- the embodiments described herein include application specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), and field programmable gate arrays (FPGAs). It may be implemented using at least one of processors, controllers, micro-controllers, microprocessors, and other electrical units for performing functions. In some cases, as described herein, The described embodiments may be implemented as a control module itself.
- ASICs application specific integrated circuits
- DSPs digital signal processors
- DSPDs digital signal processing devices
- PLDs programmable logic devices
- FPGAs field programmable gate arrays
- embodiments such as procedures and functions described in this specification may be implemented as separate software modules.
- Each of the software modules may perform one or more functions and operations described herein.
- Software code can be implemented as a software application written in an appropriate programming language.
- the software code may be stored in a memory module and executed by a control module.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Artificial Intelligence (AREA)
- Software Systems (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Molecular Biology (AREA)
- Databases & Information Systems (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Medical Informatics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
La présente invention concerne un procédé et un appareil pour extraire des documents de brevets similaires à l'aide d'un modèle de réseau neuronal. Un procédé basé sur un vecteur d'intégration pour calculer la similarité entre des documents de brevets, selon la présente invention, comprend les étapes consistant à : recevoir chaque vecteur parmi des premiers vecteurs d'incorporation et des seconds vecteurs d'incorporation des documents de brevets ; et calculer la similarité entre des documents de brevets sur la base des premier et second vecteurs d'incorporation, les premier et second vecteurs d'incorporation étant intégrés à l'aide de chaque modèle parmi des premier et second modèles d'incorporation, et étant préférable que le premier modèle d'incorporation génère le premier vecteur d'incorporation sur la base du fait que le nombre de jetons est supérieur au nombre maximal de jetons du second modèle d'incorporation. Selon la présente invention, un modèle de réseau neuronal est entraîné avec des documents de brevets et des données de détermination sur les documents de brevets, et ainsi les coûts requis pour l'analyse de brevets sont réduits et une prise de décision plus rapide peut être prise en charge.
Applications Claiming Priority (6)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR20220042194 | 2022-04-05 | ||
KR10-2022-0042213 | 2022-04-05 | ||
KR10-2022-0042194 | 2022-04-05 | ||
KR20220042213 | 2022-04-05 | ||
KR10-2022-0056072 | 2022-05-06 | ||
KR1020220056072A KR102606352B1 (ko) | 2022-04-05 | 2022-05-06 | 신경망 모델을 활용한 유사 특허 문헌 추출 방법 및 이를 제공하는 장치 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2023195769A1 true WO2023195769A1 (fr) | 2023-10-12 |
Family
ID=88243215
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/KR2023/004594 WO2023195769A1 (fr) | 2022-04-05 | 2023-04-05 | Procédé d'extraction de documents de brevets similaires à l'aide d'un modèle de réseau neuronal, et appareil pour sa fourniture |
Country Status (2)
Country | Link |
---|---|
KR (1) | KR20230163983A (fr) |
WO (1) | WO2023195769A1 (fr) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117316371A (zh) * | 2023-11-29 | 2023-12-29 | 杭州未名信科科技有限公司 | 病例报告表的生成方法、装置、电子设备和存储介质 |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2019040402A (ja) * | 2017-08-25 | 2019-03-14 | 和之 白井 | 特許要件適否予測装置および特許要件適否予測プログラム |
KR20200106108A (ko) * | 2019-02-25 | 2020-09-11 | 이진원 | 딥러닝 기반의 특허정보 워드임베딩 방법 및 그 시스템 |
KR20210039917A (ko) * | 2020-03-20 | 2021-04-12 | (주)디앤아이파비스 | 인공지능 모델을 이용한 특허문서의 유사도 판단 방법, 장치 및 시스템 |
KR20210053539A (ko) * | 2019-11-04 | 2021-05-12 | 한국전자통신연구원 | 특허 신규성 판단 시스템 및 방법 |
KR102330190B1 (ko) * | 2019-07-02 | 2021-11-23 | 국민대학교산학협력단 | 복합 문서의 의미적 분해를 통한 다중 벡터 문서 임베딩 장치 및 방법 |
-
2023
- 2023-04-05 WO PCT/KR2023/004594 patent/WO2023195769A1/fr unknown
- 2023-11-21 KR KR1020230162664A patent/KR20230163983A/ko active Application Filing
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2019040402A (ja) * | 2017-08-25 | 2019-03-14 | 和之 白井 | 特許要件適否予測装置および特許要件適否予測プログラム |
KR20200106108A (ko) * | 2019-02-25 | 2020-09-11 | 이진원 | 딥러닝 기반의 특허정보 워드임베딩 방법 및 그 시스템 |
KR102330190B1 (ko) * | 2019-07-02 | 2021-11-23 | 국민대학교산학협력단 | 복합 문서의 의미적 분해를 통한 다중 벡터 문서 임베딩 장치 및 방법 |
KR20210053539A (ko) * | 2019-11-04 | 2021-05-12 | 한국전자통신연구원 | 특허 신규성 판단 시스템 및 방법 |
KR20210039917A (ko) * | 2020-03-20 | 2021-04-12 | (주)디앤아이파비스 | 인공지능 모델을 이용한 특허문서의 유사도 판단 방법, 장치 및 시스템 |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117316371A (zh) * | 2023-11-29 | 2023-12-29 | 杭州未名信科科技有限公司 | 病例报告表的生成方法、装置、电子设备和存储介质 |
CN117316371B (zh) * | 2023-11-29 | 2024-04-16 | 杭州未名信科科技有限公司 | 病例报告表的生成方法、装置、电子设备和存储介质 |
Also Published As
Publication number | Publication date |
---|---|
KR20230163983A (ko) | 2023-12-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Dehkharghani et al. | SentiTurkNet: a Turkish polarity lexicon for sentiment analysis | |
JP5356197B2 (ja) | 単語意味関係抽出装置 | |
WO2010036013A2 (fr) | Appareil et procédé pour extraire et analyser des opinions dans des documents web | |
WO2021100902A1 (fr) | Procédé de réponse de système de dialogue basé sur la reconnaissance de paraphrase de phrases | |
WO2023195769A1 (fr) | Procédé d'extraction de documents de brevets similaires à l'aide d'un modèle de réseau neuronal, et appareil pour sa fourniture | |
CN113157859B (zh) | 一种基于上位概念信息的事件检测方法 | |
WO2013002436A1 (fr) | Procédé et dispositif pour la classification de documents basée sur une ontologie | |
CN115858758A (zh) | 一种多非结构化数据识别的智慧客服知识图谱系统 | |
WO2019098454A1 (fr) | Technique de génération et d'utilisation d'empreinte virtuelle représentant des données de texte | |
Golshan et al. | A study of recent contributions on information extraction | |
Zheng et al. | Dynamic knowledge-base alignment for coreference resolution | |
WO2018131955A1 (fr) | Procédé d'analyse de contenus numériques | |
WO2021107449A1 (fr) | Procédé pour fournir un service d'analyse d'informations de commercialisation basée sur un graphe de connaissances à l'aide de la conversion de néologismes translittérés et appareil associé | |
WO2022191368A1 (fr) | Procédé et dispositif de traitement de données pour l'apprentissage d'un réseau neuronal qui catégorise une intention en langage naturel | |
Gupta et al. | Designing and development of stemmer of Dogri using unsupervised learning | |
CN112052424A (zh) | 一种内容审核方法及装置 | |
Zhong et al. | Domain-specific language models pre-trained on construction management systems corpora | |
WO2021107445A1 (fr) | Procédé pour fournir un service d'informations de mots nouvellement créés sur la base d'un graphe de connaissances et d'une conversion de translittération spécifique à un pays, et appareil associé | |
WO2024071568A1 (fr) | Procédé de commercialisation de produit basé sur une prédiction de préférence de client | |
CN117669543A (zh) | 基于ai的思维导图生成方法 | |
WO2024019226A1 (fr) | Procédé de détection d'urls malveillantes | |
Leung et al. | Counting protests in news articles: A dataset and semi-automated data collection pipeline | |
WO2022114325A1 (fr) | Dispositif d'extraction de qualité d'interrogation et procédé d'analyse de similarité de question dans une conversation en langage naturel | |
Mei‐fang et al. | Product online review analysis using fuzzy ontology | |
CN115221868A (zh) | 一种需求条目分割方法 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 23784993 Country of ref document: EP Kind code of ref document: A1 |