CN113196278A

CN113196278A - Method for training a natural language search system, search system and corresponding use

Info

Publication number: CN113196278A
Application number: CN201980082811.1A
Authority: CN
Inventors: S·阿维拉
Original assignee: Iprelli Technologies Ltd
Current assignee: Iprelli Technologies Ltd
Priority date: 2018-10-13
Filing date: 2019-10-13
Publication date: 2021-07-30
Also published as: EP3864566A1; JP2022513353A; WO2020074788A1; FI20185865A1; US20210397790A1

Abstract

The present invention provides a method and system for training a machine learning based patent search or novelty evaluation system. The method includes providing a plurality of patent documents, each having a computer-identifiable claim block and a full specification block, the full specification block including at least a portion of a specification of a patent document. The method also includes providing a machine learning model and training the machine learning model using a set of training data that includes data from the patent documents used to form the trained machine learning model. According to the present invention, the training includes using a pair of claim blocks and a full specification block derived from the same patent document as a training case of the training data set.

Description

Method for training a natural language search system, search system and corresponding use

Technical Field

The present invention relates to natural language processing. In particular, the present invention relates to systems and methods for retrieving, comparing, or analyzing documents (documents) containing natural language based on machine learning, such as neural network based. The document may be a technical document or a scientific document. In particular, the document may be a patent document.

Background

Comparison of written technical concepts is required in many areas of business, industry, economy and culture. A specific example is the examination of a patent application, where one purpose is to determine whether a technical concept defined in a claim of the patent application semantically encompasses another technical concept defined in another document.

Currently, there are more and more search tools available to find individual documents, but the analysis and comparison of concepts disclosed by documents remains largely a manual work involving human inference of meaning of words, sentences, and larger language entities.

Scientific research around natural language processing has produced tools for automatically parsing languages through computers. These tools may be used, for example, to symbolize (token) text, part-of-speech tagging, entity recognition, and to identify correlations between words or entities.

Scientific work has also been done to automatically analyze patents by extracting key concepts from documents, for example, for text summary and technical trend analysis purposes.

Recently, word embedding (word embedding) using a multi-dimensional word vector (word vector) has become an important tool for mapping the meaning of a word into a form that can be processed by a digital computer. This approach may be used by neural networks, such as recurrent neural networks (recurrent neural networks), to provide computers with a deeper understanding of the content of documents.

Conventionally, patent retrieval is performed using keyword retrieval, which involves defining correct keywords and synonyms thereof, morphotype variations, and the like, and creation of a boolean retrieval policy. This is time consuming and requires expertise. Recently, semantic retrieval has also been developed, which is ambiguous and may involve the use of artificial intelligence techniques. They help to quickly find a large number of documents that are somehow related to a concept discussed in another document. However, they are relatively limited in, for example, patent novelty retrieval, because in practice their ability to assess novelty, i.e., to find documents disclosing specific content that falls under the general concept defined in the patent claims, is limited. These approaches have proven powerful, for example, in machine translation applications.

In summary, there are available techniques that are well suited for general retrieval and extraction of core concepts, e.g., from text and text summaries. However, they are not well suited to making detailed comparisons among concepts disclosed in different documents in a vast array of data, which are crucial, for example, for patent novelty retrieval purposes or other technical comparison purposes.

In particular, to enable more efficient search and novelty evaluation tools, improved techniques for text analysis and comparison are needed.

Disclosure of Invention

It is an object of the present invention to address at least some of the above problems and to provide a novel system and method for improving the accuracy of technical searches. A particular object is to provide a solution that helps an automated system to better assess the novelty of concepts disclosed in documents with respect to each other and to better take into account the technical relationships between them.

It is a particular object to provide an improved machine learning based retrieval system and a method of training such a system.

Particular objects include providing a patent retrieval or novelty evaluation system with improved accuracy, and providing new uses for publicly available patent data.

According to one aspect, the invention provides a method of training a machine learning-based patent retrieval or novelty assessment system, the method comprising providing a plurality of patent documents, each of the patent documents having a computer-identifiable claim block (close block) and a full specification block (specification block) comprising at least a portion of the specification of the patent document. The method also includes providing a machine learning model and training the machine learning model using a set of training data that includes data from the patent documents used to form the trained machine learning model. According to the present invention, the training includes using a pair of claim blocks and a full specification block derived from a single, i.e., same, patent document as a training case of the training data set.

The machine learning model is preferably capable of embedding claim blocks and full specification blocks into vectors. The above-described training cases (i.e., training samples) are positive training samples, whereby the learning goal of the model may be to minimize the vector angle between the claim blocks and the full specification block. Other positive training samples may be claim blocks and full specification blocks that do not originate from the same document but are associated with each other via a database reference. Another learning objective may be to maximize or provide a non-zero vector angle between claim blocks and full specification blocks derived from at least some of the different documents that are not associated with each other in this manner, thereby forming negative training samples.

According to one aspect, there is provided a machine learning-based patent retrieval or novelty assessment system, comprising: a machine learning training subsystem adapted to read patent claim blocks and full specification blocks of patent documents and utilize the same as training data; and a machine learning search engine using the trained machine learning model for finding a subset of patent documents in a larger set of patent documents. In the present invention, the machine learning training subsystem is configured to use claim blocks and full specification block pairs derived from the same patent document as training cases of the training data set.

In one aspect, a natural language retrieval system is provided that includes a digital data storage for storing a plurality of natural language chunks and data graphs (graphs) corresponding to the chunks. There is also provided first data processing means adapted to convert said blocks into said graphics, said graphics being stored in said storage means. The graph contains a plurality of nodes, preferably consecutive nodes, each node containing a natural language unit extracted from the block as a node value or a portion thereof. Also provided is: a second data processing apparatus for executing a machine learning algorithm capable of traversing the graph and reading the node values for forming a trained machine learning model based on the node structure of the graph and the node values of the graph. A third data processing means is adapted to read fresh (fresh) graphics or fresh natural language blocks converted into fresh graphics and to utilize the machine learning model for determining the subset of natural language blocks based on the fresh graphics. The first and second data processing devices are part of a machine learning trainer subsystem as described above. The third data processing apparatus is a machine learning search engine as described above.

The graph may in particular be a tree recursive (recursion) graph having a partial word relationship between node values of consecutive nodes.

The method and system are preferably neural network based, whereby the machine learning model is a neural network model.

More specifically, the invention is characterized by what is stated in the independent claims.

The present invention provides significant benefits. Although patent novelty retrieval data and citation data provided by patent authorities and patent applicants can be used to train neural networks, they have the disadvantage that the quality of the data varies. In particular, only some of the novelty disorders raised by the patent authorities are in fact novelty disorders. Nevertheless, all citations cited by patent examiners end up as public records and patent databases from which it is not possible to discern which citations are truly relevant without manual evaluation. This reduces the reliability of the publicly available training data. The invention allows for at least one truly relevant training case for each claim, and in particular for one or more independent claims. Thus, the neural network can be trained more accurately to find relevant prior art documents.

The same document training case as presently disclosed may be the only positive (indicative of the relevant prior art) training case used, or novelty retrieval data and/or citation data may also be used to form additional training cases.

The present approach is also compatible with advanced training schemes such as data augmentation (augmentation), as will be discussed in detail later. The combination of these methods provides particularly good training results.

All this contributes to a more targeted search and a more accurate automatic novelty evaluation, requiring less manual work.

Tree graphs with partial word edges are particularly beneficial because they can be modified quickly and safely while preserving the coherence (coherence) technique and semantic logic inside the graph.

The dependent claims relate to selected embodiments of the invention.

Selected embodiments of the present invention and their advantages are discussed in more detail below with reference to the accompanying drawings.

Drawings

FIG. 1A illustrates a block diagram of an exemplary retrieval system on a general level.

FIG. 1B shows a block diagram of a more detailed embodiment of the retrieval system, which includes a series of neural network-based search engines and their trainers.

FIG. 1C shows a block diagram of a patent retrieval system, according to one embodiment.

FIG. 2A illustrates a block diagram of an exemplary nested graph with only partial/whole word relationships.

FIG. 2B illustrates a block diagram of an exemplary nested graph having a partial word/whole word relationship and a hyponym/hypernym relationship.

FIG. 3 shows a flow diagram of an exemplary graph parsing algorithm.

Fig. 4A shows a block diagram of patent search neural network training using patent search/citation data as training data.

Fig. 4B shows a block diagram of neural network training using claim-description (description) pattern pairs derived from the same patent document as training data.

Figure 4C shows a block diagram of neural network training using the augmented claim graph set as training data.

FIG. 5 illustrates the functionality of an exemplary graphical feed user interface (user interface), according to one embodiment.

Detailed Description

Definition of

A "natural language unit" in this context refers to a chunk of text (chunk), or a vector representation of the chunk of text after embedding. The chunks may be single word or multi-word sub-concepts that appear one or more times in the original text stored in computer-readable form. The natural language units may be presented as a set of character values (commonly referred to as a "string" in computer science), or numerically as multidimensional vector values, or references to such values.

A "natural language block" refers to a data instance (data instance) containing a linguistically meaningful combination of natural language units, e.g., one or more complete or incomplete sentences of a language such as english. The natural language blocks may be represented as, for example, a single string of characters and stored in a file system and/or displayed to a user via a user interface.

A "document" refers to a machine-readable entity containing natural language content and associated with a machine-readable document identifier that is unique with respect to other documents within the system.

"patent document" refers to the natural language content of a patent application or issued patent. The patent document is associated in the present system with a publication number and/or another machine-readable unique document identifier assigned by a recognized patent authority of another country or regional patent office, such as EPO, WIPO or USPTO or another country or region. The term "claim" refers to the basic content of a claim of a patent document, in particular an independent claim. The term "full specification" means the contents of a patent document that cover at least a portion of the specification of the patent document. The entire specification may also cover other portions of the patent document, such as the abstract or the claims. The claims and the full specification are examples of natural language blocks.

"claims" are defined herein as natural language blocks that will be considered by the european patent office as claims on the filing date of this patent application. In particular, a "claim" is a computer-recognizable block of a natural language document identified by a machine-readable integer number therein, for example preceded by (part of) related information in a string format and/or as a markup file format (such as xml or html format).

"full specification" is defined herein as a computer-recognizable natural language block, computer-recognizable within a patent document that also contains at least one claim, and contains at least one other portion of the document than the claim. In addition, the "full specification" may be identifiable by related information in a markup document format (such as xml or html format).

An "edge relationship" in this context may in particular be a technical relationship extracted from a block and/or a semantic relationship derived from semantics using a relevant natural language unit. In particular, the edge relationship may be

-partial word relationships (also: partial word/whole word relationships); partial words: x is a moiety of Y; the whole word: y has X as its own part; for example: "wheel" is a part of the term "automobile",

-hyponym relations (also: hyponym/hypernym relations); the hyponyms: x is the lower position of Y; the hypernyms: x is the upper position of Y; for example: the term "electric vehicle" is a subordinate term of "vehicle", or

-synonym relation: x is the same as Y.

In some embodiments, edge relationships are defined between successive nested nodes of the recursive graph, each node containing a natural language unit as a node value.

Other possible technical relationships include a topic relationship, meaning the role of a sub-concept of text relative to one or more other sub-concepts in addition to the above-described relationship. At least some subject relationships may be defined between successively nested elements. In one example, the subject relationship of a parent cell is defined in a child cell. One example of a topic relationship is a role class "function". For example, the function of "handle" may be "allow manipulation of an object". Such a subject relationship may be stored as a subunit of a "handle" unit with which the "function" role is associated. The topic relationship may also be a general relationship without a predefined classification (or with a general classification such as "relationship"), but the user is free to define the relationship. For example, a general relationship between the handle and the cup may be "[ handle ] attached to [ cup ] with an adhesive. Such a subject relationship may be stored as a "handle" unit or a "cup" unit or as a sub-unit of both, preferably mutually referenced to each other.

A relationship element is considered to define a relationship in a particular category or subclass of relationships if the relationship element links to computer executable code that, when run by a data processor, produces a natural language block that includes the relationship in that category or subclass.

"graph" or "data graph" refers to an instance of data that follows a generally non-linear recursive and/or network data pattern. The system can simultaneously contain several different graphs that follow the same data pattern and whose data originates from and/or is related to different sources. Indeed, the graphics may be stored in any suitable text or binary format, which allows for recursively storing the data items and/or storing the data items as a network. Graphics are in particular semantic graphics and/or technical graphics (describing semantic relationships and/or technical relationships between node values), and not syntactic graphics (which only describes linguistic relationships between node values). The graphic may be a tree graphic. A forest-shaped graph comprising a plurality of trees is herein considered to be a tree-shaped graph. In particular, the graph may be a technical tree graph.

"data schema" refers to rules according to which data, particularly natural language units and data associated therewith, such as information of technical relationships between the units, are organized.

"nesting" of natural language elements refers to the ability of the elements to have one or more children and one or more parents, as determined by a data schema. In one example, the unit may have one or more children and only a single parent. The root cell has no parent and the leaf cell has no children. Sibling units have the same parent. "continuous nesting" refers to nesting between a parent element and its immediate child element.

A "recursive" nesting or data pattern refers to a nesting or data pattern that allows nesting of natural language units containing data items.

A "(natural language) symbol" refers to a word or block of words in a larger block of natural language. The symbols may also contain metadata related to the word or phrase block, such as part-of-speech (POS) tags (labels) or syntactic dependency labels. A "set" of natural language symbols refers in particular to symbols that may be grouped based on their text values, POS tags or dependency labels, or any combination thereof, according to predetermined rules or fuzzy logic.

The terms "data storage device", "processing device" and "user interface device" mainly refer to software devices, i.e. computer executable code (instructions) that may be stored on a non-transitory computer readable medium and adapted to perform specified functions, respectively, when executed by a processor, in other words storing digital data, allowing a user to interact with said data, and processing said data. All of these components of the system may be carried in software that is executed by a local computer or by a web server through a locally installed web browser, e.g., supported by suitable hardware for executing the software components. The methods described herein are computer-implemented methods.

Description of selected embodiments

The following describes a natural language retrieval system that includes a digital data storage device for storing a plurality of natural language blocks and data graphs corresponding to the blocks. The storage may include one or more local or cloud data stores. The memory may be file-based or query language-based.

The first data processing means is a converter unit adapted to convert said blocks into said graphics. Each graph contains a plurality of nodes, each node containing natural language units extracted from the block as node values. Edges are defined between pairs of nodes, defining the technical relationships between the nodes. For example, the edges or some of them may define a partial word relationship between two nodes.

In some implementations, the number of at least some nodes in the graph that include a particular natural language unit value is less than the number of occurrences of the particular natural language unit in the corresponding natural language block. In other words, the graphic is a condensed representation of the original text, which may be implemented using, for example, the symbol recognition and matching method described later. By allowing multiple children per node, the basic technical content (and optionally semantic content) of the text can still be maintained in the graphical representation. Reduced graphics are also efficiently processed through graphics-based neural network algorithms, whereby they are able to learn the basic content of text better and faster than learning from direct text representations. This method has proven particularly powerful in comparison of technical texts and in particular in the search of patent full specifications based on claims and the automatic evaluation of the novelty of the claims.

In some embodiments, the number of all nodes comprising a particular natural language unit is one. In other words, there are no duplicate nodes. While this may lead to a simplification of the original content of the text, it leads to graphics suitable for patent retrieval and novelty evaluation that can be processed very efficiently and still be relatively expressive, at least when tree graphics are used.

In some embodiments, the graph is such a reduced graph for at least the nouns and noun chunks found in the original text. In particular, the graph may be a reduced graph of noun-valued (noun-valued) nodes arranged according to their partial word relations. In an average patent document, many noun terms appear tens or even hundreds of times throughout the text. With the present scheme, the content of such documents can be compressed to a small portion of the original space, while making them more feasible for machine learning.

In some embodiments, the plurality of terms that appear multiple times in the at least one original natural language block appear exactly once in the corresponding graph.

Condensed graph representations are also beneficial because synonyms and coreferences (meaning the expression of the same thing in a particular context) can be considered when building the graph. This results in an even more compact graph. In some embodiments, the plurality of terms appearing in the at least one original natural language block in the at least two different written forms appear exactly once in the corresponding graphic.

The second data processing device is a neural network trainer for executing a neural network algorithm that is capable of iteratively traversing the graph structure and learning from both the graph's internal structure and its node values, as defined by a loss function and a training data case (data case) defining a learning objective. The trainer typically receives a training data combination of a pattern or augmented pattern derived therefrom, as specified by the training algorithm. The trainer outputs a trained neural network model.

It has been found that such supervised machine learning approaches employing graphical form data as described herein are exceptionally powerful in finding technically relevant documents among patent and scientific documents.

In some embodiments, the storage device is further configured to store reference data that links at least some of the blocks to each other. The reference data is used by the trainer to derive training data, i.e. to define a combination of patterns used in training as positive or negative training cases (i.e. training samples). The learning objective of the trainer depends on this information.

The third data processing means is a search engine adapted to read fresh graphics or fresh natural language blocks, typically through a user interface or a network interface. The blocks are converted to graphics in the converter unit, if necessary. The search engine uses the trained neural network model for determining a subset of natural language blocks (or graphs derived therefrom) based on the fresh graph.

FIG. 1A illustrates one embodiment of the present system that is particularly suited for retrieving technical documents, such as patent documents or scientific documents. The system includes a document store 10A containing a plurality of natural language documents. The graphics parser 12 is adapted to read documents from the document store 10A and convert them into a graphics format, which is discussed in more detail later. The converted graphics are stored in the graphics memory 10B.

The system comprises a neural network trainer unit 14 which receives as training data a set of parsed figures from a graph memory and some information about their relationship to each other. In this case, a document reference data storage 10C is provided, including, for example, cited data about the document and/or novelty retrieval results. The trainer unit 14 runs a graph-based neural network algorithm that generates a neural network model for the neural network-based search engine 16. The engine 16 uses the graphics from the graphics repository 10B as a target search set and uses the user data, typically text or graphics, obtained from the user interface 18 as a reference.

Search engine 16 may be, for example, a graphics-to-vector search engine trained to find the vector corresponding to the graphics of graphics memory 10B that is closest to the vector formed by the user data. Search engine 16 may also be a classifier search engine, such as a binary classifier search engine, that compares user graphics or vectors derived therefrom in pairs with graphics or vectors derived therefrom obtained from graphics memory 10B.

Fig. 1B shows an embodiment of the system further comprising a text embedding unit 13 which converts the natural language elements of the graph into a multi-dimensional vector format. This is done for graphics from graphics memory 10B and converted and for graphics entered through user interface 18. Typically, the vector has at least 100 dimensions, such as 300 dimensions or more.

In one embodiment, also shown in FIG. 1B, the neural network search engine 16 is divided into two parts forming a series. For example, engine 16 includes a graphics embedding engine that converts graphics into a multidimensional vector format using a model trained by graphics embedding trainer 14A of neural network trainer 14 using reference data from document reference data store 10C. The user graphics are compared in vector comparison engine 16B with graphics previously generated by graphics embedding engine 16A. As a result, a reduced subset of the graph closest to the user's graph is found. The subset of graphics is further compared to the user graphics by the graphic classifier engine 16C to further narrow the set of related graphics. The pattern classifier engine 16C is trained by the pattern classifier trainer 14C using data from the document reference data store 10C, for example, as training data. This embodiment is beneficial because vector comparison of pre-formed vectors by the vector comparison engine 16B is very fast, while the graph classifier engine has access to detailed data content and structure of the graphs, and can accurately compare the graphs to find differences between them. The graphics embedding engine 16A and the vector comparison engine 16B function as efficient pre-filters for the graphics classifier engine 16C, thereby reducing the amount of data that needs to be processed by the graphics classifier engine 16C.

The graphics embedding engine may convert the graphics into vectors having at least 100 dimensions, preferably 200 dimensions or more, and even 300 dimensions or more.

The neural network trainer 14 is divided into two parts-a pattern embedding part and a pattern classifier part, which are trained using a pattern embedding trainer 14A and a pattern classifier trainer 16C, respectively. The graph embedding trainer 14A forms a graph-to-vector model based on a neural network, the purpose of which is to form a neighborhood vector (near vector) for graphs whose text content and internal structure are similar to each other. The graph classifier trainer 14B forms a classifier model that is capable of ranking pairs of graphs according to their similarity of textual content and internal structure.

The user data obtained from the user interface 18, after being embedded in the embedding unit 13, is fed to the graphics embedding engine for vectorization, after which the vector comparison engine 16B looks up a set of closest vectors corresponding to the graphics of the graphics memory 10B. The set of closest graphics are fed to the graphics classifier engine 16C which compares them one by one with the user graphics using a trained graphics classifier model to obtain an accurate match.

In some embodiments, graph embedding engine 16A, as trained by graph embedding trainer 14A, outputs vectors whose angles are closer to each other, the more similar the graph is in terms of node content and node structure, as learned from the reference data using learning objectives dependent thereon. By training, the vector angle of positive training cases (graphs describing the same concept) derived from the cited data can be minimized, while the vector angle of negative training cases (graphs describing different concepts) is maximized, or at least deviates significantly from zero.

The graphics vector may be selected to have, for example, 200-1000 dimensions, such as 250-600 dimensions.

It has been found that such supervised machine learning models are able to efficiently evaluate the similarity of technical concepts disclosed by a graph and further by the natural language blocks from which the graph is derived.

In some embodiments, the graph classifier engine 16C, as trained by the graph classifier trainer 14C, outputs a similarity score, the higher the similarity score, the more similar the compared graph is in terms of node content and node structure, as learned from the reference data using learning objectives dependent thereon. By training, the similarity score of positive training cases (graphs describing the same concept) derived from the cited data can be maximized, while the similarity score of negative training cases (graphs describing different concepts) is maximized.

Cosine similarity is one possible criterion for similarity of graphs or vectors derived from graphs.

It should be noted that the graph classifier trainer 14C or engine 16C is not mandatory, but may evaluate the graph similarity directly based on the angle between the vectors embedded by the graph embedding engine. For this purpose, one or more neighboring graphics vectors of a given fresh graphics vector may be looked up using a fast vector index known per se.

The neural network used by the trainer 14 and search engine 16 or any one or both of its sub-trainers 14A, 14C or sub-engines 16A, 16C may be a recurrent neural network, particularly one that utilizes Long Short-Term Memory (LSTM) units. In the case of a Tree structure graph, the network may be a Tree-LSTM network, such as a Child-Sum-Tree-LSTM network. The network may have one or more LSTM layers and one or more network layers. The network may use an attention mechanism that relates portions of the graph to each other internally or externally when training and/or running the model.

Some additional embodiments of the present invention are described below in the context of a patent retrieval system, whereby the documents being processed are patent documents. The general embodiments and principles described above apply to the patent retrieval system.

In a certain embodiment, the system is configured to store natural language documents in a storage device, each natural language document containing a first natural language block and a second natural language block different from the first natural language block. The trainer may use a plurality of first graphics corresponding to first blocks of a first document, and for each first graphic, use one or more second graphics based at least in part on second blocks of a second document different from the first document, as defined by the reference data. In this way, the neural network model learns from the interrelationships between different parts of different documents. In another aspect, the trainer may use a plurality of first graphics corresponding to first blocks of a first document, and for each first graphic, use a second graphic based at least in part on second blocks of the first document. In this way, the neural network model can learn from the internal relationships of data within a single document. These two learning schemes may be used separately or simultaneously by the patent retrieval system described in detail below.

The reduced graphical representation discussed above is particularly suitable for a patent retrieval system, i.e. for claim graphs and full specification graphs, in particular for full specification graphs.

FIG. 1C shows a system including a patent document store 10A containing patent documents including at least a computer-recognizable specification portion and a claim portion. The graphic parser 12 is configured to parse claims through a claims graphic parser 12A and parse full specifications through a full specification graphic parser 12B. The parsed graph is stored separately to the claims and the full specification graph memory 10B. The text embedding unit 13 prepares the graphics for processing in the neural network.

The citation data may comprise public patent application and patent retrieval and/or review data and/or citation data between patent documents. In one embodiment, the citation data contains information of previous patent search results, i.e., novelty and/or inventive hurdles of an earlier patent document considered a later filed patent application. The citation data is stored in the prior patent retrieval and/or citation data store 10C.

The neural network trainer 14 uses the parsed and embedded patterns to form a neural network model that is trained, particularly for patent retrieval purposes. This is accomplished by using patent search and/or citation data as input to trainer 14. The object is, for example, to minimize the vector angle between the claim figure of a patent application and the full specification figure of a patent document serving as a barrier to its novelty or to maximize its similarity score. As such, as applied to multiple (typically hundreds of thousands or millions of) claims, the model learning evaluates the novelty of a claim over the prior art. The model is used by the search engine 16 for user graphics obtained through the user interface 18A to find the most likely novelty barriers. The results may be shown in the retrieve results view interface 18B.

The system of FIG. 1C may utilize a series of search engines. The engine may be trained with the same or different subsets of training data obtained from the prior patent retrieval and/or citation data store 10C. For example, one may filter a set of graphics from a complete prior art data set using a graphics embedding engine trained over a large or complete set of referenced data (i.e., positive claim/full specification pair and negative claim/full specification pair). The set of graphics to be filtered against the user's graphics is then classified in a classification engine that can be trained on smaller sets of cited data (i.e., positive claim/full specification pair and negative claim/full specification pair) specific to, for example, a patent classification to find similarity of the graphics.

Next, a tree graph structure particularly suitable for a patent retrieval system is described with reference to fig. 2A and 2B.

FIG. 2A shows a tree graph with only partial word relationships as edge relationships. The text elements a-D are arranged into the graph as linearly

recursive nodes

10, 12, 14, 16, originating from the root node 10, and the text element E as a child of the node 12, as a child node 18, as derived from the natural language blocks shown. In this document, the expressions "including", "having", "included" and "included" are used to detect partial word relationships from partial words/whole words.

FIG. 2B shows another tree graph having two different edge relationships, which in this example are a partial word relationship (first relationship) and a hyponym relationship (second relationship). The text units a-C are arranged as linear

recursive nodes

10, 12, 14 having a partial word relationship. The text unit D is arranged as a child node 26 of the parent node 14 having a hyponym relationship. The text element E is arranged as a child node 24 of the parent node 12 having a hyponym relationship. The text unit F is arranged as a child node 28 of the node 24 having a partial word relationship. In this document, the expressions "including", "having", "such as" and "from partial words/whole words are, for example," detecting partial word relationships and hyponym word relationships.

According to one embodiment, the first data processing device is adapted to transform a block into a graphic by first identifying from the block a first set of natural language symbols (e.g., nouns and noun chunks) and a second natural language symbol (e.g., partial word and whole word expressions) different from the first set of natural language symbols. A matcher is then performed using the first and second sets of symbols for forming matched pairs of the first set of symbols (e.g., "body" and "member" from the "body-comprising-member"). Finally, the first set of symbols is arranged as nodes (e.g., "body" - (part word edge) - "member") of the graph using the matching pairs.

In one embodiment, at least partial word edges are used in the graph, whereby the respective nodes comprise natural language units having a partial word relationship with each other, as derived from the blocks.

In one embodiment, hyponym edges are used in the graph, whereby individual nodes contain natural language units with hyponym relationships to each other, as derived from natural language blocks.

In one embodiment, edges are used in the graph, at least one of the nodes of the graph containing a reference to one or more nodes in the same graph and additionally at least one natural language unit derived from the corresponding natural language block (e.g., "[ node id: X" ] below … …). In this way, graphics space is saved and is simple, e.g. a tree, graph structure can be maintained, still allowing expressive data content in the graph.

In some embodiments, the graph is a tree graph whose node values contain a word or multi-word chunk that is derived from the natural language blocks or from vectorized forms thereof, typically by a graph conversion unit, using part-of-speech and syntactic dependencies of words.

Fig. 3 shows in detail one example of how the text-to-graphics conversion is implemented in the first data processing device. First, text is read in step 31 and a first set of natural language symbols (such as nouns) and a second set of natural language symbols (such as symbols (e.g., "including") indicating partial or whole part of speech (senses) are detected from the text, which may be accomplished by tokenizing the text in step 32, 33 tokenizing the symbols for part of speech (POS) tagging, deriving their syntactic dependencies in step 34.

In one embodiment, the noun chunk pairs are arranged as a tree graph, with some words being children of the corresponding whole word, as shown in step 38. The graphics are saved in graphics memory for further use in step 39, as discussed above.

In one embodiment, the graph formation step involves the use of Probabilistic Graphical Models (PGM), such as bayesian networks, for inferring a preferred graph structure. For example, different edge probabilities of the graph may be calculated according to a Bayesian model, after which the edge probabilities are used to calculate the most likely graph form.

In one embodiment, the graphical forming step includes feeding the text, typically in the form of tokenized, POS callouts and dependency parsing, into a neural network based technical parser that looks up relevant chunks from text blocks and extracts their desired edge relationships, such as part word relationships and/or hyponym relationships.

In one embodiment, the graph is a tree graph that includes edge relationships arranged recursively according to a tree data pattern that is acyclic (acyclic). This allows the use of an efficient tree-based neural network model of the cyclic or acyclic (non-recurrent) type. One example is the Tree-LSTM model.

In another embodiment, the graph is a network graph that allows loops (cycles), i.e., edges between branches. This has the benefit of allowing complex edge relationships to be expressed.

In yet another embodiment, the graph is a forest of linear and/or nonlinear branches having a length (length) of one or more edges. The linear branches have the benefits of: the tree or network building steps are avoided or significantly simplified and the maximum amount of source data is available for the neural network.

In each model, the marginal likelihood, if obtained by the PGM model, can be stored and used by the neural network.

It should be noted that the graphics-forming method as described above with reference to fig. 3 and elsewhere in this document may be implemented independently of the other methods and system parts described herein to form and store a technically streamlined representation of the technical content of a document, in particular of the full patent specification and claims.

Fig. 4A-4C illustrate different, but mutually exclusive, methods of training neural networks, particularly for patent retrieval purposes.

For the general case, the term "patent document" may be replaced with a "document" (having a unique computer-readable identifier among other documents in the system). The "claim" may be replaced with a "first computer-recognizable block" and the "full specification" may be replaced with a "second computer-recognizable block that is at least partially different from the first block.

In the embodiment of fig. 4A, a plurality of claim patterns 41A and corresponding proximate prior art full specification patterns 42A for each claim pattern, as correlated by reference data, are used as training data by a neural network trainer 44A. They form positive training cases indicating that a low vector angle or high similarity score between such figures is to be achieved. In addition, for each claim graph, negative training cases, i.e., one or more remote prior art graphs, may be used as part of the training data. A high vector angle or low similarity score between such figures will be achieved. The negative training case may be randomized, for example, from a complete set of patterns.

According to one embodiment, the plurality of negative training cases is selected from a subset of all possible training cases that are more difficult than an average of all possible negative training cases in at least one stage of training as implemented by the neural network trainer 44A. For example, difficult negative training cases may be selected such that the claim graph and the specification graph are from the same patent classification (up to a predetermined classification level), or such that the neural network was previously unable to correctly classify the specification graph as a negative case (with a predetermined confidence).

According to one embodiment, which can also be implemented independently of other method and system portions described herein, the training of the present neural network-based patent retrieval or novelty assessment system is implemented by providing a plurality of patent documents, each having a computer-identifiable claim block and a full specification block, the full specification block including at least a portion of the specification of the patent document. The method also includes providing a neural network model and training the neural network model using a training data set that includes data from the patent documents used to form the trained neural network model. Training includes using claim block and full specification block pairs derived from the same patent document as training cases of the training data set.

Typically, these intra-document positive training cases form a small fraction of all training cases of the training, e.g., 1% -25%, the rest containing, e.g., search report (reviewer novelty citation) training cases.

The present machine learning model is generally configured to convert claims and full specifications into vectors, and the learning goal of the training of the model may be to minimize the vector angle between the claim vector and the full specification vector of the same patent document. Another learning objective may be to maximize a vector angle between a claim vector and a full specification vector of at least some of the different patent documents.

In the embodiment of fig. 4B, multiple claim graphs 41A and full specification graphs 42A derived from the same patent document are used as training data by neural network trainer 44B. An "own" full specification of a claim usually forms a perfect positive training case. In other words, a patent document by itself is a desirable novelty barrier to its claims. Thus, these graphic pairs form positive training cases indicating that a low vector angle or high similarity score between such graphics will be achieved. Also in this scenario, reference data and/or negative training cases may be used.

Tests have shown that the classification accuracy of the prior art is improved by more than 15% when testing with real novelty retrieval based test data pairs, simply by adding claim-specification pairs from the same document to real novelty retrieval based training data.

Typically, at least 80%, usually at least 90%, and in many cases 100% of the machine-readable content (natural language units, especially words) of a claim is found somewhere in the entire specification of the same patent document. Thus, the claims and the entire specification of the patent document are linked to each other not only by the cognitive content and the same unique identifier (e.g., publication number) but also by their byte-level content.

According to an embodiment, which can also be implemented independently of other method and system portions described herein, the training of the present neural network-based patent retrieval or novelty assessment engine includes deriving at least one reduced data instance corresponding in part to an original block from at least some original claims or full specification blocks, and using the reduced data instance and the original claims or full specification blocks together as a training case for the set of training data.

In the embodiment of fig. 4C, the positive training case is augmented by forming a plurality of reduced claim patterns 41C "-41C" "from the original claim pattern 41C'. The reduced claim graphic is a graphic wherein

-at least one node is removed (e.g. phone-display-sensor- > phone-display)

At least one node is moved to another location of higher (more general) location of the branch (e.g. phone-display-sensor- > phone- (display, sensor), and/or

-the natural language unit value of at least one node is replaced with a more general natural language unit value (phone-display-sensor- > electronics-display-sensor).

Such an amplification scheme allows the training set for the neural network to be expanded, resulting in a more accurate model. It also allows meaningful search and evaluation of novelty of so-called trivial inventions (trivial innovations) with only few nodes or in very general terms, which is not at least seen in the actual patent novelty search data. Data amplification may be performed in conjunction with any one or combination of the embodiments of fig. 4A and 4B. Also in this scenario, a negative training case may be used.

Negative training cases can also be augmented by removing, moving, or replacing nodes or their values in the full specification graph.

A tree graph structure, such as a graph structure based on partial word relations, is beneficial for augmentation schemes, since coherent logic can still be preserved since it can be augmented by deleting nodes or moving nodes to higher tree positions in a straightforward and robust manner. In this case, both the original data instance and the reduced data instance are graphs.

In one embodiment, a reduced graph is a graph in which at least one leaf node has been deleted relative to the original graph or another reduced graph. In one embodiment, all leaf nodes at a certain depth of the graph are deleted.

This augmentation may also be performed directly on the natural language blocks, in particular by deleting parts of the natural language blocks or changing the content of the natural language blocks partly to more general content.

The number of reduced data instances per original instance may be, for example, 1-10000, in particular 1-100. Good training results were achieved in claim amplifications with 2-50 amplification patterns.

In some embodiments, the search engine reads fresh natural language blocks, such as fresh claims, which are converted to fresh graphics by a converter, or reads fresh graphics directly through a user interface. A user interface suitable for direct graphical input is discussed next.

Fig. 5 illustrates the representation and modification of an exemplary graphic on the display element 50 of the user interface. The display element 50 includes a plurality of editable data cells (cells) a-F whose values are functionally connected to corresponding natural language elements (e.g., elements a-F, respectively) of the underlying (underlying) graphic, and are shown in respective User Interface (UI) data elements 52, 54, 56, 54', 56 ″. The UI data element may be, for example, a text field whose value can be edited by a keyboard after the element is launched. UI data elements 52, 54, 65, 54', 56 "are positioned horizontally and vertically on display element 50 according to their position in the graph. Herein, the horizontal position corresponds to the depth of the cell in the graph.

The display element 50 may be, for example, a window, frame or panel of a web browser running a web application, or a graphical user interface window of a stand-alone program executable in a computer.

The user interface also includes a shifting engine that allows the natural language unit to be moved horizontally (vertically) on the display element in response to a user input and to modify the graphics accordingly. To illustrate this, fig. 5 shows data cell F (element 56 ") shifted one level to the left (arrow 59A). Due to this, the original element 56 "nested below the element 54' is no longer present and forms an element 54" nested below the higher-level element 52 and comprising the data cell F (with its original value). If data element 54 'is then shifted to the right by two steps (arrow 59B), data element 54' and its children will be shifted to the right and nested beneath data element 56 as data element 56 "and data element 58. Each shift is reflected by a corresponding shift of a nested stage in the underlying graph. Thus, when a child of a cell is shifted to a different nesting level in the user interface, the child of the cell will be retained in the graph.

In some embodiments, the UI data element includes a natural language assistant (helper) element shown in relation to an editable data cell for assisting a user in inputting natural language data. The content of the auxiliary element may be formed using the relationship unit associated with the related natural language unit and optionally the natural language unit of its parent element.

Instead of a graphical based user interface as illustrated in fig. 5, the user interface may allow for block text to be entered, such as the independent claims. The text block is then fed to a graphics parser to obtain graphics that can be used at further stages of the retrieval system.

Claims

1. A computer-implemented method of training a machine learning-based patent search or novelty assessment system, comprising

-providing a plurality of patent documents, each of said patent documents having a computer-identifiable claim block and a computer-identifiable full specification block, said full specification block comprising at least a portion of the specification of a patent document,

-providing a machine learning model, the machine learning model,

training the machine learning model using a training data set comprising data from the patent documents for forming a trained machine learning model,

wherein

-the training comprises using claim block and full specification block pairs derived from the same patent document as training cases of the training data set.

2. The method of claim 1, comprising using the claim block and full specification block pairs derived from the same patent document as positive training cases indicating positive search hits or negative novelty evaluation results.

3. The method of claim 1 or 2, comprising

-transforming the claim blocks and the instruction blocks into a graph, the graph comprising a plurality of nodes, each of the nodes comprising a natural language unit extracted from a respective block,

using a graph-based neural network model, the graph can typically be embedded into a vector,

-using claim graphics and full specification graphics derived from the same patent document as the training case, the learning goal of the training being generally to minimize the vector angle between the claim graphics and full specification graphics.

4. The method of claim 3, wherein the graph format is a recursive tree format comprising nested nodes having natural language data units as node values.

5. The method of claim 3 or 4, wherein the converting comprises

-identifying from the block a first set of natural language symbols and a second set of natural language symbols different from the first set of natural language symbols,

-performing a matcher with the first set of symbols and the second set of symbols for forming matching pairs of the first set of symbols,

-arranging at least a part of the first set of symbols as consecutive nodes of the graph with the matching pairs.

6. The method of any of claims 3-5, wherein the graph comprises a plurality of edges, respective nodes of the edges comprising natural language units having a partial word or hyponym relationship with each other, as derived from natural language blocks.

7. The method of any of the preceding claims, further using second claim blocks and full specification block pairs derived from different patent documents as training cases for the training data set.

8. The method according to any of the preceding claims, wherein the claim block comprises an independent claim of a patent document, such as the first independent claim.

9. The method of any preceding claim, wherein the claim block comprises a combination of an independent claim of a patent document and a claim dependent thereon.

10. A natural language document comparison system based on machine learning includes

A machine learning training subsystem adapted to read a first block and a second block of a document, the second block being at least partially different from the first block, and to use the blocks as training data for forming a trained machine learning model,

-a machine learning retrieval engine using the trained machine learning model for finding a subset of documents in a larger set of documents,

wherein the machine learning training subsystem is configured to use a first block and a second block pair derived from the same document as training cases of the training data.

11. The system of claim 10, wherein

-said machine learning training subsystem is adapted to transform said first and second blocks into a first graph and a second graph, said first and second graphs containing a plurality of nodes, each of said nodes containing a natural language unit extracted from a respective block, an

-the machine learning training subsystem is adapted to use a graph-based neural network algorithm and to utilize a first graph and a second graph originating from the same document as training cases of the training data set.

12. The system of claim 11, wherein the graph comprises a plurality of edges, respective nodes of the edges comprising natural language units having partial word or hyponym relationships with each other, as derived from natural language blocks.

13. The method according to any of claims 10-12, further using a second pair of first and second blocks originating from different documents as training cases of the training data set.

14. The method of any of claims 10-13, wherein the machine learning training subsystem is adapted to read a patent document as the document, whereby the first block is a claim block and the second block is a full specification block of the patent document.

15. Use of the claims and full specification of the same patent document as a training case for a machine learning based patent retrieval or novelty assessment system.

16. The use of claim 15, wherein the machine learning-based patent retrieval or novelty assessment system comprises a machine learning model configured to convert claims and full specifications into vectors, and wherein a learning goal of training of the model is to minimize a vector angle between a claim vector and a full specification vector of the same patent document.

17. The use of claim 16, further comprising using claims and full specifications of different patent documents as training cases, wherein a learning goal of the training of the model is to maximize a vector angle between claim vectors and full specification vectors of different patent documents.