CN110023924A - Device and method for semantic search - Google Patents
Device and method for semantic search Download PDFInfo
- Publication number
- CN110023924A CN110023924A CN201780069862.1A CN201780069862A CN110023924A CN 110023924 A CN110023924 A CN 110023924A CN 201780069862 A CN201780069862 A CN 201780069862A CN 110023924 A CN110023924 A CN 110023924A
- Authority
- CN
- China
- Prior art keywords
- text document
- text
- inquiry
- document data
- vector
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3347—Query execution using vector based model
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/3332—Query translation
- G06F16/3335—Syntactic pre-processing, e.g. stopword elimination, stemming
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Databases & Information Systems (AREA)
- Artificial Intelligence (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Computation (AREA)
- Evolutionary Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
Abstract
It discloses a kind of for comparing the computer implemented method of text document.Method includes establishing the database comprising the first text document data associated with multiple first text documents.Method further comprises receiving inquiry.Method further includes that the inquiry is converted to the second text document data.Method further includes at least one similarity measurement for the second text document data being compared and being calculated with the first text document data between the second text document data and the first document data.It is further disclosed that the computer implemented method for handling the similitude in text document.Method includes coordinating at least one incoming inquiry.It further comprise at least one incoming coordination inquiry of normalization.Method further includes establishing at least one query vector using at least one normalized coordination inquiry.Method further includes at least one similarity measurement calculated between at least one query vector and at least one another text document, wherein at least one another text document has carried out abovementioned steps.Also disclose a kind of computer implemented system.System includes at least one memory device assembly, which is suitable at least saving the database comprising multiple first text document datas associated with the first text document.System can also include at least one input unit suitable for receiving inquiry.Inquiry includes the second text document and/or the information for identifying the second text document.Second text document with include that the second text document data in the first text document data being stored in memory device assembly is associated.System further comprises suitable for that will inquire at least one processing component for being converted to the second text document data and/or retrieving the second text document data associated with inquiry from the storage at least one memory device assembly.Processing component is further adapted for for the second text document data being compared with the first text document data being stored at least one memory device assembly.System further includes at least one output device for being adapted to return to the information for identifying at least one similar first text document associated with the first text document data.Similar first text document is most like with inquiry in the first text document.
Description
Technical field
The present invention relates to data analysis and conversion arts.In particular it relates to semantic search.More precisely, this
Invention, which describes, to be suitable in the search engine for semantically comparing text document.
Background technique
Due to the appearance (especially in internet) of this file store, search accommodates the file store or data of mass data
Similar information in library, which has become, one of to be most difficult to solve the problems, such as.A solution of this problem is all obtainable
The brute force approach of the keyword of user's explication is searched in document.This mode is effective in processing capacity, but there are one
A little limitations: depending on the theme inquired into, identical keyword can refer to extremely different things, synonym or similar statement
It has been repeated as many times using referring to obtain the search of all related discoveries.
In the more specific example searched for about the prior art, usually passes through IPC (International Patent Classification) classification, leads to
It crosses CPC (collaboration patent classification) classification or completes the search of similar patent by citation that each patent is listed.It is this
Mode is likely to be obtained some relevant discoveries, but may miss the similar information of (and not being cited also) more recently, or provide
Less relevant discovery too much (in the case where passing through IPC or CPC classification search).
It is a kind of comb document similarity more careful mode can be realized by semantic search.This kind of search will consider
To synonym, the statement being made of more than one word, the specific technical term in a certain field and will be by all of which group
It is combined to carry out more accurate similarity system design.It different term or text that can be used will be defined as the multidimensional of vector
Vector space completes this kind of search, and similarity system design is directly carried out in the vector space.
United States Patent (USP) 8,688,720 discloses a kind of system of the cluster characterization file for conceptual dependency word.It is receiving
When document comprising set of words, " candidate clusters " of Systematic selection conceptual dependency word relevant to the set of words.This
It is that the model how to generate from the cluster of conceptual dependency word selects that a little candidate clusters, which are using explanation set of words,.So
Afterwards, system constructs component set to characterize document, wherein collection of document includes the component of candidate clusters.It is every in component set
A component instruction corresponding candidate cluster is related to the degree of set of words.
United States Patent (USP) 8,935,230 discloses a kind of method, machine readable storage medium and learns by oneself idiom for providing
The system of adopted search engine.Semantic network can be established with initial configuration.The search engine for being couple to semantic network can construct
Index and semantic indexing.It can receive user's request of business data.Search engine can be accessed via semantic scheduler.It is based on
The access, search engine can update index and semantic indexing.
U.S. Patent application 2014/280088 describe for search for by collection of document, term set and with each art
A kind of system and correlation technique of the data acquisition system of language and each document associated vector composition.Method is related to search inquiry
The vector in the vector space of term and document vector the expanded moon is converted to, and merges vector approximation matching search and is searched with term
To generate results set, which can sort rope according to the various measurements of inquiry correlation.
Summary of the invention
The present invention illustrates in claim and the following specification.Preferred embodiment is specifically in dependent claims
Illustrate in description with various embodiments.
Features above and subsidiary details of the invention are further described in the following example, are further intended to show
Out the present invention and be not intended to limit the scope in any way.
According to known, the purpose of the present invention is therefore that open carried out using at least some of following element
A kind of method and apparatus of semantic search:
1) it realizes and (specifically, training) is specially designed to technical language part-of-speech tagging, cleaning text, deletes stopping
Word reduces word to trunk and phrase, correction misspelling, coordination languages style, correction synonym, cleaning OCR (optics word
Symbol identification) error, the weighting of execution multi -components and the different modes using different index of similarity;
2) analysis and hypothesis of vocabulary and semantic algorithm are integrated;
3) consider and realize simultaneously different text related informations and different algorithms;
4) text of the analysis across all technical fields;
5) connection between text similarity measurement and bibliography feature is realized;
6) text based method and bibliography method that similitude determines are integrated.
In the document, word " keyword ", " term " and " semantic primitive " may be used interchangeably.In addition, word is " crucial
Word " or " term " can refer to a kind of statement rather than single word.
In the first embodiment, the invention discloses a kind of for comparing the computer implemented method of text document.Side
Method includes establishing the database comprising the first text document data associated with multiple first text documents.Method is further wrapped
Include reception inquiry.Method further includes that the inquiry is converted to the second text document data.Method further includes by the second text text
File data is compared and is calculated between the second text document data and the first document data extremely with the first text document data
A few similarity measurement.This similarity measurement for example may include index of similarity.It can advantageously present and be compared to each other text
A kind of quantifiable mode of document.
It should be noted that inquiry may include the second text document, in this case, which can be converted
For the second text document data.It is already received in database however, inquiry can also be identified only as the first text document number
According to a part the second text document.In this case, the second text document data has existed, and should be only from data
It retrieves in library and is compared with comprising other data in the database.
This method allow will can analyze and can with other data carry out quantization compared with text document be converted to having for data
Effect and reliable way.Preferably, it can be converted and be compared in a manner of preferred parallelization by calculating equipment.Described side
Method can be realized on the server that user interface can access.It can be used for allowing user's identification for for various purposes similar
Text document.
In some preferred embodiments, the first text document data includes by including the key in the first text document
The document vector and/or word relevant to the keywords semantics that word generates.That is, each first text document can be with storage
Document vector correlation connection in the database.
Database may or may not include the first text document itself.It advantageously only stores related to the first text document
The document vector of connection is to store in the database memory space.On the contrary, it is advantageous that also store the first text document so as into
Row easily and quickly response of the retrieval as such as inquiry.
For example, word relevant to the keywords semantics may include synonym, hypernym and/or hyponym.In order to
Semantic related term is correctly identified, external data base can be used.It is specific that these can be general and/or main body.
In some embodiments, inquiry may include the second text document.Further additionally or alternatively, inquiry may include
Identify the information of the second text document associated with the second text document data having stored in memory device assembly.?
In the case of two kinds, can only be retrieved from database the second text document data associated with second text document and
Then it is compared with the first text document data remaining in database.It should be noted that in this case, the second text document
Data may include matching in the first text document data and in different ways to avoid obscuring.
In some embodiments, inquiry is converted to the second text document data may include coordinating inquiry.Some
In preferred embodiment, coordination may include correction typing error, the specific spelling specification of selection and physical unit specification and base
Text, and/or expression (for example, chemical formula, gene order and/or Representation of Proteins) in the standard fashion are adjusted in this.This can
It advantageouslys allow for compared with more reliable between the relevant text document of same subject, but uses different specifications or different
Unit.
In some embodiments, inquiry is converted to the second text documents data may include normalization inquiry.One
In a little preferred embodiments, normalization includes identifying and deleting stop-word, and word is reduced to common stem, analysis synonym
Stem and/or identification sequence of terms and compound word.
In some embodiments, normalization inquiry may include that at least synonym, upper is retrieved from external data base
Word, hyponym, stop-word and/or the specific stop-word of theme and the pass for being based at least partially on the term language generation inquiry
Keyword list.The one or more external data bases separated by theme may be present.This may be advantageous, because word can be with
Including different meanings, this depends on theme.For example, the statement such as " transportation system (delivery system) " can have
Complete different meaning, this, which depends on it, is used in logistics or the background of drug.
Therefore, corresponding synonym, hypernym, hyponym and/or other semantic relevant words are also possible to different,
This depends on discussed technical area.As another example, consider that the present invention is used as the reality of a part of semantic search tool
Apply the prior art under the background of mode, especially patent document.Patent application and license have very specific word, energy
It is enough to be repeated in the document about entirely different theme.Such as the word of " claim ", " comprising ", " equipment ", " embodiment "
Language can be considered as the specific stop-word of patent document and can delete from inquiry.In the implementation that database includes patent document
In mode, the specific stop-word can also during converting them to the first text documents data (that is, establish or
During creation database) it is deleted from all first text documents.In some embodiments, can be stopped by deleting
Word and/or the specific stop-word of theme and include at least one of synonym, hypernym and hyponym of word of inquiry
And generate the list of the keyword of inquiry.
In some embodiments, inquiry is converted to the second text documents data may include generating at least one inquiry
Vector.For example, query vector may include the information of the keyword in relation to inquiring.That is, the component of query vector can correspond to inquire
Keyword and/or their semantic related term (e.g., synonym).It should be noted that in the literature, " keyword " can refer to including
Actual word and/or their semantic related term (e.g., synonym, hypernym and/or hyponym) in queries.Some of such
In embodiment, query vector can by identification keyword and/or inquiry in keyword synonym and using multidimensional to
The component of vector in quantity space identifies the keyword and generates.In some embodiments, query vector may include 100
To 500 components, it is preferable that 200 to 400 components, even further preferably, 200 to 300 components.That is, in some of such reality
It applies in mode, not each keyword and associated semantic related term are associated with the component of query vector.For example, this may
Mean that keyword is first evaluated and be then based on different parameter weightings, and then abandons the keyword of low weight.This may
It is particularly advantageous, because the quantity for reducing the keyword to work to query vector can significantly reduce for manipulation query vector
Computing capability needed for necessary, such as when by it compared with document vector.It should be noted that document vector can be similarly included
100 to 500 components, it is preferable that 200 to 400 components, even further preferably, 200 to 300 components.Included in database
In and can pass through with the first text document (in some embodiments, including document vector) associated first document data
Identification keyword or semantic unit and based on every first text document of entropy being associated by their quantity reduce to
100 or several hundred similarly generate with query vector.
In some preferred embodiments, weight can be distributed for keyword.It in this embodiment, can be at least partly
General theme based on inquiry distributes weight.That is, different weights can be distributed for same term, keyword and/or semantic unit,
This depends on the context or theme of text document.That is, for example, different weightings can be carried out to term " frequency ", depending on looking into
Ask whether be telecommunications theme, may refer to wave frequency or it may refer to how long something goes out in the theme of drug
It is now primary.In the embodiment that the first text document data includes document vector, this be can also be applied to and the first text document
Associated document vector.I.e., it is possible to which being based on theme includes keyword, term and/or semanteme in the first text document
Unit includes that semantic related term in these distributes different weights.This is particularly advantageous, because it makes it possible to
First text document with inquiry between carry out it is more meaningful compared with.It should be noted that specific text can be determined in a number of ways
Which technical area is this document belong to.If the document discussed includes patent document, its classification can be used.I.e., it is possible to use
The IPC and/or CPC for providing document classify to distribute certain technical area for it.Another way can be identification certainly
(external data base can be used for for the especially common a certain theme in area or the specific term in region, keyword and/or semantic unit
The purpose), and text document is distributed to technical area by the presence for being then based on the specific term of these themes.
In some embodiments, calculate similarity measurement include using cosine index, Jaccard index, stripping and slicing index,
Include index, Pearson correlation coefficients, Levenstein distance, Jaro-Winkler distance and/or Needleman-Wunsch
At least one of algorithm or combinations thereof.That is, including especially document vector and the second text text in the first text document data
File data includes in the embodiment of query vector, can by calculate the distance in multi-C vector space between them by this
The two is compared.Several different distance definitions can be used to complete for this.It should be noted that different distance definitions can be used for not
Same purpose.
In some preferred embodiments, the method for comparing text document further includes being verified using at least one statistic algorithm
At least one similarity measurement.Method, which may further include, exports at least one similarity measurement.That is, considering again relatively specially
The example of sharp document.Patent application and/or license generally include the reference of other similar informations.These references are usually in document sheet
It quotes in body or is then provided by auditor.Reference is used as the prior art, and it is closely similar with document that this may imply them.With this
Kind mode, can pass through revene lookup and the similarity measurement between the reference provided in specific first text document
Similarity measurement between test query and a certain first text document.If similarity measurement is reliable, the verifying can be expected
It will obtain similar similarity measurement between inquiry and reference.
In some embodiments, inquiry can receive from user interface and can return to similarity measurements via the interface
Amount.This interface may include apply, program and/or the interface based on browser.That is, this method can be used as one of program
It point realizes and to enable a user to the quantitatively and reliably similitude of more each text document.
In some embodiments, database include the relevant text document of patent document and establish database and/or
Conversion query includes deleting the associated stop-word of text document relevant to patent document.As described above, this patent document
Specific stop-word may include word such as " claim ", " equipment ", " embodiment ", " comprising " and similar word.One
In a little embodiments, can by calculate with include the first text document data and/or inquire in the associated entropy of term
And it deletes the term of low entropy and deletes the relevant stop-word of patent.This is discussed further below.
In some preferred embodiments, method may further include generate term vector, the term vector include from
The keyword that multiple first text documents extract.That is, term vector can based on comprising in the database and with the first text
Associated first text document data of document generates.Term vector can be based on including institute in all first text documents
There are keyword, term and/or semantic unit to generate.May include in this embodiment and in the first text document data
Document vector and the second text document data may include in the embodiment of query vector, can be relative to term vector
The component of component generation document vector sum query vector.That is, term vector can provide potential common ground to inquire and first
Text document is compared.In other words, term vector can define the multi-C vector space of its opposite achievable comparison.This is special
It is not advantageous, because it allows to carry out quantitative mathematical comparison between different text documents.
In some embodiments, the similarity measurement between the second text document data and the first document data can pass through
It is calculated using cosine index to calculate the distance between query vector and document vector.As described above, cosine index can be with
For calculating the distance in multi-C vector space.Since it can be reduced to the inner product of two vectors, this point is particularly advantageous.
Since this operation can be easy to accomplish, this can substantially reduce the calculating time compared.
In this second embodiment, the invention discloses the computer realization sides for handling the similitude in text document
Method.Method includes coordinating at least one incoming inquiry.It further comprise at least one incoming coordination inquiry of normalization.Method is also wrapped
It includes and establishes at least one query vector using at least one normalized coordination inquiry.Method further includes calculating at least one inquiry
At least one similarity measurement between vector and at least one another text document, wherein at least one another text document
Abovementioned steps are carried out.
It should be noted that another text document is referred to as the first text document.Progress abovementioned steps might mean that another
Text document or the first text document have been coordinated, have normalized and document vector has been established.
This method, which advantageouslys allow for being converted to the arbitary inquiry being made of text, to carry out quantitative comparison with other data
Data assess so as to the similitude of the inquiry to other data.Preferably, this is executed by calculating equipment, the calculating equipment
With data associated with the various text documents being stored in its memory and it can be retrieved so as to it is incoming
Inquiry is compared.Then various technologies can be used and analyze the text of inquiry by calculating the algorithm that equipment is realized.
In some preferred embodiments, text document may include technical text, scientific text, patent text, and/or
At least one of description of product or combination.
In some embodiments, coordination may include correction typing error, the specific spelling specification of selection and physical unit
Standardize and based on this adjustment text, and/or in the standard fashion expression (for example, chemical formula, gene order and/or protein
It indicates).
In some embodiments, normalization may include identifying and deleting stop-word, and word is reduced to everyday words
Dry, analysis synonym stem and/or identification sequence of terms and compound word.In this embodiment, normalization can be into one
Step includes the entropy of the keyword in multiple text documents preferably by calculating the type and deletes the key with low entropy
Word and identify and delete stop-word associated with certain type of text document.
In some embodiments, calculating similarity measurement may include using cosine index, Jaccard index, stripping and slicing
Index includes index, Pearson correlation coefficients, Levenstein distance, Jaro-Winkler distance and/or Needleman-
At least one of Wunsch algorithm or combination.This algorithm allow between text document based on by multi-C vector space
The quantitative comparison of the distance for the data that text document generates.
In some embodiments, method, which may further include, verifies at least one phase using at least one statistic algorithm
Like property measurement.It, which may further include, exports at least one similarity measurement.
It should be noted that the first and second embodiments can be complementation.That is, a part as first embodiment is presented
Embodiment can be a part of second embodiment, vice versa.
In the third embodiment, the invention discloses a kind of computer implemented systems.System includes at least one
Storage component, it includes multiple first text document numbers associated with the first text document which, which is suitable at least storage,
According to database.System can also include at least one input unit suitable for receiving inquiry.Inquiry includes the second text document
And/or the information of the second text document of identification.Second text document with include be stored in memory device assembly it is first literary
The second text document data in this document data is associated.System further comprises being suitable for inquire being converted to the second text text
File data and/or retrieval the second text document data associated with inquiry from the storage at least one memory device assembly
At least one processing component.Processing component is further adapted for the second text document data and is stored at least one memory device assembly
The first text document data be compared.System further include be adapted to return to identification it is associated with the first text document data extremely
At least one output device of the information of few similar first text document.Similar first text document is the first text
It is most like with inquiry in document.
It should be noted that inquiry preferably includes one of two kinds of forms.In the first form, inquiry may include the second text
This document, then second text document can be converted suitably and related to the second text document data in this case
Connection.In second of form, inquiry may include the reference paper for the second text document being already received in database.Example
Such as, if database includes patent document, inquiry may include can recognize specific second text document number of patent application or
License number.This can be " information of the second text document of identification " being used as.In the first scenario, the second text document data
It then may include data associated with the second text document that inquiry is included.In the latter case, it can be based on looking into
The identification information of inquiry is from the second text document data of database retrieval.In the latter case, the second text document data can be with
Included in the first text document data.
In other words, system described herein is configured as receiving any text based inquiry via input unit
Input, whether revene lookup can be associated with text document data stored in memory, then retrieves if so
Otherwise inquiry is converted to this data by the data.System be configured to inquire with it is stored in memory its
He is compared document.It can be compared by processing component via the realization of algorithms of different.System can also be filled via output
It sets and exports comparison result in the form of with the text document of inquiry most tight association.Ratio can be completed on the level of change data
Relatively itself (as summarized above and below, which may include the point in multi-C vector space), and outputting and inputting can wrap
Include actual text document or their identifier (title of such as paper, the patent No.).
In some embodiments, the first text document data may include multiple document vectors and the second text document number
According to may include query vector.It should be noted that query vector can be as included by inquiry referring again to the two kinds of forms that can be inquired
The second text document text generation or from database retrieval.In the latter case, since query vector has been stored in
In database, therefore it can be one in document vector.For clarity and consistency, term used herein " inquire to
Amount " is suitable for both of these case.In the preferred embodiment, each first text document can be with the text that is storable in database
Shelves vector correlation connection.Database can store the first text document and respective document vector, or only store document vector.
In some embodiments, memory device assembly may include and scientific paper and/or technology explanation and/or patent text
It offers and/or associated first text document data of the description of product.In other words, the first text document may include patent document,
Scientific paper, and/or technology explanation.Preferably, database may include at least relevant first text document number of patent document
According to.
In some embodiments, the second text document data can by coordinate and normalize the second text document and
It creates at least one query vector and obtains.Coordination and normalization are more fully described in the above and below.
In some embodiments, the first text document data can obtain phase compared between the second text document data
Like sex index.In some of such embodiment, output device, which can return to, associated with multiple first text documents passes through phase
Like sex index according to from most like to the information of least similar sequence, the first text text associated with the first text document data
Shelves generate highest index of similarity with the second text document data.That is, it includes most like with inquiry that system, which may be adapted to output,
The list of a certain number of first text documents.It is existing as executing in the case where the first text document includes patent document
The method of technology search, this is particularly advantageous.It should be noted that the first text document of output can store in the database, and/
Or it is exported as the information (such as patent application or license number) for identifying them, and/or as the external data of addressable document
The link in library and export.In addition, also advantageously exporting certain parts of the first most like text document.For example, can be with
One in output header and/or abstract and/or figure.
In some embodiments, index of similarity can based between text document vocabulary and/or it is semantic relatively.That is,
Index of similarity can quantitatively indicate the similitude between text.This for example can refer to pass present in inquiry and the first text document
The quantity of keyword and/or semantic unit.It should be noted that phase can be obtained for example, by calculating the distance between vector in vector space
Like sex index.However, can be based on vocabulary and/or semantic gain of parameter vector itself.And hence it is also possible to be considered based on those parameters
Index of similarity.
In some embodiments, during coordinating and normalizing incoming second text document, processing component can be known
Other keyword.Keyword may include and the significant relevant word of the content of text document.Keyword may include the master of word
The word of dry (being obtained as normalized a part), compound word, and/or a string of semantic connections.Keyword also may include
Actually not in text document but the word of synonym or other words semantically linked are to being included in text
Word in document.
In some embodiments, processing component can be that keyword distributes weight based on entropy algorithm.That is, some keywords by
Get Geng Gao can be arranged in the frequency occurred in the literature and/or correlation in particular technology area.In this case, when literary by first
When this document data is compared with the second text document data, the weight for distributing to keyword can be used.That is, with have compared with
The keyword of low weight is compared, tribute of the keyword with higher weights to similitude and/or index of similarity between document
Offering can be bigger.Due to consideration that determining the similitude between text when the frequency and specific meanings of word within a context, this is
Particularly advantageous.This can result in the comparison measurement of more robustness.
In some embodiments, processing component, which may be adapted to for the second text document being divided into, is used for parallelization calculating extremely
Few two parts, it is preferable that be divided at least four parts.Since it allows to improve processing speed, and it is therefore more efficient, and this is
It is advantageous.
In some embodiments, processing component may include at least two, it is preferable that and at least four, it is highly preferred that extremely
Few eight kernels.This can be further improved inquiry can processed speed.
In some embodiments, processing component may be adapted to regularly update the first number of files being stored in memory device assembly
According to.That is, new the first text document more new database can be used.
In some embodiments, input unit can be further adapted for must include by listing similar text document
And/or the word and/or sentence that need not include and allow given query.In other words, the example of prior art search is considered again.
Can specify must must include being particularly useful with the word or statement inquired in similar text document.Furthermore or alternatively
Ground, specified can not include the word in similar text document be highly useful.
In some embodiments, input unit can be further adapted for by specifying most like text text to be output
Shelves quantity and allow given query.
In some embodiments, memory device assembly may include RAM (random access memory).It is further in conjunction with Fig. 1
It discusses.
In some embodiments, memory device assembly, which may further include, generates term vector, which includes
The keyword extracted from multiple first text documents.Term vector is described above in association with first embodiment.Some of such
In embodiment, processing component may be adapted to the component that document vector sum query vector is generated relative to the component of term vector.?
In some of such embodiment, wherein the first text document data includes document vector and the second text document data includes looking into
Vector is ask, processing component may be adapted to be compared the second text document data with the first text document data using cosine index
To calculate the distance between query vector and document vector.
Here is the discussion of the more elegant of an embodiment of the invention.Specifically, it illustrates such as of the invention upper
The concept for the entropy that hereinafter can be used, and give a kind of mode for quantifying the similitude between different texts.
Entropy E (t) can be used for removing the specific stop-word of patent document.That is, as " claim ", " device ", " invention ",
" comprising " or other similar words.Following statement can be used:
In above statement, n refers to that the sum of patent and/or document, i and j are the fingers referring to patent and/or document
Number, fitIndicate frequency and f of the term t in patent and/or document ijtSum refer in all patent and/or document
In term t frequency.The value of E (t) is fallen between zero and one.Can between document very specifically but unevenly low distribution
The high entropy of term weighting.Entropy is higher, and the transmittable information of term is more.Can individually calculate abstract, claim, title,
The combined patent of specification and all of which specifically stops word list.Due to the claim and such as specification system of patent
Surely very different, therefore difference is important.
After by deleting various stop-words and them being prevented to identify keyword, keyword may be implemented in vector space
In model.Then document may be expressed as the object in hyperspace.Dimension can be characterized by keyword or term.With this side
Formula, each document can be described as point and/or vector in hyperspace.The value of each component of this point can be indicated in this article
The number of the particular keywords or term that are encountered in offering.Term vector T can be created in this way wraps it once accurately
All terms or keyword containing all documents considered:
T=(t1,t2..., tm)
That is, m term or keyword may include in all the first text documents considered in total.It, can based on the vector
It generates term document matrix (TDM).TDM can following form include each of n document and/or patent as indicating
The row vector of the weight of term vector T:
This means that can be by can be described as the digital weight vector d of document vectoriDocument i is described.Document vector can be related to
Weight is as follows:
di=(w11..., w1m)
The document vector shortened in Boolean expression can for example seem as follows:
di=(0,0,0,0,0,0,0,0,0,1,1,0,0,1,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0)
Since term vector once accurately includes each term or keyword in all documents, most power of document vector
Weight element witWith zero.This can lead to two problems during firing count vector spatial model.First, null value occupies need not
The memory wanted, second, the manipulation of vector causes and null value is unnecessary is multiplied in the comparison procedure of text document.Therefore, more
Favorably and more practicability is that document vector d is presentediIt weighs as coordinate to set (cit;wit).The document vector stated above
Then it can be write as:
di={ (10;1), (11,1), (14;1), (18;1), (19;1)}.
The first part indicates coordinate c of doubletit, and describe the position in term vector T and/or index.
In the expression, TDM matrix may include doublet as its element wijEach of and be contemplated that tensor.
In this way, each document can be expressed as the vector in vector space.In general, the entire set comprising document
Or the term vector of database may include million or higher components.However, each document can be exchanged into about 100-500 component
Document vector.That is, the quantity of each document keyword can reduce in this way so that document vector may include about 100-
500 keywords.
Vector space method makes it possible to by will be in they and multi-C vector space based on keyword present in text
Point and/or vector correlation connection and quantify different text documents.It then, can be close in vector space by calculating them
Spend more different texts.This for example may be used at the cosine index CI completion being given below for reference.
Detailed description of the invention
Technical staff should be understood that the attached drawing being described as follows only is used to illustrate.These attached drawings are not intended to any side
Formula limits the range of this introduction.
Fig. 1 shows the embodiment of the equipment for semantic search according to an aspect of the present invention.
Figure lb schematically depicts the embodiment that inquiry is converted to text document data.
Figure lc schematically depicts a visual embodiment for vector spatial model.
Fig. 2 depicts the embodiment of the method for semantic search according to an aspect of the present invention.
Specific embodiment
Hereinafter, exemplary embodiments of the present invention are described with reference to the accompanying drawings.These examples are provided in order to provide right
The present invention is further understood from, rather than is limited the scope of the present invention.
In the following description, a series of feature and/or step are described.Technical staff will be appreciated that unless context needs
It wants, otherwise the sequence of feature and step is not important for obtained configuration and its effect.In addition, technical staff will be shown and
It is clear to, the sequence regardless of feature and step, it can be with rendering step between some or all of described step
Between presence or absence of time delay.
Referring to Fig.1, the example of setting of the invention is shown.Attached drawing depicts calculating according to an aspect of the present invention
The system 10 that machine is realized.
Computer implemented system 10 includes memory device assembly 20.Memory device assembly 20 may include standard computer storage
Device (such as, RAM).Further additionally or alternatively, memory device assembly 20 may include Nonvolatile memory device assembly, such as, hard disk
Driver, memory, flash memory, CD-ROM driver, FeRAM, CBRAM, PRAM, SONOS, RRAM, racing track memory on server,
NRAM, 3D XPoint, and/or thousand-legger (millipede) memory.
Memory device assembly 20 may include the first text document data 21.First text document data 21 may include document
Vector.Document vector can be constructed by text documents.That is, each text document can map to text by the keyword in identification document
Shelves vector.One document vector may include the 100-500 component (that is, dimension) comprising individual keyword.
Computer implemented system 10 can also include processing component 30.Processing component 30 may be adapted to receive the second text text
File data 31 and it is compared with the first document data 21.Second text document data 31 can also include document to
Amount.For example, the identification (such as, the patent No.) for the text document that it may include user-defined inquiry and/or user provides.
Second text document data 31 may include the document vector for having become a part of the first text document data 21.For example,
User interface, which can be used for searching for, similar with specific patent and/or patent application has become computer implemented system 10
In database a part (that is, a part for having become the first text document data 21 in memory device assembly 20) it is special
Benefit and/or patent application.
Processing component 30 may be adapted to receive inquiry 41 from input unit 40.That is, inquiry 41 for example can be via in this feelings
It will act as the user interface in the application program of input unit 40, program and/or interface based on browser under condition and key in.It looks into
Ask 41 may include the second text document text and/or specific identification (as described above, for example, this may include patent and/or
Number of patent application).In the case where having received inquiry 41, processing component 30 can be for example, by all keys in identification inquiry
Word deletes stop-word, prevention and generates the document vector of inquiry and be converted into the second text document data 31.Institute as above
It states, if inquiry identification has become the text of a part of (the first text document data 21) database in memory device assembly 20
It offers, then processing component 30 can only retrieve the document vector being connected with the second text document data 31.Then processing component 30 can incite somebody to action
Second text document data 31 is compared with the first all text document datas in memory device assembly 20.Preferably,
Can based on the most like document of the distance between document vector in multi-C vector space identification (using their own document to
Amount identification).
In the case where having identified the most like document in the first text document data 21, processing component can send out result
It send to output device 50.Output device 50 is then exportable and associated with the first most like text document data 21 of inquiry 41
At least one similar first text document 51.Certainly, the exportable similitude based on them and inquiry 41 of output device 50
And multiple similar first text documents 51 to sort.For example, output device 50 may include accessible via equipment is calculated
Interface, such as, program, application program and/or the interface based on browser.
Figure lb schematically depicts the embodiment that inquiry 41 is converted to text document data.This process can be
It is carried out in processing component 30, processing component may include CPU for example associated with equipment is calculated.Further additionally or alternatively, example
Such as, processing component may include for multiple CPU of parallel processing and/or the CPU with multiple kernels.Inquiry 41 can be from input
Device 40 (being not shown herein) is sent to processing component 30.Inquiry 41 can first be coordinated to obtain the inquiry 43 of coordination.Above
Describe the process of coordination.Then coordinating inquiry 43 can be normalized in order to obtain normalized coordination inquiry 45.Above also more in detail
Carefully describe normalized process.
Normalized coordination inquiry 45 (being the normalization inquiry 43 coordinated individually) then can be exchanged into query vector 47.It looks into
Asking vector 47 can be by by the component or dimension in the normalized keyword for coordinating inquiry 45 or " term " and multi-C vector space
Degree is combined and is generated.Then can by query vector 47 and the document being storable in memory device assembly 20 (being not shown) herein in
Amount 27 is compared.
It should be noted that document vector 27 can refer to the first text document data 21 in the literature.For clarity, it can be used
Term " document vector ", so that technology reader understands that multiple and different document vectors is signified.For example, multi-C vector space can be based on
In distance carry out query vector 47 compared between document vector 27.Certainly, for this comparison, query vector 47 and text
Both shelves vectors 27 should be in identical vector space, that is, the space defined by identical dimension.To achieve it,
The database in 20 (not shown) of storage component that is included may include term vector.Term vector may include being stored in number
According to the component or an a kind of dimension of each term present in all first text documents in library or keyword.Inquiry
Then vector 47 and document vector 27 can indicate exist in each specific document of leisure relative to the dimension or component of term vector
Keyword or term in inquiry 41.In this way, unique and consistent vector space is produced.This will above carry out
It is explained in greater detail.
Figure lc schematically depicts a visual embodiment for vector space model.It should be noted that the diagram illustrating
It is only used for clear purpose, and does not correspond to the mathematical description of vector space model.Term vector 7 is shown schematically as justifying
Shape.Term vector 7 may include multiple keywords or term.These keywords or art can be extracted from multiple text documents
Language.In the preferred embodiment, term vector 7 includes from all keywords including all text documents in the database
(that is, all keywords from the first text document).This is indicated by great circle in the accompanying drawings.Query vector 47 can be by inquiry 41
Keyword (being not shown) herein in generates.It should be noted that query vector 47 is completely contained in term vector 7 in this schematic diagram
In, it means that all keywords that inquiry 41 is included are all contained in the first text document, which includes
In the database and it is generated by it term vector 7.However, being not necessarily such case.Inquiry 41 includes being not included in first
Keyword in text document is also entirely possible, therefore query vector 47 need not be completely in the keyword by term vector 7
In the vector space of generation.If however, such case, the keyword for the inquiry 41 being not included in term vector 7 be will lead to
There is no similitude with any first text document, and it can be ignored therefore to find the first most like text document.Cause
This, query vector 47 can be considered as the keyword for being used only in and having been contemplated that in term vector 7 and generate.It should be noted that can be with
Compared using the synonym of keyword for Semantic Similarity.
Document vector 27 is depicted as having intersection with query vector 47.This refers to that they include some identical keywords
And/or their synonym.Therefore, non-zero semblance measurement can be generated between query vector 47 and document vector 27.However,
Document vector 27' is depicted as with query vector 47 without intersection.This refers to inquiry 41 and text document and document vector 27'
It is associated, any keyword or their synonym are not shared.It can be query vector 47 and document vector 27' that this, which may mean that,
Distribute empty similarity measurement.
Fig. 2 diagrammatically illustrates the method for the semantic processes of the similitude in text document according to an aspect of the present invention
Embodiment.The figure illustrate the steps that incoming document is compared by description with the database in existing pond or stored document
Rapid flow chart.
As example scenarios, it is contemplated that using the user of certain text, certain text for example can be patent and/or specially
Benefit application.User needs so-called " prior art search ".That is, user needs to obtain or search in the text possessed with them
Hold other close patent documents.Then, user can be in the following manner using the present invention.They can be by problematic biography text text
Shelves are sent or upper conducting system.For example, this can be completed via interface.In one embodiment, system as described in this article
It may include the interface based on application or based on browser for receiving inquiry.Then user interface can be used to send inquiry
To system, following steps can occur at this point.
In S1, tunable is passed to text document or inquiry.I.e., it is possible to correct misspelling.In addition, spelling can be with normalizing
Change.For example, a kind of specification can be selected from Britain and U.S.'s spelling specification, and different all words in both specifications are equal
It can be exchanged into selected one kind.That is, as the word of " color (color) ", " theater (theatre) " can be converted to " color
And " theater (theater) " or vice versa (color) ".In addition, coordinating to may include that different physical units is converted to one
The physical unit of a standard and/or a kind of specific physical unit.For example, inch can be converted to rice, pound can be converted to
Kilogram etc..In addition, coordinating to may include that such as formula of chemical formula, gene order and/or Representation of Proteins is converted to standard
Symbol.
In S2, incoming text document can be normalized.This may include the stop-word being isolated in the text for including document
And they are deleted.Stop-word may include word, such as, "and", " first ", " however ".Stop-word is also possible to wait divide
The type of the text document of analysis.For example, patent document includes the word being present in most patent text document, such as, " right
It is required that ", " embodiment ", " equipment ".These words can similarly be identified and be deleted during normalization step.
In addition, normalization may include reducing word to their trunk.That is, such as " computer " and the word of " calculating " can be such as
It is reduced to their common trunk.Then, trunk can be analyzed for synonym.In addition, order and compound word are in normalization step
It may recognize that in the process.That is, can recognize word (such as, " folder "), and do not separated for the purpose of filling, so as to together
Keep the meaning of compound word.
In S3, using can be coordinated first and/or normalized text document construct document vector.Document vector can be with
It is the multi-C vector comprising the information in relation to which " term ", that is, stem and their synonym are included in text document.This
It is further illustrated above.It should be noted that in some embodiments, document vector also may include tensor.
In S4, document vector generated can be used for calculating between incoming text document and stored text document
Similarity measurement.That is, incoming text document or exactly its document vector can with comprising being converted to the text before document vector
The database of this document is compared.It should be noted that in order to be compared between different document vectors with public baseline
Compared with, may be present comprising included in all text documents in database own " term " (that is, word and/or trunk and/or
Synonym) one " term vector ".
Then, each document vector can only indicate include those of in term vector term be present in given document
In.Term vector then can define multi-C vector space, wherein each term may include a dimension.Each document vector is equal
It can indicate or be visualized as the point or vector in the multi-C vector space.For the document vector that will be generated by being passed to text document
It is compared with comprising each document vector in the database, the distance between they can be calculated.It should be noted that it is empty to calculate vector
Between in the distance between vector can be and obtain one of the similarity measurement between incoming document and the text document stored
Kind mode or a part.However, it is also possible in the presence of the other modes done so based on vocabulary and/or semantic analysis.In addition, may be used also
In the presence of including its dependent variable in similarity measurement.For example, the frequency that is occurred in a document based on them and/or based on document
Then the technical area that can be integrated into document vector is weighted keyword, and therefore works in similarity measurement.
In addition it is possible to use the number variable of text document.In the specific example of patent document, these may include IPC classification,
CPC classification, applicant, inventor, patent attorney, citation, reference, common citation and common reference information, image information.
In S5, similarity measurement can be exported.For example, several text documents can be exported, by be originally inputted
Text document or the similarity measurement of inquiry are ranked up.Back to given above in application program and/or browser
The example at interface, similarity measurement can be exported via same interface.That is, for example, can be answered by what is ordered in some way
List with incoming text document or the similar text document of inquiry is shown with program and/or browser, such as from most like
Document starts.It should be noted that " output similarity measurement " can refer to herein output be confirmed as with inquire it is most like at least
One or more documents.
As it is used in the present context, including claim, unless context indicates, the otherwise singular solution of term
It is interpreted as also including that vice versa for plural form.As such, it is noted that as it is used in the present context, unless context clearly in addition
It indicates, otherwise singular " one ", "one" and " described " include plural referents.
In the whole instruction and claim, the terms "include", "comprise", " having ", " receiving " and their modification
It should be understood to refer to " including but not limited to this ", it is not intended that exclude other assemblies.
These terms, feature, value and range etc. with such as about, left and right, usually, substantially, substantially, at least etc. (that is,
" about 3 " should also cover accurate 3 or " substantially constant " should also cover it is strictly constant) term the case where being used in combination
Under, present invention also contemplates that accurate term, feature, value and range etc..
Term "at least one" should be understood to refer to " one or more ", and therefore include comprising one or more components
Two embodiments.In addition, the dependent claims of the independent claims of reference "at least one" Expressive Features are in feature
It is mentioned as all having the same meaning when " described " and " described at least one ".
It should be understood that can be changed to previously described embodiment of the invention when still falling within the scope of the invention.
It can be instead of disclosed in the description unless otherwise stated, serving same, equivalent or similar purpose replaceable feature
Feature.Therefore, unless otherwise stated, disclosed each feature represents equivalent or similar characteristics one of universal serial
Example.
Except the explanation that is far from it, otherwise such as " such as (for instance) ", " such as (such as) ", " such as (for
) " etc. example the use of exemplary language is only intended to preferably illustrate the present invention and does not indicate the limitation of the scope of the invention.
Unless the context clearly dictates, in any order or described arbitrary steps in the description otherwise can be performed simultaneously.
Other than at least some features and/or the mutually exclusive combination of step, disclosed all features in the description
And/or step can be combined in any combination.Specifically, preferred feature of the invention is suitable for the invention all sides
It face and can be used in any combination.
Claims (45)
1. a kind of computer implemented for comparing the method for text document, comprising the following steps:
A) foundation includes the database of the first text document data (21) associated with multiple first text documents;And
B) inquiry (41) is received;And
C) inquiry (41) is converted into the second text document data (31);And
D) second text document data (31) and first text document data (21) are compared and calculate institute
State at least one similarity measurement between the second text document data (31) and first text document data (21).
2. the method according to preceding claims, wherein first text document data (21) includes by being included in institute
The document vector stating the keyword in the first text document and/or being generated with the keyword in semantically related word.
(27)。
3. the method according to any previous claim, wherein the inquiry (41) includes the second text document and/or knowledge
Not with include in first text document data (21) being stored in the memory device assembly (20) described second
The information of associated second text document of text document data (31).
4. method according to any of the preceding claims, wherein the inquiry (41) is converted to second text
This document data (31) includes coordinating the inquiry (41).
5. method according to any of the preceding claims, wherein the inquiry is converted to the second text text
File data (31) includes normalizing the inquiry (41).
6. the method according to preceding claims, wherein the normalization inquiry (41) include from external data base at least
Synonym, hypernym, hyponym, stop-word and/or the specific stop-word of theme are retrieved, and is based at least partially on and is retrieved
The word arrived generates the lists of keywords of the inquiry (41).
7. the method according to preceding claims, wherein by delete stop-word and/or the specific stop-word of theme and
At least one of synonym, hypernym and hyponym of word comprising inquiry arrange to generate the keyword of the inquiry (41)
Table.
8. method according to any of the preceding claims, wherein the inquiry (41) is converted to second text
This document data (31) includes generating at least one query vector (47).
9. the method according to preceding claims, wherein by identifying keyword and/or the pass from the inquiry (41)
The synonym of keyword and identify the keyword using the component of the vector in multi-C vector space, come generate it is described inquire to
It measures (47).
10. the method according to preceding claims, wherein the query vector (47) includes 100 to 500 components, excellent
Selection of land includes 200 to 400 components, even further preferably, including 200 to 300 components.
11. the method for the feature according to any one of the preceding claims with claim 9, wherein be keyword
Distribute weight.
12. the method according to preceding claims, wherein the general subject for being based at least partially on the inquiry (41) comes
Distribute weight.
13. method according to any of the preceding claims, wherein calculating the similarity measurement includes that application is following
At least one of or combinations thereof: cosine index, Jaccard index, stripping and slicing index, comprising index, Pearson correlation coefficients,
Levenstein distance, Jaro-Winkler distance and/or Needleman-Wunsch algorithm.
14. method according to any of the preceding claims further comprises the steps of: after step d)
F) at least one described similarity measurement is verified using at least one statistic algorithm;And
G) at least one described similarity measurement is exported.
15. the method according to preceding claims, wherein receive the inquiry (41) from user interface and via described
Interface returns to the similarity measurement.
16. method according to any of the preceding claims, wherein the database includes relevant to patent document
Text document, and wherein, constructing the database and/or the conversion inquiry (41) includes deleting and the patent document phase
The associated stop-word of the text document of pass.
17. the method according to preceding claims, wherein by calculating and being included in first text document data
(21) and/or the associated entropy of term in the inquiry (41) and the term with low entropy is deleted and to delete patent relevant
Stop-word.
18. it includes from multiple first texts texts that method according to any of the preceding claims, which further includes generation,
The term vector (7) for the keyword that shelves extract.
19. the method for the feature according to preceding claims with claim 2 and 8, wherein the document vector
(27) and the component of the query vector (47) is generated relative to the component of the term vector (7).
20. the method for the feature according to any one of the preceding claims with claim 2 and 8, wherein described
Similarity measurement between two text document datas (31) and first text document data (21) is referred to by using cosine
Number calculate the distance between query vector (47) and document vector (27) and it is calculated.
21. a kind of computer implemented method for handling the similitude in text document, comprising:
A) coordinate at least one incoming inquiry (41);And
B) by least one described incoming coordination inquiry (43) normalization;And
C) at least one query vector (47) is constructed using at least one normalized coordination inquiry (45);And
D) at least one similitude between at least one described query vector (47) and at least one another text document is calculated
Measurement, wherein at least one described another text document has carried out aforementioned step.
22. the method according to preceding claims, wherein the text document includes technical text, scientific text, patent
At least one of text, and/or the description of product or combinations thereof.
23. the method according to any one of both of the aforesaid claim, wherein coordinate to include correction typing error, selection
Specific spelling specification and physical unit specification are simultaneously adjusted described based on the specific spelling specification and the physical unit specification
Text, and/or representation formula (for example, chemical formula, gene order and/or Representation of Proteins) in the standard fashion.
24. the method according to any one of preceding claims 21 to 23, wherein normalization includes identifying and deleting to stop
Word is reduced to common stem, is analyzed the stem of synonym and/or identifies sequence of terms and compound word by only word.
25. the method according to preceding claims, wherein normalization further comprises: identifying and delete and certain seed type
The associated stop-word of text document, preferably the entropy of the term in multiple text documents by calculating the type is simultaneously
And delete the word with low entropy.
26. the method according to any one of claim 21 to 25, wherein calculate the similarity measurement include application with
It is at least one of lower or combinations thereof: cosine index, Jaccard index, stripping and slicing index, comprising index, Pearson correlation coefficients,
Levenstein distance, Jaro-Winkler distance and/or Needleman-Wunsch algorithm.
27. the method according to any one of claim 21 to 26 is further comprising the steps of after step d):
F) at least one described similarity measurement is verified using at least one statistic algorithm;And
G) at least one described similarity measurement is exported.
28. computer implemented system (10) according to any one of the preceding claims, comprising:
A) at least one memory device assembly (20) is suitable at least storage and includes associated with the first text document multiple described the
The database of one text document data (21);
B) at least one input unit (40), is suitable for receiving inquiry (41), the inquiry (41) including the second text document and/or
Identify the information of second text document, second text document has stored in the memory device assembly with being included in
(20) second text document data (31) in first text document data (21) in is associated;And
C) at least one processing component (30) is suitable for the inquiry (41) being converted to second text document data (31)
And/or from memory search and the inquiry (41) associated described second at least one described memory device assembly (20)
Text document data (31) and by second text document data (31) and it is stored at least one described memory device assembly
(20) first text document data (21) in is compared;
D) it is associated at least with the first text document data (21) to be adapted to return to identification at least one output device (50)
The information of one similar first text document (51), similar first text document (51) is first text document
In it is most like with the inquiry (41).
29. the system according to preceding claims, wherein first text document data (21) include multiple documents to
It measures (27), and wherein, second text document data (31) includes query vector (47).
30. the system according to any one of preceding claims 28 to 29, wherein the memory device assembly (20) include with
Scientific paper and/or technology explanation and/or patent document and/or associated first text document data of the description of product
(21)。
31. the system according to any one of preceding claims 28 to 30, wherein second text document data (31)
It is to be obtained and coordinating and normalize second text document and constructing at least one described query vector (47).
32. the system according to any one of preceding claims 28 to 31, wherein first text document data (21)
Index of similarity is generated compared between second text document data (31).
33. the system according to preceding claims, wherein the output device (50) returns and multiple first texts
Document is associated, by the index of similarity according to from most like to the information of least similar sequence, with first text
Associated first text document of this document data (21) generates highest similitude with second text document data (31) and refers to
Number.
34. the system according to any one of preceding claims 28 to 33, wherein the index of similarity is based on text
Vocabulary and/or semanteme between document are relatively.
35. the system according to any one of preceding claims 28 to 34, wherein the processing component (30) is to incoming
The second text document carry out coordinate and it is normalized during identify keyword.
36. the system according to any one of preceding claims 28 to 35, wherein the processing component (30) is calculated based on entropy
Method is that keyword distributes weight.
37. the system according to any one of preceding claims 28 to 36, wherein the processing component (30) is suitable for simultaneously
Rowization, which is calculated, is divided at least two parts for second text document, it is preferable that is divided at least four parts.
38. system according to any one of the preceding claims, wherein the processing component (30) includes at least two
Core, it is preferable that including at least four kernels, it is highly preferred that including at least eight kernels.
39. the system according to any one of preceding claims 28 to 38, wherein the processing component (30) is suitable for regular
Update storage the first document data (21) in the memory device assembly (20).
40. the system according to any one of preceding claims 28 to 39, wherein the input unit (40) is further adapted for permitting
Perhaps it by listing similar text document must include and/or the word that must not include and/or sentence specify the inquiry
(41)。
41. the system according to any one of preceding claims 28 to 40, wherein the input unit (40) is further adapted for permitting
Perhaps the inquiry (41) is specified by specifying the quantity of most like text document to be output.
42. the system according to any one of preceding claims 28 to 41, wherein the memory device assembly (20) includes
RAM (random access memory).
43. the system according to any one of preceding claims 28 to 42, wherein the memory device assembly (20) further includes
Term vector (7), the term vector include the keyword extracted from multiple first text documents.
44. the system of the feature according to preceding claims with claim 29, wherein the processing component (30)
The component for being adapted to the term vector (7) generates the component of the document vector (27) and the query vector (47).
45. the system of the feature according to any one of preceding claims 28 to 44 with claim 29, wherein institute
Processing component (30) are stated suitable for by using the cosine index that second text document data (31) is literary with described first
This document data (21) is compared to calculate the distance between the query vector (47) and the document vector (27).
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP16198539 | 2016-11-11 | ||
EP16198539.5 | 2016-11-11 | ||
PCT/EP2017/078674 WO2018087190A1 (en) | 2016-11-11 | 2017-11-08 | Apparatus and method for semantic search |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110023924A true CN110023924A (en) | 2019-07-16 |
Family
ID=57288265
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201780069862.1A Pending CN110023924A (en) | 2016-11-11 | 2017-11-08 | Device and method for semantic search |
Country Status (6)
Country | Link |
---|---|
US (1) | US20190347281A1 (en) |
EP (1) | EP3539018A1 (en) |
JP (1) | JP7089513B2 (en) |
CN (1) | CN110023924A (en) |
AU (1) | AU2017358691A1 (en) |
WO (1) | WO2018087190A1 (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111710387A (en) * | 2020-04-30 | 2020-09-25 | 上海数创医疗科技有限公司 | Quality control method for electrocardiogram diagnosis report |
Families Citing this family (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11762989B2 (en) | 2015-06-05 | 2023-09-19 | Bottomline Technologies Inc. | Securing electronic data by automatically destroying misdirected transmissions |
US20170163664A1 (en) | 2015-12-04 | 2017-06-08 | Bottomline Technologies (De) Inc. | Method to secure protected content on a mobile device |
US11163955B2 (en) | 2016-06-03 | 2021-11-02 | Bottomline Technologies, Inc. | Identifying non-exactly matching text |
US11416713B1 (en) | 2019-03-18 | 2022-08-16 | Bottomline Technologies, Inc. | Distributed predictive analytics data set |
US11030222B2 (en) * | 2019-04-09 | 2021-06-08 | Fair Isaac Corporation | Similarity sharding |
US11232267B2 (en) * | 2019-05-24 | 2022-01-25 | Tencent America LLC | Proximity information retrieval boost method for medical knowledge question answering systems |
US11042555B1 (en) | 2019-06-28 | 2021-06-22 | Bottomline Technologies, Inc. | Two step algorithm for non-exact matching of large datasets |
US11269841B1 (en) | 2019-10-17 | 2022-03-08 | Bottomline Technologies, Inc. | Method and apparatus for non-exact matching of addresses |
CN111339261A (en) * | 2020-03-17 | 2020-06-26 | 北京香侬慧语科技有限责任公司 | Document extraction method and system based on pre-training model |
US11526551B2 (en) * | 2020-04-10 | 2022-12-13 | Salesforce, Inc. | Search query generation based on audio processing |
US11449870B2 (en) | 2020-08-05 | 2022-09-20 | Bottomline Technologies Ltd. | Fraud detection rule optimization |
US11694276B1 (en) | 2021-08-27 | 2023-07-04 | Bottomline Technologies, Inc. | Process for automatically matching datasets |
US11544798B1 (en) | 2021-08-27 | 2023-01-03 | Bottomline Technologies, Inc. | Interactive animated user interface of a step-wise visual path of circles across a line for invoice management |
CN113987115A (en) * | 2021-09-26 | 2022-01-28 | 润联智慧科技(西安)有限公司 | Text similarity calculation method, device, equipment and storage medium |
CN113806491B (en) * | 2021-09-28 | 2024-06-25 | 上海航空工业(集团)有限公司 | Information processing method, device, equipment and medium |
US20230281396A1 (en) * | 2022-03-03 | 2023-09-07 | International Business Machines Corporation | Message mapping and combination for intent classification |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5974412A (en) * | 1997-09-24 | 1999-10-26 | Sapient Health Network | Intelligent query system for automatically indexing information in a database and automatically categorizing users |
JP2003157270A (en) * | 2001-11-22 | 2003-05-30 | Ntt Data Technology Corp | Method and system for retrieving patent literature |
US20030172058A1 (en) * | 2002-03-07 | 2003-09-11 | Fujitsu Limited | Document similarity calculation apparatus, clustering apparatus, and document extraction apparatus |
US7409383B1 (en) * | 2004-03-31 | 2008-08-05 | Google Inc. | Locating meaningful stopwords or stop-phrases in keyword-based retrieval systems |
US20090190839A1 (en) * | 2008-01-29 | 2009-07-30 | Higgins Derrick C | System and method for handling the confounding effect of document length on vector-based similarity scores |
CN104765779A (en) * | 2015-03-20 | 2015-07-08 | 浙江大学 | Patent document inquiry extension method based on YAGO2s |
Family Cites Families (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2002063192A (en) * | 2000-08-22 | 2002-02-28 | Patolis Corp | Patent document system |
US7383258B2 (en) | 2002-10-03 | 2008-06-03 | Google, Inc. | Method and apparatus for characterizing documents based on clusters of related words |
JP4534666B2 (en) * | 2004-08-24 | 2010-09-01 | 富士ゼロックス株式会社 | Text sentence search device and text sentence search program |
US20110082839A1 (en) * | 2009-10-02 | 2011-04-07 | Foundationip, Llc | Generating intellectual property intelligence using a patent search engine |
JP5578137B2 (en) * | 2011-05-25 | 2014-08-27 | 富士通株式会社 | Search program, apparatus and method |
US8935230B2 (en) | 2011-08-25 | 2015-01-13 | Sap Se | Self-learning semantic search engine |
US20140280088A1 (en) | 2013-03-15 | 2014-09-18 | Luminoso Technologies, Inc. | Combined term and vector proximity text search |
-
2017
- 2017-11-08 CN CN201780069862.1A patent/CN110023924A/en active Pending
- 2017-11-08 WO PCT/EP2017/078674 patent/WO2018087190A1/en unknown
- 2017-11-08 US US16/348,825 patent/US20190347281A1/en not_active Abandoned
- 2017-11-08 AU AU2017358691A patent/AU2017358691A1/en not_active Abandoned
- 2017-11-08 EP EP17798181.8A patent/EP3539018A1/en not_active Ceased
- 2017-11-08 JP JP2019525873A patent/JP7089513B2/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5974412A (en) * | 1997-09-24 | 1999-10-26 | Sapient Health Network | Intelligent query system for automatically indexing information in a database and automatically categorizing users |
JP2003157270A (en) * | 2001-11-22 | 2003-05-30 | Ntt Data Technology Corp | Method and system for retrieving patent literature |
US20030172058A1 (en) * | 2002-03-07 | 2003-09-11 | Fujitsu Limited | Document similarity calculation apparatus, clustering apparatus, and document extraction apparatus |
US7409383B1 (en) * | 2004-03-31 | 2008-08-05 | Google Inc. | Locating meaningful stopwords or stop-phrases in keyword-based retrieval systems |
US20090190839A1 (en) * | 2008-01-29 | 2009-07-30 | Higgins Derrick C | System and method for handling the confounding effect of document length on vector-based similarity scores |
CN104765779A (en) * | 2015-03-20 | 2015-07-08 | 浙江大学 | Patent document inquiry extension method based on YAGO2s |
Non-Patent Citations (1)
Title |
---|
MILOS RADOVANOVIC ET AL: "On the Existence of Obstinate Results in Vector Space Models", 《PROCEEDINGS OF THE 33RD INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL》 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111710387A (en) * | 2020-04-30 | 2020-09-25 | 上海数创医疗科技有限公司 | Quality control method for electrocardiogram diagnosis report |
Also Published As
Publication number | Publication date |
---|---|
JP7089513B2 (en) | 2022-06-22 |
JP2020500371A (en) | 2020-01-09 |
AU2017358691A1 (en) | 2019-05-23 |
EP3539018A1 (en) | 2019-09-18 |
US20190347281A1 (en) | 2019-11-14 |
WO2018087190A1 (en) | 2018-05-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110023924A (en) | Device and method for semantic search | |
Bhagavatula et al. | Content-based citation recommendation | |
US11900064B2 (en) | Neural network-based semantic information retrieval | |
CA2523128C (en) | Information retrieval and text mining using distributed latent semantic indexing | |
Wang et al. | Targeted disambiguation of ad-hoc, homogeneous sets of named entities | |
US20160283564A1 (en) | Predictive visual search enginge | |
CN108875065B (en) | Indonesia news webpage recommendation method based on content | |
Thanda et al. | A Document Retrieval System for Math Queries. | |
Deng et al. | A distributed PDP model based on spectral clustering for improving evaluation performance | |
Peng et al. | Hierarchical visual-textual knowledge distillation for life-long correlation learning | |
Zoupanos et al. | Efficient comparison of sentence embeddings | |
CN111143400A (en) | Full-stack type retrieval method, system, engine and electronic equipment | |
Rao et al. | An efficient semantic ranked keyword search of big data using map reduce | |
Prajapati et al. | Extreme multi-label learning: a large scale classification approach in machine learning | |
Wang | A semi-supervised learning approach for ontology matching | |
Laddha et al. | Novel concept of query-similarity and meta-processor for semantic search | |
Brázdil | Dimensionality reduction methods for vector spaces | |
Gisolf et al. | Search and Explore Strategies for Interactive Analysis of Real-Life Image Collections with Unknown and Unique Categories | |
Huybrechts et al. | Learning to rank with deep neural networks | |
Moraes et al. | Design principles and a software reference architecture for big data question answering systems | |
Abbasi et al. | Introducing triple play for improved resource retrieval in collaborative tagging systems | |
Premjith et al. | Metaheuristic Optimization Using Sentence Level Semantics for Extractive Document Summarization | |
Zhang et al. | A Content-Based Dataset Recommendation System for Biomedical Datasets | |
Sudha et al. | Efficient diversity aware retrieval system for handling medical queries | |
Elshater et al. | Web service discovery for large scale iot deployments |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20190716 |
|
WD01 | Invention patent application deemed withdrawn after publication |