CN110023924A

CN110023924A - Device and method for semantic search

Info

Publication number: CN110023924A
Application number: CN201780069862.1A
Authority: CN
Inventors: 迈克尔·纳特雷尔
Original assignee: Denimel Oktimen Co Ltd
Current assignee: Denimel Oktimen Co Ltd
Priority date: 2016-11-11
Filing date: 2017-11-08
Publication date: 2019-07-16
Also published as: JP7089513B2; JP2020500371A; AU2017358691A1; EP3539018A1; US20190347281A1; WO2018087190A1

Abstract

It discloses a kind of for comparing the computer implemented method of text document.Method includes establishing the database comprising the first text document data associated with multiple first text documents.Method further comprises receiving inquiry.Method further includes that the inquiry is converted to the second text document data.Method further includes at least one similarity measurement for the second text document data being compared and being calculated with the first text document data between the second text document data and the first document data.It is further disclosed that the computer implemented method for handling the similitude in text document.Method includes coordinating at least one incoming inquiry.It further comprise at least one incoming coordination inquiry of normalization.Method further includes establishing at least one query vector using at least one normalized coordination inquiry.Method further includes at least one similarity measurement calculated between at least one query vector and at least one another text document, wherein at least one another text document has carried out abovementioned steps.Also disclose a kind of computer implemented system.System includes at least one memory device assembly, which is suitable at least saving the database comprising multiple first text document datas associated with the first text document.System can also include at least one input unit suitable for receiving inquiry.Inquiry includes the second text document and/or the information for identifying the second text document.Second text document with include that the second text document data in the first text document data being stored in memory device assembly is associated.System further comprises suitable for that will inquire at least one processing component for being converted to the second text document data and/or retrieving the second text document data associated with inquiry from the storage at least one memory device assembly.Processing component is further adapted for for the second text document data being compared with the first text document data being stored at least one memory device assembly.System further includes at least one output device for being adapted to return to the information for identifying at least one similar first text document associated with the first text document data.Similar first text document is most like with inquiry in the first text document.

Description

Device and method for semantic search

Technical field

The present invention relates to data analysis and conversion arts.In particular it relates to semantic search.More precisely, this Invention, which describes, to be suitable in the search engine for semantically comparing text document.

Background technique

Due to the appearance (especially in internet) of this file store, search accommodates the file store or data of mass data Similar information in library, which has become, one of to be most difficult to solve the problems, such as.A solution of this problem is all obtainable The brute force approach of the keyword of user's explication is searched in document.This mode is effective in processing capacity, but there are one A little limitations: depending on the theme inquired into, identical keyword can refer to extremely different things, synonym or similar statement It has been repeated as many times using referring to obtain the search of all related discoveries.

In the more specific example searched for about the prior art, usually passes through IPC (International Patent Classification) classification, leads to It crosses CPC (collaboration patent classification) classification or completes the search of similar patent by citation that each patent is listed.It is this Mode is likely to be obtained some relevant discoveries, but may miss the similar information of (and not being cited also) more recently, or provide Less relevant discovery too much (in the case where passing through IPC or CPC classification search).

It is a kind of comb document similarity more careful mode can be realized by semantic search.This kind of search will consider To synonym, the statement being made of more than one word, the specific technical term in a certain field and will be by all of which group It is combined to carry out more accurate similarity system design.It different term or text that can be used will be defined as the multidimensional of vector Vector space completes this kind of search, and similarity system design is directly carried out in the vector space.

United States Patent (USP) 8,688,720 discloses a kind of system of the cluster characterization file for conceptual dependency word.It is receiving When document comprising set of words, " candidate clusters " of Systematic selection conceptual dependency word relevant to the set of words.This It is that the model how to generate from the cluster of conceptual dependency word selects that a little candidate clusters, which are using explanation set of words,.So Afterwards, system constructs component set to characterize document, wherein collection of document includes the component of candidate clusters.It is every in component set A component instruction corresponding candidate cluster is related to the degree of set of words.

United States Patent (USP) 8,935,230 discloses a kind of method, machine readable storage medium and learns by oneself idiom for providing The system of adopted search engine.Semantic network can be established with initial configuration.The search engine for being couple to semantic network can construct Index and semantic indexing.It can receive user's request of business data.Search engine can be accessed via semantic scheduler.It is based on The access, search engine can update index and semantic indexing.

U.S. Patent application 2014/280088 describe for search for by collection of document, term set and with each art A kind of system and correlation technique of the data acquisition system of language and each document associated vector composition.Method is related to search inquiry The vector in the vector space of term and document vector the expanded moon is converted to, and merges vector approximation matching search and is searched with term To generate results set, which can sort rope according to the various measurements of inquiry correlation.

Summary of the invention

The present invention illustrates in claim and the following specification.Preferred embodiment is specifically in dependent claims Illustrate in description with various embodiments.

Features above and subsidiary details of the invention are further described in the following example, are further intended to show Out the present invention and be not intended to limit the scope in any way.

According to known, the purpose of the present invention is therefore that open carried out using at least some of following element A kind of method and apparatus of semantic search:

1) it realizes and (specifically, training) is specially designed to technical language part-of-speech tagging, cleaning text, deletes stopping Word reduces word to trunk and phrase, correction misspelling, coordination languages style, correction synonym, cleaning OCR (optics word Symbol identification) error, the weighting of execution multi -components and the different modes using different index of similarity；

2) analysis and hypothesis of vocabulary and semantic algorithm are integrated；

3) consider and realize simultaneously different text related informations and different algorithms；

4) text of the analysis across all technical fields；

5) connection between text similarity measurement and bibliography feature is realized；

6) text based method and bibliography method that similitude determines are integrated.

In the document, word " keyword ", " term " and " semantic primitive " may be used interchangeably.In addition, word is " crucial Word " or " term " can refer to a kind of statement rather than single word.

In the first embodiment, the invention discloses a kind of for comparing the computer implemented method of text document.Side Method includes establishing the database comprising the first text document data associated with multiple first text documents.Method is further wrapped Include reception inquiry.Method further includes that the inquiry is converted to the second text document data.Method further includes by the second text text File data is compared and is calculated between the second text document data and the first document data extremely with the first text document data A few similarity measurement.This similarity measurement for example may include index of similarity.It can advantageously present and be compared to each other text A kind of quantifiable mode of document.

It should be noted that inquiry may include the second text document, in this case, which can be converted For the second text document data.It is already received in database however, inquiry can also be identified only as the first text document number According to a part the second text document.In this case, the second text document data has existed, and should be only from data It retrieves in library and is compared with comprising other data in the database.

This method allow will can analyze and can with other data carry out quantization compared with text document be converted to having for data Effect and reliable way.Preferably, it can be converted and be compared in a manner of preferred parallelization by calculating equipment.Described side Method can be realized on the server that user interface can access.It can be used for allowing user's identification for for various purposes similar Text document.

In some preferred embodiments, the first text document data includes by including the key in the first text document The document vector and/or word relevant to the keywords semantics that word generates.That is, each first text document can be with storage Document vector correlation connection in the database.

Database may or may not include the first text document itself.It advantageously only stores related to the first text document The document vector of connection is to store in the database memory space.On the contrary, it is advantageous that also store the first text document so as into Row easily and quickly response of the retrieval as such as inquiry.

For example, word relevant to the keywords semantics may include synonym, hypernym and/or hyponym.In order to Semantic related term is correctly identified, external data base can be used.It is specific that these can be general and/or main body.

In some embodiments, inquiry may include the second text document.Further additionally or alternatively, inquiry may include Identify the information of the second text document associated with the second text document data having stored in memory device assembly.? In the case of two kinds, can only be retrieved from database the second text document data associated with second text document and Then it is compared with the first text document data remaining in database.It should be noted that in this case, the second text document Data may include matching in the first text document data and in different ways to avoid obscuring.

In some embodiments, inquiry is converted to the second text document data may include coordinating inquiry.Some In preferred embodiment, coordination may include correction typing error, the specific spelling specification of selection and physical unit specification and base Text, and/or expression (for example, chemical formula, gene order and/or Representation of Proteins) in the standard fashion are adjusted in this.This can It advantageouslys allow for compared with more reliable between the relevant text document of same subject, but uses different specifications or different Unit.

In some embodiments, inquiry is converted to the second text documents data may include normalization inquiry.One In a little preferred embodiments, normalization includes identifying and deleting stop-word, and word is reduced to common stem, analysis synonym Stem and/or identification sequence of terms and compound word.

In some embodiments, normalization inquiry may include that at least synonym, upper is retrieved from external data base Word, hyponym, stop-word and/or the specific stop-word of theme and the pass for being based at least partially on the term language generation inquiry Keyword list.The one or more external data bases separated by theme may be present.This may be advantageous, because word can be with Including different meanings, this depends on theme.For example, the statement such as " transportation system (delivery system) " can have Complete different meaning, this, which depends on it, is used in logistics or the background of drug.

Therefore, corresponding synonym, hypernym, hyponym and/or other semantic relevant words are also possible to different, This depends on discussed technical area.As another example, consider that the present invention is used as the reality of a part of semantic search tool Apply the prior art under the background of mode, especially patent document.Patent application and license have very specific word, energy It is enough to be repeated in the document about entirely different theme.Such as the word of " claim ", " comprising ", " equipment ", " embodiment " Language can be considered as the specific stop-word of patent document and can delete from inquiry.In the implementation that database includes patent document In mode, the specific stop-word can also during converting them to the first text documents data (that is, establish or During creation database) it is deleted from all first text documents.In some embodiments, can be stopped by deleting Word and/or the specific stop-word of theme and include at least one of synonym, hypernym and hyponym of word of inquiry And generate the list of the keyword of inquiry.

In some embodiments, inquiry is converted to the second text documents data may include generating at least one inquiry Vector.For example, query vector may include the information of the keyword in relation to inquiring.That is, the component of query vector can correspond to inquire Keyword and/or their semantic related term (e.g., synonym).It should be noted that in the literature, " keyword " can refer to including Actual word and/or their semantic related term (e.g., synonym, hypernym and/or hyponym) in queries.Some of such In embodiment, query vector can by identification keyword and/or inquiry in keyword synonym and using multidimensional to The component of vector in quantity space identifies the keyword and generates.In some embodiments, query vector may include 100 To 500 components, it is preferable that 200 to 400 components, even further preferably, 200 to 300 components.That is, in some of such reality It applies in mode, not each keyword and associated semantic related term are associated with the component of query vector.For example, this may Mean that keyword is first evaluated and be then based on different parameter weightings, and then abandons the keyword of low weight.This may It is particularly advantageous, because the quantity for reducing the keyword to work to query vector can significantly reduce for manipulation query vector Computing capability needed for necessary, such as when by it compared with document vector.It should be noted that document vector can be similarly included 100 to 500 components, it is preferable that 200 to 400 components, even further preferably, 200 to 300 components.Included in database In and can pass through with the first text document (in some embodiments, including document vector) associated first document data Identification keyword or semantic unit and based on every first text document of entropy being associated by their quantity reduce to 100 or several hundred similarly generate with query vector.

In some preferred embodiments, weight can be distributed for keyword.It in this embodiment, can be at least partly General theme based on inquiry distributes weight.That is, different weights can be distributed for same term, keyword and/or semantic unit, This depends on the context or theme of text document.That is, for example, different weightings can be carried out to term " frequency ", depending on looking into Ask whether be telecommunications theme, may refer to wave frequency or it may refer to how long something goes out in the theme of drug It is now primary.In the embodiment that the first text document data includes document vector, this be can also be applied to and the first text document Associated document vector.I.e., it is possible to which being based on theme includes keyword, term and/or semanteme in the first text document Unit includes that semantic related term in these distributes different weights.This is particularly advantageous, because it makes it possible to First text document with inquiry between carry out it is more meaningful compared with.It should be noted that specific text can be determined in a number of ways Which technical area is this document belong to.If the document discussed includes patent document, its classification can be used.I.e., it is possible to use The IPC and/or CPC for providing document classify to distribute certain technical area for it.Another way can be identification certainly (external data base can be used for for the especially common a certain theme in area or the specific term in region, keyword and/or semantic unit The purpose), and text document is distributed to technical area by the presence for being then based on the specific term of these themes.

In some embodiments, calculate similarity measurement include using cosine index, Jaccard index, stripping and slicing index, Include index, Pearson correlation coefficients, Levenstein distance, Jaro-Winkler distance and/or Needleman-Wunsch At least one of algorithm or combinations thereof.That is, including especially document vector and the second text text in the first text document data File data includes in the embodiment of query vector, can by calculate the distance in multi-C vector space between them by this The two is compared.Several different distance definitions can be used to complete for this.It should be noted that different distance definitions can be used for not Same purpose.

In some preferred embodiments, the method for comparing text document further includes being verified using at least one statistic algorithm At least one similarity measurement.Method, which may further include, exports at least one similarity measurement.That is, considering again relatively specially The example of sharp document.Patent application and/or license generally include the reference of other similar informations.These references are usually in document sheet It quotes in body or is then provided by auditor.Reference is used as the prior art, and it is closely similar with document that this may imply them.With this Kind mode, can pass through revene lookup and the similarity measurement between the reference provided in specific first text document Similarity measurement between test query and a certain first text document.If similarity measurement is reliable, the verifying can be expected It will obtain similar similarity measurement between inquiry and reference.

In some embodiments, inquiry can receive from user interface and can return to similarity measurements via the interface Amount.This interface may include apply, program and/or the interface based on browser.That is, this method can be used as one of program It point realizes and to enable a user to the quantitatively and reliably similitude of more each text document.

In some embodiments, database include the relevant text document of patent document and establish database and/or Conversion query includes deleting the associated stop-word of text document relevant to patent document.As described above, this patent document Specific stop-word may include word such as " claim ", " equipment ", " embodiment ", " comprising " and similar word.One In a little embodiments, can by calculate with include the first text document data and/or inquire in the associated entropy of term And it deletes the term of low entropy and deletes the relevant stop-word of patent.This is discussed further below.

In some preferred embodiments, method may further include generate term vector, the term vector include from The keyword that multiple first text documents extract.That is, term vector can based on comprising in the database and with the first text Associated first text document data of document generates.Term vector can be based on including institute in all first text documents There are keyword, term and/or semantic unit to generate.May include in this embodiment and in the first text document data Document vector and the second text document data may include in the embodiment of query vector, can be relative to term vector The component of component generation document vector sum query vector.That is, term vector can provide potential common ground to inquire and first Text document is compared.In other words, term vector can define the multi-C vector space of its opposite achievable comparison.This is special It is not advantageous, because it allows to carry out quantitative mathematical comparison between different text documents.

In some embodiments, the similarity measurement between the second text document data and the first document data can pass through It is calculated using cosine index to calculate the distance between query vector and document vector.As described above, cosine index can be with For calculating the distance in multi-C vector space.Since it can be reduced to the inner product of two vectors, this point is particularly advantageous. Since this operation can be easy to accomplish, this can substantially reduce the calculating time compared.

In this second embodiment, the invention discloses the computer realization sides for handling the similitude in text document Method.Method includes coordinating at least one incoming inquiry.It further comprise at least one incoming coordination inquiry of normalization.Method is also wrapped It includes and establishes at least one query vector using at least one normalized coordination inquiry.Method further includes calculating at least one inquiry At least one similarity measurement between vector and at least one another text document, wherein at least one another text document Abovementioned steps are carried out.

It should be noted that another text document is referred to as the first text document.Progress abovementioned steps might mean that another Text document or the first text document have been coordinated, have normalized and document vector has been established.

This method, which advantageouslys allow for being converted to the arbitary inquiry being made of text, to carry out quantitative comparison with other data Data assess so as to the similitude of the inquiry to other data.Preferably, this is executed by calculating equipment, the calculating equipment With data associated with the various text documents being stored in its memory and it can be retrieved so as to it is incoming Inquiry is compared.Then various technologies can be used and analyze the text of inquiry by calculating the algorithm that equipment is realized.

In some preferred embodiments, text document may include technical text, scientific text, patent text, and/or At least one of description of product or combination.

In some embodiments, coordination may include correction typing error, the specific spelling specification of selection and physical unit Standardize and based on this adjustment text, and/or in the standard fashion expression (for example, chemical formula, gene order and/or protein It indicates).

In some embodiments, normalization may include identifying and deleting stop-word, and word is reduced to everyday words Dry, analysis synonym stem and/or identification sequence of terms and compound word.In this embodiment, normalization can be into one Step includes the entropy of the keyword in multiple text documents preferably by calculating the type and deletes the key with low entropy Word and identify and delete stop-word associated with certain type of text document.

In some embodiments, calculating similarity measurement may include using cosine index, Jaccard index, stripping and slicing Index includes index, Pearson correlation coefficients, Levenstein distance, Jaro-Winkler distance and/or Needleman- At least one of Wunsch algorithm or combination.This algorithm allow between text document based on by multi-C vector space The quantitative comparison of the distance for the data that text document generates.

In some embodiments, method, which may further include, verifies at least one phase using at least one statistic algorithm Like property measurement.It, which may further include, exports at least one similarity measurement.

It should be noted that the first and second embodiments can be complementation.That is, a part as first embodiment is presented Embodiment can be a part of second embodiment, vice versa.

In the third embodiment, the invention discloses a kind of computer implemented systems.System includes at least one Storage component, it includes multiple first text document numbers associated with the first text document which, which is suitable at least storage, According to database.System can also include at least one input unit suitable for receiving inquiry.Inquiry includes the second text document And/or the information of the second text document of identification.Second text document with include be stored in memory device assembly it is first literary The second text document data in this document data is associated.System further comprises being suitable for inquire being converted to the second text text File data and/or retrieval the second text document data associated with inquiry from the storage at least one memory device assembly At least one processing component.Processing component is further adapted for the second text document data and is stored at least one memory device assembly The first text document data be compared.System further include be adapted to return to identification it is associated with the first text document data extremely At least one output device of the information of few similar first text document.Similar first text document is the first text It is most like with inquiry in document.

It should be noted that inquiry preferably includes one of two kinds of forms.In the first form, inquiry may include the second text This document, then second text document can be converted suitably and related to the second text document data in this case Connection.In second of form, inquiry may include the reference paper for the second text document being already received in database.Example Such as, if database includes patent document, inquiry may include can recognize specific second text document number of patent application or License number.This can be " information of the second text document of identification " being used as.In the first scenario, the second text document data It then may include data associated with the second text document that inquiry is included.In the latter case, it can be based on looking into The identification information of inquiry is from the second text document data of database retrieval.In the latter case, the second text document data can be with Included in the first text document data.

In other words, system described herein is configured as receiving any text based inquiry via input unit Input, whether revene lookup can be associated with text document data stored in memory, then retrieves if so Otherwise inquiry is converted to this data by the data.System be configured to inquire with it is stored in memory its He is compared document.It can be compared by processing component via the realization of algorithms of different.System can also be filled via output It sets and exports comparison result in the form of with the text document of inquiry most tight association.Ratio can be completed on the level of change data Relatively itself (as summarized above and below, which may include the point in multi-C vector space), and outputting and inputting can wrap Include actual text document or their identifier (title of such as paper, the patent No.).

In some embodiments, the first text document data may include multiple document vectors and the second text document number According to may include query vector.It should be noted that query vector can be as included by inquiry referring again to the two kinds of forms that can be inquired The second text document text generation or from database retrieval.In the latter case, since query vector has been stored in In database, therefore it can be one in document vector.For clarity and consistency, term used herein " inquire to Amount " is suitable for both of these case.In the preferred embodiment, each first text document can be with the text that is storable in database Shelves vector correlation connection.Database can store the first text document and respective document vector, or only store document vector.

In some embodiments, memory device assembly may include and scientific paper and/or technology explanation and/or patent text It offers and/or associated first text document data of the description of product.In other words, the first text document may include patent document, Scientific paper, and/or technology explanation.Preferably, database may include at least relevant first text document number of patent document According to.

In some embodiments, the second text document data can by coordinate and normalize the second text document and It creates at least one query vector and obtains.Coordination and normalization are more fully described in the above and below.

In some embodiments, the first text document data can obtain phase compared between the second text document data Like sex index.In some of such embodiment, output device, which can return to, associated with multiple first text documents passes through phase Like sex index according to from most like to the information of least similar sequence, the first text text associated with the first text document data Shelves generate highest index of similarity with the second text document data.That is, it includes most like with inquiry that system, which may be adapted to output, The list of a certain number of first text documents.It is existing as executing in the case where the first text document includes patent document The method of technology search, this is particularly advantageous.It should be noted that the first text document of output can store in the database, and/ Or it is exported as the information (such as patent application or license number) for identifying them, and/or as the external data of addressable document The link in library and export.In addition, also advantageously exporting certain parts of the first most like text document.For example, can be with One in output header and/or abstract and/or figure.

In some embodiments, index of similarity can based between text document vocabulary and/or it is semantic relatively.That is, Index of similarity can quantitatively indicate the similitude between text.This for example can refer to pass present in inquiry and the first text document The quantity of keyword and/or semantic unit.It should be noted that phase can be obtained for example, by calculating the distance between vector in vector space Like sex index.However, can be based on vocabulary and/or semantic gain of parameter vector itself.And hence it is also possible to be considered based on those parameters Index of similarity.

In some embodiments, during coordinating and normalizing incoming second text document, processing component can be known Other keyword.Keyword may include and the significant relevant word of the content of text document.Keyword may include the master of word The word of dry (being obtained as normalized a part), compound word, and/or a string of semantic connections.Keyword also may include Actually not in text document but the word of synonym or other words semantically linked are to being included in text Word in document.

In some embodiments, processing component can be that keyword distributes weight based on entropy algorithm.That is, some keywords by Get Geng Gao can be arranged in the frequency occurred in the literature and/or correlation in particular technology area.In this case, when literary by first When this document data is compared with the second text document data, the weight for distributing to keyword can be used.That is, with have compared with The keyword of low weight is compared, tribute of the keyword with higher weights to similitude and/or index of similarity between document Offering can be bigger.Due to consideration that determining the similitude between text when the frequency and specific meanings of word within a context, this is Particularly advantageous.This can result in the comparison measurement of more robustness.

In some embodiments, processing component, which may be adapted to for the second text document being divided into, is used for parallelization calculating extremely Few two parts, it is preferable that be divided at least four parts.Since it allows to improve processing speed, and it is therefore more efficient, and this is It is advantageous.

In some embodiments, processing component may include at least two, it is preferable that and at least four, it is highly preferred that extremely Few eight kernels.This can be further improved inquiry can processed speed.

In some embodiments, processing component may be adapted to regularly update the first number of files being stored in memory device assembly According to.That is, new the first text document more new database can be used.

In some embodiments, input unit can be further adapted for must include by listing similar text document And/or the word and/or sentence that need not include and allow given query.In other words, the example of prior art search is considered again. Can specify must must include being particularly useful with the word or statement inquired in similar text document.Furthermore or alternatively Ground, specified can not include the word in similar text document be highly useful.

In some embodiments, input unit can be further adapted for by specifying most like text text to be output Shelves quantity and allow given query.

In some embodiments, memory device assembly may include RAM (random access memory).It is further in conjunction with Fig. 1 It discusses.

In some embodiments, memory device assembly, which may further include, generates term vector, which includes The keyword extracted from multiple first text documents.Term vector is described above in association with first embodiment.Some of such In embodiment, processing component may be adapted to the component that document vector sum query vector is generated relative to the component of term vector.? In some of such embodiment, wherein the first text document data includes document vector and the second text document data includes looking into Vector is ask, processing component may be adapted to be compared the second text document data with the first text document data using cosine index To calculate the distance between query vector and document vector.

Here is the discussion of the more elegant of an embodiment of the invention.Specifically, it illustrates such as of the invention upper The concept for the entropy that hereinafter can be used, and give a kind of mode for quantifying the similitude between different texts.

Entropy E (t) can be used for removing the specific stop-word of patent document.That is, as " claim ", " device ", " invention ", " comprising " or other similar words.Following statement can be used:

In above statement, n refers to that the sum of patent and/or document, i and j are the fingers referring to patent and/or document Number, f_itIndicate frequency and f of the term t in patent and/or document i_jtSum refer in all patent and/or document In term t frequency.The value of E (t) is fallen between zero and one.Can between document very specifically but unevenly low distribution The high entropy of term weighting.Entropy is higher, and the transmittable information of term is more.Can individually calculate abstract, claim, title, The combined patent of specification and all of which specifically stops word list.Due to the claim and such as specification system of patent Surely very different, therefore difference is important.

After by deleting various stop-words and them being prevented to identify keyword, keyword may be implemented in vector space In model.Then document may be expressed as the object in hyperspace.Dimension can be characterized by keyword or term.With this side Formula, each document can be described as point and/or vector in hyperspace.The value of each component of this point can be indicated in this article The number of the particular keywords or term that are encountered in offering.Term vector T can be created in this way wraps it once accurately All terms or keyword containing all documents considered:

T=(t₁,t₂..., t_m)

That is, m term or keyword may include in all the first text documents considered in total.It, can based on the vector It generates term document matrix (TDM).TDM can following form include each of n document and/or patent as indicating The row vector of the weight of term vector T:

This means that can be by can be described as the digital weight vector d of document vector_iDocument i is described.Document vector can be related to Weight is as follows:

d_i=(w₁₁..., w_1m)

The document vector shortened in Boolean expression can for example seem as follows:

d_i=(0,0,0,0,0,0,0,0,0,1,1,0,0,1,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0)

Since term vector once accurately includes each term or keyword in all documents, most power of document vector Weight element w_itWith zero.This can lead to two problems during firing count vector spatial model.First, null value occupies need not The memory wanted, second, the manipulation of vector causes and null value is unnecessary is multiplied in the comparison procedure of text document.Therefore, more Favorably and more practicability is that document vector d is presented_iIt weighs as coordinate to set (c_it；w_it).The document vector stated above Then it can be write as:

d_i={ (10；1), (11,1), (14；1), (18；1), (19；1)}.

The first part indicates coordinate c of doublet_it, and describe the position in term vector T and/or index. In the expression, TDM matrix may include doublet as its element w_ijEach of and be contemplated that tensor.

In this way, each document can be expressed as the vector in vector space.In general, the entire set comprising document Or the term vector of database may include million or higher components.However, each document can be exchanged into about 100-500 component Document vector.That is, the quantity of each document keyword can reduce in this way so that document vector may include about 100- 500 keywords.

Vector space method makes it possible to by will be in they and multi-C vector space based on keyword present in text Point and/or vector correlation connection and quantify different text documents.It then, can be close in vector space by calculating them Spend more different texts.This for example may be used at the cosine index CI completion being given below for reference.

Detailed description of the invention

Technical staff should be understood that the attached drawing being described as follows only is used to illustrate.These attached drawings are not intended to any side Formula limits the range of this introduction.

Fig. 1 shows the embodiment of the equipment for semantic search according to an aspect of the present invention.

Figure lb schematically depicts the embodiment that inquiry is converted to text document data.

Figure lc schematically depicts a visual embodiment for vector spatial model.

Fig. 2 depicts the embodiment of the method for semantic search according to an aspect of the present invention.

Specific embodiment

Hereinafter, exemplary embodiments of the present invention are described with reference to the accompanying drawings.These examples are provided in order to provide right The present invention is further understood from, rather than is limited the scope of the present invention.

In the following description, a series of feature and/or step are described.Technical staff will be appreciated that unless context needs It wants, otherwise the sequence of feature and step is not important for obtained configuration and its effect.In addition, technical staff will be shown and It is clear to, the sequence regardless of feature and step, it can be with rendering step between some or all of described step Between presence or absence of time delay.

Referring to Fig.1, the example of setting of the invention is shown.Attached drawing depicts calculating according to an aspect of the present invention The system 10 that machine is realized.

Computer implemented system 10 includes memory device assembly 20.Memory device assembly 20 may include standard computer storage Device (such as, RAM).Further additionally or alternatively, memory device assembly 20 may include Nonvolatile memory device assembly, such as, hard disk Driver, memory, flash memory, CD-ROM driver, FeRAM, CBRAM, PRAM, SONOS, RRAM, racing track memory on server, NRAM, 3D XPoint, and/or thousand-legger (millipede) memory.

Memory device assembly 20 may include the first text document data 21.First text document data 21 may include document Vector.Document vector can be constructed by text documents.That is, each text document can map to text by the keyword in identification document Shelves vector.One document vector may include the 100-500 component (that is, dimension) comprising individual keyword.

Computer implemented system 10 can also include processing component 30.Processing component 30 may be adapted to receive the second text text File data 31 and it is compared with the first document data 21.Second text document data 31 can also include document to Amount.For example, the identification (such as, the patent No.) for the text document that it may include user-defined inquiry and/or user provides. Second text document data 31 may include the document vector for having become a part of the first text document data 21.For example, User interface, which can be used for searching for, similar with specific patent and/or patent application has become computer implemented system 10 In database a part (that is, a part for having become the first text document data 21 in memory device assembly 20) it is special Benefit and/or patent application.

Processing component 30 may be adapted to receive inquiry 41 from input unit 40.That is, inquiry 41 for example can be via in this feelings It will act as the user interface in the application program of input unit 40, program and/or interface based on browser under condition and key in.It looks into Ask 41 may include the second text document text and/or specific identification (as described above, for example, this may include patent and/or Number of patent application).In the case where having received inquiry 41, processing component 30 can be for example, by all keys in identification inquiry Word deletes stop-word, prevention and generates the document vector of inquiry and be converted into the second text document data 31.Institute as above It states, if inquiry identification has become the text of a part of (the first text document data 21) database in memory device assembly 20 It offers, then processing component 30 can only retrieve the document vector being connected with the second text document data 31.Then processing component 30 can incite somebody to action Second text document data 31 is compared with the first all text document datas in memory device assembly 20.Preferably, Can based on the most like document of the distance between document vector in multi-C vector space identification (using their own document to Amount identification).

In the case where having identified the most like document in the first text document data 21, processing component can send out result It send to output device 50.Output device 50 is then exportable and associated with the first most like text document data 21 of inquiry 41 At least one similar first text document 51.Certainly, the exportable similitude based on them and inquiry 41 of output device 50 And multiple similar first text documents 51 to sort.For example, output device 50 may include accessible via equipment is calculated Interface, such as, program, application program and/or the interface based on browser.

Figure lb schematically depicts the embodiment that inquiry 41 is converted to text document data.This process can be It is carried out in processing component 30, processing component may include CPU for example associated with equipment is calculated.Further additionally or alternatively, example Such as, processing component may include for multiple CPU of parallel processing and/or the CPU with multiple kernels.Inquiry 41 can be from input Device 40 (being not shown herein) is sent to processing component 30.Inquiry 41 can first be coordinated to obtain the inquiry 43 of coordination.Above Describe the process of coordination.Then coordinating inquiry 43 can be normalized in order to obtain normalized coordination inquiry 45.Above also more in detail Carefully describe normalized process.

Normalized coordination inquiry 45 (being the normalization inquiry 43 coordinated individually) then can be exchanged into query vector 47.It looks into Asking vector 47 can be by by the component or dimension in the normalized keyword for coordinating inquiry 45 or " term " and multi-C vector space Degree is combined and is generated.Then can by query vector 47 and the document being storable in memory device assembly 20 (being not shown) herein in Amount 27 is compared.

It should be noted that document vector 27 can refer to the first text document data 21 in the literature.For clarity, it can be used Term " document vector ", so that technology reader understands that multiple and different document vectors is signified.For example, multi-C vector space can be based on In distance carry out query vector 47 compared between document vector 27.Certainly, for this comparison, query vector 47 and text Both shelves vectors 27 should be in identical vector space, that is, the space defined by identical dimension.To achieve it, The database in 20 (not shown) of storage component that is included may include term vector.Term vector may include being stored in number According to the component or an a kind of dimension of each term present in all first text documents in library or keyword.Inquiry Then vector 47 and document vector 27 can indicate exist in each specific document of leisure relative to the dimension or component of term vector Keyword or term in inquiry 41.In this way, unique and consistent vector space is produced.This will above carry out It is explained in greater detail.

Figure lc schematically depicts a visual embodiment for vector space model.It should be noted that the diagram illustrating It is only used for clear purpose, and does not correspond to the mathematical description of vector space model.Term vector 7 is shown schematically as justifying Shape.Term vector 7 may include multiple keywords or term.These keywords or art can be extracted from multiple text documents Language.In the preferred embodiment, term vector 7 includes from all keywords including all text documents in the database (that is, all keywords from the first text document).This is indicated by great circle in the accompanying drawings.Query vector 47 can be by inquiry 41 Keyword (being not shown) herein in generates.It should be noted that query vector 47 is completely contained in term vector 7 in this schematic diagram In, it means that all keywords that inquiry 41 is included are all contained in the first text document, which includes In the database and it is generated by it term vector 7.However, being not necessarily such case.Inquiry 41 includes being not included in first Keyword in text document is also entirely possible, therefore query vector 47 need not be completely in the keyword by term vector 7 In the vector space of generation.If however, such case, the keyword for the inquiry 41 being not included in term vector 7 be will lead to There is no similitude with any first text document, and it can be ignored therefore to find the first most like text document.Cause This, query vector 47 can be considered as the keyword for being used only in and having been contemplated that in term vector 7 and generate.It should be noted that can be with Compared using the synonym of keyword for Semantic Similarity.

Document vector 27 is depicted as having intersection with query vector 47.This refers to that they include some identical keywords And/or their synonym.Therefore, non-zero semblance measurement can be generated between query vector 47 and document vector 27.However, Document vector 27' is depicted as with query vector 47 without intersection.This refers to inquiry 41 and text document and document vector 27' It is associated, any keyword or their synonym are not shared.It can be query vector 47 and document vector 27' that this, which may mean that, Distribute empty similarity measurement.

Fig. 2 diagrammatically illustrates the method for the semantic processes of the similitude in text document according to an aspect of the present invention Embodiment.The figure illustrate the steps that incoming document is compared by description with the database in existing pond or stored document Rapid flow chart.

As example scenarios, it is contemplated that using the user of certain text, certain text for example can be patent and/or specially Benefit application.User needs so-called " prior art search ".That is, user needs to obtain or search in the text possessed with them Hold other close patent documents.Then, user can be in the following manner using the present invention.They can be by problematic biography text text Shelves are sent or upper conducting system.For example, this can be completed via interface.In one embodiment, system as described in this article It may include the interface based on application or based on browser for receiving inquiry.Then user interface can be used to send inquiry To system, following steps can occur at this point.

In S1, tunable is passed to text document or inquiry.I.e., it is possible to correct misspelling.In addition, spelling can be with normalizing Change.For example, a kind of specification can be selected from Britain and U.S.'s spelling specification, and different all words in both specifications are equal It can be exchanged into selected one kind.That is, as the word of " color (color) ", " theater (theatre) " can be converted to " color And " theater (theater) " or vice versa (color) ".In addition, coordinating to may include that different physical units is converted to one The physical unit of a standard and/or a kind of specific physical unit.For example, inch can be converted to rice, pound can be converted to Kilogram etc..In addition, coordinating to may include that such as formula of chemical formula, gene order and/or Representation of Proteins is converted to standard Symbol.

In S2, incoming text document can be normalized.This may include the stop-word being isolated in the text for including document And they are deleted.Stop-word may include word, such as, "and", " first ", " however ".Stop-word is also possible to wait divide The type of the text document of analysis.For example, patent document includes the word being present in most patent text document, such as, " right It is required that ", " embodiment ", " equipment ".These words can similarly be identified and be deleted during normalization step. In addition, normalization may include reducing word to their trunk.That is, such as " computer " and the word of " calculating " can be such as It is reduced to their common trunk.Then, trunk can be analyzed for synonym.In addition, order and compound word are in normalization step It may recognize that in the process.That is, can recognize word (such as, " folder "), and do not separated for the purpose of filling, so as to together Keep the meaning of compound word.

In S3, using can be coordinated first and/or normalized text document construct document vector.Document vector can be with It is the multi-C vector comprising the information in relation to which " term ", that is, stem and their synonym are included in text document.This It is further illustrated above.It should be noted that in some embodiments, document vector also may include tensor.

In S4, document vector generated can be used for calculating between incoming text document and stored text document Similarity measurement.That is, incoming text document or exactly its document vector can with comprising being converted to the text before document vector The database of this document is compared.It should be noted that in order to be compared between different document vectors with public baseline Compared with, may be present comprising included in all text documents in database own " term " (that is, word and/or trunk and/or Synonym) one " term vector ".

Then, each document vector can only indicate include those of in term vector term be present in given document In.Term vector then can define multi-C vector space, wherein each term may include a dimension.Each document vector is equal It can indicate or be visualized as the point or vector in the multi-C vector space.For the document vector that will be generated by being passed to text document It is compared with comprising each document vector in the database, the distance between they can be calculated.It should be noted that it is empty to calculate vector Between in the distance between vector can be and obtain one of the similarity measurement between incoming document and the text document stored Kind mode or a part.However, it is also possible in the presence of the other modes done so based on vocabulary and/or semantic analysis.In addition, may be used also In the presence of including its dependent variable in similarity measurement.For example, the frequency that is occurred in a document based on them and/or based on document Then the technical area that can be integrated into document vector is weighted keyword, and therefore works in similarity measurement. In addition it is possible to use the number variable of text document.In the specific example of patent document, these may include IPC classification, CPC classification, applicant, inventor, patent attorney, citation, reference, common citation and common reference information, image information.

In S5, similarity measurement can be exported.For example, several text documents can be exported, by be originally inputted Text document or the similarity measurement of inquiry are ranked up.Back to given above in application program and/or browser The example at interface, similarity measurement can be exported via same interface.That is, for example, can be answered by what is ordered in some way List with incoming text document or the similar text document of inquiry is shown with program and/or browser, such as from most like Document starts.It should be noted that " output similarity measurement " can refer to herein output be confirmed as with inquire it is most like at least One or more documents.

As it is used in the present context, including claim, unless context indicates, the otherwise singular solution of term It is interpreted as also including that vice versa for plural form.As such, it is noted that as it is used in the present context, unless context clearly in addition It indicates, otherwise singular " one ", "one" and " described " include plural referents.

In the whole instruction and claim, the terms "include", "comprise", " having ", " receiving " and their modification It should be understood to refer to " including but not limited to this ", it is not intended that exclude other assemblies.

These terms, feature, value and range etc. with such as about, left and right, usually, substantially, substantially, at least etc. (that is, " about 3 " should also cover accurate 3 or " substantially constant " should also cover it is strictly constant) term the case where being used in combination Under, present invention also contemplates that accurate term, feature, value and range etc..

Term "at least one" should be understood to refer to " one or more ", and therefore include comprising one or more components Two embodiments.In addition, the dependent claims of the independent claims of reference "at least one" Expressive Features are in feature It is mentioned as all having the same meaning when " described " and " described at least one ".

It should be understood that can be changed to previously described embodiment of the invention when still falling within the scope of the invention. It can be instead of disclosed in the description unless otherwise stated, serving same, equivalent or similar purpose replaceable feature Feature.Therefore, unless otherwise stated, disclosed each feature represents equivalent or similar characteristics one of universal serial Example.

Except the explanation that is far from it, otherwise such as " such as (for instance) ", " such as (such as) ", " such as (for ) " etc. example the use of exemplary language is only intended to preferably illustrate the present invention and does not indicate the limitation of the scope of the invention. Unless the context clearly dictates, in any order or described arbitrary steps in the description otherwise can be performed simultaneously.

Other than at least some features and/or the mutually exclusive combination of step, disclosed all features in the description And/or step can be combined in any combination.Specifically, preferred feature of the invention is suitable for the invention all sides It face and can be used in any combination.

Claims

1. a kind of computer implemented for comparing the method for text document, comprising the following steps:

A) foundation includes the database of the first text document data (21) associated with multiple first text documents；And

B) inquiry (41) is received；And

C) inquiry (41) is converted into the second text document data (31)；And

D) second text document data (31) and first text document data (21) are compared and calculate institute State at least one similarity measurement between the second text document data (31) and first text document data (21).

2. the method according to preceding claims, wherein first text document data (21) includes by being included in institute The document vector stating the keyword in the first text document and/or being generated with the keyword in semantically related word. (27)。

3. the method according to any previous claim, wherein the inquiry (41) includes the second text document and/or knowledge Not with include in first text document data (21) being stored in the memory device assembly (20) described second The information of associated second text document of text document data (31).

4. method according to any of the preceding claims, wherein the inquiry (41) is converted to second text This document data (31) includes coordinating the inquiry (41).

5. method according to any of the preceding claims, wherein the inquiry is converted to the second text text File data (31) includes normalizing the inquiry (41).

6. the method according to preceding claims, wherein the normalization inquiry (41) include from external data base at least Synonym, hypernym, hyponym, stop-word and/or the specific stop-word of theme are retrieved, and is based at least partially on and is retrieved The word arrived generates the lists of keywords of the inquiry (41).

7. the method according to preceding claims, wherein by delete stop-word and/or the specific stop-word of theme and At least one of synonym, hypernym and hyponym of word comprising inquiry arrange to generate the keyword of the inquiry (41) Table.

8. method according to any of the preceding claims, wherein the inquiry (41) is converted to second text This document data (31) includes generating at least one query vector (47).

9. the method according to preceding claims, wherein by identifying keyword and/or the pass from the inquiry (41) The synonym of keyword and identify the keyword using the component of the vector in multi-C vector space, come generate it is described inquire to It measures (47).

10. the method according to preceding claims, wherein the query vector (47) includes 100 to 500 components, excellent Selection of land includes 200 to 400 components, even further preferably, including 200 to 300 components.

11. the method for the feature according to any one of the preceding claims with claim 9, wherein be keyword Distribute weight.

12. the method according to preceding claims, wherein the general subject for being based at least partially on the inquiry (41) comes Distribute weight.

13. method according to any of the preceding claims, wherein calculating the similarity measurement includes that application is following At least one of or combinations thereof: cosine index, Jaccard index, stripping and slicing index, comprising index, Pearson correlation coefficients, Levenstein distance, Jaro-Winkler distance and/or Needleman-Wunsch algorithm.

14. method according to any of the preceding claims further comprises the steps of: after step d)

F) at least one described similarity measurement is verified using at least one statistic algorithm；And

G) at least one described similarity measurement is exported.

15. the method according to preceding claims, wherein receive the inquiry (41) from user interface and via described Interface returns to the similarity measurement.

16. method according to any of the preceding claims, wherein the database includes relevant to patent document Text document, and wherein, constructing the database and/or the conversion inquiry (41) includes deleting and the patent document phase The associated stop-word of the text document of pass.

17. the method according to preceding claims, wherein by calculating and being included in first text document data (21) and/or the associated entropy of term in the inquiry (41) and the term with low entropy is deleted and to delete patent relevant Stop-word.

18. it includes from multiple first texts texts that method according to any of the preceding claims, which further includes generation, The term vector (7) for the keyword that shelves extract.

19. the method for the feature according to preceding claims with claim 2 and 8, wherein the document vector (27) and the component of the query vector (47) is generated relative to the component of the term vector (7).

20. the method for the feature according to any one of the preceding claims with claim 2 and 8, wherein described Similarity measurement between two text document datas (31) and first text document data (21) is referred to by using cosine Number calculate the distance between query vector (47) and document vector (27) and it is calculated.

21. a kind of computer implemented method for handling the similitude in text document, comprising:

A) coordinate at least one incoming inquiry (41)；And

B) by least one described incoming coordination inquiry (43) normalization；And

C) at least one query vector (47) is constructed using at least one normalized coordination inquiry (45)；And

D) at least one similitude between at least one described query vector (47) and at least one another text document is calculated Measurement, wherein at least one described another text document has carried out aforementioned step.

22. the method according to preceding claims, wherein the text document includes technical text, scientific text, patent At least one of text, and/or the description of product or combinations thereof.

23. the method according to any one of both of the aforesaid claim, wherein coordinate to include correction typing error, selection Specific spelling specification and physical unit specification are simultaneously adjusted described based on the specific spelling specification and the physical unit specification Text, and/or representation formula (for example, chemical formula, gene order and/or Representation of Proteins) in the standard fashion.

24. the method according to any one of preceding claims 21 to 23, wherein normalization includes identifying and deleting to stop Word is reduced to common stem, is analyzed the stem of synonym and/or identifies sequence of terms and compound word by only word.

25. the method according to preceding claims, wherein normalization further comprises: identifying and delete and certain seed type The associated stop-word of text document, preferably the entropy of the term in multiple text documents by calculating the type is simultaneously And delete the word with low entropy.

26. the method according to any one of claim 21 to 25, wherein calculate the similarity measurement include application with It is at least one of lower or combinations thereof: cosine index, Jaccard index, stripping and slicing index, comprising index, Pearson correlation coefficients, Levenstein distance, Jaro-Winkler distance and/or Needleman-Wunsch algorithm.

27. the method according to any one of claim 21 to 26 is further comprising the steps of after step d):

G) at least one described similarity measurement is exported.

28. computer implemented system (10) according to any one of the preceding claims, comprising:

A) at least one memory device assembly (20) is suitable at least storage and includes associated with the first text document multiple described the The database of one text document data (21)；

B) at least one input unit (40), is suitable for receiving inquiry (41), the inquiry (41) including the second text document and/or Identify the information of second text document, second text document has stored in the memory device assembly with being included in (20) second text document data (31) in first text document data (21) in is associated；And

C) at least one processing component (30) is suitable for the inquiry (41) being converted to second text document data (31) And/or from memory search and the inquiry (41) associated described second at least one described memory device assembly (20) Text document data (31) and by second text document data (31) and it is stored at least one described memory device assembly (20) first text document data (21) in is compared；

D) it is associated at least with the first text document data (21) to be adapted to return to identification at least one output device (50) The information of one similar first text document (51), similar first text document (51) is first text document In it is most like with the inquiry (41).

29. the system according to preceding claims, wherein first text document data (21) include multiple documents to It measures (27), and wherein, second text document data (31) includes query vector (47).

30. the system according to any one of preceding claims 28 to 29, wherein the memory device assembly (20) include with Scientific paper and/or technology explanation and/or patent document and/or associated first text document data of the description of product (21)。

31. the system according to any one of preceding claims 28 to 30, wherein second text document data (31) It is to be obtained and coordinating and normalize second text document and constructing at least one described query vector (47).

32. the system according to any one of preceding claims 28 to 31, wherein first text document data (21) Index of similarity is generated compared between second text document data (31).

33. the system according to preceding claims, wherein the output device (50) returns and multiple first texts Document is associated, by the index of similarity according to from most like to the information of least similar sequence, with first text Associated first text document of this document data (21) generates highest similitude with second text document data (31) and refers to Number.

34. the system according to any one of preceding claims 28 to 33, wherein the index of similarity is based on text Vocabulary and/or semanteme between document are relatively.

35. the system according to any one of preceding claims 28 to 34, wherein the processing component (30) is to incoming The second text document carry out coordinate and it is normalized during identify keyword.

36. the system according to any one of preceding claims 28 to 35, wherein the processing component (30) is calculated based on entropy Method is that keyword distributes weight.

37. the system according to any one of preceding claims 28 to 36, wherein the processing component (30) is suitable for simultaneously Rowization, which is calculated, is divided at least two parts for second text document, it is preferable that is divided at least four parts.

38. system according to any one of the preceding claims, wherein the processing component (30) includes at least two Core, it is preferable that including at least four kernels, it is highly preferred that including at least eight kernels.

39. the system according to any one of preceding claims 28 to 38, wherein the processing component (30) is suitable for regular Update storage the first document data (21) in the memory device assembly (20).

40. the system according to any one of preceding claims 28 to 39, wherein the input unit (40) is further adapted for permitting Perhaps it by listing similar text document must include and/or the word that must not include and/or sentence specify the inquiry (41)。

41. the system according to any one of preceding claims 28 to 40, wherein the input unit (40) is further adapted for permitting Perhaps the inquiry (41) is specified by specifying the quantity of most like text document to be output.

42. the system according to any one of preceding claims 28 to 41, wherein the memory device assembly (20) includes RAM (random access memory).

43. the system according to any one of preceding claims 28 to 42, wherein the memory device assembly (20) further includes Term vector (7), the term vector include the keyword extracted from multiple first text documents.

44. the system of the feature according to preceding claims with claim 29, wherein the processing component (30) The component for being adapted to the term vector (7) generates the component of the document vector (27) and the query vector (47).

45. the system of the feature according to any one of preceding claims 28 to 44 with claim 29, wherein institute Processing component (30) are stated suitable for by using the cosine index that second text document data (31) is literary with described first This document data (21) is compared to calculate the distance between the query vector (47) and the document vector (27).