NZ794252A - System and Method for Finding Similar Documents Based on Semantic Factual Similarity - Google Patents
System and Method for Finding Similar Documents Based on Semantic Factual SimilarityInfo
- Publication number
- NZ794252A NZ794252A NZ794252A NZ79425217A NZ794252A NZ 794252 A NZ794252 A NZ 794252A NZ 794252 A NZ794252 A NZ 794252A NZ 79425217 A NZ79425217 A NZ 79425217A NZ 794252 A NZ794252 A NZ 794252A
- Authority
- NZ
- New Zealand
- Prior art keywords
- library
- triples
- documents
- facts
- triple
- Prior art date
Links
- 238000005065 mining Methods 0.000 claims abstract description 21
- 229940035295 Ting Drugs 0.000 claims description 4
- 238000004220 aggregation Methods 0.000 claims description 2
- 230000002776 aggregation Effects 0.000 claims description 2
- 230000000875 corresponding Effects 0.000 claims description 2
- 241000229754 Iva xanthiifolia Species 0.000 claims 1
- 238000000034 method Methods 0.000 description 50
- 238000003860 storage Methods 0.000 description 25
- 241000282994 Cervidae Species 0.000 description 20
- 239000000203 mixture Substances 0.000 description 19
- 238000000605 extraction Methods 0.000 description 18
- 238000004891 communication Methods 0.000 description 14
- 230000002093 peripheral Effects 0.000 description 11
- 238000005516 engineering process Methods 0.000 description 9
- 230000004044 response Effects 0.000 description 8
- 150000002500 ions Chemical class 0.000 description 7
- 238000010586 diagram Methods 0.000 description 6
- 230000003993 interaction Effects 0.000 description 6
- 238000004458 analytical method Methods 0.000 description 5
- 230000005540 biological transmission Effects 0.000 description 5
- 238000007781 pre-processing Methods 0.000 description 5
- 239000002131 composite material Substances 0.000 description 4
- 239000008186 active pharmaceutical agent Substances 0.000 description 3
- 238000004422 calculation algorithm Methods 0.000 description 3
- 230000001413 cellular Effects 0.000 description 3
- 238000004140 cleaning Methods 0.000 description 3
- SRTHRWZAMDZJOS-UHFFFAOYSA-N Lithium hydride Chemical compound [H-].[Li+] SRTHRWZAMDZJOS-UHFFFAOYSA-N 0.000 description 2
- 241001275944 Misgurnus anguillicaudatus Species 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 239000003795 chemical substances by application Substances 0.000 description 2
- 238000004590 computer program Methods 0.000 description 2
- 235000019800 disodium phosphate Nutrition 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 230000002452 interceptive Effects 0.000 description 2
- 238000011068 load Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000006011 modification reaction Methods 0.000 description 2
- 230000003287 optical Effects 0.000 description 2
- 230000037361 pathway Effects 0.000 description 2
- 239000007787 solid Substances 0.000 description 2
- RLLPVAHGXHCWKJ-IEBWSBKVSA-N (3-phenoxyphenyl)methyl (1S,3S)-3-(2,2-dichloroethenyl)-2,2-dimethylcyclopropane-1-carboxylate Chemical class CC1(C)[C@H](C=C(Cl)Cl)[C@@H]1C(=O)OCC1=CC=CC(OC=2C=CC=CC=2)=C1 RLLPVAHGXHCWKJ-IEBWSBKVSA-N 0.000 description 1
- 241000283690 Bos taurus Species 0.000 description 1
- 241000272183 Geococcyx californianus Species 0.000 description 1
- HBBGRARXTFLTSG-UHFFFAOYSA-N Lithium Ion Chemical compound [Li+] HBBGRARXTFLTSG-UHFFFAOYSA-N 0.000 description 1
- 102000002067 Protein Subunits Human genes 0.000 description 1
- 108010001267 Protein Subunits Proteins 0.000 description 1
- 210000001525 Retina Anatomy 0.000 description 1
- XCCTYIAWTASOJW-XVFCMESISA-N Uridine-5'-Diphosphate Chemical compound O[C@@H]1[C@H](O)[C@@H](COP(O)(=O)OP(O)(O)=O)O[C@H]1N1C(=O)NC(=O)C=C1 XCCTYIAWTASOJW-XVFCMESISA-N 0.000 description 1
- ASCUXPQGEXGEMJ-GPLGTHOPSA-N [(2R,3S,4S,5R,6S)-3,4,5-triacetyloxy-6-[[(2R,3R,4S,5R,6R)-3,4,5-triacetyloxy-6-(4-methylanilino)oxan-2-yl]methoxy]oxan-2-yl]methyl acetate Chemical compound CC(=O)O[C@@H]1[C@@H](OC(C)=O)[C@@H](OC(C)=O)[C@@H](COC(=O)C)O[C@@H]1OC[C@@H]1[C@@H](OC(C)=O)[C@H](OC(C)=O)[C@@H](OC(C)=O)[C@H](NC=2C=CC(C)=CC=2)O1 ASCUXPQGEXGEMJ-GPLGTHOPSA-N 0.000 description 1
- 238000007792 addition Methods 0.000 description 1
- 230000004931 aggregating Effects 0.000 description 1
- 239000003086 colorant Substances 0.000 description 1
- 230000001186 cumulative Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005265 energy consumption Methods 0.000 description 1
- 238000004880 explosion Methods 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 238000007667 floating Methods 0.000 description 1
- 229910052737 gold Inorganic materials 0.000 description 1
- 239000010931 gold Substances 0.000 description 1
- 230000000977 initiatory Effects 0.000 description 1
- 230000004301 light adaptation Effects 0.000 description 1
- 239000007788 liquid Substances 0.000 description 1
- WHXSMMKQMYFTQS-UHFFFAOYSA-N lithium Chemical compound [Li] WHXSMMKQMYFTQS-UHFFFAOYSA-N 0.000 description 1
- 229910052744 lithium Inorganic materials 0.000 description 1
- 229910001416 lithium ion Inorganic materials 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 230000005055 memory storage Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- -1 nickel cadmium Chemical compound 0.000 description 1
- 229920000642 polymer Polymers 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Abstract
The present disclosure is directed towards systems and methods for finding documents that are similar to a reference text. The inventive systems and methods examine a set of collected documents to determine the facts present in those documents by, for example, extracting triplets and expanding them. A user's input reference text is similarly examined to extract and expand triplets therein and the facts identified with respect to the reference text are used as a basis to find documents having similar facts. The present disclosure is also related to systems and methods for mining facts from documents relating to a primary source such as a piece of legislation and using the mined facts to improve the results of subsequent searches. A user's input reference text is similarly examined to extract and expand triplets therein and the facts identified with respect to the reference text are used as a basis to find documents having similar facts. The present disclosure is also related to systems and methods for mining facts from documents relating to a primary source such as a piece of legislation and using the mined facts to improve the results of subsequent searches.
Description
The present disclosure is directed s systems and methods for finding documents that
are similar to a reference text. The inventive systems and methods examine a set of ted
documents to determine the facts present in those documents by, for example, extracting ts
and expanding them. A user's input reference text is similarly examined to extract and expand
triplets therein and the facts identified with respect to the nce text are used as a basis to
find documents having similar facts. The present disclosure is also related to systems and methods
for mining facts from documents relating to a primary source such as a piece of legislation and
using the mined facts to improve the results of subsequent searches.
NZ 794252
SYSTEM AND METHOD FOR FINDING SIMILAR DOCUMENTS
BASED ON SEMANTIC L RITY
This application for letters patent disclosure document describes inventive aspects
that e various novel innovations (hereinafter “disclosure”) and contains material that is
subject to copyright, mask work, and/or other intellectual property protection. The respective
owners of such intellectual property have no objection to the facsimile reproduction of the
disclosure by anyone as it appears in hed Patent Office file/records, but otherwise
reserve all rights.
REFERENCE TO RELATED APPLICATIONS
This application claims the benefit of and priority to U.S. Provisional Application
No. 62/426,727, filed November 28, 2016, and U.S. Provisional Application No. 62/550,839,
filed August 28, 2017, which are both hereby incorporated by reference in their entireties.
BACKGROUND
The present innovations generally address tools finding nts that are
similar to a reference. Previously, in order to find documents of interest, researchers were
required to carefully craft search strategies for ing the information sought. In many
cases, substantial skill and experience on the part of the researcher were needed in order to
craft a search that would successfully and efficiently obtain the ation . For
example, a researcher’s experience with information classification systems and even foreknowledge
of a nt’s exact contents were sometimes required in order to find some
documents.
At a basic level, one previous approach for finding documents provided a word
search in which a user can search for all nts containing a certain word or phrase. The
results may be filtered or otherwise restricted (e.g., by date, author, county of origin, etc.) to
yield a result set. More advanced searches were possible using Boolean and other operators,
but still these searches required skill and/or advanced knowledge of the nts sought in
order to be successful.
[0005] Other previous approaches took the basic word search a step further by
performing an l analysis of documents available for searching to identify a relative
importance of words or topics relating to the documents. For example, documents ingested
into a research collection or library may be ed to produce a vector space model for
each document representing the relative importance of various index terms that are related to
the document. A particular example is the term frequency-inverse document frequency
model (“tf-idf”). Subsequent word es produce results based on the predetermined
importance of search terms within result documents. In other examples, tual topics are
identified in nts (manually and/or through the use of computer software) and searches
may be performed on the previously identified topics or the topics may be browsed.
However, there still remains a need for a system and method for finding
nts based on semantic similarity between the documents. The new tools for finding
documents in this manner ted herein improve access to such documents, make
searching for documents that are similar to a reference quicker, more ent, less prone to
error and yield a more comprehensive, yet more precisely targeted result set of documents
than was previously possible.
In order to develop a reader's understanding of the innovations, disclosures have
been compiled into a single description to illustrate and clarify how aspects of these
innovations operate independently, interoperate as between individual innovations, and/or
cooperate tively. The application goes on to further describe the interrelations and
synergies as n the various innovations; all of which is to further compliance with 35
U.S.C. §112.
BRIEF SUMMARY
The present invention provides a system and method for finding and retrieving
documents that are similar to a reference, and in particular where the similarity is determined
based at least in part on the semantic similarity of facts present in both.
In one aspect, a method for finding documents ses ingesting at least two
library documents by extracting and indexing library triples therefrom, receiving a reference
text string, ting at least one reference triple from the reference text string, identifying
one or more library triples similar to the at least one reference triple, and returning a list of
one or more result library documents based on the identified library s.
In some implementations, the method further comprises expanding the library
triples based on a semantic corpus to obtain expanded library triples and indexing the
expanded library triples while maintaining a record of the y nt from which the
library triples used to obtain them were extracted, wherein the identifying step includes
identifying one or more expanded library s r to the at least one reference triple
and the list of one or more result library documents returned by the ing step is based on
the identified library triples and expanded library triples.
In other implementations, the method further comprises expanding the at least one
reference triple based on a ic corpus to obtain at least one expanded reference triple,
wherein the identifying step includes identifying one or more library triples similar to the at
least one expanded nce triple.
In other implementations, the expanding step es forming word tokens
as components of a library triple based on a semantic corpus.
[0013] In other implementations, the expanding step includes forming multi-word tokens
as components of a reference triple based on a semantic corpus.
In other implementations, the returned list is ranked based on a similarity between
the fied library triples in each listed library document and the one or more reference
triples.
[0015] In other implementations, the method further comprises scoring library
documents from which fied library triples were extracted based on an aggregation of
rity scores n each identified library triple and its corresponding reference triple.
In other implementations, the list that is returned es only library documents
having a similarity score above a predefined threshold.
[0017] In other implementations, the listed library documents are ranked according to
their similarity scores.
In other entations, the method further comprises receiving a second
reference text string after returning the list, extracting at least one second reference triple
from the second reference text string, identifying one or more y triples r to the at
least one second reference triple, and returning an updated list of one or more result library
reference documents based on the library triples identified with respect to both the first
nce triples and second reference triples.
In another aspect, a method for mining facts from a body of documents,
comprises ingesting two or more library documents by extracting and indexing library triples
therefrom that relate to a primary source, grouping similar s into one or more fact
groups, ingesting a later document after the two or more library documents by extracting
later triples therefrom that relate to a primary , and grouping the later s into the
one or more fact groups based on a similarity between the later triples and the library triples
previously comprising the one or more fact groups.
[0020] In some implementations, the method further comprises receiving a reference text
string, extracting at least one reference triple from the reference text string, expanding the at
least one reference triple based on the one or more fact groups to obtain at least one
expanded reference triple, identifying one or more library triples similar to the at least one
expanded reference triple, and ing a list of one or more result library documents based
on the identified library triples.
In other implementations, the method further comprises receiving a reference text
string, extracting at least one reference triple from the reference text string, expanding the at
least one reference triple based on the one or more fact groups to obtain at least one
expanded reference triple, identifying one or more library triples similar to the at least one
expanded nce triple, and returning a list of one or more primary sources based on the
identified y triples.
In another , a method for g documents relating to a primary source
comprises ingesting two or more library documents by ting and indexing library triples
therefrom that relate to a primary source, receiving a reference text string, extracting at least
one reference triple from the reference text , identifying one or more library triples
similar to the at least one reference triple, and returning a list of one or more primary sources
based on the identified library triples.
In another aspect, a measure of similarity between two documents based on a
combination of one or more of the semantic similarity between the ent components of
the facts that are extracted from each document, the sequence of the facts in both documents
and how much they agree on, the semantic similarity between sentences in both documents,
other metadata that describe the documents such as their topics and references to other
documents and/or authorities, and/or the weights of each of these factors, determined by the
user, to t their significance, which results in adjusting the overall similarity score of a
given document.
In some implementations, the method further comprises optimizing the search
process to avoid computing the similarity to each document in the document collection by
indexing the semantically expanded facts from the document collection and scoring and/or
ranking the results from the index lookups to compute an overall relevance score for each
document and present the s ordered accordingly.
In r aspect, a new search workflow is implemented as a r extension
allowing for seamless integration of the search functionality without leaving the current
document t. Search results may be displayed in the browser extension window to
y the current context without disrupting it.
[0026] In another aspect, a new interactive search workflow where users enter facts or
statements line by line and the results view is updated automatically in real-time to show the
documents that are most nt to the current list of statements.
In another aspect, a system and method for mining facts that are extracted from a
collection of legal documents ses extracting and mining facts from documents that
cite a particular law, grouping r facts into fact groups according to their semantic
rity and treating a fact group as a single item in the mining process, and utilizing the
overall ncy of mentions of a fact in the whole corpus to avoid generating generally
popular facts as relevant.
In another aspect, a new method for semantically expanding terms in search
queries is guided by the dataset generated as described above to restrict and guide the
expansion only to semantically similar terms that are related to the same legislation, and
hence, have similar legal ations. For example, search queries comprise mainly of facts
to be searched, facts in a search query are matched against the dataset to find most relevant
laws, retrieving the matched conceptual fact groups to use for expansion, and the terms of a
fact are expanded utilizing other facts in the matched conceptual fact groups that: (a) mention
the same law; (b) are most relevant to the fact in the search query; and (c) are most relevant
to the target law.
In some implementations, the method r comprises extracting facts from the
search query text and using them to query the dataset to find relevant laws and the retrieved
laws are ranked according to aggregating the score of their relevance to the facts in the
search query.
BRIEF DESCRIPTION OF THE DRAWINGS
The accompanying drawings rate s non-limiting, example, innovative
s in accordance with the present descriptions:
Fig. 1 is a schematic diagram illustrating the high-level ecture of how one
embodiment of an exemplary system may be implemented;
Fig. 2 is a flow chart that shows an exemplary ment of preprocessing
which may run offline;
[0033] Fig. 3 is a flow chart that shows an exemplary embodiment of a fact extraction
process or module such as those depicted in Figs. 1 and 2;
Fig. 4 is a flow chart that describes in more detail the process of expanding facts
semantically;
Fig. 5 shows a block diagram illustrating embodiments of a Factual rity
System controller according to an exemplary embodiment.
Fig. 6 is a flow chart that shows an online or real-time phase in which the present
system and method can be used to find documents that are similar to a particular reference
document or snippet of text;
Figs. 7-10 are shots illustrating exemplary applications of the present
system and method;
Fig. 11 is a schematic diagram illustrating an exemplary overview of a process
that generates a target dataset;
Fig. 12 is a flow chart that illustrates an exemplary extraction process ing
to an exemplary embodiment;
[0040] Fig. 13 is a flow chart that rates an exemplary flow of a fact extraction
process;
Fig. 14 is a flow chart that depicts an exemplary process of expanding facts
semantically;
Fig. 15 is a flow chart that illustrates an exemplary fact mining process according
to an exemplary embodiment;
Fig. 16 is a flow chart that illustrates an exemplary process of semantically
expanding fact terms; and
Fig. 17 is a flow chart that illustrates an exemplary application utilizing a
legislation-related fact dataset to find relevant laws and statutes that apply to an input fact
io.
ED DESCRIPTION
Embodiments of systems and methods for finding similar documents based on
semantic factual similarity are described herein. While s of the described systems and
methods can be implemented in any number of different configurations, the embodiments are
described in the context of the following exemplary urations. The descriptions and
details of well-known components and structures are omitted for simplicity of the
ption, but would be readily familiar to those having ry skill in the art.
The description and figures merely illustrate exemplary embodiments of the
inventive systems and methods. It will thus be appreciated that those skilled in the art will be
able to devise various arrangements that, although not explicitly described or shown herein,
embody the principles of the t t matter. Furthermore, all examples d herein
are intended to be for illustrative purposes only to aid the reader in tanding the
principles of the present subject matter and the concepts contributed by the inventors to
furthering the art, and are to be construed as being without limitation to such specifically
recited examples and conditions. Moreover, all statements herein reciting principles, aspects,
and embodiments of the t subject matter, as well as specific examples thereof, are
intended to encompass all equivalents thereof.
In general, the systems and methods described herein may relate to ements
to aspects of using computers to find similar documents based on semantic l similarity.
These improvements not only improve the functioning of how such a computer (or any
number of computers employed in a search for r documents) is able to operate to serve
the user’s research goals, but also improves the accuracy, efficiency and usefulness of the
search s that are returned to the searcher. The inventive search tools described herein
generally are configured to receive a reference text from a user and to e the reference
text to the text of cataloged documents to find similar documents to the reference text. The
comparison may be accomplished by, for example, extracting, expanding and indexing facts
from documents to be catalogued and comparing these against facts extracted and expanded
from the reference texts input by users.
[0048] The tools described herein are particularly suited to legal documents and research
and are generally discussed in that context, however it will be appreciated that many other
types of documents, research and researchers will benefit from the inventive tools sed
and claimed herein.
One of the goals of legal research is to find precedents. In common law, judges
use precedents such as past decisions to guide their current decisions. Lawyers also use
precedents to support their arguments or build case strategies, among other tasks.
Finding legal ents is one example of an application of the systems and
methods described herein in which a goal is to find relevant cases with r facts to a
t situation. In an exemplary process, the semantic l similarity measure described
herein is used as a tool to enable legal researchers to find precedents.
Fig. 1 is a schematic diagram illustrating the high-level ecture of how one
embodiment of an ary system may be implemented. It shows the different system
components and the operations that may be done in the preprocessing phase (offline) and at
runtime (online). Of course, various tasks may also be performed at any time or
continuously. For example, new documents 102 may be ingested at the same time or after a
user enters a reference text 104 through their browser extension 106 in online operation. In
one example, a search operation is exposed via a web e 108 that can be accessed and
interacted with remotely, e.g., through a browser extension. For example, a browser
extension may be configured to serve as a remote web client that performs HTTP GET/POST
operations to a REST web service that is hosted and provided by a server.
Fig. 2 shows an exemplary embodiment of preprocessing which may run offline.
The goal of this process is to build 202 an index 204 on the semantically expanded 400 facts
that are ted 300 from ingested documents 206.
Fig. 3 shows an exemplary embodiment of a fact extraction process or module
300 such as those depicted in Figs. 1 and 2. The extraction module may be configured to
receive an input text 302, clean it (e.g., to remove tags and headers) 304 and split it into
ces 306. In one example, full case documents may be retrieved from Westlaw (a legal
research service). In this example, cleaning and preprocessing may include isolating the body
of a case from the document. Each sentence may then be sent to a triple extraction process or
module 308, which may be ured to analyze the structure of the sentence (e.g., attach
part-of-speech tags) and produce generic triples in the format subject-predicate-object based
on the ure of the sentence. The extracted sentences and triples s”) may then be
stored in a database 310 for later analysis. The database may retain a record of the
provenance or source (e.g., a source document or a location within a source document) of
each sentence and triple for later analysis.
Fig. 4 describes in more detail the s of expanding facts semantically. This
segment of the process is intended to ensure that the semantics of the facts are captured
less of how they are expressed in the text. The semantic expansion module 400
expands the extracted facts.
[0055] The semantic expansion process 400 that takes the extracted sentences and triples
as input 402 and tokenizes 404 the text of their components (e.g., of the subject, predicate, or
object) into le-word tokens whenever valid. The multi-word tokenization 404
determines the permissible combination of words to preserve the original meaning because
the meaning of each separate word might be different from the meaning of the multi-word
combination. This is done by looking up candidate word combinations in a specific
ic corpus, ontology, dictionary or thesaurus 406. An example of such an
external semantic corpus 406 may be built by analyzing large text collections or other
(domain-specific) ontologies that are manually curated to control the expansion of tokens.
Each component of the extracted triples and sentences (subjects, predicates,
objects, and multi-word tokens) are then expanded 408 using the same or ent domainspecific
corpus 410 to produce synonyms, hypernyms and other similar words (expanded
tokens) 412. These expanded facts and sentences may then be indexed to allow search and
analytics on this ed data.
In an online or real-time phase, shown generally in Fig. 6, the present system and
method can be used to find documents that are similar to a particular nce document or
snippet of text. Given the input reference text 602, the fact extraction 300 produces a set of
triples present in the reference text as described in Fig. 3 which are fed to the ic
ion process 400 to find related terms, just as with the ingested documents as described
above with reference to Fig.4. The expanded facts 412 are then used to search 604 in the pre-
built index 606, and the results of the search may then be ated to filter, rank and score
608 the retrieved documents and then the results 610 are returned accordingly.
Figs. 7-10 illustrate an exemplary application of the present system and method.
In Fig. 7, a user may select a phrase of interest 702 from a reference document (“Air France
jet that overran the runway and caught fire at Pearson International Airport”) and be
presented with a list of result documents 704 that are r to the selected text, the
similarity being determined by a comparison of the extracted and expanded facts from the
reference text and the potentially relevant, previously ingested documents. The search may
be integrated into a browser extension to allow for ss integration with a user’s
research ow without interrupting the current context. For example, a user may
highlight the text of interest in their browser window and click on a browser extension icon
706 to cause a similar result documents to be displayed in an extension window 708 ranked
by their relevance.
In the example shown in Fig. 7, the selected text 702 may be processed to extract
the following triples:
Subject ate Object
Air France jet overrun runway
Air France jet catch fire
Air France jet catch fire at Pearson International
Airport
In an exemplary expansion process, the tokens in the extracted triples may be
normalized to their base forms using stemming and lemmatization techniques (e.g., “caught”
is changed to “catch”). The tokens of each component of the triples are then expanded
semantically using the same corpus that was used in the e process. Taking the second
triple as an e, the triple object “fire” is expanded to te”, “flame”, “explosion”,
“gunfire”, “machine_gun”, … ] and the predicate ” is expanded to [“capture”, “find”,
“chase”, “bait”, “arrest”, “stop”, … ]. These terms are grouped ing to their relation
to the original .
[0061] Given the extracted triples and sentences and their expanded tokens, the next step
is the semantic rity calculation. The expanded triples are used to query the pre-built
index to find other similar triples in the index. Different fields of a triple and its expansions
are used in multiple queries with different weights, which weights may be customizable by
the user or may be adaptively set based on current or prior use of the particular user or of a
group of (or all) users. The ved triples may be weighted according to which fields
matched and how similar they are. Again, the ing may be izable by the user or
may be adaptively set based on current or prior use of the ular user or of a group of (or
all) users. The results are then aggregated and may be ranked according to multiple factors
including their relevance scores and weights of the matched fields. This cumulative nce
score may be used to rank the retrieved case documents.
In one particular non-limiting example, the triples extracted and/or expanded from
the reference text (reference s) are compared to indexed triples that were previously
extracted and/or expanded from the cataloged library of potential result documents (result
triples) and a similarity score is ted between pairs of r triples. For example,
reference triple A may be determined to be 30% similar to result triple Y and 80% similar to
result triple Z. Next, all result documents containing result triple Y or Z are identified and a
similarity score for each result document is calculated based on the presence and/or
prevalence of result triples Y and/or Z in the result documents. If more than one reference
triple is ted and/or expanded from the reference text, result documents are again
identified and scored in a like fashion for each reference triple and document similarity
scores may be aggregated for all reference triples. The aggregated document similarity scores
may be used to rank and/or filter the result documents returned to the user.
[0063] User-settable weights for the similarity scoring include but are not limited to the
semantic similarity between the different components of the facts that are extracted from a
reference and a library document, the sequence of the facts in a reference and a library
document and how much they agree on, the semantic similarity between sentences in a
reference and a library document, as well as other metadata that describe the reference or the
library document such as their topics and references to other documents and/or authorities.
As shown in Fig. 7, the retrieved result documents may be displayed by the
r extension as a list ordered according to their relevance . The user can expand a
particular nt listing 710 to show the reasoning for the inclusion of this document in
the results, i.e., explain what makes the document similar to the ed text by highlighting
the similar sentences 712 that contain related facts.
Fig. 8 shows a list of result documents. As sed above, users are provided
with the functionality to expand a document item to explain why it is deemed r to the
highlighted reference text. Matching sentences from both the selected reference text 702 and
the result document 712 may be highlighted in different colors.
For example, Fig. 8 shows that the highlighted sentence “Air France jet overran
the runway and caught fire at Pearson International Airport” 702 is similar to the two
sentences “On August 2, 2005 an Air France flight landed in a severe thunderstorm at
Toronto's Pearson International t.” and “It overshot the runway, pitched into a ravine,
and burst into flames.” 712 The first sentence is related to the fact that the case discusses an
Air France flight that landed in Toronto Pearson International t, while the second
sentence is related to the fact that the aircraft n the runway and was consumed by fire.
The second sentence depicts how the semantic similarity aspect of the presented invention
captures the similarity n “caught fire” and “burst into flames”. The two s
describe a similar concept even though they are expressed in different ways.
In another exemplary input method, shown generally in Fig. 9, a researcher may
interactively and dynamically enter or remove reference text 902 while result documents are
rently identified and displayed in an adjacent result window 904. For example, as a
researcher enters the facts of a case (or a ial case to be litigated) line by line, the
system shows a list of similar cases that are updated as the cher enters more details.
Each piece of added information (e.g., word, triplet, sentence, line) may be used to issue new
search queries to refine the search results and re-rank the retrieved cases to better match the
new input. Similarly, if reference text is removed, the search and ranking may be redone at
any interval during or after removal to refocus the results on the remainder of the reference
text.
Fig. 10 shows an updated view of Fig. 9 as the user adds another sentence. The
list of relevant documents 904 is automatically d to match the new input, which is
reflected in the similarity between the input text and the levant case. This usage allows
for an exploratory and interactive approach of finding relevant documents.
[0069] In another ment, the present system and method may be adapted for
particular use in a context involving a set of core documents and a set of subordinate
documents that relate to and cite the core documents. One such context is present in the legal
field, involving legislative documents such as laws, codes, etc. (core documents) that are
interpreted, applied, argued over, and cited by inate documents such as case decisions,
legal briefs, secondary sources, etc. (subordinate documents). By examining the facts in the
subordinate documents citing the core nts, a map may be built and exploited between
facts (derived from the subordinate nts) and particular portions of the core documents
(e.g., a particular statute).
For example, the present disclosure provides a new system and method for mining
facts from a collection of legal documents to find sets of semantically similar facts that are
most relevant to laws. Facts may be mined pivoted around citations to ent laws and
legislations that are cited in the same legal document in which the facts appear. The present
system and method may be configured to produce a dataset that maps each law to a list of
facts that are sorted according to their relevance to the law and their frequency of mentions in
the cases that cite it.
It is one objective of the present disclosure to use the generated t in guiding
the query expansion when searching for nts in a corpus of legal documents. The
t is used to restrict and guide the semantic expansion of fact terms to other terms that
are semantically similar to the original terms and are related to the same legislation, i.e., have
similar legal implications.
It is another objective of the present disclosure to utilize the ted dataset to
search for the laws that are most relevant to a ic case based on the facts that are
extracted from the case and querying the generated dataset.
The mining process may be ured to produce a dataset that contains laws and
a set of facts most relevant to the laws. This method is focused on the legal domain where
legal documents cite related laws, i.e., the fact mining operation is pivoted around the laws
that are cited across a collection of legal documents. The end goal is to use this dataset to
control and guide the semantic expansion of the facts that appear in a search query to other
terms that are both semantically similar and follow the same laws, and accordingly have the
same legal implications. This produces a legislation-aware semantic expansion as opposed to
the general purpose semantic expansion that relies on the linguistic semantics of a term.
Two exemplary applications are described where the generated dataset can be
utilized. However, these example applications do not encompass all possible applications of
this technology, but are used as a reference for describing the content of the generated dataset
and how it can power downstream applications.
There are two main types of sources of legal documents: primary sources and
secondary sources. Primary sources include ents of the law, such as court decisions,
statutes, and legislative bills. Secondary sources are als that interpret a legislation or a
statute, explain or discuss legal issues, or analyze the laws. Examples of secondary sources
are law reviews, legal news, books about law, opedias, and legal memoranda. They
e extensive ons to y sources and give summaries and conclusions about
different legal issues.
Laws and statutes describe the legislation relating to a particular subject matter
and they are reted and applied by courts and judges as they rule in particular l
scenarios. The text of a legislation itself states some rules that should be followed or should
not be broken. When a legal document (e.g., a case decision or a memorandum) cites a
statute, it is because there is a legal issue that is relevant to the rules of the cited statute. The
documents that cite a specific legislation usually contain facts that are related to that
legislation.
[0077] The present invention may be configured to t and mine facts from the legal
documents that cite legislations in order to find facts that appear frequently in these
documents and use this as an identifier of a set of legislation-related facts that are relevant to
a particular legislation.
Fig. 11 is a schematic diagram illustrating an exemplary overview of the process
that generates a target dataset (i.e., ation-Related Facts 1102). From a high-level, the
process is divided into extraction 1104 and fact mining 1106. The legal database 1108
contains a collection of legal documents of different types (e.g., legal memoranda,
encyclopedias, and cases) and is also used to store the citations between documents. The
facts database 1110 stores the facts that are ted from the documents, and the facts are
also indexed in a facts index 1112. Figs. 16 and 17 explain how downstream applications
utilize this dataset.
The fact mining process may be configured to run in an offline phase to generate
the target dataset of legislations and relevant facts. Of course, as bed with reference to
the embodiment of Fig. 1, such ne” processes may be conducted at any time, including
during and after a user invokes the system to begin a search.
The extraction process runs on the ed legal documents that are stored in the
Legal DB 1108. The goal of the extraction process that is depicted in Fig. 12 is two-fold:
identifying ons of laws in the documents and extracting facts from the text of the
documents.
[0081] The citation extraction process 1202 identifies mentions of laws, statutes, and
ations in general. For example, the system may be ured to employ one or more
Natural Language Processing tools that combine -defined rules with machine learning
ques to detect mentions of laws (citations) in the text. Optionally, there is a humanbased
post-processing phase that is done by experienced content s to verify the
correctness of the extracted content and generate high quality data. Facts may also be
extracted 1300 as described below with reference to Fig. 13 and the extraction results may be
populated in the database. The extracted facts may be semantically expanded 1400 as
described below with reference to Fig. 14 and the semantically expanded facts may be
indexed 1204 in an inverted index 1112 to enable efficient search.
Fig. 13 describes an exemplary flow of a fact extraction s 1300. The text
body 1302 of a document is extracted, pre-processed, and cleaned 1304 (e.g., to remove tags
and headers) in preparation for tion. The text is split into sentences 1306. Using a triple
tion module 1308, facts in the form of triples are extracted from sentences, where each
sentence can produce multiples triples. The triples are in the format (subject, predicate,
object). These triples are stored in a database 1110 for further analysis and to maintain the
provenance of facts.
To further explain the output of the fact extraction process, consider the following
snippets of text that are retrieved from le legal documents including court decisions
and legal memoranda. Shown below is a sample output of the fact extraction results and later
refer to the extracted triples to explain the mining process. Each table ns the processed
t of text and the triples ct, predicate, object) that were extracted from it. The left
column includes an ID of the snippet and IDs of the extracted triples to refer to them later.
S1 “The plaintiff was a passenger on the motorcycle driven by her d,
the defendant, when the motorcycle collided with a deer.”
t1 plaintiff be passenger
t2 plaintiff be a passenger on motorcycle
t3 motorcycle drive by she husband
t4 motorcycle collide with deer
S2 “There was no traffic in the area when the vehicle hit the moose”
t5 there be no traffic
t6 there be no traffic in the area
t7 vehicle hit moose
S3 “The truck admittedly struck a deer”
t8 truck strike deer
S4 “The left front corner of the truck struck the deer,
propelling it towards the west shoulder.”
t9 truck have left front corner
t10 truck strike deer
t11 left front corner of the truck strike deer
The tokens in the extracted s may be normalized to their base forms using
stemming and lemmatization techniques (e.g., “struck” is d to “strike”).
The semantic expansion module s the extracted triples. Fig. 14 describes in
more detail an ary process 1400 of expanding facts 1402 semantically. The multiword
tokenization 1404 determines the correct combination of words to ve their
g because the meaning of each separate word might be different from the g of
the multi-word combination. This may be done by looking up candidate multi-word
combinations in a domain-specific ic , ontology, dictionary or thesaurus 1406.
Such an external semantic corpus may be built by analyzing large text collections or other
(domain-specific) ontologies that are manually curated to control the expansion of tokens.
Each component of the extracted triples and sentences (subjects, predicates, objects, and
multi-word tokens) may then be expanded 1408 using the same or different domain-specific
corpus 1410 to produce synonyms, hypernyms and other similar words (expanded tokens)
1412. These expanded facts and sentences are then indexed to allow search and analytics on
this data.
After preprocessing all documents to identify citations of legislation or other
primary sources, extract fact triples and, index facts, the mining process may be applied to
the extracted and indexed data. The fact mining module may be configured to ent
frequent itemset mining algorithms, for example where a database transaction that contains
items corresponds to a legal document that contains facts and the items correspond to
extracted facts. However, the goal is to group semantically r facts together as a single
item called a fact group. Therefore, one may choose not to rely on mere equality between
facts. Instead of calculating the frequency of equal (identical) facts, one may calculate the
support of a fact group. This requires constructing fact groups that contain semantically
r facts.
In order to mine facts that are related to a particular legislation, simple scoping
1502 and filtering 1504 processes may be applied first to identify facts that were extracted
from the legal documents that cite the particular legislation. This limits the set of facts to
those relevant to a user’s current line of inquiry. In the example discussed herein and with
respect to the s, it is assumed that all ted facts are relevant for the mining
process.
[0088] The process of fact mining (shown generally in Fig. 15) may include grouping
facts 1506 into groups that contain semantically similar facts. Comparing facts to one another
may not scale. Therefore, a facts index 1508 may be used to find facts that are most r
1510 to a particular fact. As a part of the fact grouping process 1506, the input facts to be
grouped may be scanned. For each fact, a check may be conducted to determine if there is a
fact group that is already constructed and contains that fact. If no matching groups are found,
a search may be conducted of the facts index to find the most semantically similar facts
based on the terms in the original fact and the semantically expanded and indexed terms in
the facts index 1508. A fact group may then be constructed from the returned s for all
the facts that have a relevance score that is above a user-defined old. It is possible that
this grouping mechanism may produce ant groups, in which case redundant groups
that have substantially common facts may be merged.
Continuing on the present example, each extracted fact from t1 to t11 may be
examined to search for the most relevant facts, ucting a fact group from the retrieved
results, unless the fact is already used in one of the pre-constructed fact . For example,
using t1 and t4 as queries, the following two fact groups FG1 and FG2 may constructed:
FG1 plaintiff be passenger
plaintiff be a ger on motorcycle
FG2 motorcycle collide with deer
vehicle hit moose
truck strike deer
The next step is computing the support 1512 for each fact group. The original
facts may be scanned again, and the support (frequency of mentions) of all the fact groups
that the current fact belongs to maybe incremented again. In the given example, the support
for FG1 is 2 since it will be matched by {t1, t2}, and the support for FG2 is 5 since it be
matched by {t4, t7, t8, t10, t11}. Therefore, FG2 has the highest frequency among the
constructed fact groups.
The generated dataset (legislation-related facts) 1514 can be used to support
multiple ations. One target application is performing a legislation-aware semantic
expansion. A user might run a search query that contains facts, and the goal is to find cases
that have similar facts. A part of the process is to semantically expand the facts in order to
match more relevant cases. However, when expanding facts, the expansion must be aware of
the legislation. Instead of using general-purpose ontologies to find ically r
terms, the legislation-related facts t may be used.
[0092] An exemplary process of semantically expanding fact terms is described generally
in Fig. 16. It starts by extracting 1602 facts 1604 from the search query (input text) 1606,
which are used as queries 1608 to a dataset of legislation-related facts 1610. The goal is to
retrieve fact groups 1612 to which the search query facts (input facts) 1604 belong. Then, the
facts comparison and ion module 1614 may be configured to compare the input facts
1604 with the matched fact groups 1612 in order to produce other facts 1616 that are
semantically similar. The facts comparison and expansion module 1614 compares the
components of the input fact (subject, predicate, object) 1604 against the components of each
fact in the matched fact groups 1612. After finding most similar facts (or identical facts if
available), the module 1614 finds other facts from the same fact groups and expands each
component separately, producing other similar facts 1616.
As an example, assume that the search query is “Plaintiff’s car struck a moose on
the highway”. One triple that is extracted from this query is (Plaintiff’s car, strike, moose).
When matched against the fact groups in a ation-related fact dataset, FG2 is retrieved as
the most relevant Fact Group. The Facts Comparison and Expansion module compares the
query triple to other triples within FG2, and expands “car” to [“car”, “vehicle”, “truck”,
“motorcycle”] and expands “moose” to [“moose”, “deer”]. These form the terms in the new
search s that will be used instead of the terms in the original search query. This
restriction of ed terms based on the legislation-related fact dataset has a significant
legal implications since “moose” and “deer” are considered wildlife and do not have owners,
as d to “cow” or “horse” which have other legal implications. A general-purpose
semantic ion tool cannot make this distinction.
Another application that es a legislation-related fact dataset is finding
relevant laws and statutes that apply to an input fact scenario. Fig. 17 depicts the high-level
flowchart of this process. Given an input text 1702, the fact extraction module 1704 extracts
facts from the text. The facts are used as queries 1706 to the ation-related facts database
1708 in order to find the most nt fact groups. The resulting fact groups from using each
fact as a query are aggregated in order to find laws that are holistically most relevant to the
set of extracted facts 1710. This application is useful for legal researchers who need to know
which laws are most relevant to a particular factual scenario and use these laws and statutes
to support their arguments.
Following up on the same example query discussed above, the extracted triple
(Plaintiff’s car, , moose) matches FG2, which has a high support among the cases that
discuss hitting a wildlife animal on the highway. These cases usually cite the Highway
c Act, RSNL 1990, c H-3 that is related to driving under the speed limit.
An Exemplary System
i. Factual Similarity System Controller
Fig. 5 shows a block diagram illustrating embodiments of a Factual Similarity
System controller. In this ment, the Factual Similarity System controller 501 may
serve to aggregate, process, store, search, serve, fy, instruct, generate, match, and/or
facilitate interactions with a er, and/or other related data.
Typically, users, which may be people and/or other systems, may engage
ation technology systems (e.g., computers) to facilitate information sing. In
turn, computers employ processors to process information; such processors 503 may be
referred to as central processing units (CPU). One form of processor is referred to as a
rocessor. CPUs use communicative circuits to pass binary encoded signals acting as
instructions to enable various operations. These ctions may be operational and/or data
instructions containing and/or referencing other instructions and data in various processor
accessible and operable areas of memory 529 (e.g., registers, cache memory, random access
memory, etc.). Such communicative instructions may be stored and/or transmitted in batches
(e.g., batches of instructions) as programs and/or data components to facilitate desired
operations. These stored ction codes, e.g., programs, may engage the CPU circuit
components and other motherboard and/or system components to perform desired operations.
One type of program is a computer operating , which, may be executed by CPU on a
computer; the operating system enables and tates users to access and operate er
information technology and resources. Some resources that may be employed in information
technology systems include: input and output mechanisms through which data may pass into
and out of a computer; memory storage into which data may be saved; and processors by
which information may be processed. These information technology systems may be used to
collect data for later retrieval, analysis, and manipulation, which may be facilitated through a
database program. These information technology systems e interfaces that allow users
to access and operate various system components.
In one embodiment, the Factual Similarity System controller 501 may be
connected to and/or communicate with entities such as, but not limited to: one or more users
from user input devices 511; peripheral devices 512; an optional cryptographic processor
device 528; and/or a communications network 513.
Networks are commonly thought to comprise the interconnection and
interoperation of clients, servers, and ediary nodes in a graph topology. It should be
noted that the term “server” as used throughout this application refers generally to a
computer, other device, program, or combination thereof that processes and responds to the
requests of remote users across a communications network. Servers serve their information to
requesting “clients.” The term “client” as used herein refers generally to a computer,
program, other device, user and/or combination thereof that is capable of processing and
making requests and obtaining and processing any ses from servers across a
ications network. A computer, other device, program, or combination thereof that
tates, processes information and requests, and/or furthers the passage of ation
from a source user to a destination user is commonly referred to as a “node.” Networks are
generally thought to facilitate the er of information from source points to destinations.
A node specifically tasked with furthering the passage of information from a source to a
destination is commonly called a “router.” There are many forms of networks such as Local
Area Networks , Pico networks, Wide Area ks (WANs), Wireless Networks
), etc. For example, the Internet is generally accepted as being an interconnection of
a multitude of networks whereby remote clients and servers may access and interoperate with
one another.
[0100] The Factual Similarity System controller 501 may be based on computer s
that may comprise, but are not limited to, components such as: a computer systemization 502
connected to memory 529.
ii. Computer Systemization
A computer systemization 502 may comprise a clock 530, l processing unit
(“CPU(s)” and/or “processor(s)” (these terms are used interchangeable throughout the
disclosure unless noted to the ry)) 503, a memory 529 (e.g., a read only memory
(ROM) 506, a random access memory (RAM) 505, etc.), and/or an interface bus 507, and
most ntly, gh not necessarily, are all interconnected and/or communicating
through a system bus 504 on one or more (mother)board(s) 502 having conductive and/or
otherwise transportive circuit pathways through which instructions (e.g., binary encoded
signals) may travel to effectuate communications, operations, storage, etc. The computer
systemization may be connected to a power source 586; e.g., optionally the power source
may be internal. Optionally, a graphic processor 526 and/or transceivers (e.g., ICs) 574
may be connected to the system bus. In another embodiment, the cryptographic processor
and/or transceivers may be connected as either internal and/or al peripheral devices
512 via the interface bus I/O. In turn, the eivers may be ted to antenna(s) 575,
thereby effectuating wireless transmission and reception of various communication and/or
sensor protocols; for example the antenna(s) may connect to: a Texas Instruments WiLink
WL1283 transceiver chip (e.g., providing 802.11n, Bluetooth 3.0, FM, global positioning
system (GPS) (thereby allowing Factual Similarity System ller to determine its
location)); Broadcom BCM4329FKUBG transceiver chip (e.g., providing 802.11n, Bluetooth
2.1 + EDR, FM, etc.); a Broadcom BCM4750IUB8 receiver chip (e.g., GPS); an on
Technologies X-Gold 618-PMB9800 (e.g., providing 2G/3G HSDPA/HSUPA
communications); and/or the like. The system clock typically has a crystal oscillator and
generates a base signal through the er systemization’s circuit pathways. The clock is
typically coupled to the system bus and various clock multipliers that will increase or
decrease the base operating frequency for other components interconnected in the computer
systemization. The clock and various components in a computer systemization drive signals
ing information throughout the system. Such transmission and reception of
instructions embodying information throughout a computer systemization may be commonly
referred to as communications. These communicative instructions may further be
transmitted, ed, and the cause of return and/or reply communications beyond the
instant computer systemization to: communications networks, input devices, other computer
systemizations, peripheral devices, and/or the like. It should be understood that in alternative
embodiments, any of the above components may be connected directly to one another,
connected to the CPU, and/or organized in numerous variations ed as exemplified by
various computer systems.
The CPU comprises at least one high-speed data processor adequate to execute
program components for executing user and/or system-generated requests. Often, the
processors themselves will incorporate various specialized processing units, such as, but not
limited to: integrated system (bus) controllers, memory management control units, floating
point units, and even specialized processing sub-units like graphics processing units, digital
signal sing units, and/or the like. Additionally, processors may e al fast
access addressable , and be capable of mapping and addressing memory 529 beyond
the sor itself; internal memory may include, but is not d to: fast registers, various
levels of cache memory (e.g., level 1, 2, 3, etc.), RAM, etc. The processor may access this
memory through the use of a memory address space that is accessible via instruction address,
which the processor can construct and decode allowing it to access a circuit path to a specific
memory address space having a memory state. The CPU may be a microprocessor such as:
AMD’s Athlon, Duron and/or Opteron; ARM’s application, embedded and secure
sors; IBM and/or Motorola’s DragonBall and PowerPC; IBM’s and Sony’s Cell
processor; Intel’s Celeron, Core (2) Duo, Itanium, Pentium, Xeon, and/or ; and/or the
like processor(s). The CPU interacts with memory through instruction passing through
conductive and/or transportive conduits (e.g., ed) electronic and/or optic circuits) to
execute stored instructions (i.e., program code) according to tional data processing
techniques. Such instruction g facilitates communication within the l Similarity
System controller and beyond through various interfaces. Should sing ements
dictate a greater amount speed and/or capacity, distributed processors (e.g., Distributed
Factual Similarity System), mainframe, multi-core, el, and/or super-computer
architectures may similarly be employed. Alternatively, should deployment requirements
dictate greater portability, smaller Personal Digital Assistants (PDAs) may be employed.
Depending on the ular implementation, features of the Factual Similarity
System may be ed by implementing a microcontroller such as CAST’s R8051XC2
microcontroller; Intel’s MCS 51 (i.e., 8051 microcontroller); and/or the like. Also, to
implement certain features of the Factual Similarity System, some feature implementations
may rely on embedded components, such as: Application-Specific Integrated Circuit
("ASIC"), Digital Signal Processing ("DSP"), Field Programmable Gate Array "),
and/or the like embedded technology. For example, any of the Factual Similarity System
component collection (distributed or otherwise) and/or es may be implemented via the
microprocessor and/or via embedded components; e.g., via ASIC, coprocessor, DSP, FPGA,
and/or the like. ately, some implementations of the Factual Similarity System may be
implemented with embedded components that are configured and used to achieve a variety of
features or signal processing.
Depending on the particular implementation, the embedded components may
include software solutions, hardware solutions, and/or some combination of both
hardware/software solutions. For example, Factual Similarity System features discussed
herein may be achieved through implementing FPGAs, which are a semiconductor devices
ning programmable logic components called "logic blocks", and programmable
interconnects, such as the high performance FPGA Virtex series and/or the low cost Spartan
series manufactured by Xilinx. Logic blocks and interconnects can be programmed by the
er or designer, after the FPGA is manufactured, to implement any of the Factual
rity System features. A chy of programmable interconnects allow logic blocks to
be interconnected as needed by the Factual Similarity System designer/administrator,
somewhat like a one-chip programmable breadboard. An FPGA's logic blocks can be
programmed to perform the operation of basic logic gates such as AND, and XOR, or more
complex combinational operators such as decoders or mathematical operations. In most
FPGAs, the logic blocks also include memory elements, which may be circuit flip-flops or
more complete blocks of . In some circumstances, the Factual Similarity System may
be developed on regular FPGAs and then migrated into a fixed version that more resembles
ASIC entations. Alternate or coordinating implementations may migrate Factual
Similarity System controller features to a final ASIC instead of or in addition to FPGAs.
Depending on the implementation all of the aforementioned embedded components and
rocessors may be considered the “CPU” and/or “processor” for the l Similarity
System.
iii. Power Source
The power source 586 may be of any standard form for powering small electronic
circuit board devices such as the following power cells: alkaline, lithium hydride, lithium
ion, lithium polymer, nickel cadmium, solar cells, and/or the like. Other types of AC or DC
power sources may be used as well. In the case of solar cells, in one embodiment, the case
provides an aperture through which the solar cell may capture photonic energy. The power
cell 586 is connected to at least one of the interconnected subsequent components of the
Factual Similarity System thereby ing an electric current to all subsequent
components. In one example, the power source 586 is connected to the system bus
component 504. In an alternative embodiment, an outside power source 586 is provided
h a connection across the I/O 508 interface. For example, a USB and/or IEEE 1394
connection s both data and power across the connection and is therefore a le
source of power.
iv. Interface Adapters
Interface s) 507 may accept, connect, and/or communicate to a number of
interface adapters, conventionally although not necessarily in the form of adapter cards, such
as but not limited to: input output interfaces (I/O) 508, storage interfaces 509, network
interfaces 510, and/or the like. Optionally, cryptographic processor interfaces 527 similarly
may be connected to the interface bus. The interface bus provides for the communications of
interface adapters with one r as well as with other components of the computer
systemization. Interface adapters are adapted for a compatible interface bus. Interface
adapters conventionally connect to the interface bus via a slot architecture. tional slot
architectures may be employed, such as, but not d to: Accelerated Graphics Port
(AGP), Card Bus, (Extended) Industry Standard Architecture ((E)ISA), Micro Channel
Architecture (MCA), NuBus, Peripheral Component Interconnect ded) (PCI(X)), PCI
Express, Personal Computer Memory Card International Association (PCMCIA), and/or the
like.
Storage interfaces 509 may accept, communicate, and/or connect to a number of
storage devices such as, but not limited to: storage devices 514, removable disc devices,
and/or the like. Storage aces may employ connection protocols such as, but not limited
to: (Ultra) (Serial) Advanced Technology Attachment (Packet Interface) ((Ultra) (Serial)
)), (Enhanced) Integrated Drive Electronics ((E)IDE), Institute of Electrical and
Electronics Engineers (IEEE) 1394, fiber channel, Small Computer Systems Interface
(SCSI), Universal Serial Bus (USB), and/or the like.
Network interfaces 510 may accept, icate, and/or t to a
communications network 513. Through a communications k 513, the Factual
Similarity System controller is accessible through remote clients 533b (e.g., computers with
web browsers) by users 533a. Network interfaces may employ connection protocols such as,
but not limited to: direct connect, Ethernet (thick, thin, twisted pair 10/100/1000 Base T,
and/or the like), Token Ring, wireless connection such as IEEE 802.11a-x, and/or the like.
Should sing requirements dictate a greater amount speed and/or capacity, distributed
network controllers (e.g., buted Factual Similarity ), architectures may rly
be employed to pool, load balance, and/or otherwise increase the communicative bandwidth
required by the Factual Similarity System controller. A ications network may be any
one and/or the combination of the ing: a direct interconnection; the Internet; a Local
Area Network (LAN); a Metropolitan Area Network (MAN); an Operating Missions as
Nodes on the Internet (OMNI); a secured custom connection; a Wide Area Network (WAN);
a ss network (e.g., employing protocols such as, but not limited to a Wireless
Application Protocol (WAP), I-mode, and/or the like); and/or the like. A network interface
may be regarded as a specialized form of an input output interface. Further, multiple network
interfaces 510 may be used to engage with various communications network types 513. For
example, multiple network interfaces may be employed to allow for the communication over
broadcast, multicast, and/or unicast ks.
Input Output interfaces (I/O) 508 may , communicate, and/or connect to
user input devices 511, peripheral devices 512, graphic processor devices 528, and/or
the like. I/O may employ connection protocols such as, but not limited to: audio: analog,
digital, monaural, RCA, stereo, and/or the like; data: Apple Desktop Bus (ADB), IEEE
1394a-b, serial, universal serial bus (USB); ed; joystick; keyboard; midi; optical; PC
AT; PS/2; parallel; radio; video interface: Apple Desktop Connector (ADC), BNC, coaxial,
component, composite, digital, Digital Visual Interface (DVI), high-definition multimedia
interface (HDMI), RCA, RF antennae, S-Video, VGA, and/or the like; wireless transceivers:
a/b/g/n/x; Bluetooth; cellular (e.g., code division le access (CDMA), high speed
packet access (HSPA(+)), high-speed downlink packet access (HSDPA), global system for
mobile communications (GSM), long term evolution (LTE), WiMax, etc.); and/or the like.
One typical output device may include a video display, which typically comprises a Cathode
Ray Tube (CRT) or Liquid l Display (LCD) based monitor with an interface (e.g., DVI
circuitry and cable) that accepts signals from a video interface, may be used. The video
interface composites information generated by a computer systemization and generates video
signals based on the composited ation in a video memory frame. r output
device is a television set, which accepts signals from a video interface. Typically, the video
interface provides the composited video information through a video connection interface
that accepts a video display interface (e.g., an RCA composite video tor ing an
RCA composite video cable; a DVI connector accepting a DVI display cable, etc.).
User input devices 511 often are a type of eral device 512 (see below) and
may include: card readers, dongles, finger print readers, gloves, graphics tablets, joysticks,
keyboards, microphones, mouse (mice), remote controls, retina readers, touch screens (e.g.,
capacitive, resistive, etc.), trackballs, trackpads, sensors (e.g., accelerometers, ambient light,
GPS, gyroscopes, proximity, etc.), styluses, and/or the like.
Peripheral devices 512 may be connected and/or icate to I/O and/or other
facilities of the like such as network interfaces, storage interfaces, directly to the interface
bus, system bus, the CPU, and/or the like. Peripheral devices may be external, al and/or
part of the Factual Similarity System controller. eral devices may include: antenna,
audio devices (e.g., line-in, line-out, microphone input, speakers, etc.), cameras (e.g., still,
video, webcam, etc.), dongles (e.g., for copy protection, ng secure transactions with a
digital signature, and/or the like), external processors (for added capabilities; e.g., crypto
s 528), force-feedback devices (e.g., vibrating motors), network aces, printers,
scanners, storage devices, transceivers (e.g., cellular, GPS, etc.), video devices (e.g., goggles,
rs, etc.), video s, visors, and/or the like. Peripheral devices often include types
of input devices (e.g., cameras).
It should be noted that although user input s and peripheral devices may be
employed, the Factual Similarity System controller may be embodied as an embedded,
ted, and/or monitor-less (i.e., headless) device, wherein access would be provided over
a network interface connection.
Cryptographic units such as, but not limited to, microcontrollers, processors 526,
interfaces 527, and/or s 528 may be ed, and/or communicate with the Factual
Similarity System controller. A MC68HC16 microcontroller, manufactured by la Inc.,
may be used for and/or within cryptographic units. The MC68HC16 microcontroller utilizes
a 16-bit ly-and-accumulate instruction in the 16 MHz configuration and requires less
than one second to perform a 512-bit RSA e key operation. Cryptographic units support
the authentication of communications from interacting agents, as well as allowing for
ous transactions. Cryptographic units may also be configured as part of the CPU.
Equivalent microcontrollers and/or sors may also be used. Other commercially
available specialized cryptographic sors include: om’s CryptoNetX and other
Security Processors; nCipher’s nShield; SafeNet’s Luna PCI (e.g., 7100) series; Semaphore
Communications’ 40 MHz Roadrunner 184; Sun’s Cryptographic Accelerators (e.g.,
Accelerator 6000 PCIe Board, Accelerator 500 Daughtercard); Via Nano Processor (e.g.,
L2100, L2200, U2400) line, which is capable of performing 500+ MB/s of cryptographic
instructions; VLSI logy’s 33 MHz 6868; and/or the like.
v. Memory
[0114] Generally, any mechanization and/or embodiment allowing a processor to affect
the storage and/or retrieval of information is regarded as memory 529. However, memory is
a fungible technology and resource, thus, any number of memory embodiments may be
employed in lieu of or in concert with one r. It is to be understood that the Factual
Similarity System controller and/or a computer systemization may employ various forms of
memory 529. For example, a computer systemization may be configured wherein the
operation of on-chip CPU memory (e.g., registers), RAM, ROM, and any other storage
devices are provided by a paper punch tape or paper punch card mechanism; however, such
an embodiment would result in an extremely slow rate of operation. In a typical
configuration, memory 529 will include ROM 506, RAM 505, and a storage device 514. A
storage device 514 may be any tional computer system storage. Storage devices may
include a drum; a (fixed and/or removable) magnetic disk drive; a magneto-optical drive; an
optical drive (i.e., Blu-ray, CD ROM/RAM/Recordable (R)/ReWritable (RW), DVD R/RW,
HD DVD R/RW etc.); an array of devices (e.g., Redundant Array of Independent Disks
(RAID)); solid state memory devices (USB memory, solid state drives (SSD), etc.); other
sor-readable storage mediums; and/or other devices of the like. Thus, a computer
ization generally requires and makes use of memory.
vi. Component Collection
The memory 529 may contain a tion of m and/or database
ents and/or data such as, but not limited to: operating system component(s) 515
(operating system); information server component(s) 516 (information server); user interface
component(s) 517 (user interface); Web browser component(s) 518 (Web browser);
database(s) 519; mail server component(s) 521; mail client component(s) 522; cryptographic
server component(s) 520 (cryptographic ); the l Similarity System component(s)
535; the fact extraction ent 541; the triplet expansion component 542, the web
service component 543; the browser extension component 544; the semantic similarity
calculation component 545; the ranking component 546; the index ing component 547
and/or the like (i.e., collectively a component collection). These components may be stored
and accessed from the storage devices and/or from storage devices ible through an
interface bus. Although non-conventional program components such as those in the
component collection, typically, are stored in a local storage device 514, they may also be
loaded and/or stored in memory such as: peripheral devices, RAM, remote storage facilities
through a communications network, ROM, various forms of memory, and/or the like. Also,
while the components are described separately herein, it will be understood that they may be
combined and/or subdivided in any compatible manner.
vii. Operating System
The operating system component 515 is an executable program component
facilitating the ion of the Factual Similarity System controller. lly, the operating
system facilitates access of I/O, network interfaces, peripheral devices, storage devices,
and/or the like. The operating system may be a highly fault tolerant, scalable, and secure
system such as: Apple Macintosh OS X (Server); AT&T Plan 9; Be OS; Unix and Unix-like
system distributions (such as AT&T’s UNIX; Berkley Software Distribution (BSD)
variations such as FreeBSD, NetBSD, OpenBSD, and/or the like; Linux distributions such as
Red Hat, , and/or the like); and/or the like operating systems. However, more limited
and/or less secure operating systems also may be employed such as Apple Macintosh OS,
IBM OS/2, Microsoft DOS, Microsoft Windows
/8/7/2003/2000/98/95/3.1/CE/Millennium/NT/Vista/XP (Server), Palm OS, and/or the like.
An ing system may communicate to and/or with other components in a component
collection, including itself, and/or the like. Most frequently, the ing system
communicates with other program components, user interfaces, and/or the like. For example,
the operating system may contain, communicate, generate, obtain, and/or provide program
component, system, user, and/or data communications, requests, and/or responses. The
operating system, once executed by the CPU, may enable the interaction with
ications networks, data, I/O, eral devices, program components, , user
input devices, and/or the like. The operating system may provide communications protocols
that allow the Factual Similarity System controller to icate with other entities
through a communications network 513. Various communication protocols may be used by
the Factual Similarity System controller as a subcarrier transport mechanism for interaction,
such as, but not limited to: multicast, , UDP, unicast, and/or the like.
viii. Information Server
An information server component 516 is a stored program component that is
executed by a CPU. The information server may be a conventional Internet information
server such as, but not d to Apache Software Foundation’s , Microsoft’s
Internet Information Server, and/or the like. The information server may allow for the
execution of program components through facilities such as Active Server Page (ASP),
X, (ANSI) (Objective-) C (++), C# and/or .NET, Common Gateway Interface (CGI)
scripts, dynamic (D) hypertext markup language (HTML), FLASH, Java, JavaScript,
Practical Extraction Report Language (PERL), Hypertext ocessor (PHP), pipes,
Python, wireless application ol (WAP), WebObjects, and/or the like. The information
server may support secure ications protocols such as, but not limited to, File
Transfer Protocol (FTP); HyperText Transfer Protocol (HTTP); Secure Hypertext Transfer
Protocol (HTTPS), Secure Socket Layer (SSL), messaging protocols (e.g., America Online
(AOL) Instant Messenger (AIM), Application Exchange (APEX), ICQ, Internet Relay Chat
(IRC), Microsoft Network (MSN) Messenger e, Presence and Instant Messaging
Protocol (PRIM), et Engineering Task Force’s (IETF’s) Session Initiation Protocol
(SIP), SIP for Instant Messaging and Presence Leveraging Extensions (SIMPLE), open
XML-based Extensible Messaging and Presence Protocol (XMPP) (i.e., Jabber or Open
Mobile Alliance’s (OMA’s) Instant Messaging and Presence Service (IMPS)), Yahoo!
Instant Messenger Service, and/or the like. The information server provides results in the
form of Web pages to Web browsers, and allows for the lated generation of the Web
pages through interaction with other program components. After a Domain Name System
(DNS) resolution portion of an HTTP t is resolved to a particular information server,
the information server resolves requests for information at specified locations on the Factual
Similarity System controller based on the der of the HTTP request. For example, a
request such as http://123.124.125.126/myInformation.html might have the IP portion of the
request “123.124.125.126” resolved by a DNS server to an information server at that IP
address; that information server might in turn further parse the http request for the
“/myInformation.html” portion of the request and resolve it to a location in memory
containing the information “myInformation.html.” Additionally, other information serving
protocols may be employed across various ports, e.g., FTP ications across port 21,
and/or the like. An information server may communicate to and/or with other components in
a ent collection, including itself, and/or facilities of the like. Most frequently, the
information server communicates with the Factual Similarity System databases 519,
operating systems, other program components, user interfaces, Web browsers, and/or the
like.
Access to the Factual Similarity System database may be achieved through a
number of database bridge mechanisms such as through scripting languages as enumerated
below (e.g., CGI) and through inter-application communication channels as enumerated
below (e.g., CORBA, WebObjects, etc.). Any data requests through a Web browser are
parsed h the bridge mechanism into appropriate grammars as required by the Factual
Similarity System. In one embodiment, the information server would provide a Web form
accessible by a Web browser. Entries made into supplied fields in the Web form are tagged
as having been entered into the particular fields, and parsed as such. The entered terms are
then passed along with the field tags, which act to instruct the parser to generate queries
directed to riate tables and/or fields. In one embodiment, the parser may generate
queries in standard SQL by instantiating a search string with the proper join/select commands
based on the tagged text entries, wherein the resulting command is provided over the bridge
mechanism to the Factual rity System as a query. Upon generating query results from
the query, the results are passed over the bridge mechanism, and may be parsed for
ting and tion of a new results Web page by the bridge ism. Such a new
results Web page is then ed to the information server, which may supply it to the
requesting Web r.
Also, an information server may contain, communicate, generate, obtain, and/or
provide program component, , user, and/or data ications, requests, and/or
responses.
ix. User Interface
Computer interfaces in some respects are similar to automobile operation
aces. Automobile operation interface elements such as steering wheels, gearshifts, and
speedometers facilitate the access, operation, and y of automobile resources, and status.
er interaction interface elements such as check boxes, cursors, menus, scrollers, and
windows (collectively and commonly ed to as widgets) similarly facilitate the access,
capabilities, operation, and display of data and computer hardware and operating system
resources, and status. Operation interfaces are ly called user interfaces. Graphical
user aces (GUIs) such as the Apple Macintosh Operating System’s Aqua, IBM’s OS/2,
Microsoft’s Windows 2000/2003/3.1/95/98/CE/Millennium/NT/XP/Vista/7 (i.e., Aero),
Unix’s X-Windows (e.g., which may include additional Unix graphic interface libraries and
layers such as K Desktop Environment (KDE), mythTV and GNU Network Object Model
Environment (GNOME)), web interface libraries (e.g., ActiveX, AJAX, (D)HTML, FLASH,
Java, JavaScript, etc. interface libraries such as, but not d to, Dojo, jQuery(UI),
MooTools, Prototype, .aculo.us, SWFObject, Yahoo! User Interface, any of which may
be used and) provide a baseline and means of accessing and displaying information
graphically to users.
A user interface component 517 is a stored program component that is executed
by a CPU. The user interface may be a conventional graphic user interface as ed by,
with, and/or atop operating systems and/or operating environments such as already discussed.
The user interface may allow for the display, ion, interaction, manipulation, and/or
operation of program components and/or system facilities through textual and/or graphical
facilities. The user interface provides a facility through which users may , ct,
and/or e a computer system. A user interface may communicate to and/or with other
components in a component collection, including itself, and/or facilities of the like. Most
frequently, the user interface communicates with operating systems, other program
components, and/or the like. The user interface may contain, communicate, generate, obtain,
and/or provide program component, , user, and/or data communications, requests,
and/or responses.
x. Web Browser
A Web browser component 518 is a stored program component that is executed
by a CPU. The Web browser may be a tional hypertext g application such as
Microsoft et Explorer or Netscape Navigator. Secure Web browsing may be supplied
with 128bit (or greater) encryption by way of HTTPS, SSL, and/or the like. Web browsers
allowing for the execution of program components through facilities such as ActiveX,
AJAX, (D)HTML, FLASH, Java, JavaScript, web browser plug-in APIs (e.g., Firefox, Safari
Plug-in, and/or the like APIs), and/or the like. Web browsers and like ation access
tools may be integrated into PDAs, cellular telephones, and/or other mobile devices. A Web
browser may communicate to and/or with other components in a ent tion,
including itself, and/or facilities of the like. Most frequently, the Web browser communicates
with ation servers, operating systems, integrated program components (e.g., plug-ins),
and/or the like; e.g., it may n, communicate, generate, , and/or provide program
component, system, user, and/or data communications, requests, and/or responses. Also, in
place of a Web browser and information server, a combined application may be developed to
perform similar operations of both. The ed application would rly affect the
obtaining and the provision of information to users, user agents, and/or the like from the
Factual Similarity System enabled nodes. The combined application may be nugatory on
systems employing standard Web browsers.
xi. Mail Server
[0123] A mail server component 521 is a stored program component that is executed by a
CPU 503. The mail server may be a conventional Internet mail server such as, but not limited
to sendmail, Microsoft Exchange, and/or the like. The mail server may allow for the
ion of program components through facilities such as ASP, ActiveX, (ANSI)
(Objective-) C (++), C# and/or .NET, CGI scripts, Java, JavaScript, PERL, PHP, pipes,
, WebObjects, and/or the like. The mail server may support communications protocols
such as, but not limited to: Internet message access protocol (IMAP), Messaging Application
Programming Interface (MAPI)/Microsoft Exchange, post office protocol (POP3), simple
mail transfer protocol (SMTP), and/or the like. The mail server can route, forward, and
process incoming and ng mail messages that have been sent, relayed and/or otherwise
traversing through and/or to the Factual Similarity . Mail may also take the form of
messages sent from one Factual Similarity System user to another that is not in the form of
traditional email but is more akin to direct messaging or the like tionally d by
social networks.
Access to the Factual Similarity System mail may be achieved through a number
of APIs offered by the individual Web server ents and/or the operating system.
Also, a mail server may n, communicate, generate, obtain, and/or provide
program component, system, user, and/or data communications, requests, information, and/or
responses.
xii. Mail Client
[0126] A mail client component 522 is a stored program component that is executed by a
CPU 503. The mail client may be a conventional mail viewing application such as Apple
Mail, Microsoft Entourage, Microsoft Outlook, Microsoft Outlook Express, Mozilla,
Thunderbird, and/or the like. Mail s may support a number of transfer protocols, such
as: IMAP, Microsoft ge, POP3, SMTP, and/or the like. A mail client may
communicate to and/or with other components in a component collection, including itself,
and/or facilities of the like. Most frequently, the mail client communicates with mail servers,
operating systems, other mail clients, and/or the like; e.g., it may n, communicate,
generate, obtain, and/or provide program component, system, user, and/or data
communications, requests, information, and/or responses. Generally, the mail client provides
a facility to compose and transmit electronic mail messages.
xiii. Cryptographic Server
A cryptographic server component 520 is a stored program component that is
ed by a CPU 503, cryptographic processor 526, cryptographic processor interface 527,
cryptographic processor device 528, and/or the like. Cryptographic processor aces will
allow for expedition of encryption and/or decryption ts by the cryptographic
component; however, the cryptographic component, alternatively, may run on a conventional
CPU. The cryptographic component allows for the encryption and/or decryption of provided
data. The cryptographic component allows for both symmetric and asymmetric (e.g., Pretty
Good Protection (PGP)) encryption and/or tion. The graphic ent may
employ cryptographic techniques such as, but not limited to: l icates (e.g., X.509
authentication framework), digital signatures, dual signatures, enveloping, password access
tion, public key management, and/or the like. The cryptographic component will
facilitate us ption and/or tion) security protocols such as, but not limited
to: checksum, Data Encryption Standard (DES), Elliptical Curve Encryption (ECC),
International Data Encryption Algorithm (IDEA), Message Digest 5 (MD5, which is a one
way hash operation), passwords, Rivest Cipher (RC5), Rijndael, RSA (which is an Internet
encryption and authentication system that uses an algorithm developed in 1977 by Ron
Rivest, Adi Shamir, and Leonard Adleman), Secure Hash Algorithm (SHA), Secure Socket
Layer (SSL), Secure Hypertext Transfer Protocol (HTTPS), and/or the like. Employing such
encryption security protocols, the Factual Similarity System may encrypt all incoming and/or
outgoing communications and may serve as node within a l private network (VPN)
with a wider communications k. The cryptographic component facilitates the process
of ity authorization” whereby access to a resource is inhibited by a security protocol
wherein the cryptographic component effects authorized access to the secured resource. In
addition, the cryptographic component may e unique identifiers of content, e.g.,
employing and MD5 hash to obtain a unique signature for a digital audio file. A
cryptographic component may icate to and/or with other components in a
component collection, including itself, and/or facilities of the like. The cryptographic
component supports encryption schemes allowing for the secure transmission of information
across a communications network to enable the Factual Similarity System component to
engage in secure transactions if so desired. The cryptographic component facilitates the
secure accessing of ces on the Factual Similarity System and facilitates the access of
secured resources on remote systems; i.e., it may act as a client and/or server of secured
resources. Most frequently, the cryptographic component communicates with ation
s, operating s, other program components, and/or the like. The graphic
component may contain, communicate, generate, obtain, and/or provide program component,
system, user, and/or data communications, requests, and/or responses.
xiv. The Factual Similarity System Databases
The Factual Similarity System databases component 519 may be embodied in one
database and its stored data, may be embodied in two or more distinct databases and their
stored data, or may be partially or wholly embodied in an unstructured . For the
purposes of simplicity of ption, discussion of the Factual Similarity System ses
component 519 herein may refer to such component in the singular tense, however this is not
to be considered as limiting the Factual Similarity System ses to an embodiment in
which they reside in a single database. The database is a stored program component, which is
executed by the CPU; the stored program component portion configuring the CPU to process
the stored data. The database may be a conventional, fault tolerant, relational, scalable,
secure database such as Oracle or Sybase. Relational databases are an extension of a flat file.
Relational databases consist of a series of related tables. The tables are interconnected via a
key field. Use of the key field allows the combination of the tables by indexing against the
key field; i.e., the key fields act as dimensional pivot points for combining information from
various tables. Relationships generally identify links maintained n tables by matching
primary keys. Primary keys represent fields that uniquely identify the rows of a table in a
relational database. More precisely, they ly identify rows of a table on the “one” side
of a one-to-many relationship.
Alternatively, the Factual rity System database may be implemented using
various standard data-structures, such as an array, hash, d) list, struct, structured text
file (e.g., XML), table, and/or the like. Such data-structures may be stored in memory and/or
in (structured) files. In r alternative, an object-oriented database may be used, such as
Frontier, ObjectStore, Poet, Zope, and/or the like. Object databases can include a number of
object collections that are grouped and/or linked together by common attributes; they may be
related to other object collections by some common attributes. Object-oriented databases
perform similarly to relational ses with the exception that objects are not just pieces of
data but may have other types of capabilities encapsulated within a given object. If the
Factual Similarity System database is implemented as a data-structure, the use of the Factual
rity System database 519 may be integrated into another component such as the
l Similarity System component 535. Also, the se may be implemented as a mix
of data structures, objects, and onal structures. Databases may be consolidated and/or
distributed in countless ions through standard data processing techniques. ns of
databases, e.g., tables, may be exported and/or imported and thus decentralized and/or
integrated.
In one embodiment, the database component 519 may include several included
databases or tables 519a-f, examples of which are described above.
In one embodiment, the Factual Similarity System database 519 may interact with
other database systems. For e, employing a distributed database system, queries and
data access by a search Factual Similarity System component may treat the combination of
the Factual Similarity System databases 519, an integrated data security layer database as a
single database entity.
In one embodiment, user programs may contain various user interface primitives,
which may serve to update the Factual Similarity System. Also, various accounts may require
custom database tables depending upon the environments and the types of clients the l
Similarity System may need to serve. It should be noted that any unique fields may be
ated as a key field throughout. In an alternative embodiment, these tables have been
decentralized into their own databases and their respective database controllers (i.e.,
individual database controllers for each of the above tables). Employing standard data
processing techniques, one may further bute the databases over several computer
systemizations and/or storage devices. Similarly, configurations of the decentralized database
controllers may be varied by consolidating and/or distributing the various database
components 519a-f. The Factual Similarity System may be configured to keep track of
various settings, inputs, and parameters via database controllers.
The Factual Similarity System database may communicate to and/or with other
components in a component collection, including , and/or facilities of the like. Most
frequently, the Factual rity System database icates with the Factual Similarity
System component, other program ents, and/or the like. The database may contain,
retain, and provide information regarding other nodes and data.
xv. The Factual Similarity s
The l Similarity System component 535 is a stored program component
that is executed by a CPU. In one embodiment, the Factual Similarity System component
incorporates any and/or all combinations of the aspects of the Factual rity System that
was discussed in the previous figures. As such, the Factual Similarity System affects
accessing, obtaining and the provision of information, services, transactions, and/or the like
across various communications networks. The features and embodiments of the Factual
Similarity System discussed herein increase network efficiency by ng data transfer
requirements the use of more efficient data structures and mechanisms for their transfer and
storage. As a consequence, more data may be erred in less time, and latencies with
regard to transactions, are also reduced. In many cases, such reduction in storage, transfer
time, bandwidth ements, latencies, etc., will reduce the capacity and ural
infrastructure requirements to support the Factual Similarity ’s es and facilities,
and in many cases reduce the costs, energy consumption/requirements, and extend the life of
Factual Similarity System’s underlying infrastructure; this has the added t of making
the Factual Similarity System more reliable. Similarly, many of the features and mechanisms
are designed to be easier for users to use and , thereby broadening the audience that
may enjoy/employ and exploit the feature sets of the Factual Similarity ; such ease of
use also helps to increase the reliability of the Factual Similarity System. In addition, the
feature sets include heightened security as noted via the Cryptographic components 520, 526,
528 and throughout, making access to the features and data more reliable and secure.
The Factual Similarity System component enabling access of information
between nodes may be developed by employing standard development tools and languages
such as, but not limited to: Apache components, Assembly, ActiveX, binary executables,
(ANSI) (Objective-) C (++), C# and/or .NET, database adapters, CGI scripts, Java,
JavaScript, mapping tools, procedural and object oriented development tools, PERL, PHP,
Python, shell scripts, SQL commands, web ation server extensions, web development
environments and libraries (e.g., oft’s ActiveX; Adobe AIR, FLEX & FLASH; AJAX;
(D)HTML; Dojo, Java; JavaScript; jQuery(UI); MooTools; Prototype; script.aculo.us;
Simple Object Access Protocol (SOAP); SWFObject; Yahoo! User Interface; and/or the
like), WebObjects, and/or the like. In one embodiment, the l Similarity System server
employs a cryptographic server to encrypt and decrypt ications. The Factual
Similarity System component may communicate to and/or with other ents in a
component collection, including itself, and/or facilities of the like. Most frequently, the
Factual Similarity System component communicates with the Factual Similarity System
database, operating systems, other program ents, and/or the like. The Factual
Similarity System may contain, communicate, generate, obtain, and/or provide program
component, system, user, and/or data ications, requests, and/or responses.
xvi. buted Factual Similarity Systems
The structure and/or operation of any of the Factual rity System node
controller components may be combined, consolidated, and/or distributed in any number of
ways to facilitate development and/or deployment. Similarly, the ent collection may
be combined in any number of ways to tate deployment and/or pment. To
accomplish this, one may integrate the ents into a common code base or in a facility
that can cally load the ents on demand in an integrated fashion.
The ent collection may be consolidated and/or distributed in countless
variations through standard data processing and/or development techniques. Multiple
instances of any one of the m components in the m component collection may
be instantiated on a single node, and/or across us nodes to improve performance
through load-balancing and/or data-processing techniques. Furthermore, single instances may
also be distributed across multiple controllers and/or storage devices; e.g., databases. All
m component instances and controllers working in concert may do so through standard
data processing communication techniques.
[0138] The configuration of the Factual Similarity System controller will depend on the
context of system deployment. Factors such as, but not limited to, the budget, capacity,
location, and/or use of the underlying hardware resources may affect deployment
ements and configuration. Regardless of if the configuration results in more
consolidated and/or integrated program components, results in a more distributed series of
program components, and/or results in some combination between a consolidated and
distributed configuration, data may be communicated, obtained, and/or provided. Instances
of components consolidated into a common code base from the m component
collection may communicate, obtain, and/or provide data. This may be accomplished through
intra-application data processing communication techniques such as, but not limited to: data
referencing (e.g., pointers), internal messaging, object instance le communication,
shared memory space, variable passing, and/or the like.
If component collection components are discrete, separate, and/or external to one
another, then communicating, obtaining, and/or providing data with and/or to other
component components may be accomplished through inter-application data processing
communication techniques such as, but not limited to: Application Program Interfaces (API)
information passage; (distributed) Component Object Model ((D)COM), (Distributed) Object
Linking and Embedding ((D)OLE), and/or the like), Common Object Request Broker
Architecture ), Jini local and remote ation program interfaces, JavaScript
Object Notation (JSON), Remote Method Invocation (RMI), SOAP, process pipes, shared
files, and/or the like. Messages sent between discrete ent components for inter-
ation communication or within memory spaces of a singular component for intraapplication
communication may be facilitated through the creation and parsing of a grammar.
A grammar may be developed by using development tools such as lex, yacc, XML, and/or
the like, which allow for grammar generation and parsing capabilities, which in turn may
form the basis of communication messages within and between components.
For e, a grammar may be arranged to recognize the tokens of an HTTP
post command, e.g.:
w3c -post http://... Value1
where Value1 is discerned as being a parameter because “http://” is part of the grammar
, and what follows is considered part of the post value. Similarly, with such a
grammar, a variable “Value1” may be inserted into an “http://” post command and then sent.
The grammar syntax itself may be presented as structured data that is interpreted and/or
otherwise used to generate the parsing mechanism (e.g., a syntax ption text file as
processed by lex, yacc, etc.). Also, once the parsing ism is generated and/or
instantiated, it itself may process and/or parse structured data such as, but not limited to:
character (e.g., tab) delineated text, HTML, structured text streams, XML, and/or the like
structured data. In r embodiment, inter-application data processing protocols
themselves may have integrated and/or readily available parsers (e.g., JSON, SOAP, and/or
like parsers) that may be ed to parse (e.g., communications) data. Further, the parsing
grammar may be used beyond message parsing, but may also be used to parse: databases,
data collections, data stores, structured data, and/or the like. Again, the desired configuration
will depend upon the context, environment, and requirements of system deployment.
For example, in some entations, the l Similarity System controller
may be executing a PHP script implementing a Secure Sockets Layer (“SSL”) socket server
via the information server, which listens to incoming communications on a server port to
which a client may send data, e.g., data encoded in JSON format. Upon identifying an
incoming communication, the PHP script may read the incoming message from the client
device, parse the received JSON-encoded text data to extract information from the JSON-
encoded text data into PHP script variables, and store the data (e.g., client identifying
information, etc.) and/or extracted information in a relational database accessible using the
ured Query Language ). An exemplary listing, written substantially in the form
of PHP/SQL commands, to accept JSON-encoded input data from a client device via a SSL
tion, parse the data to extract variables, and store the data to a database, is provided
below:
<?PHP
('Content-Type: text/plain');
// set ip address and port to listen to for incoming data
$address = ‘192.168.0.100’;
$port = 255;
// create a server-side SSL socket, listen for/accept incoming communication
$sock = socket_create(AF_INET, SOCK_STREAM, 0);
socket_bind($sock, $address, $port) or die(‘Could not bind to address’);
socket_listen($sock);
$client = socket_accept($sock);
// read input data from client device in 1024 byte blocks until end of message
do {
$input = “”;
$input = socket_read($client, 1024);
$data .= $input;
} while($input != “”);
// parse data to extract variables
$obj = ecode($data, true);
// store input data in a database
mysql_connect("201.408.185.132",$DBserver,$password); // access database server
mysql_select("CLIENT_DB.SQL"); // select database to append
mysql_query(“INSERT INTO ble (transmission)
VALUES )”); // add data to UserTable table in a CLIENT database
mysql_close("CLIENT_DB.SQL"); // close connection to database
?>
Also, the following resources may be used to provide example embodiments
regarding SOAP parser implementation:
http://www.xav.com/perl/site/lib/SOAP/Parser.html
http://publib.boulder.ibm.com/infocenter/tivihelp/v2r1/index.jsp?topic=/com.ibm
.IBMDI.doc/referenceguide295.htm
and other parser implementations:
http://publib.boulder.ibm.com/infocenter/tivihelp/v2r1/index.jsp?topic=/com.ibm
.doc/referenceguide259.htm
all of which are hereby expressly incorporated by reference.
A. Conclusion
FIGS. 1 through 23 are conceptual rations allowing for an explanation of the
present disclosure. It should be understood that various s of the embodiments of the
present disclosure could be implemented in hardware, firmware, re, or combinations
thereof. In such ments, the various components and/or steps would be implemented in
hardware, firmware, and/or software to perform the functions of the present disclosure. That
is, the same piece of hardware, firmware, or module of software could perform one or more
of the illustrated blocks (e.g., components or steps).
In software implementations, er re (e.g., programs or other
instructions) and/or data is stored on a machine readable medium as part of a computer
program product, and is loaded into a computer system or other device or machine via a
removable e drive, hard drive, or communications ace. er programs (also
called computer control logic or computer readable program code) are stored in a main
and/or secondary memory, and executed by one or more processors (controllers, or the like)
to cause the one or more processors to perform the functions of the disclosure as described
herein. In this document, the terms “machine readable medium,” “computer program
” and “computer usable medium” are used to lly refer to media such as a
random access memory (RAM); a read only memory (ROM); a removable storage unit (e.g.,
a magnetic or l disc, flash memory device, or the like); a hard disk; or the like.
[0145] Notably, the figures and examples above are not meant to limit the scope of the
present disclosure to a single ment, as other embodiments are possible by way of
interchange of some or all of the described or illustrated elements. Moreover, where certain
elements of the present disclosure can be partially or fully ented using known
components, only those portions of such known components that are necessary for an
understanding of the present disclosure are described, and detailed descriptions of other
portions of such known components are omitted so as not to obscure the sure. In the
present specification, an embodiment showing a ar component should not necessarily
be limited to other embodiments including a plurality of the same component, and vice-versa,
unless explicitly stated ise herein. Moreover, the applicants do not intend for any term
in the specification or claims to be ascribed an uncommon or special meaning unless
explicitly set forth as such. Further, the present disclosure encompasses present and future
known equivalents to the known components referred to herein by way of illustration.
The foregoing description of the specific embodiments so fully reveals the general
nature of the disclosure that others can, by applying knowledge within the skill of the
relevant art(s), readily modify and/or adapt for various applications such specific
embodiments, without undue experimentation, without departing from the general concept of
the t disclosure. Such adaptations and modifications are therefore intended to be within
the meaning and range of lents of the disclosed embodiments, based on the teaching
and guidance ted . It is to be understood that the phraseology or terminology
herein is for the purpose of description and not of limitation, such that the terminology or
ology of the present specification is to be interpreted by the skilled artisan in light of
the teachings and guidance presented , in combination with the knowledge of one
skilled in the relevant art(s).
In order to address various issues and e the art, the entirety of this
application for LEGAL FACTUAL SIMILARITY SYSTEM (including the Cover Page,
Title, gs, Cross-Reference to Related Application, Background, Brief Summary, Brief
Description of the Drawings, Detailed Description, , Figures, and otherwise) shows,
by way of illustration, various embodiments in which the claimed innovations may be
practiced. The advantages and features of the application are of a representative sample of
ments only, and are not exhaustive and/or exclusive. They are presented only to assist
in understanding and teach the claimed principles. It should be understood that they are not
representative of all claimed innovations. As such, certain aspects of the disclosure have not
been sed herein. That alternate embodiments may not have been presented for a
ic portion of the innovations or that further undescribed alternate embodiments may be
available for a portion is not to be considered a disclaimer of those alternate embodiments. It
will be appreciated that many of those undescribed embodiments incorporate the same
ples of the innovations and others are equivalent. Thus, it is to be understood that other
embodiments may be utilized and functional, logical, operational, organizational, structural
and/or topological modifications may be made without departing from the scope and/or spirit
of the disclosure. As such, all examples and/or embodiments are deemed to be non-limiting
throughout this disclosure. Also, no inference should be drawn regarding those ments
discussed herein relative to those not discussed herein other than it is as such for purposes of
reducing space and repetition. For instance, it is to be understood that the logical and/or
topological structure of any ation of any program components (a component
collection), other components and/or any t feature sets as described in the figures
and/or throughout are not limited to a fixed operating order and/or arrangement, but rather,
any disclosed order is exemplary and all equivalents, regardless of order, are contemplated
by the disclosure. Furthermore, it is to be understood that such features are not limited to
serial execution, but rather, any number of threads, processes, services, servers, and/or the
like that may e asynchronously, concurrently, in parallel, simultaneously,
synchronously, and/or the like are contemplated by the disclosure. As such, some of these
features may be mutually contradictory, in that they cannot be simultaneously present in a
single embodiment. Similarly, some es are applicable to one aspect of the innovations,
and inapplicable to others. In addition, the disclosure includes other innovations not presently
claimed. Applicant reserves all rights in those presently unclaimed innovations including the
right to claim such innovations, file additional applications, continuations, continuations in
part, divisions, and/or the like thereof. As such, it should be understood that advantages,
embodiments, examples, functional, features, logical, operational, organizational, structural,
topological, and/or other aspects of the sure are not to be considered limitations on the
disclosure as defined by the claims or limitations on equivalents to the claims. It is to be
understood that, depending on the particular needs and/or characteristics of an individual
and/or enterprise user, database configuration and/or relational model, data type, data
transmission and/or network ork, syntax structure, and/or the like, various
ments may be implemented that enable a great deal of flexibility and ization.
For example, aspects may be adapted for video, audio or any other content. While s
embodiments and discussions have included reference to applications in the legal industry, it
is to be understood that the ments described herein may be readily configured and/or
customized for a wide variety of other ations and/or implementations.
Claims (14)
1. A method for finding documents, comprising: ingesting at least two library documents by ting and indexing library triples 5 therefrom; receiving a reference text string; extracting at least one reference triple from the reference text string; identifying one or more library s similar to the at least one reference triple; and returning a list of one or more result library documents based on the identified library 10 s.
2. The method of claim 1, further comprising: ing the library triples based on a semantic corpus to obtain expanded library triples; and indexing the expanded y triples while maintaining a record of the library document 15 from which the library triples used to obtain them were extracted, wherein the identifying step includes identifying one or more expanded library triples similar to the at least one reference triple and the list of one or more result library documents returned by the returning step is based on the identified y triples and expanded library s. 20
3. The method of claim 1, further comprising: expanding the at least one reference triple based on a semantic corpus to obtain at least one expanded reference triple, wherein the identifying step includes identifying one or more library triples similar to the at least one expanded reference triple. 25
4. The method of claim 2, wherein the expanding step includes forming multi-word tokens as components of a library triple based on a ic corpus.
5. The method of claim 3, wherein the expanding step includes forming multi-word tokens as components of a reference triple based on a ic corpus.
6. The method of claim 1, wherein the returned list is ranked based on a similarity between 30 the identified y triples in each listed library document and the one or more reference triples.
7. The method of claim 1, r comprising scoring y documents from which fied library triples were extracted based on an aggregation of similarity scores n each identified library triple and its corresponding reference triple.
8. The method of claim 7, wherein the list that is returned includes only library documents 5 having a similarity score above a predefined threshold.
9. The method of claim 7, wherein the listed library documents are ranked according to their similarity scores.
10. The method of claim 1, further comprising: receiving a second reference text string after returning the list; 10 extracting at least one second nce triple from the second reference text string; identifying one or more y triples similar to the at least one second reference triple; returning an updated list of one or more result y reference documents based on the library triples identified with respect to both the first reference triples and second reference 15 triples.
11. A method for mining facts from a body of documents, comprising: ingesting two or more library documents by extracting and indexing y triples therefrom that relate to a primary source; grouping similar triples into one or more fact groups; 20 ingesting a later document after the two or more library documents by extracting later triples therefrom that relate to a primary source; and grouping the later triples into the one or more fact groups based on a similarity between the later triples and the library triples previously comprising the one or more fact groups.
12. The method of claim 11, further comprising: 25 receiving a reference text ; extracting at least one reference triple from the reference text ; expanding the at least one nce triple based on the one or more fact groups to obtain at least one expanded reference triple; identifying one or more library triples similar to the at least one expanded reference 30 triple; and returning a list of one or more result library documents based on the fied library triples.
13. The method of claim 11, further comprising: receiving a reference text string; extracting at least one reference triple from the reference text string; expanding the at least one nce triple based on the one or more fact groups to obtain 5 at least one expanded reference triple; identifying one or more library triples similar to the at least one expanded reference triple; and ing a list of one or more primary sources based on the identified library triples.
14. A method for g documents relating to a primary source, sing: 10 ingesting two or more library documents by extracting and indexing y triples therefrom that relate to a primary source; ing a reference text string; extracting at least one reference triple from the reference text string; identifying one or more library triples similar to the at least one reference triple; and 15 returning a list of one or more primary sources based on the identified library triples. The present disclosure is directed towards systems and methods for finding documents that are similar to a reference text. The ive systems and methods examine a set of collected documents to determine the facts present in those documents by, for example, extracting triplets 5 and expanding them. A user’s input reference text is similarly examined to extract and expand triplets n and the facts identified with respect to the reference text are used as a basis to find documents having similar facts. The present disclosure is also related to systems and methods for mining facts from documents relating to a primary source such as a piece of legislation and using the mined facts to improve the results of uent searches. # " ! $ 0( ) $ %& ' 5 22 0 12 3 4 1 26 7 22 1 25 8 9 9 7 5 78 6 9 1 24 5678 6 9 1 2 3 01238 01239 BCDEFDGHIPQRDSDR DRRCUPVI CUCGTCWCXY 0123@ 425 424 0123A a 12 a 13 a 26 a 24 a 23 0 12 3 ` a 13 0 12 3 b 0 12 3 c d 25 d 23 e 0 12 3 4 1125 1112 1124 1123 012344 1722 1112 012348 012349 GHIPQRDSDR IQEFGTDRRCUPVI CUCGTCWCXY 1522 1526 01234@
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US62/426,727 | 2016-11-28 | ||
US62/550,839 | 2017-08-28 |
Publications (1)
Publication Number | Publication Date |
---|---|
NZ794252A true NZ794252A (en) | 2022-11-25 |
Family
ID=
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20240078271A1 (en) | System and method for finding similar documents based on semantic factual similarity | |
Khan et al. | A survey on scholarly data: From big data perspective | |
Demartini et al. | Large-scale linked data integration using probabilistic reasoning and crowdsourcing | |
Kaur et al. | Sentiment analysis approach based on N-gram and KNN classifier | |
Malik et al. | Comparing mobile apps by identifying ‘Hot’features | |
Zanzotto et al. | Linguistic redundancy in twitter | |
US9406020B2 (en) | System and method for natural language querying | |
Vogel et al. | Detecting fake news spreaders on twitter from a multilingual perspective | |
US12020340B2 (en) | Legal research recommendation system | |
US12002122B2 (en) | Legal research recommendation system | |
US10002371B1 (en) | System, method, and computer program product for searching summaries of online reviews of products | |
Mars et al. | Big data analysis to features opinions extraction of customer | |
Ji et al. | Open-domain multi-document summarization via information extraction: Challenges and prospects | |
Agarwal et al. | Accelerating automatic hate speech detection using parallelized ensemble learning models | |
Bonifazi et al. | New approaches to extract information from posts on COVID-19 published on Reddit | |
Zhang et al. | The xLiMe system: Cross-lingual and cross-modal semantic annotation, search and recommendation over live-TV, news and social media streams | |
NZ794252A (en) | System and Method for Finding Similar Documents Based on Semantic Factual Similarity | |
NZ794000A (en) | System and Method for Finding Similar Documents Based on Semantic Factual Similarity | |
Alzhrani et al. | Towards Security Awareness of Mobile Applications using Semantic-based Sentiment Analysis | |
US20240320770A1 (en) | Legal research recommendation system | |
Toraman | Early prediction of public reactions to news events using microblogs | |
US20140236940A1 (en) | System and method for organizing search results | |
Wang et al. | Towards tracking political sentiment through microblog data | |
Aloshban et al. | A new approach for group spam detection in social media for Arabic language (AGSD) | |
Wafula | Social Media Forensics For Hate Speech Opinion Mining |