New! View global litigation for patent families

US20060212288A1 - Topic specific language models built from large numbers of documents - Google Patents

Topic specific language models built from large numbers of documents Download PDF

Info

Publication number
US20060212288A1
US20060212288A1 US11384226 US38422606A US2006212288A1 US 20060212288 A1 US20060212288 A1 US 20060212288A1 US 11384226 US11384226 US 11384226 US 38422606 A US38422606 A US 38422606A US 2006212288 A1 US2006212288 A1 US 2006212288A1
Authority
US
Grant status
Application
Patent type
Prior art keywords
model
language
documents
information
topic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
US11384226
Other versions
US7739286B2 (en )
Inventor
Abhinav Sethy
Panayiotis Georgiou
Shrikanth Narayanan
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Southern California (USC)
Original Assignee
University of Southern California (USC)
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRICAL DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/20Handling natural language data
    • G06F17/27Automatic analysis, e.g. parsing
    • G06F17/2705Parsing
    • G06F17/2715Statistical methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • G10L15/19Grammatical context, e.g. disambiguation of the recognition hypotheses based on word sequence rules
    • G10L15/197Probabilistic grammars, e.g. word n-grams
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models

Abstract

Forming and/or improving a language model based on data from a large collection of documents, such as web data. The collection of documents is queried using queries that are formed from the language model. The language model is subsequently improved using the information thus obtained. The improvement is used to improve the query. As data is received from the collection of documents, it is compared to a rejection model, that models what rejected documents typically look like. Any document that meets the test is then rejected. The documents that remain are characterized to determine whether they add information to the language model, whether they are relevant, and whether they should be independently rejected. Rejected documents are used to update the rejection model; accepted documents are used to update the language model. Each iteration improves the language model, and the documents may be analyzed again using the improved language model.

Description

    CROSS-REFERENCE TO RELATED APPLICATIONS
  • [0001]
    This application claims priority to U.S. provisional application Ser. No. 60/663,141, filed on Mar. 17, 2005. The disclosure of the prior application is considered part of (and is incorporated by reference in) the disclosure of this application.
  • FEDERALLY SPONSORED RESEARCH OR DEVELOPMENT
  • [0002]
    The U.S. Government may have certain rights in this invention pursuant to Grant No. N66001-02-C-6023 awarded by DARPA.
  • BACKGROUND
  • [0003]
    Natural language processing (NLP) systems, such as speech recognition, machine translation, or other text to text applications, typically rely on language models to allow a machine to recognize speech. The performance of these systems can be improved by customizing the model for a specific domain and/or application. A typical way of forming such a model is to base the model on text resources. For example, a model for a specific domain may be based on text resources that are specific to that domain.
  • [0004]
    Sometimes, text for a target domain might be available from an institution, that maintains a repository of texts, such as NIST or LDC. Other times, the data is simply collected manually.
  • [0005]
    Manual collection of data may be very difficult, and may add to system turnaround time and cost. Moreover, the amount of available data for a specific domain may be quite limited. In order to limit the effects of minimal domain specific data, a topic independent language model is often merged with a topic-specific language model generated from the limited in-domain data. This operation may form a hybrid model. The hybrid model may be smoothed to form a final topic specific language model.
  • [0006]
    This approach, however, is often less accurate compared to an approach where effective amounts of in-domain data are available.
  • SUMMARY
  • [0007]
    The present application describes a technique of using a publicly available network, such as the World Wide Web, to automatically gather data for building domain specific language models. Data is gathered that includes usable parts mixed with unusable parts. The gathered data is characterized and weighted according to its usability and relevance.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • [0008]
    In the drawings:
  • [0009]
    FIG. 1 shows a block diagram of a machine used to form and use the model;
  • [0010]
    FIG. 2 shows a flowchart of operation.
  • DETAILED DESCRIPTION
  • [0011]
    The amount of data that is available and indexed on the World Wide Web is immense. More than 10 billion pages have been indexed by Google alone. However, each web page may typically have data related to many different topics and actions, e.g., links, advertisements and others. It may be a challenge to access clean text that is relevant to a particular application.
  • [0012]
    The techniques disclosed herein make use of a query based search engine to retrieve documents for a specific domain. A collection of documents can be the Internet, or can be any large database, e.g., a database with 10,000 or more documents; or 100,000 or more documents. Examples of databases which can be used to provide such documents may include news archives, corporate e-mail or other e-mail databases, corporate documents, medical histories, and any other collection of documents. Query based search engines retrieve documents which contain the specific query terms that were requested. Even though the document might contain the query terms of interest, the document might not be useful overall for modeling the domain of interest. In reality, large sections of the returned documents may not be relevant to the particular task. The embodiment describes classifying the documents according to multiple different techniques. The sections which are not relevant are considered as noise. The retrieved data is then selectively weighted to maximize the gain of the relevant parts. The techniques described herein are intended to be used with documents obtained from the Web, which is described as the embodiment. However, it should be understood that the same techniques can be used with any documents of any type, where it is expected that parts of the document will be relevant, and other parts of the document will be partly or wholly irrelevant.
  • [0013]
    The obtained set of documents for a domain of interest can then be used as an update to an already existing language model. This may be used in cases where an existing speech recognition system handles new content, in applications such as broadcast news applications. However, if the set of documents is too small to support building a robust language model, then the new data may be higher weighted.
  • [0014]
    An initial topic model represents the topic of the item being trained. A generic, topic independent language model and corresponding documents on which it is built, are also contemplated as an alternative embodiment.
  • [0015]
    Two language models are used, one is topic dependent, and the other is topic independent or a background model. The models are used to generate speech queries using the relative entropy measure. The queries are used on the Internet to return downloaded data.
  • [0016]
    The downloaded data from those speech queries is weighted at the utterance level, using a soft clustering technique. The weighting is used to create a rejection model. The rejection model is used to determine “documents”, that is, collections of information, that should be rejected as a whole. Hence, this system classifies the information in two different ways: at the document level, and at some level of word cluster less than the document level, called the utterance level. the different levels may include phrases, words, sentences, subsets of sentences or clusters of sentences as well as complete documents.
  • [0017]
    Low-scoring downloaded Web data helps reject the spurious text. Other documents associated with the retrieved documents, such as in line advertisements, and cross links, may also be rejected.
  • [0018]
    A first test reviews the documents at the utterance level, that is, by phrases that are somehow matched together. Utterance processing may be supplemented using document classification techniques such as TFIDF and naïve Bayes to generate query words and provide document level weights.
  • [0019]
    FIG. 1 illustrates a computer system which may be used to form the model. Computer 110 accesses the Internet 120, to form the model 100 within a memory. The memory may be internal to or external to the computer.
  • [0020]
    The computer 110 operates according to the flowchart of FIG. 2. 200 represents the computer generating queries to the Internet 120. The queries are generated by comparing the topic language module with the background language model. The comparison may use relative entropy computation. For example, the relative entropy computation between two discrete distributions may compare densities across all the possible symbols in the alphabet. A direct relative entropy implementation for an n-gram language model would require Vn computations, where V is the vocabulary size. Unfortunately, this direct implementation would make even medium-size trigram language models (15-20,000 words) computationally prohibitive.
  • [0021]
    Real world n-gram language models may be conceptualized as tree structures. A majority of the densities of those n-gram models may be reduced to probabilities corresponding (n−1)grams. This makes it possible to compute the relative entropy between two language models in O(L) computations, where L is the number of language model terms actually present in the two language models. The techniques described in “measuring convergence . . . ” recursively calculates the relative entropy for an n-gram model using the relative entropy for the n−1 gram model. The computation provides relative entropy conditioned on word sequence histories h, the relative entropy between the n-grams represented by p(x|h) and q(x|h), where h is the history on which a probability of seeing the word x is conditioned, p is the based topic model language model and q is the background language model being evaluated with respect to p.
  • [0022]
    Histories with large relative entropies form the best candidates for becoming key phrases or keywords. These histories have been found to have good discriminative power. Analysis of p(h) can be analyzed to ensure that it is higher than the corresponding q(h), to verify qualification as keywords or phrases.
  • [0023]
    An embodiment is described herein, modeling language for movies. In the movie model, some key phrases are relevant, phrases such as “the movie”, “on screen”, “the characters”. However, many key phrases contain functional words such as “is great” or “at times”. These query phrases may be useful on their own, but may be more effective when combined with keywords. For example, “is great”+“actor” may be effective queries.
  • [0024]
    FIG. 2 shows a flowchart of the operation. At 200, a list of query key phrases and keywords is generated using the language model. This is described herein. Importantly, as the language model improves from these techniques, the queries also improve. A keyword list generated based on the information between words and class labels using a document classification system is also generated. The key phrases and keywords are merged with the keyword list.
  • [0025]
    In the embodiment, a random selection of five query words, with a few randomly chosen key phrases is used as the search query. The queries are sent to Google's SOAP API. A relevant set of URLs are returned. The query itself is a mix of keywords and key phrases, and the mix may be individualized based on the task at hand. For example, conversational styles may use more key phrases than keywords.
  • [0026]
    The URLs are downloaded and converted to text. This operation is generically shown as 210.
  • [0027]
    At 220, each of the utterances receive likelihood scores and low scoring documents and utterances are rejected.
  • [0028]
    Downloaded data from the World Wide Web includes substantial amounts of unusable information, including, as described above, advertising, links and embedded subsections from other content within the otherwise-relevant content. These items typically add undesired terms to the vocabulary, and may have a detrimental effect on the performance of the language model.
  • [0029]
    The rejection model is initialized in the first iteration of data downloads. The rejection model 220 subsequently rejects information based on this model.
  • [0030]
    Documents whose scores are very low compared to an average document score for the background and topic model are classed as being rejected. A language model is built based on the rejected documents. Subsequent iterations are then classify the documents as to their closeness to rejected documents. Utterances with high likelihood matched to the rejection language model are included within the rejection model. The model may also include domain information for rejection, e.g., a list of bad domains, such as URLs and Web domains that result in large sets of rejected documents. This may form a block list that rejects future downloads. Conversely, a green list may be formed based on web sites with high scores that may be marked as potential sources for so-called blind recursive crawling of data.
  • [0031]
    The rejection model helps maintain a clean vocabulary and removes noise from the training cycle.
  • [0032]
    The likelihood scores are calculated using background (“B”) 221, topic(“T”) 222 and rejection(“R”) 223 language models. 220 determines if the utterance scores high on the rejection model, or if the utterance has low scores on both background and topic models with respect to the average. If there is either a high score from the rejection model, or a low background and topic score, then the document or utterance is rejected.
  • [0033]
    Otherwise, at 230, the classification and relevance weights for the utterance are calculated according to score ( T ) = P ( utt / T ) * DW ( T ) P ( utt / T ) * DW ( T ) + P ( utt / B ) * DW ( B ) ( 1 ) score ( B ) = P ( utt / B ) * DW ( B ) P ( utt / T ) * DW ( T ) + P ( utt / B ) * DW ( B ) ( 2 ) T : Topic Model B : Background Model DW ( T ) : Document weight for topic DW ( B ) : Document weight for background
  • [0034]
    A document level weight is also obtained as explained herein. The document level weight is included as a trade-off between the relevance of the entire document and that of the given utterances.
  • [0035]
    The utterances as weighted in this way, are grouped into a number of bins according to their weight for the topic model and the background. The binned data is then used create language models. These are combined according to their assigned weights to create an update topic model and an update background model.
  • [0036]
    The update models are subsequently merged with the initial models, using a merged weight that is determined by iterative perplexity minimization on a handout set. The new data is added to the model at 250, and hence the training set is enhanced at each iteration as new downloaded documents are included.
  • [0037]
    The document classification system may classify documents to correspond to topic, background or rejected data. These classifications may then be used to train a document classification system. The training may use the TFIDF/Naive Bayes measure, included in the CMU BOW tookit. The document weights are used in conjunction with the utterance weights that have been calculated at 240. Document weights are calculated for each of the background class, the topic class and the rejection class. Moreover, mutual information between the background, topic and rejection class labels are used to select keywords using the relative entropy measure. The keyword selection process chooses words which have high discrimination power for document classification and high conditional occurrence probability for the topic model.
  • [0038]
    The techniques described above, including an initial rejection of documents that meet the rejection model utterance, or have low utterance relevancy is carried out, prior to adding the document to the training set.
  • [0039]
    A simple linear interpolation model may be used for merging the Web data language module with the existing topic models. More complex techniques such as class based model interpolation can be used.
  • [0040]
    The language model may use a bin based approach as described. Alternatively, fractional counting can be used to build the language models from weighted utterances directly instead of the bin based approach.
  • [0041]
    The system is shown as an endless loop used for the iteration. For the first iteration, the system uses a dummy rejection module. A termination condition for the iteration loop can also be set.
  • [0042]
    In a specific example, the training is carried out using a system designed for movie domain related text such as movie reviews, news and the like.
  • [0043]
    Initial data for training the model is generated from the movie web site IMDB.com. A background model used an interpolated model with data from multiple different models, including SWB (4M words), WSJ (2M words, and the Gutenberg project (2M words) pruning was then used to reduce the background model to a vocabulary of about 40,000 words.
  • [0044]
    The test set is collected from a random selection of movie reviews and news from MSN, Amazon and other sites. The test noted that the query generation process worked well with as few as 20,000 words. Hence, the performance of the final merged model worked well even with small amounts of data. The final merged model is not critically dependent on the size of the seed set. Moreover, the best results were obtained with five keywords and two keyphrases, although other counts of keywords and key phrases could also be adequately used.
  • [0045]
    It was found that the rejection model removes about 8% of the downloaded documents and about 6% of the utterances from the remaining documents set in its first iteration. The rejected data size increases on subsequent iterations. By the end of the process, an average of 10% of the documents and 13% of the utterances have been reduced in all.
  • [0046]
    Finally, for data weighting, the initial seed set size of 300 words got its best performance from five weight bins, five keywords and two key phrases as the query structure, and data filtering.
  • [0047]
    Another embodiment describes an additional aspect related to whether new information really adds to the language model—the issue of distributional similarity. A technique described herein uses an incremental greedy selection scheme based on relative entropy. This technique selects information, e.g. a sentence or phrase, if and only if adding the phrase to the already selected set of information reduces the relative entropy with respect to the in-domain data distribution.
  • [0048]
    Denote the language model built from in-domain data as P and let Pinit be a language model for initialization purposes which we estimate by bagging samples from the same in-domain data. The technique is described herein using unigram probabilities. However, the disclosed technique can also be used with higher n-grams also.
  • [0049]
    Let W0(i) be a initial set of counts for the words i in the vocabulary V initialized using Pinit. The count of word i in the jth sentence sj of webdata is denoted as mij.
  • [0050]
    Nj=summation over i of mij, which represents the number of words in the sentence and Nj is the summation over i of Wj(i) which is the total number of words already selected.
  • [0051]
    The relative entropy of the maximum likelihood estimate of the language model of the selected sentences to the initial model P is given by H ( j - 1 ) = - i P ( i ) ln P ( i ) W j ( i ) / N j
  • [0052]
    If we select the sentence sj, the updated RE H ( j ) = - i P ( i ) ln P ( i ) ( W j ( i ) + m j ) / ( N j + n j )
  • [0053]
    Direct computation of relative entropy using the above expressions for every sentence in the webdata will have a very high computational cost, since O(V) computations per sentence in the webdata would be required. The number of sentences in the webdata can be very large and can easily be on the order 108 to 109. The computation cost for moderate vocabularies (around 105) would be very large in the order of O(1014). If bigrams and trigrams are included, the computation becomes infeasible.
  • [0054]
    The summation H(j) can be split into H ( j ) = - i P ( i ) ln P ( i ) + + i P ( i ) ln W j ( i ) + m j N j + n j = H ( j - 1 ) + ln N j + n j N j T 1 - i , m , j 0 P ( i ) ln ( W j ( i ) + m j ) W j ( i ) T 2
  • [0055]
    Intuitively, the term T1 represents the decrease in probability mass because of adding nj words more to the corpus and the term T2 measures the in-domain distribution P weighted improvement in probability for words with non-zero mij.
  • [0056]
    The relative entropy will decrease with selection of sentence sj if T1<T2. To make the selection more refined, a condition T1+thr(j)<T2 can be used, where thr(j) is a function of j. A good choice for thr(j) is a function that declines at the same or similar rate (e.g., within 10 or 20%) as the ratio in (Nj+nj )/Nj{tilde over ( )}nj/Nj{tilde over ( )}1/kj where k is the average number of words for every sentence.
  • [0057]
    This technique becomes better in selecting the right sentences as the size of the already selected corpus, Nj, increases and the relative entropy H(j) decreases. The initial set of sentences selected might not be as useful or relevant to the task. However, after doing one round of selection from the webdata, the selected sentences may be re-entered into the the corpus and scanned again. This simple heuristic helps to significantly reduce the bias towards selecting more in the initial part of the process. The corpus may also be randomly permuted a few times to generate more subsets.
  • [0058]
    Use of the maximum likelihood estimation for estimating the intermediate language models for W(j) may simplify the entropy calculation, which reduces the order from O(V) to O(k). However, maximum likelihood estimation of language models is relatively poor when compared to smoothing based estimation. To balance the computation cost and estimation accuracy, the counts W (j) may be modified, e.g., by using Kneyser-Ney smoothing periodically after fixed number of sentences. The general structure and techniques, and more specific embodiments which can be used to effect different ways of carrying out the more general goals are described herein.
  • [0059]
    Although only a few embodiments have been disclosed in detail above, other embodiments are possible and the inventor (s) intend these to be encompassed within this specification. The specification describes specific examples to accomplish a more general goal that may be accomplished in another way. This disclosure is intended to be exemplary, and the claims are intended to cover any modification or alternative which might be predictable to a person having ordinary skill in the art. For example, this system can be used for other text-to-text applications, including automated summarization, or to any other application that uses a natural language model.
  • [0060]
    Also, the inventor(s) intend that only those claims which use the words “means for” are intended to be interpreted under 35 USC 112, sixth paragraph. Moreover, no limitations from the specification are intended to be read into any claims, unless those limitations are expressly included in the claims.
  • [0061]
    The processing described herein is carried out on a computer. The computer may be any kind of computer, either general purpose, or some specific purpose computer such as a workstation. The computer may be a Pentium class computer, running Windows XP or Linux, or may be a Macintosh computer. The computer may also be a handheld computer, such as a PDA, cellphone, or laptop. The programs may be written in C, or Java, Brew or any other programming language. The programs may be resident on a storage medium, e.g., magnetic or optical, e.g. the computer hard drive, a removable disk or media such as a memory stick or SD media, or other removable medium. The program may also be run a network, for example, with a server or other machine sending signals to the local machine, which allows the local machine to carry out the operations described herein.

Claims (33)

  1. 1. A computer system comprising:
    a query element which produces queries to a large database of documents; and
    a language model part, which receives information responsive to said queries, and uses said information for said language model, by using some, but not all, of said information, for said language model.
  2. 2. A system as in claim 1, wherein said query element uses the language model to form said queries.
  3. 3. A system as in claim 2, wherein the query element forms new queries as the language model is improved based on new information from said database of documents.
  4. 4. A system as in claim 1, wherein said language model part includes a rejection model, which compares said information from said queries to other information which has already been determined as not being used for said language model.
  5. 5. A system as in claim 4, wherein said language model part also includes a background model which classifies said information as background, and a topic model which classifies said information as being specific to a specific topic of the language model.
  6. 6. A system as in claim 1, wherein said language model part models the information at least at a document level which includes a collection of sentences, and at an utterance level which includes less than said collection of sentences.
  7. 7. A system as in claim 1, wherein said language model part analyzes said information, and adds said information to a language model only if said information reduces a total relative entropy with respect to the data distribution.
  8. 8. A system as in claim 7, wherein said language model compares new information to previously obtained information, and uses said new information only when it is not sufficiently similar to previously obtained information.
  9. 9. A system as in claim 3, wherein said language model includes a background model which is not specific to a topic of the language model, a topic model which is specific to a specified topic of the language model, and wherein said query element uses both said background and topic models to form said queries.
  10. 10. A system as in claim 9, wherein said language model further includes a rejection model which models information that should not be added to said language model, and wherein said rejection model forms said rejection model based on information that scores poorly using both said background model and said topic model.
  11. 11. A system as in claim 1, wherein said large database of documents is the Internet, and said query element queries at least one part of the Internet.
  12. 12. A system as in claim 1, wherein said large database of documents is a collection of documents within a company.
  13. 13. A system as in claim 1, wherein said large database of documents is a database with more than 10,000 documents.
  14. 14. A method, comprising:
    querying a large database of documents which includes more than 10,000 documents, and includes documents which are directed to a plurality of different topics;
    receiving information responsive to said querying; and
    using said information for a language model by classifying said documents, and using some, but not all, of said information, for said language model.
  15. 15. A method as in claim 14, wherein said querying comprises using the language model to form queries.
  16. 16. A method as in claim 14, wherein said using comprises using said information to improve said language model, and using an improved language model to form new queries.
  17. 17. A method as in claim 16, wherein said using further comprises using a background model which is generic to a plurality of topics, a topic specific model, which is specific to a specified topic.
  18. 18. A method as in claim 14, further comprising using said information both at a document level and at an utterance level, wherein a document level includes a plurality of sentences, and said utterance level includes less than said plurality of sentences.
  19. 19. A method as in claim 14, wherein said using comprises analyzing said information to determine if said information adds new information to said language model, and adding said information only if said information adds said new information to said language model.
  20. 20. A method as in claim 14, wherein said large database of documents is the Internet.
  21. 21. A method as in claim 14, wherein said using comprises using a rejection model, which models information received responsive to said querying, that has not been added to the language model, and rejecting subsequent information based on said rejection model.
  22. 22. A method comprising:
    using a language model to form queries to a plurality of documents, which documents include at least some documents that have information about a topic, and at least other documents which do not have information about said topic;
    receiving information from said documents, responsive to said queries; and
    classifying said information and using said information to modify said language model.
  23. 23. A method as in claim 22, wherein said using comprises using the information to improve the language model, and using an improved language model to form new queries.
  24. 24. A method, comprising:
    accessing a plurality of documents which includes some documents that include information about a topic and other documents that do not include information about said topic;
    comparing information from said documents to a rejection model which represents a model of information that is not sufficiently relevant to said topic to use as a language model for said topic;
    rejecting information which is not sufficiently relevant; and using information which is sufficiently relevant for said language model.
  25. 25. A method as in claim 24, further comprising using said language model to form a query which queries the Internet, and using responses from said query as documents used by said obtaining.
  26. 26. A method as in claim 24, wherein said comparing compares both documents as a whole and also compares utterances within the documents.
  27. 27. A method as in claim 24, further comprising weighting parts of the documents according to relevance to the topic.
  28. 28. A method as in claim 24, wherein said language model includes a background language model, representative of a topic independent model, and a topic language model representative of the topic.
  29. 29. A method as in claim 28, wherein said comparing comprises determining a relative entropy comparison of the background model and the topic model.
  30. 30. A method as in claim 24, further comprising forming a topic specific language model using said information which is sufficiently relevant, and using said topic specific language model to update said language model.
  31. 31. A method as in claim 30, further comprising using said language model for speech recognition.
  32. 32. A method as in claim 25, further comprising forming a list of Internet sites based on said rejecting, and rejecting use of Internet sites that are on said list.
  33. 33. A method as in claim 32, wherein said forming comprises identifying queries and web addresses URLS which provide data for improving the language model, by measuring the gains in model at each of a plurality of iterations.
US11384226 2005-03-17 2006-03-17 Topic specific language models built from large numbers of documents Expired - Fee Related US7739286B2 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
US66314105 true 2005-03-17 2005-03-17
US11384226 US7739286B2 (en) 2005-03-17 2006-03-17 Topic specific language models built from large numbers of documents

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US11384226 US7739286B2 (en) 2005-03-17 2006-03-17 Topic specific language models built from large numbers of documents

Publications (2)

Publication Number Publication Date
US20060212288A1 true true US20060212288A1 (en) 2006-09-21
US7739286B2 US7739286B2 (en) 2010-06-15

Family

ID=36992483

Family Applications (1)

Application Number Title Priority Date Filing Date
US11384226 Expired - Fee Related US7739286B2 (en) 2005-03-17 2006-03-17 Topic specific language models built from large numbers of documents

Country Status (2)

Country Link
US (1) US7739286B2 (en)
WO (1) WO2006099621A3 (en)

Cited By (32)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070233487A1 (en) * 2006-04-03 2007-10-04 Cohen Michael H Automatic language model update
US20070294077A1 (en) * 2006-05-22 2007-12-20 Shrikanth Narayanan Socially Cognizant Translation by Detecting and Transforming Elements of Politeness and Respect
US20080003551A1 (en) * 2006-05-16 2008-01-03 University Of Southern California Teaching Language Through Interactive Translation
US20080071518A1 (en) * 2006-05-18 2008-03-20 University Of Southern California Communication System Using Mixed Translating While in Multilingual Communication
US20090150308A1 (en) * 2007-12-07 2009-06-11 Microsoft Corporation Maximum entropy model parameterization
US20100169385A1 (en) * 2008-12-29 2010-07-01 Robert Rubinoff Merging of Multiple Data Sets
US20100250614A1 (en) * 2009-03-31 2010-09-30 Comcast Cable Holdings, Llc Storing and searching encoded data
US20100293195A1 (en) * 2009-05-12 2010-11-18 Comcast Interactive Media, Llc Disambiguation and Tagging of Entities
US20110004462A1 (en) * 2009-07-01 2011-01-06 Comcast Interactive Media, Llc Generating Topic-Specific Language Models
US20110172988A1 (en) * 2010-01-08 2011-07-14 Microsoft Corporation Adaptive construction of a statistical language model
US8032356B2 (en) 2006-05-25 2011-10-04 University Of Southern California Spoken translation system using meta information strings
JP2012008554A (en) * 2010-05-24 2012-01-12 Denso Corp Voice recognition device
US8176016B1 (en) * 2006-11-17 2012-05-08 At&T Intellectual Property Ii, L.P. Method and apparatus for rapid identification of column heterogeneity
US20120191694A1 (en) * 2010-12-09 2012-07-26 Apple Inc. Generation of topic-based language models for an app search engine
US20120253799A1 (en) * 2011-03-28 2012-10-04 At&T Intellectual Property I, L.P. System and method for rapid customization of speech recognition models
US8296142B2 (en) 2011-01-21 2012-10-23 Google Inc. Speech recognition using dock context
US8352246B1 (en) 2010-12-30 2013-01-08 Google Inc. Adjusting language models
US20130110501A1 (en) * 2010-05-20 2013-05-02 Nec Corporation Perplexity calculation device
US8527520B2 (en) 2000-07-06 2013-09-03 Streamsage, Inc. Method and system for indexing and searching timed media information based upon relevant intervals
US8527534B2 (en) 2010-03-18 2013-09-03 Microsoft Corporation Bootstrap and adapt a document search engine
US8713016B2 (en) 2008-12-24 2014-04-29 Comcast Interactive Media, Llc Method and apparatus for organizing segments of media assets and determining relevance of segments to a query
US8725496B2 (en) 2011-07-26 2014-05-13 International Business Machines Corporation Customization of a natural language processing engine
US8751217B2 (en) 2009-12-23 2014-06-10 Google Inc. Multi-modal input on an electronic device
US8965763B1 (en) * 2012-02-02 2015-02-24 Google Inc. Discriminative language modeling for automatic speech recognition with a weak acoustic model and distributed training
US9081760B2 (en) 2011-03-08 2015-07-14 At&T Intellectual Property I, L.P. System and method for building diverse language models
US20150278192A1 (en) * 2014-03-25 2015-10-01 Nice-Systems Ltd Language model adaptation based on filtered data
US20160004763A1 (en) * 2010-06-07 2016-01-07 Quora, Inc. Methods and systems for merging topics assigned to content items in an online application
US9348915B2 (en) 2009-03-12 2016-05-24 Comcast Interactive Media, Llc Ranking search results
US9412365B2 (en) 2014-03-24 2016-08-09 Google Inc. Enhanced maximum entropy models
US9442933B2 (en) 2008-12-24 2016-09-13 Comcast Interactive Media, Llc Identification of segments within audio, video, and multimedia items
US20160336006A1 (en) * 2015-05-13 2016-11-17 Microsoft Technology Licensing, Llc Discriminative data selection for language modeling
US9842592B2 (en) 2014-02-12 2017-12-12 Google Inc. Language models using non-linguistic context

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8370127B2 (en) * 2006-06-16 2013-02-05 Nuance Communications, Inc. Systems and methods for building asset based natural language call routing application with limited resources
GB0623932D0 (en) * 2006-11-29 2007-01-10 Ibm Data modelling of class independent recognition models
US8185528B2 (en) * 2008-06-23 2012-05-22 Yahoo! Inc. Assigning human-understandable labels to web pages
US8768686B2 (en) 2010-05-13 2014-07-01 International Business Machines Corporation Machine translation with side information
US8812321B2 (en) * 2010-09-30 2014-08-19 At&T Intellectual Property I, L.P. System and method for combining speech recognition outputs from a plurality of domain-specific speech recognizers via machine learning
US20120253784A1 (en) * 2011-03-31 2012-10-04 International Business Machines Corporation Language translation based on nearby devices
US9916306B2 (en) 2012-10-19 2018-03-13 Sdl Inc. Statistical linguistic analysis of source content
US9734826B2 (en) 2015-03-11 2017-08-15 Microsoft Technology Licensing, Llc Token-level interpolation for class-based language models

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6418431B1 (en) * 1998-03-30 2002-07-09 Microsoft Corporation Information retrieval and speech recognition based on language models
US6430551B1 (en) * 1997-10-08 2002-08-06 Koninklijke Philips Electroncis N.V. Vocabulary and/or language model training
US20030046263A1 (en) * 2001-08-31 2003-03-06 Maria Castellanos Method and system for mining a document containing dirty text
US20040254920A1 (en) * 2003-06-16 2004-12-16 Brill Eric D. Systems and methods that employ a distributional analysis on a query log to improve search results
US20050010556A1 (en) * 2002-11-27 2005-01-13 Kathleen Phelan Method and apparatus for information retrieval
US6990628B1 (en) * 1999-06-14 2006-01-24 Yahoo! Inc. Method and apparatus for measuring similarity among electronic documents
US7406458B1 (en) * 2002-09-17 2008-07-29 Yahoo! Inc. Generating descriptions of matching resources based on the kind, quality, and relevance of available sources of information about the matching resources

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6430551B1 (en) * 1997-10-08 2002-08-06 Koninklijke Philips Electroncis N.V. Vocabulary and/or language model training
US6418431B1 (en) * 1998-03-30 2002-07-09 Microsoft Corporation Information retrieval and speech recognition based on language models
US6990628B1 (en) * 1999-06-14 2006-01-24 Yahoo! Inc. Method and apparatus for measuring similarity among electronic documents
US20030046263A1 (en) * 2001-08-31 2003-03-06 Maria Castellanos Method and system for mining a document containing dirty text
US7406458B1 (en) * 2002-09-17 2008-07-29 Yahoo! Inc. Generating descriptions of matching resources based on the kind, quality, and relevance of available sources of information about the matching resources
US20050010556A1 (en) * 2002-11-27 2005-01-13 Kathleen Phelan Method and apparatus for information retrieval
US20040254920A1 (en) * 2003-06-16 2004-12-16 Brill Eric D. Systems and methods that employ a distributional analysis on a query log to improve search results

Cited By (67)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9244973B2 (en) 2000-07-06 2016-01-26 Streamsage, Inc. Method and system for indexing and searching timed media information based upon relevance intervals
US8706735B2 (en) * 2000-07-06 2014-04-22 Streamsage, Inc. Method and system for indexing and searching timed media information based upon relevance intervals
US8527520B2 (en) 2000-07-06 2013-09-03 Streamsage, Inc. Method and system for indexing and searching timed media information based upon relevant intervals
US9542393B2 (en) 2000-07-06 2017-01-10 Streamsage, Inc. Method and system for indexing and searching timed media information based upon relevance intervals
US8423359B2 (en) 2006-04-03 2013-04-16 Google Inc. Automatic language model update
US20110213613A1 (en) * 2006-04-03 2011-09-01 Google Inc., a CA corporation Automatic Language Model Update
US9159316B2 (en) 2006-04-03 2015-10-13 Google Inc. Automatic language model update
US7756708B2 (en) * 2006-04-03 2010-07-13 Google Inc. Automatic language model update
US8447600B2 (en) 2006-04-03 2013-05-21 Google Inc. Automatic language model update
US20070233487A1 (en) * 2006-04-03 2007-10-04 Cohen Michael H Automatic language model update
US20080003551A1 (en) * 2006-05-16 2008-01-03 University Of Southern California Teaching Language Through Interactive Translation
US20110207095A1 (en) * 2006-05-16 2011-08-25 University Of Southern California Teaching Language Through Interactive Translation
US20080071518A1 (en) * 2006-05-18 2008-03-20 University Of Southern California Communication System Using Mixed Translating While in Multilingual Communication
US8706471B2 (en) 2006-05-18 2014-04-22 University Of Southern California Communication system using mixed translating while in multilingual communication
US20070294077A1 (en) * 2006-05-22 2007-12-20 Shrikanth Narayanan Socially Cognizant Translation by Detecting and Transforming Elements of Politeness and Respect
US8032355B2 (en) 2006-05-22 2011-10-04 University Of Southern California Socially cognizant translation by detecting and transforming elements of politeness and respect
US8032356B2 (en) 2006-05-25 2011-10-04 University Of Southern California Spoken translation system using meta information strings
US8176016B1 (en) * 2006-11-17 2012-05-08 At&T Intellectual Property Ii, L.P. Method and apparatus for rapid identification of column heterogeneity
US20090150308A1 (en) * 2007-12-07 2009-06-11 Microsoft Corporation Maximum entropy model parameterization
US7925602B2 (en) 2007-12-07 2011-04-12 Microsoft Corporation Maximum entropy model classfier that uses gaussian mean values
US9477712B2 (en) 2008-12-24 2016-10-25 Comcast Interactive Media, Llc Searching for segments based on an ontology
US9442933B2 (en) 2008-12-24 2016-09-13 Comcast Interactive Media, Llc Identification of segments within audio, video, and multimedia items
US8713016B2 (en) 2008-12-24 2014-04-29 Comcast Interactive Media, Llc Method and apparatus for organizing segments of media assets and determining relevance of segments to a query
US20100169385A1 (en) * 2008-12-29 2010-07-01 Robert Rubinoff Merging of Multiple Data Sets
US9348915B2 (en) 2009-03-12 2016-05-24 Comcast Interactive Media, Llc Ranking search results
US20100250614A1 (en) * 2009-03-31 2010-09-30 Comcast Cable Holdings, Llc Storing and searching encoded data
US20100293195A1 (en) * 2009-05-12 2010-11-18 Comcast Interactive Media, Llc Disambiguation and Tagging of Entities
US9626424B2 (en) 2009-05-12 2017-04-18 Comcast Interactive Media, Llc Disambiguation and tagging of entities
US8533223B2 (en) 2009-05-12 2013-09-10 Comcast Interactive Media, LLC. Disambiguation and tagging of entities
US20110004462A1 (en) * 2009-07-01 2011-01-06 Comcast Interactive Media, Llc Generating Topic-Specific Language Models
US9892730B2 (en) * 2009-07-01 2018-02-13 Comcast Interactive Media, Llc Generating topic-specific language models
US9251791B2 (en) 2009-12-23 2016-02-02 Google Inc. Multi-modal input on an electronic device
US9047870B2 (en) 2009-12-23 2015-06-02 Google Inc. Context based language model selection
US9495127B2 (en) 2009-12-23 2016-11-15 Google Inc. Language model selection for speech-to-text conversion
US8751217B2 (en) 2009-12-23 2014-06-10 Google Inc. Multi-modal input on an electronic device
US9031830B2 (en) 2009-12-23 2015-05-12 Google Inc. Multi-modal input on an electronic device
US20110172988A1 (en) * 2010-01-08 2011-07-14 Microsoft Corporation Adaptive construction of a statistical language model
US9448990B2 (en) 2010-01-08 2016-09-20 Microsoft Technology Licensing, Llc Adaptive construction of a statistical language model
US8577670B2 (en) 2010-01-08 2013-11-05 Microsoft Corporation Adaptive construction of a statistical language model
US8527534B2 (en) 2010-03-18 2013-09-03 Microsoft Corporation Bootstrap and adapt a document search engine
US9075774B2 (en) * 2010-05-20 2015-07-07 Nec Corporation Perplexity calculation device
US20130110501A1 (en) * 2010-05-20 2013-05-02 Nec Corporation Perplexity calculation device
JP2012008554A (en) * 2010-05-24 2012-01-12 Denso Corp Voice recognition device
US9852211B2 (en) * 2010-06-07 2017-12-26 Quora, Inc. Methods and systems for merging topics assigned to content items in an online application
US20160004763A1 (en) * 2010-06-07 2016-01-07 Quora, Inc. Methods and systems for merging topics assigned to content items in an online application
US9805022B2 (en) * 2010-12-09 2017-10-31 Apple Inc. Generation of topic-based language models for an app search engine
US20120191694A1 (en) * 2010-12-09 2012-07-26 Apple Inc. Generation of topic-based language models for an app search engine
US8352245B1 (en) 2010-12-30 2013-01-08 Google Inc. Adjusting language models
US9076445B1 (en) 2010-12-30 2015-07-07 Google Inc. Adjusting language models using context information
US8352246B1 (en) 2010-12-30 2013-01-08 Google Inc. Adjusting language models
US9542945B2 (en) 2010-12-30 2017-01-10 Google Inc. Adjusting language models based on topics identified using context
US8296142B2 (en) 2011-01-21 2012-10-23 Google Inc. Speech recognition using dock context
US8396709B2 (en) 2011-01-21 2013-03-12 Google Inc. Speech recognition using device docking context
US9727557B2 (en) 2011-03-08 2017-08-08 Nuance Communications, Inc. System and method for building diverse language models
US9396183B2 (en) 2011-03-08 2016-07-19 At&T Intellectual Property I, L.P. System and method for building diverse language models
US9081760B2 (en) 2011-03-08 2015-07-14 At&T Intellectual Property I, L.P. System and method for building diverse language models
US20120253799A1 (en) * 2011-03-28 2012-10-04 At&T Intellectual Property I, L.P. System and method for rapid customization of speech recognition models
US9679561B2 (en) * 2011-03-28 2017-06-13 Nuance Communications, Inc. System and method for rapid customization of speech recognition models
US8725496B2 (en) 2011-07-26 2014-05-13 International Business Machines Corporation Customization of a natural language processing engine
US8965763B1 (en) * 2012-02-02 2015-02-24 Google Inc. Discriminative language modeling for automatic speech recognition with a weak acoustic model and distributed training
US9842592B2 (en) 2014-02-12 2017-12-12 Google Inc. Language models using non-linguistic context
US9412365B2 (en) 2014-03-24 2016-08-09 Google Inc. Enhanced maximum entropy models
US20150278192A1 (en) * 2014-03-25 2015-10-01 Nice-Systems Ltd Language model adaptation based on filtered data
US9564122B2 (en) * 2014-03-25 2017-02-07 Nice Ltd. Language model adaptation based on filtered data
US9761220B2 (en) * 2015-05-13 2017-09-12 Microsoft Technology Licensing, Llc Language modeling based on spoken and unspeakable corpuses
WO2016183110A1 (en) * 2015-05-13 2016-11-17 Microsoft Technology Licensing, Llc Discriminative data selection for language modeling
US20160336006A1 (en) * 2015-05-13 2016-11-17 Microsoft Technology Licensing, Llc Discriminative data selection for language modeling

Also Published As

Publication number Publication date Type
WO2006099621A3 (en) 2009-04-16 application
WO2006099621A2 (en) 2006-09-21 application
US7739286B2 (en) 2010-06-15 grant

Similar Documents

Publication Publication Date Title
Chaovalit et al. Movie review mining: A comparison between supervised and unsupervised classification approaches
Porteous et al. Fast collapsed gibbs sampling for latent dirichlet allocation
Chelba et al. Retrieval and browsing of spoken content
Blei et al. Topic segmentation with an aspect hidden Markov model
Lee Similarity-based approaches to natural language processing
Li Learning to rank for information retrieval and natural language processing
Rosenfeld Adaptive statistical language modeling: A maximum entropy approach
US5828999A (en) Method and system for deriving a large-span semantic language model for large-vocabulary recognition systems
US7827125B1 (en) Learning based on feedback for contextual personalized information retrieval
Craven et al. Learning to construct knowledge bases from the World Wide Web
US6418431B1 (en) Information retrieval and speech recognition based on language models
Cohen et al. Exploiting dictionaries in named entity extraction: combining semi-markov extraction processes and data integration methods
US20080059187A1 (en) Retrieval of Documents Using Language Models
US6606620B1 (en) Method and system for classifying semi-structured documents
US7421418B2 (en) Method and apparatus for fundamental operations on token sequences: computing similarity, extracting term values, and searching efficiently
US20050021323A1 (en) Method and apparatus for identifying translations
US20050234952A1 (en) Content propagation for enhanced document retrieval
US6868411B2 (en) Fuzzy text categorizer
US20050108200A1 (en) Category based, extensible and interactive system for document retrieval
US20040254795A1 (en) Speech input search system
US20080091670A1 (en) Search phrase refinement by search term replacement
McCallum et al. Automating the construction of internet portals with machine learning
US20030028512A1 (en) System and method of finding documents related to other documents and of finding related words in response to a query to refine a search
US20090248661A1 (en) Identifying relevant information sources from user activity
US20060206306A1 (en) Text mining apparatus and associated methods

Legal Events

Date Code Title Description
AS Assignment

Owner name: SOUTHERN CALIFORNIA, UNIVERSITY OF, CALIFORNIA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:SETHY, ABHINAV;GEORGIOU, PANAYIOTIS;NARAYANAN, SHRIKANTH;REEL/FRAME:017583/0651

Effective date: 20060424

Owner name: SOUTHERN CALIFORNIA, UNIVERSITY OF,CALIFORNIA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:SETHY, ABHINAV;GEORGIOU, PANAYIOTIS;NARAYANAN, SHRIKANTH;REEL/FRAME:017583/0651

Effective date: 20060424

CC Certificate of correction
REMI Maintenance fee reminder mailed
LAPS Lapse for failure to pay maintenance fees
FP Expired due to failure to pay maintenance fee

Effective date: 20140615